pith. machine review for the scientific record. sign in

arxiv: 2604.04372 · v1 · submitted 2026-04-06 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:41 UTC · model grok-4.3

classification 💻 cs.CV
keywords video reasoningretrieval-augmented generationknowledge graphlarge multimodal modelsvisual fusiontraining-free reasoning
0
0 comments X

The pith

Rendering a minimal knowledge subgraph as one visual frame lets multimodal models fuse external facts directly in video space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Graph-to-Frame RAG as a training-free way to supply missing knowledge during video reasoning with large multimodal models. It builds a problem-agnostic graph offline that links entities, events, spatial relations, and world facts, then uses a controller to select and render only the needed part as a single appended frame. The model therefore reasons over the original video plus this new visual element in one domain, avoiding the attention split that occurs when text or extra clips are added. This produces measurable gains on benchmarks, especially those that demand outside knowledge, while leaving an inspectable visual record of what was used.

Core claim

G2F-RAG constructs a video knowledge graph offline, retrieves a minimal sufficient subgraph on demand via a hierarchical controller, renders that subgraph as a single reasoning frame, and appends the frame to the input video so that large multimodal models can perform joint reasoning entirely inside a unified visual domain while preserving an explicit, auditable evidence trail.

What carries the argument

The rendering step that converts a retrieved minimal subgraph from the video knowledge graph into one appended visual reasoning frame for direct fusion with the video input.

If this is right

  • Consistent accuracy gains appear across multiple public video reasoning benchmarks.
  • Larger gains occur on tasks that require external world knowledge.
  • The method works as a plug-and-play addition to different large multimodal model backbones without any retraining.
  • Reasoning becomes fully auditable because the exact visual evidence is visible in the input.
  • Ablations show that both the graph representation and the visual delivery format contribute to the observed improvements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The single-frame design may extend to other multimodal retrieval settings where cross-modal mixing currently hurts performance.
  • The offline graph plus online controller split could be reused to control knowledge injection in long-form or streaming video tasks.
  • If visual fusion proves robust, similar rendering pipelines might reduce reliance on fine-tuning for knowledge updates.

Load-bearing premise

Appending the rendered visual frame does not create new attention dilution, information loss, or integration errors inside the model.

What would settle it

On knowledge-intensive benchmarks, replacing the rendered frame with an equivalent textual description of the same subgraph would produce equal or lower accuracy than the visual version.

Figures

Figures reproduced from arXiv: 2604.04372 by Guijian Tang, Huibin Tan, Nong Xiao, Songyuan Yang, Weijiang Yu, Wenjing Yang, Ziyu Liu.

Figure 1
Figure 1. Figure 1: Traditional Video RAG appends auxiliary text, mixing [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (a) Attention Analysis. Text-append Video-RAG diverts attention from key frames toward contextual tokens, while G2F-RAG [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Method overview. Offline: a graph-construction agent builds a problem-agnostic video knowledge graph with optional external knowledge. Online: an orchestration agent routes the query; a retrieval agent extracts a minimal subgraph S ∗ ; a rendering agent converts it into one reasoning frame and appends it to the video to form V˜ = [V ; IRF]. Finally, the LMM answers from V for easy cases, or from V˜ for har… view at source ↗
Figure 4
Figure 4. Figure 4: G2F-RAG pipline. Build a full video graph, retrieve a minimal subgraph for the query, render it as a single frame, and append it to the video for visual-space reasoning with the LMM. view models participants, actions, intent, preconditions, postconditions, and event-level abstractions connected by causal links. The scene–affordance view models objects and their affordances, functional areas and connectivit… view at source ↗
Figure 5
Figure 5. Figure 5: Case study. We builds an offline full graph, retrieves a compact subgraph online, renders one reasoning frame, and appends it to the video for reasoning. Further analysis show attention concentrates on key video frames and the graph frame. The gains for 8B and 14B are slightly smaller but remain steady, especially on knowledge- and structure-intensive tasks. The pattern indicates that the effect comes from… view at source ↗
read the original abstract

When video reasoning requires external knowledge, many systems with large multimodal models (LMMs) adopt retrieval augmentation to supply the missing context. Appending textual or multi-clip evidence, however, forces heterogeneous signals into a single attention space. We observe diluted attention and higher cognitive load even on non-long videos. The bottleneck is not only what to retrieve but how to represent and fuse external knowledge with the video backbone.We present Graph-to-Frame RAG (G2F-RAG), a training free and auditable paradigm that delivers knowledge in the visual space. On the offline stage, an agent builds a problem-agnostic video knowledge graph that integrates entities, events, spatial relations, and linked world knowledge. On the online stage, a hierarchical multi-agent controller decides whether external knowledge is needed, retrieves a minimal sufficient subgraph, and renders it as a single reasoning frame appended to the video. LMMs then perform joint reasoning in a unified visual domain. This design reduces cognitive load and leaves an explicit, inspectable evidence trail.G2F-RAG is plug-and-play across backbones and scales. It yields consistent gains on diverse public benchmarks, with larger improvements in knowledge-intensive settings. Ablations further confirm that knowledge representation and delivery matter. G2F-RAG reframes retrieval as visual space knowledge fusion for robust and interpretable video reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Graph-to-Frame RAG (G2F-RAG), a training-free paradigm for video reasoning with large multimodal models (LMMs). It involves offline construction of a problem-agnostic video knowledge graph integrating entities, events, spatial relations, and world knowledge. Online, a hierarchical multi-agent controller assesses the need for external knowledge, retrieves a minimal subgraph, and renders it as a single appended visual reasoning frame. This enables unified visual-domain reasoning by LMMs, aiming to mitigate attention dilution from heterogeneous signals and provide an explicit evidence trail. The approach is presented as plug-and-play across backbones, yielding consistent benchmark gains especially in knowledge-intensive cases.

Significance. If the empirical claims hold, this work could meaningfully advance retrieval-augmented video reasoning by reframing knowledge delivery as visual fusion rather than textual or multi-modal appending. Strengths include the training-free design, auditable trail via the rendered frame, and potential for reduced cognitive load. The problem-agnostic KG and hierarchical control are innovative elements that could generalize. However, the significance is tempered by the absence of specific quantitative results in the abstract, which are necessary to substantiate the 'consistent gains' and 'larger improvements' assertions.

major comments (2)
  1. The abstract asserts that G2F-RAG 'yields consistent gains on diverse public benchmarks, with larger improvements in knowledge-intensive settings' and that 'ablations further confirm that knowledge representation and delivery matter,' but supplies no numerical values, baseline comparisons, standard deviations, or specific benchmark names. This omission makes it impossible to evaluate the practical impact or statistical significance of the reported improvements.
  2. The central assumption that rendering a minimal subgraph as a single visual frame allows LMMs to perform joint reasoning in a unified visual domain without introducing new attention dilution, information loss, or integration errors is load-bearing for the method's validity. The manuscript should include targeted evidence such as attention visualization comparisons or ablations on frame rendering variants to substantiate this.
minor comments (1)
  1. The term 'problem-agnostic video knowledge graph' is used without an immediate clarifying definition or example of its construction; adding a short parenthetical or reference to the relevant methods subsection would aid clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment point by point below, agreeing where revisions are warranted to improve clarity and evidence.

read point-by-point responses
  1. Referee: The abstract asserts that G2F-RAG 'yields consistent gains on diverse public benchmarks, with larger improvements in knowledge-intensive settings' and that 'ablations further confirm that knowledge representation and delivery matter,' but supplies no numerical values, baseline comparisons, standard deviations, or specific benchmark names. This omission makes it impossible to evaluate the practical impact or statistical significance of the reported improvements.

    Authors: We agree that the abstract would be strengthened by including concrete numerical results to allow immediate assessment of impact. In the revised manuscript, we have updated the abstract to report key quantitative findings, including average gains (with standard deviations) on specific benchmarks such as Video-MME and NExT-QA, along with baseline comparisons and notes on larger improvements in knowledge-intensive subsets. Full experimental details, tables, and statistical analysis remain in the body of the paper. revision: yes

  2. Referee: The central assumption that rendering a minimal subgraph as a single visual frame allows LMMs to perform joint reasoning in a unified visual domain without introducing new attention dilution, information loss, or integration errors is load-bearing for the method's validity. The manuscript should include targeted evidence such as attention visualization comparisons or ablations on frame rendering variants to substantiate this.

    Authors: We acknowledge that targeted empirical support is needed for this core assumption. We have added a dedicated subsection to the experiments section containing attention visualization comparisons (showing attention maps for baseline vs. G2F-RAG augmented inputs) and ablations on rendering variants, including single-frame vs. multi-frame approaches and visual vs. textual knowledge delivery. These results demonstrate reduced attention dilution and minimal integration errors, directly supporting the validity of the unified visual-domain reasoning design. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely procedural pipeline

full rationale

The paper describes a training-free, agent-based pipeline for building a video knowledge graph offline and rendering a minimal subgraph as one appended visual frame for LMM reasoning. No equations, fitted parameters, self-referential definitions, or load-bearing self-citations appear in the derivation chain. Claims of benchmark gains are empirical and external to any internal reduction; the method does not derive results from quantities defined by the method itself. This matches the default expectation of no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that visual-space fusion via a single rendered frame integrates external knowledge effectively for LMMs and that the multi-agent system can reliably select minimal sufficient subgraphs.

axioms (1)
  • domain assumption Large multimodal models can perform joint reasoning over video plus an appended visual frame without significant new attention dilution or integration errors.
    Invoked when claiming unified visual domain reasoning and reduced cognitive load.
invented entities (1)
  • Problem-agnostic video knowledge graph integrating entities, events, spatial relations, and linked world knowledge no independent evidence
    purpose: To organize and enable retrieval of external knowledge for video reasoning
    Introduced as the offline-stage artifact; no independent evidence of its construction or utility is supplied.

pith-pipeline@v0.9.0 · 5565 in / 1309 out tokens · 39441 ms · 2026-05-10T19:41:34.449446+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 38 canonical work pages · 20 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    Goldfish: Vision-language understanding of arbitrarily long videos

    Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Es- sam Sleiman, Mingchen Zhuge, Jian Ding, Deyao Zhu, J¨urgen Schmidhuber, and Mohamed Elhoseiny. Goldfish: Vision-language understanding of arbitrarily long videos. In European Conference on Computer Vision, pages 251–267. Springer, 2024. 1, 3

  3. [3]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023. 2

  4. [4]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 1, 5, 6

  5. [5]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

  6. [6]

    Wiki-llava: Hierarchical retrieval-augmented gener- ation for multimodal llms

    Davide Caffagni, Federico Cocchi, Nicholas Moratelli, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, and Rita Cuc- chiara. Wiki-llava: Hierarchical retrieval-augmented gener- ation for multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1818–1826, 2024. 2

  7. [7]

    Llava-kd: A framework of distilling multimodal large language models

    Yuxuan Cai, Jiangning Zhang, Haoyang He, Xinwei He, Ao Tong, Zhenye Gan, Chengjie Wang, Zhucun Xue, Yong Liu, and Xiang Bai. Llava-kd: A framework of distilling multimodal large language models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 239–249, 2025. 2

  8. [8]

    Longvila: Scaling long-context visual language models for long videos.arXiv preprint arXiv:2408.10188, 2024

    Yukang Chen, Fuzhao Xue, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, et al. Longvila: Scaling long-context visual language models for long videos.arXiv preprint arXiv:2408.10188, 2024. 2

  9. [9]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling.arXiv preprint arXiv:2412.05271, 2024. 2, 5, 6

  10. [10]

    Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024. 2

  11. [11]

    VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial- temporal modeling and audio understanding in video-llms. arXiv preprint arXiv:2406.07476, 2024. 2

  12. [12]

    Gonzalez, Ion Stoica, and Eric P

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhang- hao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yong- hao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023. 2

  13. [13]

    Words or vision: Do vision-language models have blind faith in text? InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 3867–3876, 2025

    Ailin Deng, Tri Cao, Zhirui Chen, and Bryan Hooi. Words or vision: Do vision-language models have blind faith in text? InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 3867–3876, 2025. 2

  14. [14]

    Graphviz—open source graph drawing tools

    John Ellson, Emden Gansner, Lefteris Koutsofios, Stephen C North, and Gordon Woodhull. Graphviz—open source graph drawing tools. InInternational symposium on graph draw- ing, pages 483–484. Springer, 2001. 5

  15. [15]

    Mmbench-video: A long-form multi-shot benchmark for holistic video under- standing.Advances in Neural Information Processing Sys- tems, 37:89098–89124, 2024

    Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench-video: A long-form multi-shot benchmark for holistic video under- standing.Advances in Neural Information Processing Sys- tems, 37:89098–89124, 2024. 5

  16. [16]

    Video-R1: Reinforcing Video Reasoning in MLLMs

    Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776,

  17. [17]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24108–24118, 2025. 3, 5

  18. [18]

    Retrieval-Augmented Generation for Large Language Models: A Survey

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jin- liu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, and Haofen Wang. Retrieval-augmented generation for large lan- guage models: A survey.arXiv preprint arXiv:2312.10997, 2(1), 2023. 3

  19. [19]

    Rossi, Subhabrata Mukherjee, Xianfeng Tang, Qi He, Zhigang Hua, Bo Long, Tong Zhao, Neil Shah, Amin Javari, Yinglong Xia, and Jiliang Tang

    Haoyu Han, Yu Wang, Harry Shomer, Kai Guo, Jiayuan Ding, Yongjia Lei, Mahantesh Halappanavar, Ryan A Rossi, Subhabrata Mukherjee, Xianfeng Tang, et al. Retrieval- augmented generation with graphs (graphrag).arXiv preprint arXiv:2501.00309, 2024. 3

  20. [20]

    Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

    Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline pro- fessional videos.arXiv preprint arXiv:2501.13826, 2025. 5

  21. [21]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 1, 2, 5, 6

  22. [22]

    Videograph: Recognizing minutes-long human activities in videos.arXiv preprint arXiv:1905.05143, 2019

    Noureldien Hussein, Efstratios Gavves, and Arnold WM Smeulders. Videograph: Recognizing minutes-long human activities in videos.arXiv preprint arXiv:1905.05143, 2019. 2

  23. [23]

    Vide- orag: Retrieval-augmented generation over video corpus,

    Soyeong Jeong, Kangsan Kim, Jinheon Baek, and Sung Ju Hwang. Videorag: Retrieval-augmented generation over video corpus.arXiv preprint arXiv:2501.05874, 2025. 3

  24. [24]

    From specific-MLLMs to omni-MLLMs: A survey on MLLMs aligned with multi-modalities

    Shixin Jiang, Jiafeng Liang, Jiyuan Wang, Xuan Dong, Heng Chang, Weijiang Yu, Jinhua Du, Ming Liu, and Bing Qin. From specific-MLLMs to omni-MLLMs: A survey on MLLMs aligned with multi-modalities. InFindings of the Association for Computational Linguistics: ACL 2025, 2025. 2

  25. [25]

    Learning to purification for unsupervised per- son re-identification.IEEE Transactions on Image Process- ing, 32:3338–3353, 2023

    Long Lan, Xiao Teng, Jing Zhang, Xiang Zhang, and Dacheng Tao. Learning to purification for unsupervised per- son re-identification.IEEE Transactions on Image Process- ing, 32:3338–3353, 2023. 1

  26. [26]

    Mvbench: A comprehensive multi-modal video understand- ing benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195– 22206, 2024. 5

  27. [27]

    Videochat-flash: Hierarchical compression for long-context video modeling,

    Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang, Jianfei Gao, Kunchang Li, Yinan He, Chenting Wang, et al. Videochat-flash: Hierarchical com- pression for long-context video modeling.arXiv preprint arXiv:2501.00574, 2024. 2

  28. [28]

    arXiv preprint arXiv:2410.08815 , year=

    Zhuoqun Li, Xuanang Chen, Haiyang Yu, Hongyu Lin, Yao- jie Lu, Qiaoyu Tang, Fei Huang, Xianpei Han, Le Sun, and Yongbin Li. Structrag: Boosting knowledge intensive rea- soning of llms via inference-time hybrid information struc- turization.arXiv preprint arXiv:2410.08815, 2024. 3

  29. [29]

    Star-r1: Spatial transformation reasoning by rein- forcing multimodal llms.arXiv preprint arXiv:2505.15804,

    Zongzhao Li, Zongyang Ma, Mingze Li, Songyou Li, Yu Rong, Tingyang Xu, Ziqi Zhang, Deli Zhao, and Wenbing Huang. Star-r1: Spatial transformation reasoning by rein- forcing multimodal llms.arXiv preprint arXiv:2505.15804,

  30. [30]

    Vila: On pre-training for vi- sual language models

    Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Moham- mad Shoeybi, and Song Han. Vila: On pre-training for vi- sual language models. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 26689–26699, 2024. 5, 6

  31. [31]

    Unisvg: A unified dataset for vector graphic understanding and generation with multimodal large language models

    Kevin Qinghong Lin, Yuhao Zheng, Hangyu Ran, Dantong Zhu, Dongxing Mao, Linjie Li, Philip Torr, and Alex Jin- peng Wang. Vcode: a multimodal coding benchmark with svg as symbolic visual representation.arXiv preprint arXiv:2511.02778, 2025. 3

  32. [32]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 2

  33. [33]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024

  34. [34]

    Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024. 2

  35. [35]

    Lost in the middle: How language models use long contexts.Trans- actions of the Association for Computational Linguistics, 12: 157–173, 2024

    Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Trans- actions of the Association for Computational Linguistics, 12: 157–173, 2024. 2

  36. [36]

    Tempcompass: Do video llms really understand videos?,

    Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcom- pass: Do video llms really understand videos?arXiv preprint arXiv:2403.00476, 2024. 5

  37. [37]

    Video-RAG: Visually-aligned retrieval-augmented long video comprehension

    Yongdong Luo, Xiawu Zheng, Guilin Li, Shukang Yin, Haojia Lin, Chaoyou Fu, Jinfa Huang, Jiayi Ji, Fei Chao, Jiebo Luo, and Rongrong Ji. Video-RAG: Visually-aligned retrieval-augmented long video comprehension. InThe Thirty-ninth Annual Conference on Neural Information Pro- cessing Systems, 2025. 1, 3, 5, 6

  38. [38]

    The llama 4 herd: The beginning of a new era of natively multimodal ai innovation.https://ai

    AI Meta. The llama 4 herd: The beginning of a new era of natively multimodal ai innovation.https://ai. meta. com/blog/llama-4-multimodal-intelligence/, checked on, 4 (7):2025, 2025. 2

  39. [39]

    Gpt-4o mini: advancing cost-efficient intelligence

    OpenAI. Gpt-4o mini: advancing cost-efficient intelligence. https : / / openai . com / index / gpt - 4o - mini - advancing - cost - efficient - intelligence,

  40. [40]

    Ziqi Pang and Yu-Xiong Wang. Mr. video:” mapreduce” is the principle for long video understanding.arXiv preprint arXiv:2504.16082, 2025. 2

  41. [41]

    Vamba: Understanding hour- long videos with hybrid mamba-transformers.arXiv preprint arXiv:2503.11579, 2025

    Weiming Ren, Wentao Ma, Huan Yang, Cong Wei, Ge Zhang, and Wenhu Chen. Vamba: Understanding hour- long videos with hybrid mamba-transformers.arXiv preprint arXiv:2503.11579, 2025. 2

  42. [42]

    Videorag: Retrieval-augmented generation with extreme long-context videos, 2025

    Xubin Ren, Lingrui Xu, Long Xia, Shuaiqiang Wang, Dawei Yin, and Chao Huang. Videorag: Retrieval-augmented gen- eration with extreme long-context videos.arXiv preprint arXiv:2502.01549, 2025. 2, 3

  43. [43]

    Vgent: Graph-based retrieval-reasoning- augmented generation for long video understanding

    Xiaoqian Shen, Wenxuan Zhang, Jun Chen, and Mo- hamed Elhoseiny. Vgent: Graph-based retrieval-reasoning- augmented generation for long video understanding. InThe Thirty-ninth Annual Conference on Neural Information Pro- cessing Systems, 2025. 1, 2, 3, 5, 6

  44. [44]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of con- text.arXiv preprint arXiv:2403.05530, 2024. 2, 5, 6

  45. [45]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Jun- yang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 2

  46. [46]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Sheng- long Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 1, 2, 5, 6

  47. [47]

    InternVideo2.5: Empowering video MLLMs with long and rich context modeling.arXiv:2501.12386, 2025

    Yi Wang, Xinhao Li, Ziang Yan, Yinan He, Jiashuo Yu, Xi- angyu Zeng, Chenting Wang, Changlian Ma, Haian Huang, Jianfei Gao, et al. Internvideo2. 5: Empowering video mllms with long and rich context modeling.arXiv preprint arXiv:2501.12386, 2025. 6

  48. [48]

    DeepSeek-OCR: Contexts Optical Compression

    Haoran Wei, Yaofeng Sun, and Yukun Li. Deepseek- ocr: Contexts optical compression.arXiv preprint arXiv:2510.18234, 2025. 2

  49. [49]

    Number it: Temporal grounding videos like flipping manga

    Yongliang Wu, Xinting Hu, Yuyang Sun, Yizhou Zhou, Wenbo Zhu, Fengyun Rao, Bernt Schiele, and Xu Yang. Number it: Temporal grounding videos like flipping manga. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 13754–13765, 2025. 3

  50. [50]

    Adavideorag: Omni- contextual adaptive retrieval-augmented efficient long video understanding

    Zhucun Xue, Jiangning Zhang, Xurong Xie, Yong Liu, Xiangtai Li, Dacheng Tao, et al. Adavideorag: Omni- contextual adaptive retrieval-augmented efficient long video understanding. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 3

  51. [51]

    Videochat-r1. 5: Visual test-time scaling to reinforce multimodal reasoning by iterative perception,

    Ziang Yan, Xinhao Li, Yinan He, Zhengrong Yue, Xiangyu Zeng, Yali Wang, Yu Qiao, Limin Wang, and Yi Wang. Videochat-r1. 5: Visual test-time scaling to reinforce mul- timodal reasoning by iterative perception.arXiv preprint arXiv:2509.21100, 2025. 2

  52. [52]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 2

  53. [53]

    Thinking in space: How mul- timodal large language models see, remember, and recall spaces

    Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How mul- timodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025. 1, 5

  54. [54]

    Wildvideo: Bench- marking lmms for understanding video-language interaction

    Songyuan Yang, Weijiang Yu, Wenjing Yang, Xinwang Liu, Huibin Tan, Long Lan, and Nong Xiao. Wildvideo: Bench- marking lmms for understanding video-language interaction. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 2025. 1, 3, 5

  55. [55]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024. 2, 5, 6

  56. [56]

    Yi: Open Foundation Models by 01.AI

    Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Guoyin Wang, Heng Li, Jiangcheng Zhu, Jianqun Chen, et al. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652, 2024. 2

  57. [57]

    Minicpm-v 4.5: Cooking efficient mllms via architecture, data, and training recipe.arXiv preprint arXiv:2509.18154, 2025

    Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, et al. Minicpm-v 4.5: Cooking effi- cient mllms via architecture, data, and training recipe.arXiv preprint arXiv:2509.18154, 2025. 5, 6

  58. [58]

    VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multi- modal foundation models for image and video understand- ing.arXiv preprint arXiv:2501.13106, 2025. 2, 5, 6

  59. [59]

    Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

    Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video un- derstanding.arXiv preprint arXiv:2306.02858, 2023. 2

  60. [60]

    Vtime- cot: Thinking by drawing for video temporal grounding and reasoning

    Jinglei Zhang, Yuanfan Guo, Rolandos Alexandros Potamias, Jiankang Deng, Hang Xu, and Chao Ma. Vtime- cot: Thinking by drawing for video temporal grounding and reasoning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 24203–24213, 2025. 3

  61. [61]

    Deep video dis- covery: Agentic search with tool use for long-form video understanding.arXiv preprint arXiv:2505.18079, 2025

    Xiaoyi Zhang, Zhaoyang Jia, Zongyu Guo, Jiahao Li, Bin Li, Houqiang Li, and Yan Lu. Deep video discovery: Agen- tic search with tool use for long-form video understanding. arXiv preprint arXiv:2505.18079, 2025. 2

  62. [62]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Zi- wei Liu, and Chunyuan Li. Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024. 2, 5, 6

  63. [63]

    Tinyllava: A framework of small-scale large multimodal models,

    Baichuan Zhou, Ying Hu, Xi Weng, Junlong Jia, Jie Luo, Xien Liu, Ji Wu, and Lei Huang. Tinyllava: A frame- work of small-scale large multimodal models.arXiv preprint arXiv:2402.14289, 2024. 2

  64. [64]

    Mlvu: Benchmarking multi-task long video understanding

    Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, et al. Mlvu: Benchmarking multi-task long video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13691– 13701, 2025. 3, 5

  65. [65]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 2, 5, 6