pith. sign in

arxiv: 2604.14692 · v2 · pith:YB43P2XZnew · submitted 2026-04-16 · 💻 cs.CV

Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding

Pith reviewed 2026-05-19 17:30 UTC · model grok-4.3

classification 💻 cs.CV
keywords video reasoningobject groundingprogressive reasoningvisual evidencereinforcement learningmulti-step decisionchain of glimpsevideo understanding
0
0 comments X

The pith

Video reasoning improves when each step anchors explicitly to specific visual objects in the frames.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that video understanding can be turned into a reliable step-by-step process by forcing every reasoning step to stay tied to concrete visual regions around task-relevant objects. It does this with a controller that searches for and grounds those regions iteratively, trained through reinforcement learning that rewards proper grounding format. A reader would care because most current video models process frames in an object-agnostic way and therefore lose track when objects change appearance or position, producing brittle answers on questions that require tracking particular things over time.

Core claim

Chain-of-Glimpse formulates video reasoning as a step-by-step process that incrementally builds spatially grounded traces around task-relevant visual objects, thereby mitigating over-reliance on saliency-driven cues. It features a search-guided controller, optimized via reinforcement learning with a format reward that incentivizes grounding capability, to iteratively ground visual evidence regions and form reliable reasoning trajectories, yielding accurate and interpretable multi-step decisions.

What carries the argument

The search-guided controller that iteratively selects and grounds task-relevant visual evidence regions to build incremental reasoning traces.

If this is right

  • The framework produces consistent accuracy gains on both in-domain and out-of-domain video reasoning benchmarks.
  • Reasoning trajectories become more interpretable because each step is tied to explicit visual regions.
  • Over-reliance on saliency-driven cues is reduced by the progressive object-grounding process.
  • The same controller yields robustness and generalization across diverse video tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same incremental grounding idea could be tested on long-form video question answering to see whether trace length stays manageable.
  • If the built traces are stored, they might serve as human-readable explanations for model outputs on video QA datasets.
  • Applying the controller to single-image reasoning tasks would test whether progressive object focus helps even without temporal change.

Load-bearing premise

Training the controller with reinforcement learning plus a format reward will reliably create object-grounding behavior that lifts compositional reasoning above object-agnostic methods.

What would settle it

An ablation that removes the search-guided controller or the format reward and measures whether accuracy on NExTQA, Video-Holmes, CG-Bench Reasoning, or VRBench falls back to the level of object-agnostic baselines would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.14692 by Bo Cheng, Genbao Xu, Nan Ma, Quanxing Zha, Soujanya Poria, Teng Wang, Wei Rao, Wenyuan Gu, Zhixuan Wu.

Figure 1
Figure 1. Figure 1: Inconsistent reasoning in prior models and improved with Chain-of-Glimpse. (a) Vanilla RL-based and (b) CoT-based models both insufficient evidence integration and global context oversight, as they tend to rely on superficial, visually prominent cues. Consequently, they fail to capture complex dependencies, leading to inconsistent reasoning (D and C). In contrast, (c) our Chain-of-Glimpse performs progress… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Chain-of-Glimpse. Chain-of-Glimpse formulates video reasoning as a search-guided, multi-turn object-grounded decision process. Given a video and a query, the model searches over object-grounded reasoning trajectories and optimizes them via reinforcement learning with task-level rewards, enabling accurate reasoning beyond visually salient cues. intermediate reasoning states help bridge low-level… view at source ↗
Figure 3
Figure 3. Figure 3: Effect of MCTS rollout numbers on Video-Holmes. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Ablation studies on NExTQA and Video-Holmes. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Video understanding requires identifying and reasoning over semantically discriminative visual objects across frames, yet existing object-agnostic solutions struggle to effectively handle substantial object variations over time. To address this, we introduce Chain-of-Glimpse, a search-guided progressive object-grounded reasoning framework that explicitly anchors each reasoning step to specific visual evidence regions, enabling compositional and multi-step decision-making. Formally, Chain-of-Glimpse formulates video reasoning as a step-by-step process that incrementally builds spatially grounded traces around task-relevant visual objects, thereby mitigating over-reliance on saliency-driven cues. Specifically, Chain-of-Glimpse features a search-guided controller, optimized via reinforcement learning with a format reward that significantly incentivizes grounding capability, to iteratively ground visual evidence regions and form reliable reasoning trajectories, yielding accurate and interpretable multi-step decisions. Extensive evaluations on both in domain NExTQA and out-of-domain Video-Holmes, CG-Bench Reasoning, and VRBench benchmarks demonstrate consistent performance gains, robustness and generalization of Chain-of-Glimpse across diverse video reasoning tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Chain-of-Glimpse, a search-guided progressive object-grounded reasoning framework for video understanding. It formulates the task as a step-by-step process that incrementally builds spatially grounded traces around task-relevant visual objects, using a search-guided controller optimized via reinforcement learning with a format reward to produce reliable reasoning trajectories. The approach is evaluated on in-domain NExTQA and out-of-domain benchmarks including Video-Holmes, CG-Bench Reasoning, and VRBench, reporting consistent performance gains, robustness, and generalization over object-agnostic baselines.

Significance. If the central claim holds, the work offers a promising direction for improving interpretability and compositional reasoning in video understanding by explicitly anchoring steps to visual evidence regions rather than relying on saliency-driven cues. The progressive, search-guided structure could help mitigate object variation over time and support multi-step decision-making in complex video tasks.

major comments (2)
  1. [Abstract and method description] The description of the RL optimization (abstract and method) states that the format reward 'significantly incentivizes grounding capability,' yet provides no explicit mechanism to ensure the grounded regions are task-relevant or causally used in subsequent reasoning steps. This leaves open the possibility that gains arise from output formatting compliance or search exploration rather than functional object grounding, as the reward does not penalize irrelevant regions or verify their contribution to the final decision.
  2. [Experiments and results] The experimental claims of consistent gains and improved generalization rest on unspecified implementation details and lack ablations that isolate the contribution of learned grounding (e.g., replacing controller outputs with random or saliency-based regions while keeping the rest of the pipeline fixed). Without such controls or inspection of whether masking the grounded traces degrades performance, the central claim that object-grounded traces drive the improvements over baselines cannot be fully evaluated.
minor comments (2)
  1. [Abstract] The abstract mentions 'extensive evaluations' and 'consistent performance gains' but does not specify the exact metrics, baseline models, or quantitative improvements; adding these details would improve clarity for readers.
  2. [Method] Notation for the search-guided controller and the format reward could be introduced more formally with equations or pseudocode to make the iterative grounding process easier to follow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate where revisions will be made to the manuscript.

read point-by-point responses
  1. Referee: [Abstract and method description] The description of the RL optimization (abstract and method) states that the format reward 'significantly incentivizes grounding capability,' yet provides no explicit mechanism to ensure the grounded regions are task-relevant or causally used in subsequent reasoning steps. This leaves open the possibility that gains arise from output formatting compliance or search exploration rather than functional object grounding, as the reward does not penalize irrelevant regions or verify their contribution to the final decision.

    Authors: We appreciate this observation on the need for greater clarity. The format reward ensures structural compliance in producing grounded outputs, while the search-guided controller is optimized through reinforcement learning to maximize task performance on the video understanding objective. This process encourages selection of regions that support successful reasoning trajectories, as non-contributory groundings would not aid in reaching correct decisions. We agree the description can be strengthened and will revise the method section to more explicitly describe how the combined RL objective and search guidance promote task-relevant grounding beyond format compliance alone. revision: yes

  2. Referee: [Experiments and results] The experimental claims of consistent gains and improved generalization rest on unspecified implementation details and lack ablations that isolate the contribution of learned grounding (e.g., replacing controller outputs with random or saliency-based regions while keeping the rest of the pipeline fixed). Without such controls or inspection of whether masking the grounded traces degrades performance, the central claim that object-grounded traces drive the improvements over baselines cannot be fully evaluated.

    Authors: We agree that targeted ablations would better isolate the contribution of the learned grounding. In the revised manuscript we will add experiments that substitute the controller outputs with random regions and with saliency-based regions while holding the remainder of the pipeline fixed. We will also report performance when the grounded traces are masked. These additions will provide direct evidence regarding the role of object-grounded traces in the observed gains. revision: yes

Circularity Check

0 steps flagged

No circularity: Chain-of-Glimpse presents an independent RL-based framework without reducing results to inputs by construction.

full rationale

The paper defines Chain-of-Glimpse as a novel search-guided progressive object-grounded reasoning process for video understanding, optimized via reinforcement learning with a format reward to encourage grounding. No equations, derivations, or claims in the abstract or description reduce a reported prediction or first-principles result to a fitted parameter, self-citation chain, or renamed input by construction. The central claims rest on empirical evaluations across benchmarks rather than tautological re-derivations, making the derivation self-contained against external benchmarks as per the reader's assessment.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only view yields no explicit free parameters, axioms, or invented entities beyond the high-level description of the RL controller and format reward; these are treated as standard optimization choices rather than new postulates.

pith-pipeline@v0.9.0 · 5746 in / 1081 out tokens · 38158 ms · 2026-05-19T17:30:00.450691+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 13 internal anchors

  1. [1]

    A simple llm framework for long-range video question- answering,

    C. Zhang, T. Lu, M. M. Islam, Z. Wang, S. Yu, M. Bansal, and G. Bertasius, “A simple llm framework for long-range video question- answering,” inProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024, pp. 21 715–21 737

  2. [2]

    Understanding long videos in one multimodal language model pass,

    K. Ranasinghe, X. Li, K. Kahatapitiya, and M. S. Ryoo, “Understanding long videos in one multimodal language model pass,”arXiv preprint arXiv:2403.16998, vol. 3, no. 4, p. 12, 2024

  3. [3]

    Stimuvar: Spatiotemporal stimuli-aware video affective reasoning with multimodal large language models,

    Y . Guo, F. Siddiqui, Y . Zhao, R. Chellappa, and S.-Y . Lo, “Stimuvar: Spatiotemporal stimuli-aware video affective reasoning with multimodal large language models,”International Journal of Computer Vision (IJCV), pp. 1–17, 2025

  4. [4]

    Dycoke: Dynamic com- pression of tokens for fast video large language models,

    K. Tao, C. Qin, H. You, Y . Sui, and H. Wang, “Dycoke: Dynamic com- pression of tokens for fast video large language models,” inProceedings of the Computer Vision and Pattern Recognition (CVPR), 2025, pp. 18 992–19 001

  5. [5]

    Vtimellm: Empower llm to grasp video moments,

    B. Huang, X. Wang, H. Chen, Z. Song, and W. Zhu, “Vtimellm: Empower llm to grasp video moments,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 14 271–14 280

  6. [6]

    Chapter-llama: Efficient chaptering in hour-long videos with llms,

    L. Ventura, A. Yang, C. Schmid, and G. Varol, “Chapter-llama: Efficient chaptering in hour-long videos with llms,” inProceedings of the Com- puter Vision and Pattern Recognition (CVPR), 2025, pp. 18 947–18 958

  7. [7]

    Automated multi-level preference for mllms,

    M. Zhang, W. Wu, Y . Lu, Y . Song, K. Rong, H. Yao, J. Zhao, F. Liu, H. Feng, J. Wanget al., “Automated multi-level preference for mllms,” Advances in Neural Information Processing Systems (NeurIPS), pp. 26 171–26 194, 2024

  8. [8]

    Omnialign-v: Towards enhanced alignment of mllms with human preference,

    X. Zhao, S. Ding, Z. Zhang, H. Huang, M. Maosongcao, J. Wang, W. Wang, X. Fang, W. Wang, G. Zhaiet al., “Omnialign-v: Towards enhanced alignment of mllms with human preference,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025, pp. 18 490–18 515

  9. [9]

    Grounded Reinforcement Learning for Visual Reasoning

    G. Sarch, S. Saha, N. Khandelwal, A. Jain, M. J. Tarr, A. Kumar, and K. Fragkiadaki, “Grounded reinforcement learning for visual reasoning,” arXiv preprint arXiv:2505.23678, 2025

  10. [10]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Geet al., “Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,”arXiv preprint arXiv:2409.12191, 2024

  11. [11]

    Timechat: A time-sensitive multimodal large language model for long video understanding,

    S. Ren, L. Yao, S. Li, X. Sun, and L. Hou, “Timechat: A time-sensitive multimodal large language model for long video understanding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 14 313–14 323

  12. [12]

    Videoespresso: A large-scale chain-of-thought dataset for fine-grained video reasoning via core frame selection,

    S. Han, W. Huang, H. Shi, L. Zhuo, X. Su, S. Zhang, X. Zhou, X. Qi, Y . Liao, and S. Liu, “Videoespresso: A large-scale chain-of-thought dataset for fine-grained video reasoning via core frame selection,” in Proceedings of the Computer Vision and Pattern Recognition (CVPR), 2025, pp. 26 181–26 191

  13. [13]

    Qwen2.5-VL Technical Report

    S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tanget al., “Qwen2.5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025

  14. [14]

    OneThinker: All-in-one Reasoning Model for Image and Video

    K. Feng, M. Zhang, H. Li, K. Fan, S. Chen, Y . Jiang, D. Zheng, P. Sun, Y . Zhang, H. Sunet al., “Onethinker: All-in-one reasoning model for image and video,”arXiv preprint arXiv:2512.03043, 2025

  15. [15]

    Video-R1: Reinforcing Video Reasoning in MLLMs

    K. Feng, K. Gong, B. Li, Z. Guo, Y . Wang, T. Peng, J. Wu, X. Zhang, B. Wang, and X. Yue, “Video-r1: Reinforcing video reasoning in mllms,” arXiv preprint arXiv:2503.21776, 2025

  16. [16]

    VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

    X. Li, Z. Yan, D. Meng, L. Dong, X. Zeng, Y . He, Y . Wang, Y . Qiao, Y . Wang, and L. Wang, “Videochat-r1: Enhancing spatio-temporal per- ception via reinforcement fine-tuning,”arXiv preprint arXiv:2504.06958, 2025

  17. [17]

    Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models,

    G. Zheng, B. Yang, J. Tang, H.-Y . Zhou, and S. Yang, “Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models,”Advances in Neural Information Processing Systems (NeurIPS), vol. 36, pp. 5168–5191, 2023

  18. [18]

    Imagine while reasoning in space: Multimodal visualization-of- thought,

    C. Li, W. Wu, H. Zhang, Y . Xia, S. Mao, L. Dong, I. Vuli ´c, and F. Wei, “Imagine while reasoning in space: Multimodal visualization-of- thought,” inForty-second International Conference on Machine Learn- ing (ICML), 2025

  19. [19]

    Rethinking chain-of-thought reasoning for videos,

    Y . Zhong, Z.-Y . Hu, Y . Li, and L. Wang, “Rethinking chain-of-thought reasoning for videos,”arXiv preprint arXiv:2512.09616, 2025

  20. [20]

    Mmtom-qa: Multimodal theory of mind question answering,

    C. Jin, Y . Wu, J. Cao, J. Xiang, Y .-L. Kuo, Z. Hu, T. Ullman, A. Torralba, J. Tenenbaum, and T. Shu, “Mmtom-qa: Multimodal theory of mind question answering,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024, pp. 16 077– 16 102. JOURNAL OF LATEX CLASS FILES, APIRL 2026 10

  21. [21]

    Morevqa: Exploring modular reasoning models for video question answering,

    J. Min, S. Buch, A. Nagrani, M. Cho, and C. Schmid, “Morevqa: Exploring modular reasoning models for video question answering,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 13 235–13 245

  22. [22]

    End-to-end generative pretraining for multimodal video captioning,

    P. H. Seo, A. Nagrani, A. Arnab, and C. Schmid, “End-to-end generative pretraining for multimodal video captioning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 17 959–17 968

  23. [23]

    Video-xl: Extra-long vision language model for hour- scale video understanding,

    Y . Shu, Z. Liu, P. Zhang, M. Qin, J. Zhou, Z. Liang, T. Huang, and B. Zhao, “Video-xl: Extra-long vision language model for hour- scale video understanding,” inProceedings of the Computer Vision and Pattern Recognition (CVPR), 2025, pp. 26 160–26 169

  24. [24]

    Revisiting tem- poral modeling for clip-based image-to-video knowledge transferring,

    R. Liu, J. Huang, G. Li, J. Feng, X. Wu, and T. H. Li, “Revisiting tem- poral modeling for clip-based image-to-video knowledge transferring,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 6555–6564

  25. [25]

    Streaming long video understanding with large language models,

    R. Qian, X. Dong, P. Zhang, Y . Zang, S. Ding, D. Lin, and J. Wang, “Streaming long video understanding with large language models,” Advances in Neural Information Processing Systems (NeurIPS), vol. 37, pp. 119 336–119 360, 2024

  26. [26]

    Moviechat+: Question-aware sparse memory for long video question answering,

    E. Song, W. Chai, T. Ye, J.-N. Hwang, X. Li, and G. Wang, “Moviechat+: Question-aware sparse memory for long video question answering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  27. [27]

    Learning high-quality dynamic memory for video object segmentation,

    Y . Liu, R. Yu, F. Yin, X. Zhao, W. Zhao, W. Xia, J. Wang, Y . Wang, Y . Tang, and Y . Yang, “Learning high-quality dynamic memory for video object segmentation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  28. [28]

    Cotdet: Affordance knowledge prompting for task driven object detection,

    J. Tang, G. Zheng, J. Yu, and S. Yang, “Cotdet: Affordance knowledge prompting for task driven object detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 3068–3078

  29. [29]

    Sf2t: Self-supervised fragment finetuning of video-llms for fine-grained understanding,

    Y . Hu, Z. Song, N. Feng, Y . Luo, J. Yu, Y .-P. P. Chen, and W. Yang, “Sf2t: Self-supervised fragment finetuning of video-llms for fine-grained understanding,” inProceedings of the Computer Vision and Pattern Recognition (CVPR), 2025, pp. 29 108–29 117

  30. [30]

    Ma-lmm: Memory-augmented large multimodal model for long-term video understanding,

    B. He, H. Li, Y . K. Jang, M. Jia, X. Cao, A. Shah, A. Shrivastava, and S.-N. Lim, “Ma-lmm: Memory-augmented large multimodal model for long-term video understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 13 504–13 514

  31. [31]

    Compositional chain- of-thought prompting for large multimodal models,

    C. Mitra, B. Huang, T. Darrell, and R. Herzig, “Compositional chain- of-thought prompting for large multimodal models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 14 420–14 431

  32. [32]

    Videorefer suite: Advancing spatial- temporal object understanding with video llm,

    Y . Yuan, H. Zhang, W. Li, Z. Cheng, B. Zhang, L. Li, X. Li, D. Zhao, W. Zhang, Y . Zhuanget al., “Videorefer suite: Advancing spatial- temporal object understanding with video llm,” inProceedings of the Computer Vision and Pattern Recognition (CVPR), 2025, pp. 18 970– 18 980

  33. [33]

    Pixel-level reasoning segmentation via multi-turn conversations,

    D. Cai, X. Yang, Y . Liu, D. Wang, S. Feng, Y . Zhang, and S. Poria, “Pixel-level reasoning segmentation via multi-turn conversations,” in Proceedings of the 62nd Annual Meeting of the Association for Com- putational Linguistics (ACL), 2025, pp. 17 660–17 679

  34. [34]

    Re-thinking temporal search for long-form video understanding,

    J. Ye, Z. Wang, H. Sun, K. Chandrasegaran, Z. Durante, C. Eyzaguirre, Y . Bisk, J. C. Niebles, E. Adeli, L. Fei-Feiet al., “Re-thinking temporal search for long-form video understanding,” inProceedings of the Com- puter Vision and Pattern Recognition (CVPR), 2025, pp. 8579–8591

  35. [35]

    Video-of-thought: Step-by-step video reasoning from perception to cognition,

    H. Fei, S. Wu, W. Ji, H. Zhang, M. Zhang, M.-L. Lee, and W. Hsu, “Video-of-thought: Step-by-step video reasoning from perception to cognition,”arXiv preprint arXiv:2501.03230, 2024

  36. [36]

    Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

    H. Yuan, X. Li, T. Zhang, Y . Sun, Z. Huang, S. Xu, S. Ji, Y . Tong, L. Qi, J. Fenget al., “Sa2va: Marrying sam2 with llava for dense grounded understanding of images and videos,”arXiv preprint arXiv:2501.04001, 2025

  37. [37]

    Vistadpo: Video hierarchical spatial-temporal direct preference optimization for large video models,

    H. Huang, H. Chen, S. Wu, M. Luo, J. Fu, X. Du, H. Zhang, and H. Fei, “Vistadpo: Video hierarchical spatial-temporal direct preference optimization for large video models,” inForty-second International Conference on Machine Learning (ICML), 2025

  38. [38]

    DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    Z. Zheng, M. Yang, J. Hong, C. Zhao, G. Xu, L. Yang, C. Shen, and X. Yu, “Deepeyes: Incentivizing” thinking with images” via reinforce- ment learning,”arXiv preprint arXiv:2505.14362, 2025

  39. [39]

    Cot-vla: Visual chain-of-thought reasoning for vision-language-action models,

    Q. Zhao, Y . Lu, M. J. Kim, Z. Fu, Z. Zhang, Y . Wu, Z. Li, Q. Ma, S. Han, C. Finnet al., “Cot-vla: Visual chain-of-thought reasoning for vision-language-action models,” inProceedings of the Computer Vision and Pattern Recognition (CVPR), 2025, pp. 1702–1713

  40. [40]

    Visual chain-of-thought prompting for knowledge-based visual reason- ing,

    Z. Chen, Q. Zhou, Y . Shen, Y . Hong, Z. Sun, D. Gutfreund, and C. Gan, “Visual chain-of-thought prompting for knowledge-based visual reason- ing,” inProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2024, pp. 1254–1262

  41. [41]

    Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning,

    H. Shao, S. Qian, H. Xiao, G. Song, Z. Zong, L. Wang, Y . Liu, and H. Li, “Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning,” Advances in Neural Information Processing Systems (NeurIPS), pp. 8612–8642, 2024

  42. [42]

    An analysis of monte carlo tree search,

    S. James, G. Konidaris, and B. Rosman, “An analysis of monte carlo tree search,” inProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2017

  43. [43]

    Videoagent: Long- form video understanding with large language model as agent,

    X. Wang, Y . Zhang, O. Zohar, and S. Yeung-Levy, “Videoagent: Long- form video understanding with large language model as agent,” in European Conference on Computer Vision (ECCV). Springer, 2024, pp. 58–76

  44. [44]

    Language repository for long video understanding,

    K. Kahatapitiya, K. Ranasinghe, J. Park, and M. S. Ryoo, “Language repository for long video understanding,” inFindings of the Association for Computational Linguistics (ACL), 2025, pp. 5627–5646

  45. [45]

    Video-llava: Learning united visual representation by alignment before projection,

    B. Lin, Y . Ye, B. Zhu, J. Cui, M. Ning, P. Jin, and L. Yuan, “Video-llava: Learning united visual representation by alignment before projection,” inProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024, pp. 5971–5984

  46. [46]

    GPT-4o System Card

    A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radfordet al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024

  47. [47]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosenet al., “Gem- ini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,”arXiv preprint arXiv:2507.06261, 2025

  48. [48]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Z. Chen, W. Wang, Y . Cao, Y . Liu, Z. Gao, E. Cui, J. Zhu, S. Ye, H. Tian, Z. Liuet al., “Expanding performance boundaries of open- source multimodal models with model, data, and test-time scaling,” arXiv preprint arXiv:2412.05271, 2024

  49. [49]

    Exploring the effect of reinforcement learning on video understanding: Insights from seed-bench- r1.arXiv preprint arXiv:2503.24376, 2025

    Y . Chen, Y . Ge, R. Wang, Y . Ge, L. Qiu, Y . Shan, and X. Liu, “Exploring the effect of reinforcement learning on video understanding: Insights from seed-bench-r1,”arXiv preprint arXiv:2503.24376, 2025

  50. [50]

    Next-qa: Next phase of question-answering to explaining temporal actions,

    J. Xiao, X. Shang, A. Yao, and T.-S. Chua, “Next-qa: Next phase of question-answering to explaining temporal actions,” inProceedings of the IEEE/CVF Computer Vision and Pattern Recognition (CVPR), 2021, pp. 9777–9786

  51. [51]

    Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

    J. Cheng, Y . Ge, T. Wang, Y . Ge, J. Liao, and Y . Shan, “Video- holmes: Can mllm think like holmes for complex video reasoning?” arXiv preprint arXiv:2505.21374, 2025

  52. [52]

    Cg-bench: Clue-grounded question answering benchmark for long video understanding,

    G. Chen, Y . Liu, Y . Huang, B. Pei, J. Xu, Y . He, T. Lu, Y . Wang, and L. Wang, “Cg-bench: Clue-grounded question answering benchmark for long video understanding,” inThe Thirteenth International Conference on Learning Representations (ICLR), 2025

  53. [53]

    Vrbench: A benchmark for multi-step reasoning in long nar- rative videos.arXiv preprint arXiv:2506.10857, 2025

    J. Yu, Y . Wu, M. Chu, Z. Ren, Z. Huang, P. Chu, R. Zhang, Y . He, Q. Li, S. Liet al., “Vrbench: A benchmark for multi-step reasoning in long narrative videos,”arXiv preprint arXiv:2506.10857, 2025

  54. [54]

    Egoschema: A diagnostic benchmark for very long-form video language understanding,

    K. Mangalam, R. Akshulakov, and J. Malik, “Egoschema: A diagnostic benchmark for very long-form video language understanding,”Advances in Neural Information Processing Systems (NeurIPS), vol. 36, pp. 46 212–46 244, 2023

  55. [55]

    Star: A benchmark for situated reasoning in real-world videos.arXiv preprint arXiv:2405.09711, 2024

    B. Wu, S. Yu, Z. Chen, J. B. Tenenbaum, and C. Gan, “Star: A benchmark for situated reasoning in real-world videos,”arXiv preprint arXiv:2405.09711, 2024

  56. [56]

    LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

    Y . Zheng, R. Zhang, J. Zhang, Y . Ye, Z. Luo, Z. Feng, and Y . Ma, “Llamafactory: Unified efficient fine-tuning of 100+ language models,” arXiv preprint arXiv:2403.13372, 2024