Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding

Bo Cheng; Genbao Xu; Nan Ma; Quanxing Zha; Soujanya Poria; Teng Wang; Wei Rao; Wenyuan Gu; Zhixuan Wu

arxiv: 2604.14692 · v2 · pith:YB43P2XZnew · submitted 2026-04-16 · 💻 cs.CV

Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding

Zhixuan Wu , Quanxing Zha , Teng Wang , Genbao Xu , Wenyuan Gu , Wei Rao , Nan Ma , Bo Cheng

show 1 more author

Soujanya Poria

This is my paper

Pith reviewed 2026-05-19 17:30 UTC · model grok-4.3

classification 💻 cs.CV

keywords video reasoningobject groundingprogressive reasoningvisual evidencereinforcement learningmulti-step decisionchain of glimpsevideo understanding

0 comments

The pith

Video reasoning improves when each step anchors explicitly to specific visual objects in the frames.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that video understanding can be turned into a reliable step-by-step process by forcing every reasoning step to stay tied to concrete visual regions around task-relevant objects. It does this with a controller that searches for and grounds those regions iteratively, trained through reinforcement learning that rewards proper grounding format. A reader would care because most current video models process frames in an object-agnostic way and therefore lose track when objects change appearance or position, producing brittle answers on questions that require tracking particular things over time.

Core claim

Chain-of-Glimpse formulates video reasoning as a step-by-step process that incrementally builds spatially grounded traces around task-relevant visual objects, thereby mitigating over-reliance on saliency-driven cues. It features a search-guided controller, optimized via reinforcement learning with a format reward that incentivizes grounding capability, to iteratively ground visual evidence regions and form reliable reasoning trajectories, yielding accurate and interpretable multi-step decisions.

What carries the argument

The search-guided controller that iteratively selects and grounds task-relevant visual evidence regions to build incremental reasoning traces.

If this is right

The framework produces consistent accuracy gains on both in-domain and out-of-domain video reasoning benchmarks.
Reasoning trajectories become more interpretable because each step is tied to explicit visual regions.
Over-reliance on saliency-driven cues is reduced by the progressive object-grounding process.
The same controller yields robustness and generalization across diverse video tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same incremental grounding idea could be tested on long-form video question answering to see whether trace length stays manageable.
If the built traces are stored, they might serve as human-readable explanations for model outputs on video QA datasets.
Applying the controller to single-image reasoning tasks would test whether progressive object focus helps even without temporal change.

Load-bearing premise

Training the controller with reinforcement learning plus a format reward will reliably create object-grounding behavior that lifts compositional reasoning above object-agnostic methods.

What would settle it

An ablation that removes the search-guided controller or the format reward and measures whether accuracy on NExTQA, Video-Holmes, CG-Bench Reasoning, or VRBench falls back to the level of object-agnostic baselines would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.14692 by Bo Cheng, Genbao Xu, Nan Ma, Quanxing Zha, Soujanya Poria, Teng Wang, Wei Rao, Wenyuan Gu, Zhixuan Wu.

**Figure 1.** Figure 1: Inconsistent reasoning in prior models and improved with Chain-of-Glimpse. (a) Vanilla RL-based and (b) CoT-based models both insufficient evidence integration and global context oversight, as they tend to rely on superficial, visually prominent cues. Consequently, they fail to capture complex dependencies, leading to inconsistent reasoning (D and C). In contrast, (c) our Chain-of-Glimpse performs progress… view at source ↗

**Figure 2.** Figure 2: Overview of Chain-of-Glimpse. Chain-of-Glimpse formulates video reasoning as a search-guided, multi-turn object-grounded decision process. Given a video and a query, the model searches over object-grounded reasoning trajectories and optimizes them via reinforcement learning with task-level rewards, enabling accurate reasoning beyond visually salient cues. intermediate reasoning states help bridge low-level… view at source ↗

**Figure 3.** Figure 3: Effect of MCTS rollout numbers on Video-Holmes. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Ablation studies on NExTQA and Video-Holmes. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Video understanding requires identifying and reasoning over semantically discriminative visual objects across frames, yet existing object-agnostic solutions struggle to effectively handle substantial object variations over time. To address this, we introduce Chain-of-Glimpse, a search-guided progressive object-grounded reasoning framework that explicitly anchors each reasoning step to specific visual evidence regions, enabling compositional and multi-step decision-making. Formally, Chain-of-Glimpse formulates video reasoning as a step-by-step process that incrementally builds spatially grounded traces around task-relevant visual objects, thereby mitigating over-reliance on saliency-driven cues. Specifically, Chain-of-Glimpse features a search-guided controller, optimized via reinforcement learning with a format reward that significantly incentivizes grounding capability, to iteratively ground visual evidence regions and form reliable reasoning trajectories, yielding accurate and interpretable multi-step decisions. Extensive evaluations on both in domain NExTQA and out-of-domain Video-Holmes, CG-Bench Reasoning, and VRBench benchmarks demonstrate consistent performance gains, robustness and generalization of Chain-of-Glimpse across diverse video reasoning tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Chain-of-Glimpse uses RL to train a controller for progressive object grounding in video reasoning, showing benchmark gains, but the evidence that grounding drives those gains is not yet convincing.

read the letter

The main thing to know about this paper is that Chain-of-Glimpse uses a search-guided controller trained with RL and a format reward to progressively ground reasoning steps to specific objects in videos. This is meant to improve compositional reasoning over standard object-agnostic approaches, with reported gains on NExTQA and out-of-domain sets. What is new here is the framing of video reasoning as building spatially grounded traces step by step, with the controller optimized to do the grounding iteratively. The work does a solid job running evaluations across in-domain and several out-of-domain benchmarks, which gives some evidence of robustness and generalization. The soft spot is around whether the grounding is actually doing the heavy lifting. The format reward pushes the model to output grounding annotations, but it does not explicitly make sure those are relevant or that the reasoning depends on them. This leaves room for the gains to come from the search exploration or just the structured output format rather than true object-grounded reasoning. Without ablations that swap in random or fixed regions and check if performance holds, or inspect if masking the traces drops results, it's difficult to confirm the central mechanism. The abstract does not provide those details, so the soundness depends on what is in the full text. The approach looks like a reasonable extension of existing chain-of-thought and grounding ideas, with no major circularity or fitting issues apparent. This is for folks in video understanding and multimodal reasoning who care about interpretability and step-by-step processes. A reader working on RL for vision-language tasks would probably find the controller setup interesting. It deserves peer review because the results are there and the idea is practical, though it would benefit from tighter experiments on the grounding contribution. I would recommend sending it to referees with a note to verify the causal role of the grounded traces.

Referee Report

2 major / 2 minor

Summary. The paper introduces Chain-of-Glimpse, a search-guided progressive object-grounded reasoning framework for video understanding. It formulates the task as a step-by-step process that incrementally builds spatially grounded traces around task-relevant visual objects, using a search-guided controller optimized via reinforcement learning with a format reward to produce reliable reasoning trajectories. The approach is evaluated on in-domain NExTQA and out-of-domain benchmarks including Video-Holmes, CG-Bench Reasoning, and VRBench, reporting consistent performance gains, robustness, and generalization over object-agnostic baselines.

Significance. If the central claim holds, the work offers a promising direction for improving interpretability and compositional reasoning in video understanding by explicitly anchoring steps to visual evidence regions rather than relying on saliency-driven cues. The progressive, search-guided structure could help mitigate object variation over time and support multi-step decision-making in complex video tasks.

major comments (2)

[Abstract and method description] The description of the RL optimization (abstract and method) states that the format reward 'significantly incentivizes grounding capability,' yet provides no explicit mechanism to ensure the grounded regions are task-relevant or causally used in subsequent reasoning steps. This leaves open the possibility that gains arise from output formatting compliance or search exploration rather than functional object grounding, as the reward does not penalize irrelevant regions or verify their contribution to the final decision.
[Experiments and results] The experimental claims of consistent gains and improved generalization rest on unspecified implementation details and lack ablations that isolate the contribution of learned grounding (e.g., replacing controller outputs with random or saliency-based regions while keeping the rest of the pipeline fixed). Without such controls or inspection of whether masking the grounded traces degrades performance, the central claim that object-grounded traces drive the improvements over baselines cannot be fully evaluated.

minor comments (2)

[Abstract] The abstract mentions 'extensive evaluations' and 'consistent performance gains' but does not specify the exact metrics, baseline models, or quantitative improvements; adding these details would improve clarity for readers.
[Method] Notation for the search-guided controller and the format reward could be introduced more formally with equations or pseudocode to make the iterative grounding process easier to follow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate where revisions will be made to the manuscript.

read point-by-point responses

Referee: [Abstract and method description] The description of the RL optimization (abstract and method) states that the format reward 'significantly incentivizes grounding capability,' yet provides no explicit mechanism to ensure the grounded regions are task-relevant or causally used in subsequent reasoning steps. This leaves open the possibility that gains arise from output formatting compliance or search exploration rather than functional object grounding, as the reward does not penalize irrelevant regions or verify their contribution to the final decision.

Authors: We appreciate this observation on the need for greater clarity. The format reward ensures structural compliance in producing grounded outputs, while the search-guided controller is optimized through reinforcement learning to maximize task performance on the video understanding objective. This process encourages selection of regions that support successful reasoning trajectories, as non-contributory groundings would not aid in reaching correct decisions. We agree the description can be strengthened and will revise the method section to more explicitly describe how the combined RL objective and search guidance promote task-relevant grounding beyond format compliance alone. revision: yes
Referee: [Experiments and results] The experimental claims of consistent gains and improved generalization rest on unspecified implementation details and lack ablations that isolate the contribution of learned grounding (e.g., replacing controller outputs with random or saliency-based regions while keeping the rest of the pipeline fixed). Without such controls or inspection of whether masking the grounded traces degrades performance, the central claim that object-grounded traces drive the improvements over baselines cannot be fully evaluated.

Authors: We agree that targeted ablations would better isolate the contribution of the learned grounding. In the revised manuscript we will add experiments that substitute the controller outputs with random regions and with saliency-based regions while holding the remainder of the pipeline fixed. We will also report performance when the grounded traces are masked. These additions will provide direct evidence regarding the role of object-grounded traces in the observed gains. revision: yes

Circularity Check

0 steps flagged

No circularity: Chain-of-Glimpse presents an independent RL-based framework without reducing results to inputs by construction.

full rationale

The paper defines Chain-of-Glimpse as a novel search-guided progressive object-grounded reasoning process for video understanding, optimized via reinforcement learning with a format reward to encourage grounding. No equations, derivations, or claims in the abstract or description reduce a reported prediction or first-principles result to a fitted parameter, self-citation chain, or renamed input by construction. The central claims rest on empirical evaluations across benchmarks rather than tautological re-derivations, making the derivation self-contained against external benchmarks as per the reader's assessment.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only view yields no explicit free parameters, axioms, or invented entities beyond the high-level description of the RL controller and format reward; these are treated as standard optimization choices rather than new postulates.

pith-pipeline@v0.9.0 · 5746 in / 1081 out tokens · 38158 ms · 2026-05-19T17:30:00.450691+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

search-guided controller, optimized via reinforcement learning with a format reward that significantly incentivizes grounding capability
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

multi-turn decision policy learning... group relative policy optimization (GRPO)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 13 internal anchors

[1]

A simple llm framework for long-range video question- answering,

C. Zhang, T. Lu, M. M. Islam, Z. Wang, S. Yu, M. Bansal, and G. Bertasius, “A simple llm framework for long-range video question- answering,” inProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024, pp. 21 715–21 737

work page 2024
[2]

Understanding long videos in one multimodal language model pass,

K. Ranasinghe, X. Li, K. Kahatapitiya, and M. S. Ryoo, “Understanding long videos in one multimodal language model pass,”arXiv preprint arXiv:2403.16998, vol. 3, no. 4, p. 12, 2024

work page arXiv 2024
[3]

Stimuvar: Spatiotemporal stimuli-aware video affective reasoning with multimodal large language models,

Y . Guo, F. Siddiqui, Y . Zhao, R. Chellappa, and S.-Y . Lo, “Stimuvar: Spatiotemporal stimuli-aware video affective reasoning with multimodal large language models,”International Journal of Computer Vision (IJCV), pp. 1–17, 2025

work page 2025
[4]

Dycoke: Dynamic com- pression of tokens for fast video large language models,

K. Tao, C. Qin, H. You, Y . Sui, and H. Wang, “Dycoke: Dynamic com- pression of tokens for fast video large language models,” inProceedings of the Computer Vision and Pattern Recognition (CVPR), 2025, pp. 18 992–19 001

work page 2025
[5]

Vtimellm: Empower llm to grasp video moments,

B. Huang, X. Wang, H. Chen, Z. Song, and W. Zhu, “Vtimellm: Empower llm to grasp video moments,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 14 271–14 280

work page 2024
[6]

Chapter-llama: Efficient chaptering in hour-long videos with llms,

L. Ventura, A. Yang, C. Schmid, and G. Varol, “Chapter-llama: Efficient chaptering in hour-long videos with llms,” inProceedings of the Com- puter Vision and Pattern Recognition (CVPR), 2025, pp. 18 947–18 958

work page 2025
[7]

Automated multi-level preference for mllms,

M. Zhang, W. Wu, Y . Lu, Y . Song, K. Rong, H. Yao, J. Zhao, F. Liu, H. Feng, J. Wanget al., “Automated multi-level preference for mllms,” Advances in Neural Information Processing Systems (NeurIPS), pp. 26 171–26 194, 2024

work page 2024
[8]

Omnialign-v: Towards enhanced alignment of mllms with human preference,

X. Zhao, S. Ding, Z. Zhang, H. Huang, M. Maosongcao, J. Wang, W. Wang, X. Fang, W. Wang, G. Zhaiet al., “Omnialign-v: Towards enhanced alignment of mllms with human preference,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025, pp. 18 490–18 515

work page 2025
[9]

Grounded Reinforcement Learning for Visual Reasoning

G. Sarch, S. Saha, N. Khandelwal, A. Jain, M. J. Tarr, A. Kumar, and K. Fragkiadaki, “Grounded reinforcement learning for visual reasoning,” arXiv preprint arXiv:2505.23678, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Geet al., “Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,”arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Timechat: A time-sensitive multimodal large language model for long video understanding,

S. Ren, L. Yao, S. Li, X. Sun, and L. Hou, “Timechat: A time-sensitive multimodal large language model for long video understanding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 14 313–14 323

work page 2024
[12]

Videoespresso: A large-scale chain-of-thought dataset for fine-grained video reasoning via core frame selection,

S. Han, W. Huang, H. Shi, L. Zhuo, X. Su, S. Zhang, X. Zhou, X. Qi, Y . Liao, and S. Liu, “Videoespresso: A large-scale chain-of-thought dataset for fine-grained video reasoning via core frame selection,” in Proceedings of the Computer Vision and Pattern Recognition (CVPR), 2025, pp. 26 181–26 191

work page 2025
[13]

Qwen2.5-VL Technical Report

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tanget al., “Qwen2.5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

OneThinker: All-in-one Reasoning Model for Image and Video

K. Feng, M. Zhang, H. Li, K. Fan, S. Chen, Y . Jiang, D. Zheng, P. Sun, Y . Zhang, H. Sunet al., “Onethinker: All-in-one reasoning model for image and video,”arXiv preprint arXiv:2512.03043, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Video-R1: Reinforcing Video Reasoning in MLLMs

K. Feng, K. Gong, B. Li, Z. Guo, Y . Wang, T. Peng, J. Wu, X. Zhang, B. Wang, and X. Yue, “Video-r1: Reinforcing video reasoning in mllms,” arXiv preprint arXiv:2503.21776, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

X. Li, Z. Yan, D. Meng, L. Dong, X. Zeng, Y . He, Y . Wang, Y . Qiao, Y . Wang, and L. Wang, “Videochat-r1: Enhancing spatio-temporal per- ception via reinforcement fine-tuning,”arXiv preprint arXiv:2504.06958, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models,

G. Zheng, B. Yang, J. Tang, H.-Y . Zhou, and S. Yang, “Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models,”Advances in Neural Information Processing Systems (NeurIPS), vol. 36, pp. 5168–5191, 2023

work page 2023
[18]

Imagine while reasoning in space: Multimodal visualization-of- thought,

C. Li, W. Wu, H. Zhang, Y . Xia, S. Mao, L. Dong, I. Vuli ´c, and F. Wei, “Imagine while reasoning in space: Multimodal visualization-of- thought,” inForty-second International Conference on Machine Learn- ing (ICML), 2025

work page 2025
[19]

Rethinking chain-of-thought reasoning for videos,

Y . Zhong, Z.-Y . Hu, Y . Li, and L. Wang, “Rethinking chain-of-thought reasoning for videos,”arXiv preprint arXiv:2512.09616, 2025

work page arXiv 2025
[20]

Mmtom-qa: Multimodal theory of mind question answering,

C. Jin, Y . Wu, J. Cao, J. Xiang, Y .-L. Kuo, Z. Hu, T. Ullman, A. Torralba, J. Tenenbaum, and T. Shu, “Mmtom-qa: Multimodal theory of mind question answering,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024, pp. 16 077– 16 102. JOURNAL OF LATEX CLASS FILES, APIRL 2026 10

work page 2024
[21]

Morevqa: Exploring modular reasoning models for video question answering,

J. Min, S. Buch, A. Nagrani, M. Cho, and C. Schmid, “Morevqa: Exploring modular reasoning models for video question answering,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 13 235–13 245

work page 2024
[22]

End-to-end generative pretraining for multimodal video captioning,

P. H. Seo, A. Nagrani, A. Arnab, and C. Schmid, “End-to-end generative pretraining for multimodal video captioning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 17 959–17 968

work page 2022
[23]

Video-xl: Extra-long vision language model for hour- scale video understanding,

Y . Shu, Z. Liu, P. Zhang, M. Qin, J. Zhou, Z. Liang, T. Huang, and B. Zhao, “Video-xl: Extra-long vision language model for hour- scale video understanding,” inProceedings of the Computer Vision and Pattern Recognition (CVPR), 2025, pp. 26 160–26 169

work page 2025
[24]

Revisiting tem- poral modeling for clip-based image-to-video knowledge transferring,

R. Liu, J. Huang, G. Li, J. Feng, X. Wu, and T. H. Li, “Revisiting tem- poral modeling for clip-based image-to-video knowledge transferring,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 6555–6564

work page 2023
[25]

Streaming long video understanding with large language models,

R. Qian, X. Dong, P. Zhang, Y . Zang, S. Ding, D. Lin, and J. Wang, “Streaming long video understanding with large language models,” Advances in Neural Information Processing Systems (NeurIPS), vol. 37, pp. 119 336–119 360, 2024

work page 2024
[26]

Moviechat+: Question-aware sparse memory for long video question answering,

E. Song, W. Chai, T. Ye, J.-N. Hwang, X. Li, and G. Wang, “Moviechat+: Question-aware sparse memory for long video question answering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025
[27]

Learning high-quality dynamic memory for video object segmentation,

Y . Liu, R. Yu, F. Yin, X. Zhao, W. Zhao, W. Xia, J. Wang, Y . Wang, Y . Tang, and Y . Yang, “Learning high-quality dynamic memory for video object segmentation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025
[28]

Cotdet: Affordance knowledge prompting for task driven object detection,

J. Tang, G. Zheng, J. Yu, and S. Yang, “Cotdet: Affordance knowledge prompting for task driven object detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 3068–3078

work page 2023
[29]

Sf2t: Self-supervised fragment finetuning of video-llms for fine-grained understanding,

Y . Hu, Z. Song, N. Feng, Y . Luo, J. Yu, Y .-P. P. Chen, and W. Yang, “Sf2t: Self-supervised fragment finetuning of video-llms for fine-grained understanding,” inProceedings of the Computer Vision and Pattern Recognition (CVPR), 2025, pp. 29 108–29 117

work page 2025
[30]

Ma-lmm: Memory-augmented large multimodal model for long-term video understanding,

B. He, H. Li, Y . K. Jang, M. Jia, X. Cao, A. Shah, A. Shrivastava, and S.-N. Lim, “Ma-lmm: Memory-augmented large multimodal model for long-term video understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 13 504–13 514

work page 2024
[31]

Compositional chain- of-thought prompting for large multimodal models,

C. Mitra, B. Huang, T. Darrell, and R. Herzig, “Compositional chain- of-thought prompting for large multimodal models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 14 420–14 431

work page 2024
[32]

Videorefer suite: Advancing spatial- temporal object understanding with video llm,

Y . Yuan, H. Zhang, W. Li, Z. Cheng, B. Zhang, L. Li, X. Li, D. Zhao, W. Zhang, Y . Zhuanget al., “Videorefer suite: Advancing spatial- temporal object understanding with video llm,” inProceedings of the Computer Vision and Pattern Recognition (CVPR), 2025, pp. 18 970– 18 980

work page 2025
[33]

Pixel-level reasoning segmentation via multi-turn conversations,

D. Cai, X. Yang, Y . Liu, D. Wang, S. Feng, Y . Zhang, and S. Poria, “Pixel-level reasoning segmentation via multi-turn conversations,” in Proceedings of the 62nd Annual Meeting of the Association for Com- putational Linguistics (ACL), 2025, pp. 17 660–17 679

work page 2025
[34]

Re-thinking temporal search for long-form video understanding,

J. Ye, Z. Wang, H. Sun, K. Chandrasegaran, Z. Durante, C. Eyzaguirre, Y . Bisk, J. C. Niebles, E. Adeli, L. Fei-Feiet al., “Re-thinking temporal search for long-form video understanding,” inProceedings of the Com- puter Vision and Pattern Recognition (CVPR), 2025, pp. 8579–8591

work page 2025
[35]

Video-of-thought: Step-by-step video reasoning from perception to cognition,

H. Fei, S. Wu, W. Ji, H. Zhang, M. Zhang, M.-L. Lee, and W. Hsu, “Video-of-thought: Step-by-step video reasoning from perception to cognition,”arXiv preprint arXiv:2501.03230, 2024

work page arXiv 2024
[36]

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

H. Yuan, X. Li, T. Zhang, Y . Sun, Z. Huang, S. Xu, S. Ji, Y . Tong, L. Qi, J. Fenget al., “Sa2va: Marrying sam2 with llava for dense grounded understanding of images and videos,”arXiv preprint arXiv:2501.04001, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Vistadpo: Video hierarchical spatial-temporal direct preference optimization for large video models,

H. Huang, H. Chen, S. Wu, M. Luo, J. Fu, X. Du, H. Zhang, and H. Fei, “Vistadpo: Video hierarchical spatial-temporal direct preference optimization for large video models,” inForty-second International Conference on Machine Learning (ICML), 2025

work page 2025
[38]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Z. Zheng, M. Yang, J. Hong, C. Zhao, G. Xu, L. Yang, C. Shen, and X. Yu, “Deepeyes: Incentivizing” thinking with images” via reinforce- ment learning,”arXiv preprint arXiv:2505.14362, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Cot-vla: Visual chain-of-thought reasoning for vision-language-action models,

Q. Zhao, Y . Lu, M. J. Kim, Z. Fu, Z. Zhang, Y . Wu, Z. Li, Q. Ma, S. Han, C. Finnet al., “Cot-vla: Visual chain-of-thought reasoning for vision-language-action models,” inProceedings of the Computer Vision and Pattern Recognition (CVPR), 2025, pp. 1702–1713

work page 2025
[40]

Visual chain-of-thought prompting for knowledge-based visual reason- ing,

Z. Chen, Q. Zhou, Y . Shen, Y . Hong, Z. Sun, D. Gutfreund, and C. Gan, “Visual chain-of-thought prompting for knowledge-based visual reason- ing,” inProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2024, pp. 1254–1262

work page 2024
[41]

Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning,

H. Shao, S. Qian, H. Xiao, G. Song, Z. Zong, L. Wang, Y . Liu, and H. Li, “Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning,” Advances in Neural Information Processing Systems (NeurIPS), pp. 8612–8642, 2024

work page 2024
[42]

An analysis of monte carlo tree search,

S. James, G. Konidaris, and B. Rosman, “An analysis of monte carlo tree search,” inProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2017

work page 2017
[43]

Videoagent: Long- form video understanding with large language model as agent,

X. Wang, Y . Zhang, O. Zohar, and S. Yeung-Levy, “Videoagent: Long- form video understanding with large language model as agent,” in European Conference on Computer Vision (ECCV). Springer, 2024, pp. 58–76

work page 2024
[44]

Language repository for long video understanding,

K. Kahatapitiya, K. Ranasinghe, J. Park, and M. S. Ryoo, “Language repository for long video understanding,” inFindings of the Association for Computational Linguistics (ACL), 2025, pp. 5627–5646

work page 2025
[45]

Video-llava: Learning united visual representation by alignment before projection,

B. Lin, Y . Ye, B. Zhu, J. Cui, M. Ning, P. Jin, and L. Yuan, “Video-llava: Learning united visual representation by alignment before projection,” inProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024, pp. 5971–5984

work page 2024
[46]

GPT-4o System Card

A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radfordet al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosenet al., “Gem- ini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,”arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Z. Chen, W. Wang, Y . Cao, Y . Liu, Z. Gao, E. Cui, J. Zhu, S. Ye, H. Tian, Z. Liuet al., “Expanding performance boundaries of open- source multimodal models with model, data, and test-time scaling,” arXiv preprint arXiv:2412.05271, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Exploring the effect of reinforcement learning on video understanding: Insights from seed-bench- r1.arXiv preprint arXiv:2503.24376, 2025

Y . Chen, Y . Ge, R. Wang, Y . Ge, L. Qiu, Y . Shan, and X. Liu, “Exploring the effect of reinforcement learning on video understanding: Insights from seed-bench-r1,”arXiv preprint arXiv:2503.24376, 2025

work page arXiv 2025
[50]

Next-qa: Next phase of question-answering to explaining temporal actions,

J. Xiao, X. Shang, A. Yao, and T.-S. Chua, “Next-qa: Next phase of question-answering to explaining temporal actions,” inProceedings of the IEEE/CVF Computer Vision and Pattern Recognition (CVPR), 2021, pp. 9777–9786

work page 2021
[51]

Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

J. Cheng, Y . Ge, T. Wang, Y . Ge, J. Liao, and Y . Shan, “Video- holmes: Can mllm think like holmes for complex video reasoning?” arXiv preprint arXiv:2505.21374, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

Cg-bench: Clue-grounded question answering benchmark for long video understanding,

G. Chen, Y . Liu, Y . Huang, B. Pei, J. Xu, Y . He, T. Lu, Y . Wang, and L. Wang, “Cg-bench: Clue-grounded question answering benchmark for long video understanding,” inThe Thirteenth International Conference on Learning Representations (ICLR), 2025

work page 2025
[53]

Vrbench: A benchmark for multi-step reasoning in long nar- rative videos.arXiv preprint arXiv:2506.10857, 2025

J. Yu, Y . Wu, M. Chu, Z. Ren, Z. Huang, P. Chu, R. Zhang, Y . He, Q. Li, S. Liet al., “Vrbench: A benchmark for multi-step reasoning in long narrative videos,”arXiv preprint arXiv:2506.10857, 2025

work page arXiv 2025
[54]

Egoschema: A diagnostic benchmark for very long-form video language understanding,

K. Mangalam, R. Akshulakov, and J. Malik, “Egoschema: A diagnostic benchmark for very long-form video language understanding,”Advances in Neural Information Processing Systems (NeurIPS), vol. 36, pp. 46 212–46 244, 2023

work page 2023
[55]

Star: A benchmark for situated reasoning in real-world videos.arXiv preprint arXiv:2405.09711, 2024

B. Wu, S. Yu, Z. Chen, J. B. Tenenbaum, and C. Gan, “Star: A benchmark for situated reasoning in real-world videos,”arXiv preprint arXiv:2405.09711, 2024

work page arXiv 2024
[56]

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

Y . Zheng, R. Zhang, J. Zhang, Y . Ye, Z. Luo, Z. Feng, and Y . Ma, “Llamafactory: Unified efficient fine-tuning of 100+ language models,” arXiv preprint arXiv:2403.13372, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

A simple llm framework for long-range video question- answering,

C. Zhang, T. Lu, M. M. Islam, Z. Wang, S. Yu, M. Bansal, and G. Bertasius, “A simple llm framework for long-range video question- answering,” inProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024, pp. 21 715–21 737

work page 2024

[2] [2]

Understanding long videos in one multimodal language model pass,

K. Ranasinghe, X. Li, K. Kahatapitiya, and M. S. Ryoo, “Understanding long videos in one multimodal language model pass,”arXiv preprint arXiv:2403.16998, vol. 3, no. 4, p. 12, 2024

work page arXiv 2024

[3] [3]

Stimuvar: Spatiotemporal stimuli-aware video affective reasoning with multimodal large language models,

Y . Guo, F. Siddiqui, Y . Zhao, R. Chellappa, and S.-Y . Lo, “Stimuvar: Spatiotemporal stimuli-aware video affective reasoning with multimodal large language models,”International Journal of Computer Vision (IJCV), pp. 1–17, 2025

work page 2025

[4] [4]

Dycoke: Dynamic com- pression of tokens for fast video large language models,

K. Tao, C. Qin, H. You, Y . Sui, and H. Wang, “Dycoke: Dynamic com- pression of tokens for fast video large language models,” inProceedings of the Computer Vision and Pattern Recognition (CVPR), 2025, pp. 18 992–19 001

work page 2025

[5] [5]

Vtimellm: Empower llm to grasp video moments,

B. Huang, X. Wang, H. Chen, Z. Song, and W. Zhu, “Vtimellm: Empower llm to grasp video moments,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 14 271–14 280

work page 2024

[6] [6]

Chapter-llama: Efficient chaptering in hour-long videos with llms,

L. Ventura, A. Yang, C. Schmid, and G. Varol, “Chapter-llama: Efficient chaptering in hour-long videos with llms,” inProceedings of the Com- puter Vision and Pattern Recognition (CVPR), 2025, pp. 18 947–18 958

work page 2025

[7] [7]

Automated multi-level preference for mllms,

M. Zhang, W. Wu, Y . Lu, Y . Song, K. Rong, H. Yao, J. Zhao, F. Liu, H. Feng, J. Wanget al., “Automated multi-level preference for mllms,” Advances in Neural Information Processing Systems (NeurIPS), pp. 26 171–26 194, 2024

work page 2024

[8] [8]

Omnialign-v: Towards enhanced alignment of mllms with human preference,

X. Zhao, S. Ding, Z. Zhang, H. Huang, M. Maosongcao, J. Wang, W. Wang, X. Fang, W. Wang, G. Zhaiet al., “Omnialign-v: Towards enhanced alignment of mllms with human preference,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025, pp. 18 490–18 515

work page 2025

[9] [9]

Grounded Reinforcement Learning for Visual Reasoning

G. Sarch, S. Saha, N. Khandelwal, A. Jain, M. J. Tarr, A. Kumar, and K. Fragkiadaki, “Grounded reinforcement learning for visual reasoning,” arXiv preprint arXiv:2505.23678, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Geet al., “Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,”arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Timechat: A time-sensitive multimodal large language model for long video understanding,

S. Ren, L. Yao, S. Li, X. Sun, and L. Hou, “Timechat: A time-sensitive multimodal large language model for long video understanding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 14 313–14 323

work page 2024

[12] [12]

Videoespresso: A large-scale chain-of-thought dataset for fine-grained video reasoning via core frame selection,

S. Han, W. Huang, H. Shi, L. Zhuo, X. Su, S. Zhang, X. Zhou, X. Qi, Y . Liao, and S. Liu, “Videoespresso: A large-scale chain-of-thought dataset for fine-grained video reasoning via core frame selection,” in Proceedings of the Computer Vision and Pattern Recognition (CVPR), 2025, pp. 26 181–26 191

work page 2025

[13] [13]

Qwen2.5-VL Technical Report

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tanget al., “Qwen2.5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

OneThinker: All-in-one Reasoning Model for Image and Video

K. Feng, M. Zhang, H. Li, K. Fan, S. Chen, Y . Jiang, D. Zheng, P. Sun, Y . Zhang, H. Sunet al., “Onethinker: All-in-one reasoning model for image and video,”arXiv preprint arXiv:2512.03043, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Video-R1: Reinforcing Video Reasoning in MLLMs

K. Feng, K. Gong, B. Li, Z. Guo, Y . Wang, T. Peng, J. Wu, X. Zhang, B. Wang, and X. Yue, “Video-r1: Reinforcing video reasoning in mllms,” arXiv preprint arXiv:2503.21776, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

X. Li, Z. Yan, D. Meng, L. Dong, X. Zeng, Y . He, Y . Wang, Y . Qiao, Y . Wang, and L. Wang, “Videochat-r1: Enhancing spatio-temporal per- ception via reinforcement fine-tuning,”arXiv preprint arXiv:2504.06958, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models,

G. Zheng, B. Yang, J. Tang, H.-Y . Zhou, and S. Yang, “Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models,”Advances in Neural Information Processing Systems (NeurIPS), vol. 36, pp. 5168–5191, 2023

work page 2023

[18] [18]

Imagine while reasoning in space: Multimodal visualization-of- thought,

C. Li, W. Wu, H. Zhang, Y . Xia, S. Mao, L. Dong, I. Vuli ´c, and F. Wei, “Imagine while reasoning in space: Multimodal visualization-of- thought,” inForty-second International Conference on Machine Learn- ing (ICML), 2025

work page 2025

[19] [19]

Rethinking chain-of-thought reasoning for videos,

Y . Zhong, Z.-Y . Hu, Y . Li, and L. Wang, “Rethinking chain-of-thought reasoning for videos,”arXiv preprint arXiv:2512.09616, 2025

work page arXiv 2025

[20] [20]

Mmtom-qa: Multimodal theory of mind question answering,

C. Jin, Y . Wu, J. Cao, J. Xiang, Y .-L. Kuo, Z. Hu, T. Ullman, A. Torralba, J. Tenenbaum, and T. Shu, “Mmtom-qa: Multimodal theory of mind question answering,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024, pp. 16 077– 16 102. JOURNAL OF LATEX CLASS FILES, APIRL 2026 10

work page 2024

[21] [21]

Morevqa: Exploring modular reasoning models for video question answering,

J. Min, S. Buch, A. Nagrani, M. Cho, and C. Schmid, “Morevqa: Exploring modular reasoning models for video question answering,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 13 235–13 245

work page 2024

[22] [22]

End-to-end generative pretraining for multimodal video captioning,

P. H. Seo, A. Nagrani, A. Arnab, and C. Schmid, “End-to-end generative pretraining for multimodal video captioning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 17 959–17 968

work page 2022

[23] [23]

Video-xl: Extra-long vision language model for hour- scale video understanding,

Y . Shu, Z. Liu, P. Zhang, M. Qin, J. Zhou, Z. Liang, T. Huang, and B. Zhao, “Video-xl: Extra-long vision language model for hour- scale video understanding,” inProceedings of the Computer Vision and Pattern Recognition (CVPR), 2025, pp. 26 160–26 169

work page 2025

[24] [24]

Revisiting tem- poral modeling for clip-based image-to-video knowledge transferring,

R. Liu, J. Huang, G. Li, J. Feng, X. Wu, and T. H. Li, “Revisiting tem- poral modeling for clip-based image-to-video knowledge transferring,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 6555–6564

work page 2023

[25] [25]

Streaming long video understanding with large language models,

R. Qian, X. Dong, P. Zhang, Y . Zang, S. Ding, D. Lin, and J. Wang, “Streaming long video understanding with large language models,” Advances in Neural Information Processing Systems (NeurIPS), vol. 37, pp. 119 336–119 360, 2024

work page 2024

[26] [26]

Moviechat+: Question-aware sparse memory for long video question answering,

E. Song, W. Chai, T. Ye, J.-N. Hwang, X. Li, and G. Wang, “Moviechat+: Question-aware sparse memory for long video question answering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025

[27] [27]

Learning high-quality dynamic memory for video object segmentation,

Y . Liu, R. Yu, F. Yin, X. Zhao, W. Zhao, W. Xia, J. Wang, Y . Wang, Y . Tang, and Y . Yang, “Learning high-quality dynamic memory for video object segmentation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025

[28] [28]

Cotdet: Affordance knowledge prompting for task driven object detection,

J. Tang, G. Zheng, J. Yu, and S. Yang, “Cotdet: Affordance knowledge prompting for task driven object detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 3068–3078

work page 2023

[29] [29]

Sf2t: Self-supervised fragment finetuning of video-llms for fine-grained understanding,

Y . Hu, Z. Song, N. Feng, Y . Luo, J. Yu, Y .-P. P. Chen, and W. Yang, “Sf2t: Self-supervised fragment finetuning of video-llms for fine-grained understanding,” inProceedings of the Computer Vision and Pattern Recognition (CVPR), 2025, pp. 29 108–29 117

work page 2025

[30] [30]

Ma-lmm: Memory-augmented large multimodal model for long-term video understanding,

B. He, H. Li, Y . K. Jang, M. Jia, X. Cao, A. Shah, A. Shrivastava, and S.-N. Lim, “Ma-lmm: Memory-augmented large multimodal model for long-term video understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 13 504–13 514

work page 2024

[31] [31]

Compositional chain- of-thought prompting for large multimodal models,

C. Mitra, B. Huang, T. Darrell, and R. Herzig, “Compositional chain- of-thought prompting for large multimodal models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 14 420–14 431

work page 2024

[32] [32]

Videorefer suite: Advancing spatial- temporal object understanding with video llm,

Y . Yuan, H. Zhang, W. Li, Z. Cheng, B. Zhang, L. Li, X. Li, D. Zhao, W. Zhang, Y . Zhuanget al., “Videorefer suite: Advancing spatial- temporal object understanding with video llm,” inProceedings of the Computer Vision and Pattern Recognition (CVPR), 2025, pp. 18 970– 18 980

work page 2025

[33] [33]

Pixel-level reasoning segmentation via multi-turn conversations,

D. Cai, X. Yang, Y . Liu, D. Wang, S. Feng, Y . Zhang, and S. Poria, “Pixel-level reasoning segmentation via multi-turn conversations,” in Proceedings of the 62nd Annual Meeting of the Association for Com- putational Linguistics (ACL), 2025, pp. 17 660–17 679

work page 2025

[34] [34]

Re-thinking temporal search for long-form video understanding,

J. Ye, Z. Wang, H. Sun, K. Chandrasegaran, Z. Durante, C. Eyzaguirre, Y . Bisk, J. C. Niebles, E. Adeli, L. Fei-Feiet al., “Re-thinking temporal search for long-form video understanding,” inProceedings of the Com- puter Vision and Pattern Recognition (CVPR), 2025, pp. 8579–8591

work page 2025

[35] [35]

Video-of-thought: Step-by-step video reasoning from perception to cognition,

H. Fei, S. Wu, W. Ji, H. Zhang, M. Zhang, M.-L. Lee, and W. Hsu, “Video-of-thought: Step-by-step video reasoning from perception to cognition,”arXiv preprint arXiv:2501.03230, 2024

work page arXiv 2024

[36] [36]

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

H. Yuan, X. Li, T. Zhang, Y . Sun, Z. Huang, S. Xu, S. Ji, Y . Tong, L. Qi, J. Fenget al., “Sa2va: Marrying sam2 with llava for dense grounded understanding of images and videos,”arXiv preprint arXiv:2501.04001, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

Vistadpo: Video hierarchical spatial-temporal direct preference optimization for large video models,

H. Huang, H. Chen, S. Wu, M. Luo, J. Fu, X. Du, H. Zhang, and H. Fei, “Vistadpo: Video hierarchical spatial-temporal direct preference optimization for large video models,” inForty-second International Conference on Machine Learning (ICML), 2025

work page 2025

[38] [38]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Z. Zheng, M. Yang, J. Hong, C. Zhao, G. Xu, L. Yang, C. Shen, and X. Yu, “Deepeyes: Incentivizing” thinking with images” via reinforce- ment learning,”arXiv preprint arXiv:2505.14362, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

Cot-vla: Visual chain-of-thought reasoning for vision-language-action models,

Q. Zhao, Y . Lu, M. J. Kim, Z. Fu, Z. Zhang, Y . Wu, Z. Li, Q. Ma, S. Han, C. Finnet al., “Cot-vla: Visual chain-of-thought reasoning for vision-language-action models,” inProceedings of the Computer Vision and Pattern Recognition (CVPR), 2025, pp. 1702–1713

work page 2025

[40] [40]

Visual chain-of-thought prompting for knowledge-based visual reason- ing,

Z. Chen, Q. Zhou, Y . Shen, Y . Hong, Z. Sun, D. Gutfreund, and C. Gan, “Visual chain-of-thought prompting for knowledge-based visual reason- ing,” inProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2024, pp. 1254–1262

work page 2024

[41] [41]

Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning,

H. Shao, S. Qian, H. Xiao, G. Song, Z. Zong, L. Wang, Y . Liu, and H. Li, “Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning,” Advances in Neural Information Processing Systems (NeurIPS), pp. 8612–8642, 2024

work page 2024

[42] [42]

An analysis of monte carlo tree search,

S. James, G. Konidaris, and B. Rosman, “An analysis of monte carlo tree search,” inProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2017

work page 2017

[43] [43]

Videoagent: Long- form video understanding with large language model as agent,

X. Wang, Y . Zhang, O. Zohar, and S. Yeung-Levy, “Videoagent: Long- form video understanding with large language model as agent,” in European Conference on Computer Vision (ECCV). Springer, 2024, pp. 58–76

work page 2024

[44] [44]

Language repository for long video understanding,

K. Kahatapitiya, K. Ranasinghe, J. Park, and M. S. Ryoo, “Language repository for long video understanding,” inFindings of the Association for Computational Linguistics (ACL), 2025, pp. 5627–5646

work page 2025

[45] [45]

Video-llava: Learning united visual representation by alignment before projection,

B. Lin, Y . Ye, B. Zhu, J. Cui, M. Ning, P. Jin, and L. Yuan, “Video-llava: Learning united visual representation by alignment before projection,” inProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024, pp. 5971–5984

work page 2024

[46] [46]

GPT-4o System Card

A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radfordet al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[47] [47]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosenet al., “Gem- ini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,”arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[48] [48]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Z. Chen, W. Wang, Y . Cao, Y . Liu, Z. Gao, E. Cui, J. Zhu, S. Ye, H. Tian, Z. Liuet al., “Expanding performance boundaries of open- source multimodal models with model, data, and test-time scaling,” arXiv preprint arXiv:2412.05271, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[49] [49]

Exploring the effect of reinforcement learning on video understanding: Insights from seed-bench- r1.arXiv preprint arXiv:2503.24376, 2025

Y . Chen, Y . Ge, R. Wang, Y . Ge, L. Qiu, Y . Shan, and X. Liu, “Exploring the effect of reinforcement learning on video understanding: Insights from seed-bench-r1,”arXiv preprint arXiv:2503.24376, 2025

work page arXiv 2025

[50] [50]

Next-qa: Next phase of question-answering to explaining temporal actions,

J. Xiao, X. Shang, A. Yao, and T.-S. Chua, “Next-qa: Next phase of question-answering to explaining temporal actions,” inProceedings of the IEEE/CVF Computer Vision and Pattern Recognition (CVPR), 2021, pp. 9777–9786

work page 2021

[51] [51]

Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

J. Cheng, Y . Ge, T. Wang, Y . Ge, J. Liao, and Y . Shan, “Video- holmes: Can mllm think like holmes for complex video reasoning?” arXiv preprint arXiv:2505.21374, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[52] [52]

Cg-bench: Clue-grounded question answering benchmark for long video understanding,

G. Chen, Y . Liu, Y . Huang, B. Pei, J. Xu, Y . He, T. Lu, Y . Wang, and L. Wang, “Cg-bench: Clue-grounded question answering benchmark for long video understanding,” inThe Thirteenth International Conference on Learning Representations (ICLR), 2025

work page 2025

[53] [53]

Vrbench: A benchmark for multi-step reasoning in long nar- rative videos.arXiv preprint arXiv:2506.10857, 2025

J. Yu, Y . Wu, M. Chu, Z. Ren, Z. Huang, P. Chu, R. Zhang, Y . He, Q. Li, S. Liet al., “Vrbench: A benchmark for multi-step reasoning in long narrative videos,”arXiv preprint arXiv:2506.10857, 2025

work page arXiv 2025

[54] [54]

Egoschema: A diagnostic benchmark for very long-form video language understanding,

K. Mangalam, R. Akshulakov, and J. Malik, “Egoschema: A diagnostic benchmark for very long-form video language understanding,”Advances in Neural Information Processing Systems (NeurIPS), vol. 36, pp. 46 212–46 244, 2023

work page 2023

[55] [55]

Star: A benchmark for situated reasoning in real-world videos.arXiv preprint arXiv:2405.09711, 2024

B. Wu, S. Yu, Z. Chen, J. B. Tenenbaum, and C. Gan, “Star: A benchmark for situated reasoning in real-world videos,”arXiv preprint arXiv:2405.09711, 2024

work page arXiv 2024

[56] [56]

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

Y . Zheng, R. Zhang, J. Zhang, Y . Ye, Z. Luo, Z. Feng, and Y . Ma, “Llamafactory: Unified efficient fine-tuning of 100+ language models,” arXiv preprint arXiv:2403.13372, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024