Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding
Pith reviewed 2026-05-19 17:30 UTC · model grok-4.3
The pith
Video reasoning improves when each step anchors explicitly to specific visual objects in the frames.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Chain-of-Glimpse formulates video reasoning as a step-by-step process that incrementally builds spatially grounded traces around task-relevant visual objects, thereby mitigating over-reliance on saliency-driven cues. It features a search-guided controller, optimized via reinforcement learning with a format reward that incentivizes grounding capability, to iteratively ground visual evidence regions and form reliable reasoning trajectories, yielding accurate and interpretable multi-step decisions.
What carries the argument
The search-guided controller that iteratively selects and grounds task-relevant visual evidence regions to build incremental reasoning traces.
If this is right
- The framework produces consistent accuracy gains on both in-domain and out-of-domain video reasoning benchmarks.
- Reasoning trajectories become more interpretable because each step is tied to explicit visual regions.
- Over-reliance on saliency-driven cues is reduced by the progressive object-grounding process.
- The same controller yields robustness and generalization across diverse video tasks.
Where Pith is reading between the lines
- The same incremental grounding idea could be tested on long-form video question answering to see whether trace length stays manageable.
- If the built traces are stored, they might serve as human-readable explanations for model outputs on video QA datasets.
- Applying the controller to single-image reasoning tasks would test whether progressive object focus helps even without temporal change.
Load-bearing premise
Training the controller with reinforcement learning plus a format reward will reliably create object-grounding behavior that lifts compositional reasoning above object-agnostic methods.
What would settle it
An ablation that removes the search-guided controller or the format reward and measures whether accuracy on NExTQA, Video-Holmes, CG-Bench Reasoning, or VRBench falls back to the level of object-agnostic baselines would falsify the central claim.
Figures
read the original abstract
Video understanding requires identifying and reasoning over semantically discriminative visual objects across frames, yet existing object-agnostic solutions struggle to effectively handle substantial object variations over time. To address this, we introduce Chain-of-Glimpse, a search-guided progressive object-grounded reasoning framework that explicitly anchors each reasoning step to specific visual evidence regions, enabling compositional and multi-step decision-making. Formally, Chain-of-Glimpse formulates video reasoning as a step-by-step process that incrementally builds spatially grounded traces around task-relevant visual objects, thereby mitigating over-reliance on saliency-driven cues. Specifically, Chain-of-Glimpse features a search-guided controller, optimized via reinforcement learning with a format reward that significantly incentivizes grounding capability, to iteratively ground visual evidence regions and form reliable reasoning trajectories, yielding accurate and interpretable multi-step decisions. Extensive evaluations on both in domain NExTQA and out-of-domain Video-Holmes, CG-Bench Reasoning, and VRBench benchmarks demonstrate consistent performance gains, robustness and generalization of Chain-of-Glimpse across diverse video reasoning tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Chain-of-Glimpse, a search-guided progressive object-grounded reasoning framework for video understanding. It formulates the task as a step-by-step process that incrementally builds spatially grounded traces around task-relevant visual objects, using a search-guided controller optimized via reinforcement learning with a format reward to produce reliable reasoning trajectories. The approach is evaluated on in-domain NExTQA and out-of-domain benchmarks including Video-Holmes, CG-Bench Reasoning, and VRBench, reporting consistent performance gains, robustness, and generalization over object-agnostic baselines.
Significance. If the central claim holds, the work offers a promising direction for improving interpretability and compositional reasoning in video understanding by explicitly anchoring steps to visual evidence regions rather than relying on saliency-driven cues. The progressive, search-guided structure could help mitigate object variation over time and support multi-step decision-making in complex video tasks.
major comments (2)
- [Abstract and method description] The description of the RL optimization (abstract and method) states that the format reward 'significantly incentivizes grounding capability,' yet provides no explicit mechanism to ensure the grounded regions are task-relevant or causally used in subsequent reasoning steps. This leaves open the possibility that gains arise from output formatting compliance or search exploration rather than functional object grounding, as the reward does not penalize irrelevant regions or verify their contribution to the final decision.
- [Experiments and results] The experimental claims of consistent gains and improved generalization rest on unspecified implementation details and lack ablations that isolate the contribution of learned grounding (e.g., replacing controller outputs with random or saliency-based regions while keeping the rest of the pipeline fixed). Without such controls or inspection of whether masking the grounded traces degrades performance, the central claim that object-grounded traces drive the improvements over baselines cannot be fully evaluated.
minor comments (2)
- [Abstract] The abstract mentions 'extensive evaluations' and 'consistent performance gains' but does not specify the exact metrics, baseline models, or quantitative improvements; adding these details would improve clarity for readers.
- [Method] Notation for the search-guided controller and the format reward could be introduced more formally with equations or pseudocode to make the iterative grounding process easier to follow.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate where revisions will be made to the manuscript.
read point-by-point responses
-
Referee: [Abstract and method description] The description of the RL optimization (abstract and method) states that the format reward 'significantly incentivizes grounding capability,' yet provides no explicit mechanism to ensure the grounded regions are task-relevant or causally used in subsequent reasoning steps. This leaves open the possibility that gains arise from output formatting compliance or search exploration rather than functional object grounding, as the reward does not penalize irrelevant regions or verify their contribution to the final decision.
Authors: We appreciate this observation on the need for greater clarity. The format reward ensures structural compliance in producing grounded outputs, while the search-guided controller is optimized through reinforcement learning to maximize task performance on the video understanding objective. This process encourages selection of regions that support successful reasoning trajectories, as non-contributory groundings would not aid in reaching correct decisions. We agree the description can be strengthened and will revise the method section to more explicitly describe how the combined RL objective and search guidance promote task-relevant grounding beyond format compliance alone. revision: yes
-
Referee: [Experiments and results] The experimental claims of consistent gains and improved generalization rest on unspecified implementation details and lack ablations that isolate the contribution of learned grounding (e.g., replacing controller outputs with random or saliency-based regions while keeping the rest of the pipeline fixed). Without such controls or inspection of whether masking the grounded traces degrades performance, the central claim that object-grounded traces drive the improvements over baselines cannot be fully evaluated.
Authors: We agree that targeted ablations would better isolate the contribution of the learned grounding. In the revised manuscript we will add experiments that substitute the controller outputs with random regions and with saliency-based regions while holding the remainder of the pipeline fixed. We will also report performance when the grounded traces are masked. These additions will provide direct evidence regarding the role of object-grounded traces in the observed gains. revision: yes
Circularity Check
No circularity: Chain-of-Glimpse presents an independent RL-based framework without reducing results to inputs by construction.
full rationale
The paper defines Chain-of-Glimpse as a novel search-guided progressive object-grounded reasoning process for video understanding, optimized via reinforcement learning with a format reward to encourage grounding. No equations, derivations, or claims in the abstract or description reduce a reported prediction or first-principles result to a fitted parameter, self-citation chain, or renamed input by construction. The central claims rest on empirical evaluations across benchmarks rather than tautological re-derivations, making the derivation self-contained against external benchmarks as per the reader's assessment.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
search-guided controller, optimized via reinforcement learning with a format reward that significantly incentivizes grounding capability
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
multi-turn decision policy learning... group relative policy optimization (GRPO)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A simple llm framework for long-range video question- answering,
C. Zhang, T. Lu, M. M. Islam, Z. Wang, S. Yu, M. Bansal, and G. Bertasius, “A simple llm framework for long-range video question- answering,” inProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024, pp. 21 715–21 737
work page 2024
-
[2]
Understanding long videos in one multimodal language model pass,
K. Ranasinghe, X. Li, K. Kahatapitiya, and M. S. Ryoo, “Understanding long videos in one multimodal language model pass,”arXiv preprint arXiv:2403.16998, vol. 3, no. 4, p. 12, 2024
-
[3]
Y . Guo, F. Siddiqui, Y . Zhao, R. Chellappa, and S.-Y . Lo, “Stimuvar: Spatiotemporal stimuli-aware video affective reasoning with multimodal large language models,”International Journal of Computer Vision (IJCV), pp. 1–17, 2025
work page 2025
-
[4]
Dycoke: Dynamic com- pression of tokens for fast video large language models,
K. Tao, C. Qin, H. You, Y . Sui, and H. Wang, “Dycoke: Dynamic com- pression of tokens for fast video large language models,” inProceedings of the Computer Vision and Pattern Recognition (CVPR), 2025, pp. 18 992–19 001
work page 2025
-
[5]
Vtimellm: Empower llm to grasp video moments,
B. Huang, X. Wang, H. Chen, Z. Song, and W. Zhu, “Vtimellm: Empower llm to grasp video moments,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 14 271–14 280
work page 2024
-
[6]
Chapter-llama: Efficient chaptering in hour-long videos with llms,
L. Ventura, A. Yang, C. Schmid, and G. Varol, “Chapter-llama: Efficient chaptering in hour-long videos with llms,” inProceedings of the Com- puter Vision and Pattern Recognition (CVPR), 2025, pp. 18 947–18 958
work page 2025
-
[7]
Automated multi-level preference for mllms,
M. Zhang, W. Wu, Y . Lu, Y . Song, K. Rong, H. Yao, J. Zhao, F. Liu, H. Feng, J. Wanget al., “Automated multi-level preference for mllms,” Advances in Neural Information Processing Systems (NeurIPS), pp. 26 171–26 194, 2024
work page 2024
-
[8]
Omnialign-v: Towards enhanced alignment of mllms with human preference,
X. Zhao, S. Ding, Z. Zhang, H. Huang, M. Maosongcao, J. Wang, W. Wang, X. Fang, W. Wang, G. Zhaiet al., “Omnialign-v: Towards enhanced alignment of mllms with human preference,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025, pp. 18 490–18 515
work page 2025
-
[9]
Grounded Reinforcement Learning for Visual Reasoning
G. Sarch, S. Saha, N. Khandelwal, A. Jain, M. J. Tarr, A. Kumar, and K. Fragkiadaki, “Grounded reinforcement learning for visual reasoning,” arXiv preprint arXiv:2505.23678, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Geet al., “Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,”arXiv preprint arXiv:2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
Timechat: A time-sensitive multimodal large language model for long video understanding,
S. Ren, L. Yao, S. Li, X. Sun, and L. Hou, “Timechat: A time-sensitive multimodal large language model for long video understanding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 14 313–14 323
work page 2024
-
[12]
S. Han, W. Huang, H. Shi, L. Zhuo, X. Su, S. Zhang, X. Zhou, X. Qi, Y . Liao, and S. Liu, “Videoespresso: A large-scale chain-of-thought dataset for fine-grained video reasoning via core frame selection,” in Proceedings of the Computer Vision and Pattern Recognition (CVPR), 2025, pp. 26 181–26 191
work page 2025
-
[13]
S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tanget al., “Qwen2.5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
OneThinker: All-in-one Reasoning Model for Image and Video
K. Feng, M. Zhang, H. Li, K. Fan, S. Chen, Y . Jiang, D. Zheng, P. Sun, Y . Zhang, H. Sunet al., “Onethinker: All-in-one reasoning model for image and video,”arXiv preprint arXiv:2512.03043, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Video-R1: Reinforcing Video Reasoning in MLLMs
K. Feng, K. Gong, B. Li, Z. Guo, Y . Wang, T. Peng, J. Wu, X. Zhang, B. Wang, and X. Yue, “Video-r1: Reinforcing video reasoning in mllms,” arXiv preprint arXiv:2503.21776, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
X. Li, Z. Yan, D. Meng, L. Dong, X. Zeng, Y . He, Y . Wang, Y . Qiao, Y . Wang, and L. Wang, “Videochat-r1: Enhancing spatio-temporal per- ception via reinforcement fine-tuning,”arXiv preprint arXiv:2504.06958, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models,
G. Zheng, B. Yang, J. Tang, H.-Y . Zhou, and S. Yang, “Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models,”Advances in Neural Information Processing Systems (NeurIPS), vol. 36, pp. 5168–5191, 2023
work page 2023
-
[18]
Imagine while reasoning in space: Multimodal visualization-of- thought,
C. Li, W. Wu, H. Zhang, Y . Xia, S. Mao, L. Dong, I. Vuli ´c, and F. Wei, “Imagine while reasoning in space: Multimodal visualization-of- thought,” inForty-second International Conference on Machine Learn- ing (ICML), 2025
work page 2025
-
[19]
Rethinking chain-of-thought reasoning for videos,
Y . Zhong, Z.-Y . Hu, Y . Li, and L. Wang, “Rethinking chain-of-thought reasoning for videos,”arXiv preprint arXiv:2512.09616, 2025
-
[20]
Mmtom-qa: Multimodal theory of mind question answering,
C. Jin, Y . Wu, J. Cao, J. Xiang, Y .-L. Kuo, Z. Hu, T. Ullman, A. Torralba, J. Tenenbaum, and T. Shu, “Mmtom-qa: Multimodal theory of mind question answering,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024, pp. 16 077– 16 102. JOURNAL OF LATEX CLASS FILES, APIRL 2026 10
work page 2024
-
[21]
Morevqa: Exploring modular reasoning models for video question answering,
J. Min, S. Buch, A. Nagrani, M. Cho, and C. Schmid, “Morevqa: Exploring modular reasoning models for video question answering,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 13 235–13 245
work page 2024
-
[22]
End-to-end generative pretraining for multimodal video captioning,
P. H. Seo, A. Nagrani, A. Arnab, and C. Schmid, “End-to-end generative pretraining for multimodal video captioning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 17 959–17 968
work page 2022
-
[23]
Video-xl: Extra-long vision language model for hour- scale video understanding,
Y . Shu, Z. Liu, P. Zhang, M. Qin, J. Zhou, Z. Liang, T. Huang, and B. Zhao, “Video-xl: Extra-long vision language model for hour- scale video understanding,” inProceedings of the Computer Vision and Pattern Recognition (CVPR), 2025, pp. 26 160–26 169
work page 2025
-
[24]
Revisiting tem- poral modeling for clip-based image-to-video knowledge transferring,
R. Liu, J. Huang, G. Li, J. Feng, X. Wu, and T. H. Li, “Revisiting tem- poral modeling for clip-based image-to-video knowledge transferring,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 6555–6564
work page 2023
-
[25]
Streaming long video understanding with large language models,
R. Qian, X. Dong, P. Zhang, Y . Zang, S. Ding, D. Lin, and J. Wang, “Streaming long video understanding with large language models,” Advances in Neural Information Processing Systems (NeurIPS), vol. 37, pp. 119 336–119 360, 2024
work page 2024
-
[26]
Moviechat+: Question-aware sparse memory for long video question answering,
E. Song, W. Chai, T. Ye, J.-N. Hwang, X. Li, and G. Wang, “Moviechat+: Question-aware sparse memory for long video question answering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
work page 2025
-
[27]
Learning high-quality dynamic memory for video object segmentation,
Y . Liu, R. Yu, F. Yin, X. Zhao, W. Zhao, W. Xia, J. Wang, Y . Wang, Y . Tang, and Y . Yang, “Learning high-quality dynamic memory for video object segmentation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
work page 2025
-
[28]
Cotdet: Affordance knowledge prompting for task driven object detection,
J. Tang, G. Zheng, J. Yu, and S. Yang, “Cotdet: Affordance knowledge prompting for task driven object detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 3068–3078
work page 2023
-
[29]
Sf2t: Self-supervised fragment finetuning of video-llms for fine-grained understanding,
Y . Hu, Z. Song, N. Feng, Y . Luo, J. Yu, Y .-P. P. Chen, and W. Yang, “Sf2t: Self-supervised fragment finetuning of video-llms for fine-grained understanding,” inProceedings of the Computer Vision and Pattern Recognition (CVPR), 2025, pp. 29 108–29 117
work page 2025
-
[30]
Ma-lmm: Memory-augmented large multimodal model for long-term video understanding,
B. He, H. Li, Y . K. Jang, M. Jia, X. Cao, A. Shah, A. Shrivastava, and S.-N. Lim, “Ma-lmm: Memory-augmented large multimodal model for long-term video understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 13 504–13 514
work page 2024
-
[31]
Compositional chain- of-thought prompting for large multimodal models,
C. Mitra, B. Huang, T. Darrell, and R. Herzig, “Compositional chain- of-thought prompting for large multimodal models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 14 420–14 431
work page 2024
-
[32]
Videorefer suite: Advancing spatial- temporal object understanding with video llm,
Y . Yuan, H. Zhang, W. Li, Z. Cheng, B. Zhang, L. Li, X. Li, D. Zhao, W. Zhang, Y . Zhuanget al., “Videorefer suite: Advancing spatial- temporal object understanding with video llm,” inProceedings of the Computer Vision and Pattern Recognition (CVPR), 2025, pp. 18 970– 18 980
work page 2025
-
[33]
Pixel-level reasoning segmentation via multi-turn conversations,
D. Cai, X. Yang, Y . Liu, D. Wang, S. Feng, Y . Zhang, and S. Poria, “Pixel-level reasoning segmentation via multi-turn conversations,” in Proceedings of the 62nd Annual Meeting of the Association for Com- putational Linguistics (ACL), 2025, pp. 17 660–17 679
work page 2025
-
[34]
Re-thinking temporal search for long-form video understanding,
J. Ye, Z. Wang, H. Sun, K. Chandrasegaran, Z. Durante, C. Eyzaguirre, Y . Bisk, J. C. Niebles, E. Adeli, L. Fei-Feiet al., “Re-thinking temporal search for long-form video understanding,” inProceedings of the Com- puter Vision and Pattern Recognition (CVPR), 2025, pp. 8579–8591
work page 2025
-
[35]
Video-of-thought: Step-by-step video reasoning from perception to cognition,
H. Fei, S. Wu, W. Ji, H. Zhang, M. Zhang, M.-L. Lee, and W. Hsu, “Video-of-thought: Step-by-step video reasoning from perception to cognition,”arXiv preprint arXiv:2501.03230, 2024
-
[36]
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
H. Yuan, X. Li, T. Zhang, Y . Sun, Z. Huang, S. Xu, S. Ji, Y . Tong, L. Qi, J. Fenget al., “Sa2va: Marrying sam2 with llava for dense grounded understanding of images and videos,”arXiv preprint arXiv:2501.04001, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
Vistadpo: Video hierarchical spatial-temporal direct preference optimization for large video models,
H. Huang, H. Chen, S. Wu, M. Luo, J. Fu, X. Du, H. Zhang, and H. Fei, “Vistadpo: Video hierarchical spatial-temporal direct preference optimization for large video models,” inForty-second International Conference on Machine Learning (ICML), 2025
work page 2025
-
[38]
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
Z. Zheng, M. Yang, J. Hong, C. Zhao, G. Xu, L. Yang, C. Shen, and X. Yu, “Deepeyes: Incentivizing” thinking with images” via reinforce- ment learning,”arXiv preprint arXiv:2505.14362, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Cot-vla: Visual chain-of-thought reasoning for vision-language-action models,
Q. Zhao, Y . Lu, M. J. Kim, Z. Fu, Z. Zhang, Y . Wu, Z. Li, Q. Ma, S. Han, C. Finnet al., “Cot-vla: Visual chain-of-thought reasoning for vision-language-action models,” inProceedings of the Computer Vision and Pattern Recognition (CVPR), 2025, pp. 1702–1713
work page 2025
-
[40]
Visual chain-of-thought prompting for knowledge-based visual reason- ing,
Z. Chen, Q. Zhou, Y . Shen, Y . Hong, Z. Sun, D. Gutfreund, and C. Gan, “Visual chain-of-thought prompting for knowledge-based visual reason- ing,” inProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2024, pp. 1254–1262
work page 2024
-
[41]
H. Shao, S. Qian, H. Xiao, G. Song, Z. Zong, L. Wang, Y . Liu, and H. Li, “Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning,” Advances in Neural Information Processing Systems (NeurIPS), pp. 8612–8642, 2024
work page 2024
-
[42]
An analysis of monte carlo tree search,
S. James, G. Konidaris, and B. Rosman, “An analysis of monte carlo tree search,” inProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2017
work page 2017
-
[43]
Videoagent: Long- form video understanding with large language model as agent,
X. Wang, Y . Zhang, O. Zohar, and S. Yeung-Levy, “Videoagent: Long- form video understanding with large language model as agent,” in European Conference on Computer Vision (ECCV). Springer, 2024, pp. 58–76
work page 2024
-
[44]
Language repository for long video understanding,
K. Kahatapitiya, K. Ranasinghe, J. Park, and M. S. Ryoo, “Language repository for long video understanding,” inFindings of the Association for Computational Linguistics (ACL), 2025, pp. 5627–5646
work page 2025
-
[45]
Video-llava: Learning united visual representation by alignment before projection,
B. Lin, Y . Ye, B. Zhu, J. Cui, M. Ning, P. Jin, and L. Yuan, “Video-llava: Learning united visual representation by alignment before projection,” inProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024, pp. 5971–5984
work page 2024
-
[46]
A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radfordet al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[47]
G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosenet al., “Gem- ini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,”arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
Z. Chen, W. Wang, Y . Cao, Y . Liu, Z. Gao, E. Cui, J. Zhu, S. Ye, H. Tian, Z. Liuet al., “Expanding performance boundaries of open- source multimodal models with model, data, and test-time scaling,” arXiv preprint arXiv:2412.05271, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
Y . Chen, Y . Ge, R. Wang, Y . Ge, L. Qiu, Y . Shan, and X. Liu, “Exploring the effect of reinforcement learning on video understanding: Insights from seed-bench-r1,”arXiv preprint arXiv:2503.24376, 2025
-
[50]
Next-qa: Next phase of question-answering to explaining temporal actions,
J. Xiao, X. Shang, A. Yao, and T.-S. Chua, “Next-qa: Next phase of question-answering to explaining temporal actions,” inProceedings of the IEEE/CVF Computer Vision and Pattern Recognition (CVPR), 2021, pp. 9777–9786
work page 2021
-
[51]
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
J. Cheng, Y . Ge, T. Wang, Y . Ge, J. Liao, and Y . Shan, “Video- holmes: Can mllm think like holmes for complex video reasoning?” arXiv preprint arXiv:2505.21374, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[52]
Cg-bench: Clue-grounded question answering benchmark for long video understanding,
G. Chen, Y . Liu, Y . Huang, B. Pei, J. Xu, Y . He, T. Lu, Y . Wang, and L. Wang, “Cg-bench: Clue-grounded question answering benchmark for long video understanding,” inThe Thirteenth International Conference on Learning Representations (ICLR), 2025
work page 2025
-
[53]
J. Yu, Y . Wu, M. Chu, Z. Ren, Z. Huang, P. Chu, R. Zhang, Y . He, Q. Li, S. Liet al., “Vrbench: A benchmark for multi-step reasoning in long narrative videos,”arXiv preprint arXiv:2506.10857, 2025
-
[54]
Egoschema: A diagnostic benchmark for very long-form video language understanding,
K. Mangalam, R. Akshulakov, and J. Malik, “Egoschema: A diagnostic benchmark for very long-form video language understanding,”Advances in Neural Information Processing Systems (NeurIPS), vol. 36, pp. 46 212–46 244, 2023
work page 2023
-
[55]
Star: A benchmark for situated reasoning in real-world videos.arXiv preprint arXiv:2405.09711, 2024
B. Wu, S. Yu, Z. Chen, J. B. Tenenbaum, and C. Gan, “Star: A benchmark for situated reasoning in real-world videos,”arXiv preprint arXiv:2405.09711, 2024
-
[56]
LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models
Y . Zheng, R. Zhang, J. Zhang, Y . Ye, Z. Luo, Z. Feng, and Y . Ma, “Llamafactory: Unified efficient fine-tuning of 100+ language models,” arXiv preprint arXiv:2403.13372, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.