{"total":17,"items":[{"citing_arxiv_id":"2605.21988","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning","primary_cat":"cs.CV","submitted_at":"2026-05-21T04:38:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CRPO applies counterfactual videos and a cross-branch relation reward in RL post-training to reduce shortcut reliance in Video LLMs, with gains shown on the new DyBench paired benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19852","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning","primary_cat":"cs.CL","submitted_at":"2026-05-19T13:44:26+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16416","ref_index":42,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CAVE: A Structured Credit Assignment Approach for Fragmented Visual Evidence Reasoning","primary_cat":"cs.CV","submitted_at":"2026-05-13T16:50:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CAVE is a GRPO-based process-reward method that improves VLMs on fragmented visual reasoning by crediting intermediate actions via belief update, evidence acquisition, and adaptive focus, shown on TRACER-Bench and public benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02730","ref_index":55,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Perceptual Flow Network for Visually Grounded Reasoning","primary_cat":"cs.CV","submitted_at":"2026-05-04T15:31:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PFlowNet decouples perception from reasoning, integrates multi-dimensional rewards with vicinal geometric shaping via variational RL, and reports new SOTA results on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Processing Systems, 37:95095-95169, 2024. [54] Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, Wei Yu, and Dacheng Tao. Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 7907-7915, 2025. [55] Xiyao Wang, Zhengyuan Yang, Chao Feng, Hongjin Lu, Linjie Li, Chung-Ching Lin, Kevin Lin, Furong Huang, and Lijuan Wang. Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement. arXiv preprint arXiv:2504.07934, 2025. [56] Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal llms."},{"citing_arxiv_id":"2605.00814","ref_index":79,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs","primary_cat":"cs.CV","submitted_at":"2026-05-01T17:54:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PVM adds a parallel branch to LVLMs that directly supplies visual embeddings to prevent attention decay over long generated sequences, yielding accuracy gains on reasoning tasks with minimal overhead.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Transformer layers aligning with the configuration in DeepStack [ 57], introducing only 27.92M additional trainable parameters (a negligible ∼0.32% of the 8B model size). Our training data comprises two subsets: a supervised fine-tuning set Dsft of 526k samples from OpenMMReasoner- SFT [94], and a reinforcement learning set Drl of 3.6k complex reasoning queries aggregated from MMK12 [56], ThinkLite-VL-hard [79], ViRL39K [73], and We-Math2.0-Pro [59]. Training Details.Our pipeline contains two stages:Stage I: Visual Memory Alignment (SFT). We freeze the backbone and exclusively optimize the PVM modules and gating scalars to establish the semantic mapping between textual queries and visual keys.Stage II: Policy Refinement (GRPO). Using Group Relative Policy Optimization [61], we unfreeze the LLM backbone and PVM modules"},{"citing_arxiv_id":"2604.22498","ref_index":48,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding","primary_cat":"cs.CV","submitted_at":"2026-04-24T12:26:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench and VLM2-Bench with transfer gains to other multimodal tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Furthermore, the learned capabilities transfer effectively to broader multimodal reasoning tasks, yielding con- sistent gains on MathVista (+2.90), MuirBench (+2.88), MMStar (+1.93), MMMU (+1.77), and BLINK (+1.69). 2 Related Work 2.1 Visual Grounding in MLLMs Recent Multimodal Large Language Models (MLLMs), such as Shikra [9], Griffon [53], and Ferret [48], formulate visual grounding as a lan- guage generation problem by serializing spatial coordinates into text tokens. This line of work establishes grounding as a practical interface between language generation and object localization. To further improve grounding fidelity and reduce hallucination, subse- quent approaches incorporate stronger supervision or optimization"},{"citing_arxiv_id":"2604.21268","ref_index":91,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding","primary_cat":"cs.LG","submitted_at":"2026-04-23T04:23:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20705","ref_index":66,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models","primary_cat":"cs.CV","submitted_at":"2026-04-22T15:46:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SSL-R1 reformulates visual SSL tasks into verifiable puzzles to supply rewards for RL post-training of MLLMs, yielding gains on multimodal benchmarks without external supervision.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"5-VL-3B/7B-Instruct [4] as the base MLLMs for all experiments, due to their strong rea- soning performances and moderate model sizes. Apart from the base models, we also compare SSL-R1 with several rep- resentative Qwen2.5-VL-based reasoning models that have undergone reasoning-intensive RL post-training in a super- vised way, including ThinkLite-VL [66], VL-Cogito [79], LLaV A-Critic-R1 [64]. In addition, to further demonstrate the superiority of our method, we consider a concurrent self-supervised RL post-training method,e.g., Visual Jig- saw [70], for comparison, with the results reproduced using their official code. Training Details.We optimize the base models using the GRPO algorithm. During GRPO training, we remove both"},{"citing_arxiv_id":"2604.21718","ref_index":65,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Building a Precise Video Language with Human-AI Oversight","primary_cat":"cs.CV","submitted_at":"2026-04-22T09:01:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CHAI framework pairs AI pre-captions with expert human critiques to produce precise video descriptions, enabling open models to outperform closed ones like Gemini-3.1-Pro and improve fine-grained control in video generation models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 8 [64] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in lan- guage models.arXiv preprint arXiv:2203.11171, 2022. 45 [65] Xiyao Wang, Zhengyuan Yang, Chao Feng, Hongjin Lu, Lin- jie Li, Chung-Ching Lin, Kevin Lin, Furong Huang, and Lijuan Wang. Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement.arXiv preprint arXiv:2504.07934, 2025. 3 [66] Yubo Wang, Xiang Yue, and Wenhu Chen. Critique fine- tuning: Learning to critique is more effective than learning to"},{"citing_arxiv_id":"2604.19264","ref_index":74,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DR-MMSearchAgent: Deepening Reasoning in Multimodal Search Agents","primary_cat":"cs.CV","submitted_at":"2026-04-21T09:28:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DR-MMSearchAgent derives batch-wide trajectory advantages and uses differentiated Gaussian rewards to prevent premature collapse in multimodal agents, outperforming MMSearch-R1 by 8.4% on FVQA-test.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10219","ref_index":81,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models","primary_cat":"cs.AI","submitted_at":"2026-04-11T13:59:05+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"V-STAR achieves state-of-the-art performance on most datasets, outperforming both strong base models and other reasoning-specialized models. Task Benchmark Paper/Conf.Qwen2.5VL [74]R1-Onevision [75]Vision-R1 [76]VL-Rethinker [77]VL-Cogito [78]OpenVLThinker [79]ThinkLite-VL [80]V-STAR Data Size -- --- -- 155k 210k 39k 80k 59.2k 11k 40k General Reasoning & Understanding V-Star [81] CVPR 2024 70.1 66.5 78.9 67.6 79.6 68.1 82.1 81.3 RealWorldQA -- 68.8 60.5 64.3 69.3 68.1 62.3 70.1 72.6 MMVP [82] CVPR 2024 47.3 43.0 44.0 42.0 40.0 46.5 46.7 47.1 MMEval-Pro [83] NAACL 2025 70.6 69.4 72.2 73.2 73 71.5 72.0 73.6 VMCBench [84] NeurIPS 2025 79.7 65.2 80.3 73.9 73.2 80.3 81.4 81.5 MMVet [85] ICML 2024 66.0 72.1 65.6 74.6 73.2 66.9 67."},{"citing_arxiv_id":"2604.08545","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models","primary_cat":"cs.CV","submitted_at":"2026-04-09T17:59:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HDPO reframes tool efficiency as a conditional objective within accurate trajectories, enabling Metis to reduce tool invocations by orders of magnitude while raising reasoning accuracy.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"And all experiments were performed on a server featuring 8 NVIDIA Blackwell B200 GPUs. Baselines.We compareMetisagainst three categories of strong baselines: (1)Open-source models without tool use, including LLaV A-OneVision [ 14], InternVL3-8B [ 57], Qwen2.5-VL- 7B/32B-Instruct [3], and Qwen3-VL-8B-Instruct [ 2]; (2)Text-only reasoning models, including MM-Eureka [20], ThinkLite-VL [38], VL-Rethinker [33], and VLAA-Thinker [4]; and (3)Agentic multimodal models, including Pixel-Reasoner [34], DeepEyes [56], Thyme [51], DeepEyesV2 [8], Mini-o3 [12], and Skywork-R1V4-30B-A3B [53]. Benchmarks.We evaluateMetisacross two broad groups of benchmarks covering complementary cognitive capabilities.Perception and Document Understanding:V*Bench [ 42], HRBench-"},{"citing_arxiv_id":"2604.03179","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models","primary_cat":"cs.LG","submitted_at":"2026-04-03T16:56:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RL post-training on hallucination-forced multimodal data improves reasoning performance and can outperform standard training.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"suring multimodal mathematical reasoning with math-vision dataset.Advances in Neural Information Processing Sys- tems, 37:95095-95169, 2024. 2, 4 [32] Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 2 [33] Xiyao Wang, Zhengyuan Yang, Chao Feng, Hongjin Lu, Lin- jie Li, Chung-Ching Lin, Kevin Lin, Furong Huang, and Li- juan Wang. Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement.arXiv preprint arXiv:2504.07934, 2025. 2, 3 [34] Lai Wei, Yuting Li, Chen Wang, Yue Wang, Linghe Kong, Weiran Huang, and Lichao Sun."},{"citing_arxiv_id":"2603.14184","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models","primary_cat":"cs.CV","submitted_at":"2026-03-15T02:21:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Attention dispersion during extended reasoning impairs MLLM perception on images, and a training-free VRGA framework mitigates it by selecting and reweighting visual attention heads using an entropy-focus criterion.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.05271","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DeepEyesV2: Toward Agentic Multimodal Model","primary_cat":"cs.CV","submitted_at":"2025-11-07T14:31:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DeepEyesV2 uses a two-stage cold-start plus reinforcement learning pipeline to produce an agentic multimodal model that adaptively invokes tools and outperforms direct RL on real-world reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.15436","ref_index":42,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs","primary_cat":"cs.CV","submitted_at":"2025-05-21T12:18:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Chain-of-Focus enables VLMs to adaptively search and zoom on important image areas via a two-stage SFT and RL pipeline on a custom 3K-sample dataset, yielding 5% gains on the V* benchmark across resolutions from 224 to 4K.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.17352","ref_index":76,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles","primary_cat":"cs.CV","submitted_at":"2025-03-21T17:52:43+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Iterative SFT-RL cycles enable a 7B LVLM to develop sophisticated visual chain-of-thought reasoning and improve performance on math and general reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}