{"total":10,"items":[{"citing_arxiv_id":"2604.24339","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection","primary_cat":"cs.CV","submitted_at":"2026-04-27T11:31:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ForeSight lets VLMs use low-level visual cues and mask-based visual feedback within an RL loop to reason more accurately, with the 7B model beating same-scale peers and some closed-source SOTA on a new benchmark.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"ing them through step-by-step reasoning. This paradigm has been successfully extended to VLMs, giving rise to the research framework of Multimodal Chain-of-Thought (MCoT). Early MCoT methods adopted a \"perceive-then- reason\" paradigm, but decoupling vision and language of- ten led to loss of critical visual details. Recent works like LLaV A-CoT [54], Virgo [14], and Mulberry [55] intro- duced \"Slow Thinking\" mechanisms to improve systematic reasoning, yet they are commonly task-specific and gener- alize poorly in open-domain settings. GThinker [59] pro- posed a more flexible, prompt-driven approach with vision- guided reflection to improve cross-task generalization and interpretability. Recent MCoT [25, 44, 48] methods have"},{"citing_arxiv_id":"2604.12890","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Towards Long-horizon Agentic Multimodal Search","primary_cat":"cs.CV","submitted_at":"2026-04-14T15:40:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LMM-Searcher uses file-based visual UIDs and a fetch tool plus 12K synthesized trajectories to fine-tune a multimodal agent that scales to 100-turn horizons and reaches SOTA among open-source models on MM-BrowseComp and MMSearch-Plus.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"and linguistic plugins, including plugins for object detection [ 44], image segmentation [ 45], and OCR [46]. This setup enables MLLMs to autonomously invoke appropriate tools based on complex user instructions. Beyond this basic paradigm, recent studies internalize such interactive capabilities into the model's reasoning process, leading to the thinking-with-image paradigm [47, 48, 49, 50]. Such frameworks treat visual operations as explicit reasoning steps, facilitating significant gains in spatial reasoning and fine-grained VQA. Building upon these advancements, recent work [6, 21, 7, 8] deeply integrates search engines as core tools into the reasoning chain of MLLMs. By combining robust internal visual reasoning with dynamic external search tools, models are empowered to perform"},{"citing_arxiv_id":"2601.04442","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Addressing Overthinking in Large Vision-Language Models via Gated Perception-Reasoning Optimization","primary_cat":"cs.CV","submitted_at":"2026-01-07T23:05:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GPRO trains a meta-controller on 790k failure-labeled samples to dynamically select fast, perception, or reasoning paths in LVLMs, yielding higher accuracy and shorter responses than prior slow-thinking methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.14044","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OmniDrive-R1: Reinforcement-driven Interleaved Multi-modal Chain-of-Thought for Trustworthy Vision-Language Autonomous Driving","primary_cat":"cs.CV","submitted_at":"2025-12-16T03:19:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OmniDrive-R1 boosts VLM reasoning score from 51.77% to 80.35% and answer accuracy from 37.81% to 73.62% on DriveLMM-o1 via reinforcement-driven interleaved multi-modal chain-of-thought with annotation-free grounding.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.23322","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Mitigating Visual Context Degradation in Large Multimodal Models: A Training-Free Decoupled Agentic Framework","primary_cat":"cs.CV","submitted_at":"2025-09-27T14:13:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DRP decouples reasoning from perception in LMMs by using an LLM reasoner to query an LMM observer for visual details as needed, reducing visual grounding loss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.22746","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning","primary_cat":"cs.AI","submitted_at":"2025-09-26T04:33:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MoVT unifies different visual reasoning modes in a single model and uses the AdaVaR two-stage framework with supervised cold-start and RL via AdaGRPO to enable context-adaptive mode selection, yielding consistent gains on visual reasoning tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.01805","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SpaceR: Reinforcing MLLMs in Video Spatial Reasoning","primary_cat":"cs.CV","submitted_at":"2025-04-02T15:12:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SpaceR uses a new verifiable dataset and map-imagination-augmented RLVR to reach SOTA spatial reasoning accuracy in MLLMs, exceeding GPT-4o on VSI-Bench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.17352","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles","primary_cat":"cs.CV","submitted_at":"2025-03-21T17:52:43+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Iterative SFT-RL cycles enable a 7B LVLM to develop sophisticated visual chain-of-thought reasoning and improve performance on math and general reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.12605","ref_index":95,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey","primary_cat":"cs.CV","submitted_at":"2025-03-16T18:39:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"IPVR [51]; Multimodal-CoT [29]; CoT-PT [52]; PromptCoT [53]; VCoT [54];PCoT [55]; MM-CoT [56]; HoT [57]; CoTDet [58]; DDCoT [59]; CPSeg [60];Gen2Sim [61]; CoI [62]; MC-CoT [63]; CCoT [64]; LoT [65]; DPMM-CoT [66];GCoT [23]; CoCoT [67]; KAM-CoT [68]; PKRD-CoT [69]; CoS [70]; CoA [71];Det-CoT [72]; BDoG [73]; TextCoT [74]; CoRAG [75]; Cantor [76];Visual Sketchpad [77]; IoT [78]; PS-CoT [79]; G-CoT [80]; STIC [81];SNSE-CoT [82]; CoE [83]; DCoT [84]; Layoutllm-t2i [85]; Creatilayout [86];visual-o1 [87]; R-CoT [88]; LLaV A-CoT [9]; VIC [89]; RelationLMM [90];Insight-V [91]; LLaV A-Aurora [92]; AR-MCTS [93]; Mulberry [94]; Virgo [95];Socratic [96]; LlamaV-o1 [97]; MV oT [30]; PARM++ [34]; URSA [98];Multimodal Open R1 [99]; AStar [100]; R1-OneVision [101]; SoT [102] Video CaVIR [103]; VideoAgent [104]; Track-LongVideo [105]; CaRDiff [106];V oT [31]; R3CoT [107]; Antgpt [108]; Grounding-Prompter [109];VIP [110]; DreamFactory [111]; Chain-of-Shot [112]; TI-PREGO [113] Audio/Speech SpeechGPT-Gen [114]; CoT-ST [115]; LPE [116]; SpatialSonic [117];Audio-CoT [118]; Audio-Reasoner [119]"},{"citing_arxiv_id":"2503.09567","ref_index":169,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models","primary_cat":"cs.AI","submitted_at":"2025-03-12T17:35:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Michael Qizhe Shieh, and Longxu Dou. Efficient process reward model training via active learning. arXiv preprint arXiv:2504.10559, 2025. [168] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. [169] Subhabrata Dutta, Joykirat Singh, Soumen Chakrabarti, and Tanmoy Chakraborty. How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning. Trans- actions on Machine Learning Research , July 2024. ISSN 2835-8856. URL https: //openreview.net/forum?id=uHLDkQVtyC. [170] Ahmed El-Kishky, Alexander Wei, Andre Saraiva, Borys Minaev, Daniel Selsam, David"}],"limit":50,"offset":0}