StemBind benchmark diagnoses MLLM failures in abstract visual reasoning by separating perception, rule induction, and answer selection on shared stems, finding a persistent rule-to-instance binding gap even when perception and rule are correct.
Visual thoughts: A unified perspective of understanding multimodal chain-of-thought.arXiv preprint arXiv:2505.15510
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 9roles
background 3polarities
background 3representative citing papers
SeProD is a plug-and-play self-prophetic decoding framework that combines pre- and post-training LVLM capabilities via probability-based sampling to improve coherent visual search and multi-step reasoning.
SCOLAR fixes information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens via a detransformer, extending acceptable CoT length over 30x and delivering +14.12% gains on reasoning benchmarks.
MLLMs ignore dial state geometry and cluster by appearance, causing inconsistency under variations; TriSCA's state-distance alignment, metadata supervision, and objective alignment improve robustness on clock and gauge benchmarks.
OMIBench benchmark reveals that current LVLMs achieve at most 50% on Olympiad problems requiring reasoning across multiple images.
OmniDrive-R1 boosts VLM reasoning score from 51.77% to 80.35% and answer accuracy from 37.81% to 73.62% on DriveLMM-o1 via reinforcement-driven interleaved multi-modal chain-of-thought with annotation-free grounding.
Locate-Then-Examine improves AI-generated image detection by localizing suspicious regions first then performing region-aware re-examination, while releasing the TRACE dataset of 20k annotated images.
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.
citing papers explorer
-
OmniDrive-R1: Reinforcement-driven Interleaved Multi-modal Chain-of-Thought for Trustworthy Vision-Language Autonomous Driving
OmniDrive-R1 boosts VLM reasoning score from 51.77% to 80.35% and answer accuracy from 37.81% to 73.62% on DriveLMM-o1 via reinforcement-driven interleaved multi-modal chain-of-thought with annotation-free grounding.
-
Locate-Then-Examine: Grounded Region Reasoning Improves Detection of AI-Generated Images
Locate-Then-Examine improves AI-generated image detection by localizing suspicious regions first then performing region-aware re-examination, while releasing the TRACE dataset of 20k annotated images.
-
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.