DeepEyes uses reinforcement learning to teach vision-language models active perception and image-based thinking, yielding gains on perception, reasoning, grounding, and hallucination benchmarks.
Per- ception tokens enhance visual reasoning in multimodal lan- guage models
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
polarities
background 3representative citing papers
IPT supervision improves spatial reasoning in VLMs on perspective taking, path tracing, and multiview counting tasks, often outperforming textual chain-of-thought while remaining consistent with observed inputs.
Fine-tuning VLMs to output action sequences for puzzles causes emergent internal visual representations that improve performance when integrated into reasoning.
Multi-SpatialMLLM integrates depth perception, visual correspondence, and dynamic perception into MLLMs via a 27M-sample MultiSPA dataset and benchmark, yielding gains on multi-frame spatial tasks.
The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.
citing papers explorer
No citing papers match the current filters.