pith. machine review for the scientific record. sign in

arxiv: 2506.09965 · v2 · submitted 2025-06-11 · 💻 cs.CV · cs.AI

Recognition: 1 theorem link

Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing

Authors on Pith no claims yet

Pith reviewed 2026-05-17 04:53 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords spatial reasoningvision-language modelsvisual drawingreinforcement learningmultimodal reasoningmaze navigationbounding boxesauxiliary lines
0
0 comments X

The pith

Vision-language models improve spatial reasoning by drawing boxes and lines on images during thinking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current methods for multimodal reasoning in vision-language models stay text-centric even when given images, which restricts their ability to manage precise geometry and track moving positions continuously. The paper presents a new paradigm called drawing to reason in space, where models perform simple drawing actions such as marking bounding boxes and sketching auxiliary lines directly on the visual input. These operations let the model express and examine spatial relations through visual manipulation rather than words alone. Training happens in three stages that begin with synthetic data to learn drawing basics, add reflective rejection sampling, and finish with reinforcement learning to maximize rewards on spatial tasks. A sympathetic reader would care because this internal visual approach could bypass the limits of external perception tools and raise performance on real-world spatial problems.

Core claim

The paper claims that equipping LVLMs with elementary drawing operations in visual space enables them to reason about spatial relationships through direct manipulation, and that a three-stage training framework of cold-start synthetic data, reflective rejection sampling, and reinforcement learning produces a model named VILASR that outperforms prior methods by an average of 18.4 percent on benchmarks covering maze navigation, static spatial reasoning, video-based reasoning, and multi-view reasoning.

What carries the argument

Drawing to reason in space, a paradigm that lets LVLMs execute basic drawing operations such as annotating bounding boxes and drawing auxiliary lines to express and analyze spatial relationships through direct visual manipulation.

If this is right

  • The approach enables precise geometric understanding and continuous spatial tracking directly in visual space.
  • It avoids the performance ceiling that comes from relying on specialized external perception tools.
  • VILASR achieves consistent gains on maze navigation, static spatial reasoning, video-based reasoning, and multi-view-based reasoning.
  • Overall accuracy rises by an average of 18.4 percent across the tested spatial reasoning benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same drawing-based loop might help models handle dynamic 3D scenes or robotic path planning where positions change over time.
  • Internal visual manipulation could lessen dependence on separate vision modules by building spatial awareness through training alone.
  • Extending the allowed drawing primitives might support more complex geometry such as perspective projections or 3D rotations.
  • Reinforcement on drawing actions could generalize to other tasks that benefit from intermediate visual annotations rather than pure language.

Load-bearing premise

Basic drawing operations like annotating bounding boxes and drawing auxiliary lines can be learned and used by LVLMs to achieve precise geometric understanding and continuous spatial tracking without specialized external perception tools.

What would settle it

Run VILASR on the same spatial benchmarks but with drawing operations disabled and compare accuracy to the full model; a large drop would support that the drawing mechanism drives the reported gains.

read the original abstract

As textual reasoning with large language models (LLMs) has advanced significantly, there has been growing interest in enhancing the multimodal reasoning capabilities of large vision-language models (LVLMs). However, existing methods primarily approach multimodal reasoning in a straightforward, text-centric manner, where both reasoning and answer derivation are conducted purely through text, with the only difference being the presence of multimodal input. As a result, these methods often encounter fundamental limitations in spatial reasoning tasks that demand precise geometric understanding and continuous spatial tracking-capabilities that humans achieve through mental visualization and manipulation. To address the limitations, we propose drawing to reason in space, a novel paradigm that enables LVLMs to reason through elementary drawing operations in the visual space. By equipping models with basic drawing operations, including annotating bounding boxes and drawing auxiliary lines, we empower them to express and analyze spatial relationships through direct visual manipulation, meanwhile avoiding the performance ceiling imposed by specialized perception tools in previous tool-integrated reasoning approaches. To cultivate this capability, we develop a three-stage training framework: cold-start training with synthetic data to establish basic drawing abilities, reflective rejection sampling to enhance self-reflection behaviors, and reinforcement learning to directly optimize for target rewards. Extensive experiments demonstrate that our model, named VILASR, consistently outperforms existing methods across diverse spatial reasoning benchmarks, involving maze navigation, static spatial reasoning, video-based reasoning, and multi-view-based reasoning tasks, with an average improvement of 18.4%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a 'drawing to reason in space' paradigm for large vision-language models (LVLMs) in which the model interleaves textual thinking with elementary visual drawing operations (bounding-box annotation and auxiliary lines) to improve spatial reasoning. It introduces a three-stage training pipeline (synthetic cold-start, reflective rejection sampling, and reinforcement learning) to instill this capability and presents the resulting model VILASR, which is reported to outperform prior methods by an average of 18.4% across maze navigation, static spatial reasoning, video-based reasoning, and multi-view reasoning benchmarks.

Significance. If the performance gains prove robust and causally attributable to the visual-drawing mechanism rather than the training recipe alone, the work would offer a concrete alternative to purely text-centric or external-tool-dependent multimodal reasoning. It could meaningfully advance the field's ability to equip LVLMs with human-like continuous spatial tracking without relying on specialized perception modules.

major comments (2)
  1. [§4] §4 (Experimental Results): The headline 18.4% average improvement is presented without baseline model specifications, statistical significance tests, ablation results, or error bars. This absence prevents assessment of whether the gains are stable or sensitive to post-hoc choices.
  2. [§3.2] §3.2 and §4.2: The central claim attributes gains to the 'drawing to reason in space' paradigm, yet no ablation holds the three-stage training procedure fixed while removing or disabling the visual drawing primitives (e.g., text-only conditioning at inference). Without this isolation, it remains unclear whether the reported improvements across maze, static, video, and multi-view tasks stem from the drawing operations or from the training interventions themselves.
minor comments (2)
  1. [§2] The abstract and early sections use the term 'interwoven thinking' without a precise operational definition; a short clarifying paragraph or diagram in §2 would improve readability.
  2. Figure captions and benchmark descriptions should explicitly list the exact metrics (e.g., success rate, IoU) and dataset splits used for each task category.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which has helped us identify areas to strengthen the presentation of our experimental results and the attribution of our method's gains. We address each major comment in detail below and commit to making the necessary revisions.

read point-by-point responses
  1. Referee: [§4] §4 (Experimental Results): The headline 18.4% average improvement is presented without baseline model specifications, statistical significance tests, ablation results, or error bars. This absence prevents assessment of whether the gains are stable or sensitive to post-hoc choices.

    Authors: We agree that the current presentation of results in §4 would benefit from greater rigor and transparency. In the revised manuscript we will explicitly list the base LVLM architectures and training configurations for all compared methods, report error bars computed over multiple random seeds, and include statistical significance tests (e.g., paired t-tests with p-values) for the headline improvements. We will also expand the ablation tables to make the full set of controls more visible. revision: yes

  2. Referee: [§3.2] §3.2 and §4.2: The central claim attributes gains to the 'drawing to reason in space' paradigm, yet no ablation holds the three-stage training procedure fixed while removing or disabling the visual drawing primitives (e.g., text-only conditioning at inference). Without this isolation, it remains unclear whether the reported improvements across maze, static, video, and multi-view tasks stem from the drawing operations or from the training interventions themselves.

    Authors: We acknowledge that a controlled ablation keeping the three-stage training fixed while disabling drawing primitives at inference would provide stronger causal evidence. Although our existing experiments compare against text-only baselines and ablate individual training stages, we did not include the precise isolation suggested. We will add this experiment to the revision: models trained with the full pipeline will be evaluated under text-only conditioning (no drawing operations permitted at inference). We expect the results to show a clear drop, thereby supporting attribution to the visual-drawing mechanism. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical training and benchmark evaluation

full rationale

The paper presents an empirical pipeline: a three-stage training procedure (synthetic cold-start, reflective rejection sampling, RL) to instill drawing operations, followed by evaluation on spatial reasoning benchmarks yielding an 18.4% average gain. No equations, fitted parameters, or self-citations are shown to define the target performance metric in terms of the method itself. The reported improvements are measured against external benchmarks and baselines rather than reducing by construction to the training inputs or prior self-references. The derivation chain is therefore self-contained as standard supervised/RL training plus held-out evaluation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The paper rests on the assumption that LVLMs can acquire and usefully apply elementary drawing operations through the described training stages; no new physical entities or mathematical axioms are introduced beyond standard supervised and reinforcement learning assumptions.

free parameters (1)
  • Stage-specific training hyperparameters
    Cold-start, rejection sampling, and RL stages each require learning rates, reward weights, and sampling thresholds that are chosen or fitted during development.
axioms (1)
  • domain assumption LVLMs can acquire effective spatial drawing behavior from synthetic data followed by reflection and reward optimization.
    Invoked to justify the three-stage pipeline as sufficient to instill the claimed capability.
invented entities (1)
  • VILASR no independent evidence
    purpose: The final trained model that performs interwoven thinking and visual drawing.
    The model is the direct output of the training process; no independent falsifiable prediction about its internal parameters is supplied.

pith-pipeline@v0.9.0 · 5575 in / 1355 out tokens · 51174 ms · 2026-05-17T04:53:42.203896+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SketchVLM: Vision language models can annotate images to explain thoughts and guide users

    cs.CV 2026-04 unverdicted novelty 7.0

    SketchVLM lets VLMs generate non-destructive SVG annotations on input images to visually explain answers, raising visual reasoning accuracy by up to 28.5 points and annotation quality by 1.48x over baselines.

  2. Video-R1: Reinforcing Video Reasoning in MLLMs

    cs.CV 2025-03 conditional novelty 7.0

    Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.

  3. SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs

    cs.CV 2026-05 unverdicted novelty 6.0

    SpaceMind++ adds an explicit voxelized allocentric cognitive map and coordinate-guided fusion to video MLLMs, claiming SOTA on VSI-Bench and improved out-of-distribution generalization on three other 3D benchmarks.

  4. Uncovering and Shaping the Latent Representation of 3D Scene Topology in Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    VLMs possess a latent 3D scene topology subspace corresponding to Laplacian eigenmaps that can be causally shaped via Dirichlet energy regularization to improve spatial task performance by up to 12.1%.

  5. Visual Reasoning through Tool-supervised Reinforcement Learning

    cs.CV 2026-04 unverdicted novelty 6.0

    ToolsRL trains MLLMs via a tool-specific then accuracy-focused RL curriculum to master visual tools for complex reasoning tasks.

  6. LAST: Leveraging Tools as Hints to Enhance Spatial Reasoning for Multimodal Large Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    LAST augments MLLMs with a tool-abstraction sandbox and three-stage training to deliver around 20% gains on spatial reasoning tasks, outperforming closed-source models.

  7. EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs

    cs.CV 2026-04 unverdicted novelty 6.0

    EgoMind activates spatial cognition in MLLMs via linguistic Role-Play Caption and Progressive Spatial Analysis, reaching competitive results on VSI-Bench, SPAR-Bench, SITE-Bench and SPBench with only 5K SFT and 20K RL...

  8. Gen-Searcher: Reinforcing Agentic Search for Image Generation

    cs.CV 2026-03 unverdicted novelty 6.0

    Gen-Searcher is the first search-augmented image generation agent trained with SFT followed by agentic RL using dual text and image rewards on custom datasets and the KnowGen benchmark.

  9. How RL Unlocks the Aha Moment in Geometric Interleaved Reasoning

    cs.CL 2026-03 unverdicted novelty 6.0

    Reinforcement learning with three causal constraints enables multimodal models to internalize diagram-reasoning links in geometry, unlike SFT which only mimics surface format and harms performance.

  10. AdaTooler-V: Adaptive Tool-Use for Images and Videos

    cs.CV 2025-12 conditional novelty 6.0

    AdaTooler-V trains MLLMs to adaptively use vision tools via AT-GRPO reinforcement learning and new datasets, reaching 89.8% on V* and outperforming GPT-4o.

  11. Thinking with Novel Views: A Systematic Analysis of Generative-Augmented Spatial Intelligence

    cs.CV 2026-05 unverdicted novelty 5.0

    Integrating generative novel-view synthesis into LMM reasoning loops improves accuracy on spatial subtasks by 1.3 to 3.9 percentage points across multiple models and tasks.

  12. SpatialImaginer: Towards Adaptive Visual Imagination for Spatial Reasoning

    cs.CV 2026-04 unverdicted novelty 5.0

    SpatialImaginer integrates visual imagination with textual chain-of-thought to improve spatial reasoning robustness in multimodal large language models.

  13. Test-time Scaling over Perception: Resolving the Grounding Paradox in Thinking with Images

    cs.CV 2026-04 unverdicted novelty 5.0

    TTSP resolves the Grounding Paradox by treating perception as a scalable test-time process that generates, filters, and iteratively refines multiple visual exploration traces, outperforming baselines on high-resolutio...

  14. OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence

    cs.CL 2026-04 unverdicted novelty 5.0

    OpenSpatial supplies a principled open-source data engine and 3-million-sample dataset that raises spatial-reasoning model performance by an average of 19 percent on benchmarks.

  15. Learning to Focus and Precise Cropping: A Reinforcement Learning Framework with Information Gaps and Grounding Loss for MLLMs

    cs.CV 2026-03 unverdicted novelty 5.0

    A two-stage RL method with information gaps and grounding loss trains MLLMs to focus on and precisely crop relevant image regions, yielding SOTA results on high-resolution VQA benchmarks.

  16. OneThinker: All-in-one Reasoning Model for Image and Video

    cs.CV 2025-12 unverdicted novelty 5.0

    OneThinker unifies image and video reasoning in one model across 10 tasks via a 600k corpus, CoT-annotated SFT, and EMA-GRPO reinforcement learning, reporting strong results on 31 benchmarks plus some cross-task transfer.

  17. XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments

    cs.CV 2026-04 unverdicted novelty 4.0

    XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial...

Reference graph

Works this paper leans on

82 extracted references · 82 canonical work pages · cited by 17 Pith papers · 11 internal anchors

  1. [1]

    Self-RAG: Learning to retrieve, generate, and critique through self-reflection

    Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-RAG: Learning to retrieve, generate, and critique through self-reflection. In The Twelfth International Conference on Learning Representations, 2024. 10

  2. [2]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025

  3. [3]

    Spatial cognition and the brain

    Neil Burgess. Spatial cognition and the brain. Annals of the New York Academy of Sciences, 1124(1):77–97, 2008

  4. [4]

    Spatialbot: Precise spatial understanding with vision language models, 2025

    Wenxiao Cai, Iaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spatialbot: Precise spatial understanding with vision language models, 2025

  5. [5]

    Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14455–14465, June 2024

  6. [6]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  7. [7]

    SpatialRGPT: Grounded spatial reasoning in vision-language models

    An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. SpatialRGPT: Grounded spatial reasoning in vision-language models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  8. [8]

    From the least to the most: Building a plug-and-play visual reasoner via data synthesis

    Chuanqi Cheng, Jian Guan, Wei Wu, and Rui Yan. From the least to the most: Building a plug-and-play visual reasoner via data synthesis. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 4941–4957, Miami, Florida, USA, November 2024. Association for Co...

  9. [9]

    Scaling video-language models to 10k frames via hierarchical differential distillation, 2025

    Chuanqi Cheng, Jian Guan, Wei Wu, and Rui Yan. Scaling video-language models to 10k frames via hierarchical differential distillation, 2025

  10. [10]

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

    DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

  11. [11]

    Open- vlthinker: An early exploration to complex vision-language reasoning via iterative self- improvement

    Yihe Deng, Hritik Bansal, Fan Yin, Nanyun Peng, Wei Wang, and Kai-Wei Chang. Open- vlthinker: An early exploration to complex vision-language reasoning via iterative self- improvement. arXiv preprint arXiv:2503.17352, 2025

  12. [12]

    Learning to prompt for open-vocabulary object detection with vision-language model

    Yu Du, Fangyun Wei, Zihe Zhang, Miaojing Shi, Yue Gao, and Guoqi Li. Learning to prompt for open-vocabulary object detection with vision-language model. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14084–14093, 2022

  13. [13]

    Video-R1: Reinforcing Video Reasoning in MLLMs

    Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms. arXiv preprint arXiv:2503.21776, 2025

  14. [14]

    Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D. Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars, 2025

  15. [15]

    Frames of mind: The theory of multiple intelligences

    Howard E Gardner. Frames of mind: The theory of multiple intelligences. Basic books, 2011

  16. [16]

    Introducing gemini 2.0: our new ai model for the agentic era, 2024

    Google. Introducing gemini 2.0: our new ai model for the agentic era, 2024. 11

  17. [17]

    Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Z

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagara- jan, Ilija Radosavovic, Santhosh K. Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Z. Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, S...

  18. [18]

    Amor: A recipe for building adaptable modular knowledge agents through process feedback

    Jian Guan, Wei Wu, Peng Xu, Hongning Wang, Minlie Huang, et al. Amor: A recipe for building adaptable modular knowledge agents through process feedback. Advances in Neural Information Processing Systems, 37:126118–126148, 2024

  19. [19]

    Visual programming: Compositional visual reasoning without training

    Tanmay Gupta and Aniruddha Kembhavi. Visual programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14953–14962, 2023

  20. [20]

    Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model, 2025

    Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model, 2025

  21. [21]

    Visual sketchpad: Sketching as a visual chain of thought for multimodal language models

    Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  22. [22]

    Gqa: A new dataset for real-world visual reasoning and compositional question answering

    Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019

  23. [23]

    Position: The platonic representation hypothesis

    Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. Position: The platonic representation hypothesis. In Forty-first International Conference on Machine Learning, 2024

  24. [24]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

  25. [25]

    Spies, Tilman Räuker, Dan Valentine, Can Rager, Lucia Quirke, Chris Mathwin, Guillaume Corlouer, Cecilia Diniz Behn, and Samy Wu Fung

    Michael Igorevich Ivanitskiy, Rusheb Shah, Alex F. Spies, Tilman Räuker, Dan Valentine, Can Rager, Lucia Quirke, Chris Mathwin, Guillaume Corlouer, Cecilia Diniz Behn, and Samy Wu Fung. A configurable library for generating and manipulating maze datasets, 2023

  26. [26]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024

  27. [27]

    What’s" up" with vision-language models? investigating their struggle with spatial reasoning

    Amita Kamath, Jack Hessel, and Kai-Wei Chang. What’s" up" with vision-language models? investigating their struggle with spatial reasoning. arXiv preprint arXiv:2310.19785, 2023

  28. [28]

    The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale

    Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, et al. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International journal of computer vision, 128(7):1956–1981, 2020. 12

  29. [29]

    Coderl: Mastering code generation through pretrained models and deep reinforcement learning

    Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Chu Hong Hoi. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Advances in Neural Information Processing Systems, 35:21314–21328, 2022

  30. [30]

    Llava-onevision: Easy visual task transfer, 2024

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer, 2024

  31. [31]

    Imagine while Reasoning in Space: Multimodal Visualization-of-Thought

    Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vuli ´c, and Furu Wei. Imagine while reasoning in space: Multimodal visualization-of-thought. arXiv preprint arXiv:2501.07542, 2025

  32. [32]

    Topviewrs: Vision-language models as top-view spatial reasoners, 2024

    Chengzu Li, Caiqi Zhang, Han Zhou, Nigel Collier, Anna Korhonen, and Ivan Vuli´c. Topviewrs: Vision-language models as top-view spatial reasoners, 2024

  33. [33]

    Multimodal alignment and fusion: A survey

    Songtao Li and Hao Tang. Multimodal alignment and fusion: A survey. arXiv preprint arXiv:2411.17040, 2024

  34. [34]

    Torl: Scaling tool-integrated rl, 2025

    Xuefeng Li, Haoyang Zou, and Pengfei Liu. Torl: Scaling tool-integrated rl, 2025

  35. [35]

    Let's Verify Step by Step

    Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023

  36. [36]

    Vila: On pre-training for visual language models

    Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26689–26699, 2024

  37. [37]

    Coarse correspondences boost spatial-temporal reasoning in multimodal language model, 2024

    Benlin Liu, Yuhao Dong, Yiqin Wang, Zixian Ma, Yansong Tang, Luming Tang, Yongming Rao, Wei-Chiu Ma, and Ranjay Krishna. Coarse correspondences boost spatial-temporal reasoning in multimodal language model, 2024

  38. [38]

    Visual spatial reasoning

    Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning. Transactions of the Association for Computational Linguistics, 11:635–651, 2023

  39. [39]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024

  40. [40]

    Video-chatgpt: Towards detailed video understanding via large vision and language models

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume1: Long Papers), pages 12585–12602, 2024

  41. [41]

    Exploring visual–spatial working memory: A critical review of concepts and models

    Julia McAfoose and BT Baune. Exploring visual–spatial working memory: A critical review of concepts and models. Neuropsychology review, 19:130–142, 2009

  42. [42]

    SPARTQA: A textual question answering benchmark for spatial reasoning

    Roshanak Mirzaee, Hossein Rajaby Faghihi, Qiang Ning, and Parisa Kordjamshidi. SPARTQA: A textual question answering benchmark for spatial reasoning. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou, editors, Proceedings of the 2021 Conference of the...

  43. [43]

    Hello gpt-4o

    OpenAI. Hello gpt-4o. In OpenAI Blog, 2024

  44. [44]

    Introducing openai o1-preview

    OpenAI. Introducing openai o1-preview. https://openai.com/index/ introducing-openai-o1-preview/ , 2024

  45. [45]

    Introducing openai o3 and o4-mini, 2025

    OpenAI. Introducing openai o3 and o4-mini, 2025

  46. [46]

    Thinking with images, 2025

    OpenAI. Thinking with images, 2025

  47. [47]

    Spacer: Reinforcing mllms in video spatial reasoning, 2025

    Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. Spacer: Reinforcing mllms in video spatial reasoning, 2025. 13

  48. [48]

    Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, Albert Tung, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anchit Gupta, Andrew Wang, Anikait Singh, Animesh Garg, Aniruddha Kembhavi, Annie Xie, Anthony Brohan, Antonin Raffin, Arc...

  49. [49]

    Skywork r1v: pioneering multimodal reasoning with chain-of- thought

    Yi Peng, Xiaokun Wang, Yichen Wei, Jiangbo Pei, Weijie Qiu, Ai Jian, Yunzhuo Hao, Jiachun Pan, Tianyidan Xie, Li Ge, et al. Skywork r1v: pioneering multimodal reasoning with chain-of- thought. arXiv preprint arXiv:2504.05599, 2025

  50. [50]

    Cogcom: A visual language model with chain-of-manipulations reasoning

    Ji Qi, Ming Ding, Weihan Wang, Yushi Bai, Qingsong Lv, Wenyi Hong, Bin Xu, Lei Hou, Juanzi Li, Yuxiao Dong, and Jie Tang. Cogcom: A visual language model with chain-of-manipulations reasoning. In The Thirteenth International Conference on Learning Representations, 2025

  51. [51]

    Gpt4scene: Understand 3d scenes from videos with vision-language models.arXiv preprint arXiv:2501.01428, 2025

    Zhangyang Qi, Zhixiong Zhang, Ye Fang, Jiaqi Wang, and Hengshuang Zhao. Gpt4scene: Un- derstand 3d scenes from videos with vision-language models. arXiv preprint arXiv:2501.01428, 14 2025

  52. [52]

    ToolLLM: Facilitating large language models to master 16000+ real-world APIs

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, dahai li, Zhiyuan Liu, and Maosong Sun. ToolLLM: Facilitating large language models to master 16000+ real-world APIs. In The Twelfth International Conference on Learnin...

  53. [53]

    Toolformer: Language models can teach themselves to use tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36, 2024

  54. [54]

    Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning

    Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning. Advances in Neural Information Processing Systems, 37:8612–8642, 2024

  55. [55]

    Pangu-coder2: Boosting large language models for code with ranking feedback

    Bo Shen, Jiaxin Zhang, Taihong Chen, Daoguang Zan, Bing Geng, An Fu, Muhan Zeng, Ailun Yu, Jichuan Ji, Jingyang Zhao, et al. Pangu-coder2: Boosting large language models for code with ranking feedback. arXiv preprint arXiv:2307.14936, 2023

  56. [56]

    Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface

    Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. In Advances in Neural Information Processing Systems, 2023

  57. [57]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256, 2024

  58. [58]

    Scaling LLM test-time com- pute optimally can be more effective than scaling parameters for reasoning

    Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time com- pute optimally can be more effective than scaling parameters for reasoning. In The Thirteenth International Conference on Learning Representations, 2025

  59. [59]

    RoboSpatial: Teaching spatial understanding to 2D and 3D vision-language models for robotics

    Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, and Stan Birchfield. RoboSpatial: Teaching spatial understanding to 2D and 3D vision-language models for robotics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. Oral Presentation

  60. [60]

    Vipergpt: Visual inference via python execution for reasoning

    Dídac Surís, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11888–11898, 2023

  61. [61]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024

  62. [62]

    Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, Congcong Wang, Dehao Zhang, Dikang Du, Dongliang Wang, Enming Yuan, Enzhe Lu, Fang Li, Flood Sung, Guangda Wei, Guokun Lai, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haoning Wu, Haotian Yao, Haoyu Lu, Heng Wang, Hongcheng Gao, Huabi...

  63. [63]

    Is a picture worth a thousand words? delving into spatial reasoning for vision language models

    Jiayu Wang, Yifei Ming, Zhenmei Shi, Vibhav Vineet, Xin Wang, Yixuan Li, and Neel Joshi. Is a picture worth a thousand words? delving into spatial reasoning for vision language models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 75392–754...

  64. [64]

    Is a picture worth a thousand words? delving into spatial reasoning for vision language models

    Jiayu Wang, Yifei Ming, Zhenmei Shi, Vibhav Vineet, Xin Wang, Yixuan Li, and Neel Joshi. Is a picture worth a thousand words? delving into spatial reasoning for vision language models. In The Thirty-Eighth Annual Conference on Neural Information Processing Systems, 2024

  65. [65]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

  66. [66]

    Codeplan: Unlocking reasoning potential in large language models by scaling code-form planning

    Jiaxin Wen, Jian Guan, Hongning Wang, Wei Wu, and Minlie Huang. Codeplan: Unlocking reasoning potential in large language models by scaling code-form planning. In The Thirteenth International Conference on Learning Representations, 2025

  67. [67]

    Logic-rl: Unleashing llm reasoning with rule-based reinforcement learning, 2025

    Tian Xie, Zitian Gao, Qingnan Ren, Haoming Luo, Yuqian Hong, Bryan Dai, Joey Zhou, Kai Qiu, Zhirong Wu, and Chong Luo. Logic-rl: Unleashing llm reasoning with rule-based reinforcement learning, 2025

  68. [68]

    Evaluating spatial understanding of large language models

    Yutaro Yamada, Yihan Bao, Andrew K Lampinen, Jungo Kasai, and Ilker Yildirim. Evaluating spatial understanding of large language models. arXiv preprint arXiv:2310.14540, 2023

  69. [69]

    Thinking in space: How multimodal large language models see, remember, and recall spaces

    Jihan Yang, Shusheng Yang, Anjali Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces. arXiv preprint arXiv:2412.14171, 2024

  70. [70]

    Mmsi-bench: A benchmark for multi-image spatial intelligence, 2025

    Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, Dahua Lin, Tai Wang, and Jiangmiao Pang. Mmsi-bench: A benchmark for multi-image spatial intelligence, 2025

  71. [71]

    R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

    Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, et al. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization. arXiv preprint arXiv:2503.10615, 2025

  72. [72]

    MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

    Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023

  73. [73]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, 2022

  74. [74]

    Metamath: Bootstrap your own mathematical questions for large language models

    Longhui Yu, Weisen Jiang, Han Shi, YU Jincheng, Zhengying Liu, Yu Zhang, James Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. In The Twelfth International Conference on Learning Representations, 2023

  75. [75]

    Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?, 2025

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?, 2025

  76. [76]

    Scaling and beyond: Advancing spatial reasoning in mllms requires new recipes, 2025

    Huanyu Zhang, Chengzu Li, Wenshan Wu, Shaoguang Mao, Yifan Zhang, Haochen Tian, Ivan Vuli´c, Zhang Zhang, Liang Wang, Tieniu Tan, and Furu Wei. Scaling and beyond: Advancing spatial reasoning in mllms requires new recipes, 2025

  77. [77]

    From flatland to space: Teaching vision-language models to perceive and reason in 3d, 2025

    Jiahui Zhang, Yurui Chen, Yanpeng Zhou, Yueming Xu, Ze Huang, Jilin Mei, Junhui Chen, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, Xingyue Quan, Hang Xu, and Li Zhang. From flatland to space: Teaching vision-language models to perceive and reason in 3d, 2025. 16

  78. [78]

    Vsr: a unified framework for document layout analysis combining vision, semantics and relations

    Peng Zhang, Can Li, Liang Qiao, Zhanzhan Cheng, Shiliang Pu, Yi Niu, and Fei Wu. Vsr: a unified framework for document layout analysis combining vision, semantics and relations. In Document Analysis and Recognition–ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5–10, 2021, Proceedings, Part I 16, pages 115–130. Springer, 2021

  79. [79]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Video instruction tuning with synthetic data. arXiv preprint arXiv:2410.02713, 2024

  80. [80]

    Multimodal chain-of-thought reasoning in language models

    Zhuosheng Zhang, Aston Zhang, Mu Li, hai zhao, George Karypis, and Alex Smola. Multimodal chain-of-thought reasoning in language models. Transactions on Machine Learning Research, 2024

Showing first 80 references.