arxiv: 2506.09965 · v2 · submitted 2025-06-11 · 💻 cs.CV · cs.AI

Recognition: 1 theorem link

Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing

Junfei Wu , Jian Guan , Kaituo Feng , Qiang Liu , Shu Wu , Liang Wang , Wei Wu , Tieniu Tan

Authors on Pith no claims yet

Pith reviewed 2026-05-17 04:53 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords spatial reasoningvision-language modelsvisual drawingreinforcement learningmultimodal reasoningmaze navigationbounding boxesauxiliary lines

0 comments

The pith

Vision-language models improve spatial reasoning by drawing boxes and lines on images during thinking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current methods for multimodal reasoning in vision-language models stay text-centric even when given images, which restricts their ability to manage precise geometry and track moving positions continuously. The paper presents a new paradigm called drawing to reason in space, where models perform simple drawing actions such as marking bounding boxes and sketching auxiliary lines directly on the visual input. These operations let the model express and examine spatial relations through visual manipulation rather than words alone. Training happens in three stages that begin with synthetic data to learn drawing basics, add reflective rejection sampling, and finish with reinforcement learning to maximize rewards on spatial tasks. A sympathetic reader would care because this internal visual approach could bypass the limits of external perception tools and raise performance on real-world spatial problems.

Core claim

The paper claims that equipping LVLMs with elementary drawing operations in visual space enables them to reason about spatial relationships through direct manipulation, and that a three-stage training framework of cold-start synthetic data, reflective rejection sampling, and reinforcement learning produces a model named VILASR that outperforms prior methods by an average of 18.4 percent on benchmarks covering maze navigation, static spatial reasoning, video-based reasoning, and multi-view reasoning.

What carries the argument

Drawing to reason in space, a paradigm that lets LVLMs execute basic drawing operations such as annotating bounding boxes and drawing auxiliary lines to express and analyze spatial relationships through direct visual manipulation.

If this is right

The approach enables precise geometric understanding and continuous spatial tracking directly in visual space.
It avoids the performance ceiling that comes from relying on specialized external perception tools.
VILASR achieves consistent gains on maze navigation, static spatial reasoning, video-based reasoning, and multi-view-based reasoning.
Overall accuracy rises by an average of 18.4 percent across the tested spatial reasoning benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same drawing-based loop might help models handle dynamic 3D scenes or robotic path planning where positions change over time.
Internal visual manipulation could lessen dependence on separate vision modules by building spatial awareness through training alone.
Extending the allowed drawing primitives might support more complex geometry such as perspective projections or 3D rotations.
Reinforcement on drawing actions could generalize to other tasks that benefit from intermediate visual annotations rather than pure language.

Load-bearing premise

Basic drawing operations like annotating bounding boxes and drawing auxiliary lines can be learned and used by LVLMs to achieve precise geometric understanding and continuous spatial tracking without specialized external perception tools.

What would settle it

Run VILASR on the same spatial benchmarks but with drawing operations disabled and compare accuracy to the full model; a large drop would support that the drawing mechanism drives the reported gains.

read the original abstract

As textual reasoning with large language models (LLMs) has advanced significantly, there has been growing interest in enhancing the multimodal reasoning capabilities of large vision-language models (LVLMs). However, existing methods primarily approach multimodal reasoning in a straightforward, text-centric manner, where both reasoning and answer derivation are conducted purely through text, with the only difference being the presence of multimodal input. As a result, these methods often encounter fundamental limitations in spatial reasoning tasks that demand precise geometric understanding and continuous spatial tracking-capabilities that humans achieve through mental visualization and manipulation. To address the limitations, we propose drawing to reason in space, a novel paradigm that enables LVLMs to reason through elementary drawing operations in the visual space. By equipping models with basic drawing operations, including annotating bounding boxes and drawing auxiliary lines, we empower them to express and analyze spatial relationships through direct visual manipulation, meanwhile avoiding the performance ceiling imposed by specialized perception tools in previous tool-integrated reasoning approaches. To cultivate this capability, we develop a three-stage training framework: cold-start training with synthetic data to establish basic drawing abilities, reflective rejection sampling to enhance self-reflection behaviors, and reinforcement learning to directly optimize for target rewards. Extensive experiments demonstrate that our model, named VILASR, consistently outperforms existing methods across diverse spatial reasoning benchmarks, involving maze navigation, static spatial reasoning, video-based reasoning, and multi-view-based reasoning tasks, with an average improvement of 18.4%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gets LVLMs to draw boxes and lines while reasoning for spatial tasks, but the gains may owe more to the training stages than to the drawing itself.

read the letter

The punchline is that this work gets LVLMs to interleave visual drawing with their reasoning steps for better spatial understanding, using a three-stage training process. What stands out is the shift to operating directly in visual space with simple operations like bounding boxes and auxiliary lines. This avoids the limits of text-only reasoning and the overhead of external tools. The training recipe starts with synthetic data to teach drawing, adds reflective rejection sampling, and finishes with RL to optimize rewards. They test on a range of tasks from mazes to video and multi-view reasoning, reporting solid gains. This has clear value for anyone building systems that need geometric precision, such as in robotics or scene understanding. The idea is grounded in how humans visualize and manipulate mentally. The soft spot is that we don't yet see strong evidence separating the effect of the drawing operations from the training procedure. If a model trained the same way but without conditioning on or producing drawings performs similarly, then the paradigm's contribution is less clear. The abstract mentions the average improvement but skips baseline specifics, ablations, and variance measures, which makes it harder to assess how reliable the 18.4% figure is across setups. This paper is for the multimodal reasoning community, particularly those focused on spatial and embodied tasks. A reader interested in practical enhancements to LVLMs will find useful ideas here. It deserves a serious referee because the proposal is concrete and the problem it targets is important, even though more controls would strengthen the case. Recommendation: It should go through peer review with requests for detailed ablations on the drawing component.

Referee Report

2 major / 2 minor

Summary. The paper proposes a 'drawing to reason in space' paradigm for large vision-language models (LVLMs) in which the model interleaves textual thinking with elementary visual drawing operations (bounding-box annotation and auxiliary lines) to improve spatial reasoning. It introduces a three-stage training pipeline (synthetic cold-start, reflective rejection sampling, and reinforcement learning) to instill this capability and presents the resulting model VILASR, which is reported to outperform prior methods by an average of 18.4% across maze navigation, static spatial reasoning, video-based reasoning, and multi-view reasoning benchmarks.

Significance. If the performance gains prove robust and causally attributable to the visual-drawing mechanism rather than the training recipe alone, the work would offer a concrete alternative to purely text-centric or external-tool-dependent multimodal reasoning. It could meaningfully advance the field's ability to equip LVLMs with human-like continuous spatial tracking without relying on specialized perception modules.

major comments (2)

[§4] §4 (Experimental Results): The headline 18.4% average improvement is presented without baseline model specifications, statistical significance tests, ablation results, or error bars. This absence prevents assessment of whether the gains are stable or sensitive to post-hoc choices.
[§3.2] §3.2 and §4.2: The central claim attributes gains to the 'drawing to reason in space' paradigm, yet no ablation holds the three-stage training procedure fixed while removing or disabling the visual drawing primitives (e.g., text-only conditioning at inference). Without this isolation, it remains unclear whether the reported improvements across maze, static, video, and multi-view tasks stem from the drawing operations or from the training interventions themselves.

minor comments (2)

[§2] The abstract and early sections use the term 'interwoven thinking' without a precise operational definition; a short clarifying paragraph or diagram in §2 would improve readability.
Figure captions and benchmark descriptions should explicitly list the exact metrics (e.g., success rate, IoU) and dataset splits used for each task category.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which has helped us identify areas to strengthen the presentation of our experimental results and the attribution of our method's gains. We address each major comment in detail below and commit to making the necessary revisions.

read point-by-point responses

Referee: [§4] §4 (Experimental Results): The headline 18.4% average improvement is presented without baseline model specifications, statistical significance tests, ablation results, or error bars. This absence prevents assessment of whether the gains are stable or sensitive to post-hoc choices.

Authors: We agree that the current presentation of results in §4 would benefit from greater rigor and transparency. In the revised manuscript we will explicitly list the base LVLM architectures and training configurations for all compared methods, report error bars computed over multiple random seeds, and include statistical significance tests (e.g., paired t-tests with p-values) for the headline improvements. We will also expand the ablation tables to make the full set of controls more visible. revision: yes
Referee: [§3.2] §3.2 and §4.2: The central claim attributes gains to the 'drawing to reason in space' paradigm, yet no ablation holds the three-stage training procedure fixed while removing or disabling the visual drawing primitives (e.g., text-only conditioning at inference). Without this isolation, it remains unclear whether the reported improvements across maze, static, video, and multi-view tasks stem from the drawing operations or from the training interventions themselves.

Authors: We acknowledge that a controlled ablation keeping the three-stage training fixed while disabling drawing primitives at inference would provide stronger causal evidence. Although our existing experiments compare against text-only baselines and ablate individual training stages, we did not include the precise isolation suggested. We will add this experiment to the revision: models trained with the full pipeline will be evaluated under text-only conditioning (no drawing operations permitted at inference). We expect the results to show a clear drop, thereby supporting attribution to the visual-drawing mechanism. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical training and benchmark evaluation

full rationale

The paper presents an empirical pipeline: a three-stage training procedure (synthetic cold-start, reflective rejection sampling, RL) to instill drawing operations, followed by evaluation on spatial reasoning benchmarks yielding an 18.4% average gain. No equations, fitted parameters, or self-citations are shown to define the target performance metric in terms of the method itself. The reported improvements are measured against external benchmarks and baselines rather than reducing by construction to the training inputs or prior self-references. The derivation chain is therefore self-contained as standard supervised/RL training plus held-out evaluation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The paper rests on the assumption that LVLMs can acquire and usefully apply elementary drawing operations through the described training stages; no new physical entities or mathematical axioms are introduced beyond standard supervised and reinforcement learning assumptions.

free parameters (1)

Stage-specific training hyperparameters
Cold-start, rejection sampling, and RL stages each require learning rates, reward weights, and sampling thresholds that are chosen or fitted during development.

axioms (1)

domain assumption LVLMs can acquire effective spatial drawing behavior from synthetic data followed by reflection and reward optimization.
Invoked to justify the three-stage pipeline as sufficient to instill the claimed capability.

invented entities (1)

VILASR no independent evidence
purpose: The final trained model that performs interwoven thinking and visual drawing.
The model is the direct output of the training process; no independent falsifiable prediction about its internal parameters is supplied.

pith-pipeline@v0.9.0 · 5575 in / 1355 out tokens · 51174 ms · 2026-05-17T04:53:42.203896+00:00 · methodology

discussion (0)

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SketchVLM: Vision language models can annotate images to explain thoughts and guide users
cs.CV 2026-04 unverdicted novelty 7.0

SketchVLM lets VLMs generate non-destructive SVG annotations on input images to visually explain answers, raising visual reasoning accuracy by up to 28.5 points and annotation quality by 1.48x over baselines.
Video-R1: Reinforcing Video Reasoning in MLLMs
cs.CV 2025-03 conditional novelty 7.0

Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.
SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs
cs.CV 2026-05 unverdicted novelty 6.0

SpaceMind++ adds an explicit voxelized allocentric cognitive map and coordinate-guided fusion to video MLLMs, claiming SOTA on VSI-Bench and improved out-of-distribution generalization on three other 3D benchmarks.
Uncovering and Shaping the Latent Representation of 3D Scene Topology in Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

VLMs possess a latent 3D scene topology subspace corresponding to Laplacian eigenmaps that can be causally shaped via Dirichlet energy regularization to improve spatial task performance by up to 12.1%.
Visual Reasoning through Tool-supervised Reinforcement Learning
cs.CV 2026-04 unverdicted novelty 6.0

ToolsRL trains MLLMs via a tool-specific then accuracy-focused RL curriculum to master visual tools for complex reasoning tasks.
LAST: Leveraging Tools as Hints to Enhance Spatial Reasoning for Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 6.0

LAST augments MLLMs with a tool-abstraction sandbox and three-stage training to deliver around 20% gains on spatial reasoning tasks, outperforming closed-source models.
EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs
cs.CV 2026-04 unverdicted novelty 6.0

EgoMind activates spatial cognition in MLLMs via linguistic Role-Play Caption and Progressive Spatial Analysis, reaching competitive results on VSI-Bench, SPAR-Bench, SITE-Bench and SPBench with only 5K SFT and 20K RL...
Gen-Searcher: Reinforcing Agentic Search for Image Generation
cs.CV 2026-03 unverdicted novelty 6.0

Gen-Searcher is the first search-augmented image generation agent trained with SFT followed by agentic RL using dual text and image rewards on custom datasets and the KnowGen benchmark.
How RL Unlocks the Aha Moment in Geometric Interleaved Reasoning
cs.CL 2026-03 unverdicted novelty 6.0

Reinforcement learning with three causal constraints enables multimodal models to internalize diagram-reasoning links in geometry, unlike SFT which only mimics surface format and harms performance.
AdaTooler-V: Adaptive Tool-Use for Images and Videos
cs.CV 2025-12 conditional novelty 6.0

AdaTooler-V trains MLLMs to adaptively use vision tools via AT-GRPO reinforcement learning and new datasets, reaching 89.8% on V* and outperforming GPT-4o.
Thinking with Novel Views: A Systematic Analysis of Generative-Augmented Spatial Intelligence
cs.CV 2026-05 unverdicted novelty 5.0

Integrating generative novel-view synthesis into LMM reasoning loops improves accuracy on spatial subtasks by 1.3 to 3.9 percentage points across multiple models and tasks.
SpatialImaginer: Towards Adaptive Visual Imagination for Spatial Reasoning
cs.CV 2026-04 unverdicted novelty 5.0

SpatialImaginer integrates visual imagination with textual chain-of-thought to improve spatial reasoning robustness in multimodal large language models.
Test-time Scaling over Perception: Resolving the Grounding Paradox in Thinking with Images
cs.CV 2026-04 unverdicted novelty 5.0

TTSP resolves the Grounding Paradox by treating perception as a scalable test-time process that generates, filters, and iteratively refines multiple visual exploration traces, outperforming baselines on high-resolutio...
OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence
cs.CL 2026-04 unverdicted novelty 5.0

OpenSpatial supplies a principled open-source data engine and 3-million-sample dataset that raises spatial-reasoning model performance by an average of 19 percent on benchmarks.
Learning to Focus and Precise Cropping: A Reinforcement Learning Framework with Information Gaps and Grounding Loss for MLLMs
cs.CV 2026-03 unverdicted novelty 5.0

A two-stage RL method with information gaps and grounding loss trains MLLMs to focus on and precisely crop relevant image regions, yielding SOTA results on high-resolution VQA benchmarks.
OneThinker: All-in-one Reasoning Model for Image and Video
cs.CV 2025-12 unverdicted novelty 5.0

OneThinker unifies image and video reasoning in one model across 10 tasks via a 600k corpus, CoT-annotated SFT, and EMA-GRPO reinforcement learning, reporting strong results on 31 benchmarks plus some cross-task transfer.
XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments
cs.CV 2026-04 unverdicted novelty 4.0

XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial...

Reference graph

Works this paper leans on

82 extracted references · 82 canonical work pages · cited by 17 Pith papers · 11 internal anchors

[1]

Self-RAG: Learning to retrieve, generate, and critique through self-reflection

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-RAG: Learning to retrieve, generate, and critique through self-reflection. In The Twelfth International Conference on Learning Representations, 2024. 10

work page 2024
[2]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Spatial cognition and the brain

Neil Burgess. Spatial cognition and the brain. Annals of the New York Academy of Sciences, 1124(1):77–97, 2008

work page 2008
[4]

Spatialbot: Precise spatial understanding with vision language models, 2025

Wenxiao Cai, Iaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spatialbot: Precise spatial understanding with vision language models, 2025

work page 2025
[5]

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14455–14465, June 2024

work page 2024
[6]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page 2021
[7]

SpatialRGPT: Grounded spatial reasoning in vision-language models

An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. SpatialRGPT: Grounded spatial reasoning in vision-language models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[8]

From the least to the most: Building a plug-and-play visual reasoner via data synthesis

Chuanqi Cheng, Jian Guan, Wei Wu, and Rui Yan. From the least to the most: Building a plug-and-play visual reasoner via data synthesis. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 4941–4957, Miami, Florida, USA, November 2024. Association for Co...

work page 2024
[9]

Scaling video-language models to 10k frames via hierarchical differential distillation, 2025

Chuanqi Cheng, Jian Guan, Wei Wu, and Rui Yan. Scaling video-language models to 10k frames via hierarchical differential distillation, 2025

work page 2025
[10]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

work page 2025
[11]

Open- vlthinker: An early exploration to complex vision-language reasoning via iterative self- improvement

Yihe Deng, Hritik Bansal, Fan Yin, Nanyun Peng, Wei Wang, and Kai-Wei Chang. Open- vlthinker: An early exploration to complex vision-language reasoning via iterative self- improvement. arXiv preprint arXiv:2503.17352, 2025

work page arXiv 2025
[12]

Learning to prompt for open-vocabulary object detection with vision-language model

Yu Du, Fangyun Wei, Zihe Zhang, Miaojing Shi, Yue Gao, and Guoqi Li. Learning to prompt for open-vocabulary object detection with vision-language model. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14084–14093, 2022

work page 2022
[13]

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms. arXiv preprint arXiv:2503.21776, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D. Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars, 2025

work page 2025
[15]

Frames of mind: The theory of multiple intelligences

Howard E Gardner. Frames of mind: The theory of multiple intelligences. Basic books, 2011

work page 2011
[16]

Introducing gemini 2.0: our new ai model for the agentic era, 2024

Google. Introducing gemini 2.0: our new ai model for the agentic era, 2024. 11

work page 2024
[17]

Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Z

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagara- jan, Ilija Radosavovic, Santhosh K. Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Z. Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, S...

work page 2022
[18]

Amor: A recipe for building adaptable modular knowledge agents through process feedback

Jian Guan, Wei Wu, Peng Xu, Hongning Wang, Minlie Huang, et al. Amor: A recipe for building adaptable modular knowledge agents through process feedback. Advances in Neural Information Processing Systems, 37:126118–126148, 2024

work page 2024
[19]

Visual programming: Compositional visual reasoning without training

Tanmay Gupta and Aniruddha Kembhavi. Visual programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14953–14962, 2023

work page 2023
[20]

Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model, 2025

Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model, 2025

work page 2025
[21]

Visual sketchpad: Sketching as a visual chain of thought for multimodal language models

Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[22]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019

work page 2019
[23]

Position: The platonic representation hypothesis

Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. Position: The platonic representation hypothesis. In Forty-first International Conference on Machine Learning, 2024

work page 2024
[24]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Spies, Tilman Räuker, Dan Valentine, Can Rager, Lucia Quirke, Chris Mathwin, Guillaume Corlouer, Cecilia Diniz Behn, and Samy Wu Fung

Michael Igorevich Ivanitskiy, Rusheb Shah, Alex F. Spies, Tilman Räuker, Dan Valentine, Can Rager, Lucia Quirke, Chris Mathwin, Guillaume Corlouer, Cecilia Diniz Behn, and Samy Wu Fung. A configurable library for generating and manipulating maze datasets, 2023

work page 2023
[26]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

What’s" up" with vision-language models? investigating their struggle with spatial reasoning

Amita Kamath, Jack Hessel, and Kai-Wei Chang. What’s" up" with vision-language models? investigating their struggle with spatial reasoning. arXiv preprint arXiv:2310.19785, 2023

work page arXiv 2023
[28]

The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale

Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, et al. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International journal of computer vision, 128(7):1956–1981, 2020. 12

work page 1956
[29]

Coderl: Mastering code generation through pretrained models and deep reinforcement learning

Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Chu Hong Hoi. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Advances in Neural Information Processing Systems, 35:21314–21328, 2022

work page 2022
[30]

Llava-onevision: Easy visual task transfer, 2024

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer, 2024

work page 2024
[31]

Imagine while Reasoning in Space: Multimodal Visualization-of-Thought

Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vuli ´c, and Furu Wei. Imagine while reasoning in space: Multimodal visualization-of-thought. arXiv preprint arXiv:2501.07542, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Topviewrs: Vision-language models as top-view spatial reasoners, 2024

Chengzu Li, Caiqi Zhang, Han Zhou, Nigel Collier, Anna Korhonen, and Ivan Vuli´c. Topviewrs: Vision-language models as top-view spatial reasoners, 2024

work page 2024
[33]

Multimodal alignment and fusion: A survey

Songtao Li and Hao Tang. Multimodal alignment and fusion: A survey. arXiv preprint arXiv:2411.17040, 2024

work page arXiv 2024
[34]

Torl: Scaling tool-integrated rl, 2025

Xuefeng Li, Haoyang Zou, and Pengfei Liu. Torl: Scaling tool-integrated rl, 2025

work page 2025
[35]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Vila: On pre-training for visual language models

Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26689–26699, 2024

work page 2024
[37]

Coarse correspondences boost spatial-temporal reasoning in multimodal language model, 2024

Benlin Liu, Yuhao Dong, Yiqin Wang, Zixian Ma, Yansong Tang, Luming Tang, Yongming Rao, Wei-Chiu Ma, and Ranjay Krishna. Coarse correspondences boost spatial-temporal reasoning in multimodal language model, 2024

work page 2024
[38]

Visual spatial reasoning

Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning. Transactions of the Association for Computational Linguistics, 11:635–651, 2023

work page 2023
[39]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024

work page 2024
[40]

Video-chatgpt: Towards detailed video understanding via large vision and language models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume1: Long Papers), pages 12585–12602, 2024

work page 2024
[41]

Exploring visual–spatial working memory: A critical review of concepts and models

Julia McAfoose and BT Baune. Exploring visual–spatial working memory: A critical review of concepts and models. Neuropsychology review, 19:130–142, 2009

work page 2009
[42]

SPARTQA: A textual question answering benchmark for spatial reasoning

Roshanak Mirzaee, Hossein Rajaby Faghihi, Qiang Ning, and Parisa Kordjamshidi. SPARTQA: A textual question answering benchmark for spatial reasoning. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou, editors, Proceedings of the 2021 Conference of the...

work page 2021
[43]

Hello gpt-4o

OpenAI. Hello gpt-4o. In OpenAI Blog, 2024

work page 2024
[44]

Introducing openai o1-preview

OpenAI. Introducing openai o1-preview. https://openai.com/index/ introducing-openai-o1-preview/ , 2024

work page 2024
[45]

Introducing openai o3 and o4-mini, 2025

OpenAI. Introducing openai o3 and o4-mini, 2025

work page 2025
[46]

Thinking with images, 2025

OpenAI. Thinking with images, 2025

work page 2025
[47]

Spacer: Reinforcing mllms in video spatial reasoning, 2025

Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. Spacer: Reinforcing mllms in video spatial reasoning, 2025. 13

work page 2025
[48]

Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, Albert Tung, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anchit Gupta, Andrew Wang, Anikait Singh, Animesh Garg, Aniruddha Kembhavi, Annie Xie, Anthony Brohan, Antonin Raffin, Arc...

work page 2024
[49]

Skywork r1v: pioneering multimodal reasoning with chain-of- thought

Yi Peng, Xiaokun Wang, Yichen Wei, Jiangbo Pei, Weijie Qiu, Ai Jian, Yunzhuo Hao, Jiachun Pan, Tianyidan Xie, Li Ge, et al. Skywork r1v: pioneering multimodal reasoning with chain-of- thought. arXiv preprint arXiv:2504.05599, 2025

work page arXiv 2025
[50]

Cogcom: A visual language model with chain-of-manipulations reasoning

Ji Qi, Ming Ding, Weihan Wang, Yushi Bai, Qingsong Lv, Wenyi Hong, Bin Xu, Lei Hou, Juanzi Li, Yuxiao Dong, and Jie Tang. Cogcom: A visual language model with chain-of-manipulations reasoning. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025
[51]

Gpt4scene: Understand 3d scenes from videos with vision-language models.arXiv preprint arXiv:2501.01428, 2025

Zhangyang Qi, Zhixiong Zhang, Ye Fang, Jiaqi Wang, and Hengshuang Zhao. Gpt4scene: Un- derstand 3d scenes from videos with vision-language models. arXiv preprint arXiv:2501.01428, 14 2025

work page arXiv 2025
[52]

ToolLLM: Facilitating large language models to master 16000+ real-world APIs

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, dahai li, Zhiyuan Liu, and Maosong Sun. ToolLLM: Facilitating large language models to master 16000+ real-world APIs. In The Twelfth International Conference on Learnin...

work page 2024
[53]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[54]

Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning

Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning. Advances in Neural Information Processing Systems, 37:8612–8642, 2024

work page 2024
[55]

Pangu-coder2: Boosting large language models for code with ranking feedback

Bo Shen, Jiaxin Zhang, Taihong Chen, Daoguang Zan, Bing Geng, An Fu, Muhan Zeng, Ailun Yu, Jichuan Ji, Jingyang Zhao, et al. Pangu-coder2: Boosting large language models for code with ranking feedback. arXiv preprint arXiv:2307.14936, 2023

work page arXiv 2023
[56]

Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. In Advances in Neural Information Processing Systems, 2023

work page 2023
[57]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[58]

Scaling LLM test-time com- pute optimally can be more effective than scaling parameters for reasoning

Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time com- pute optimally can be more effective than scaling parameters for reasoning. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025
[59]

RoboSpatial: Teaching spatial understanding to 2D and 3D vision-language models for robotics

Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, and Stan Birchfield. RoboSpatial: Teaching spatial understanding to 2D and 3D vision-language models for robotics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. Oral Presentation

work page 2025
[60]

Vipergpt: Visual inference via python execution for reasoning

Dídac Surís, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11888–11898, 2023

work page 2023
[61]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[62]

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, Congcong Wang, Dehao Zhang, Dikang Du, Dongliang Wang, Enming Yuan, Enzhe Lu, Fang Li, Flood Sung, Guangda Wei, Guokun Lai, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haoning Wu, Haotian Yao, Haoyu Lu, Heng Wang, Hongcheng Gao, Huabi...

work page 2025
[63]

Is a picture worth a thousand words? delving into spatial reasoning for vision language models

Jiayu Wang, Yifei Ming, Zhenmei Shi, Vibhav Vineet, Xin Wang, Yixuan Li, and Neel Joshi. Is a picture worth a thousand words? delving into spatial reasoning for vision language models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 75392–754...

work page 2024
[64]

Is a picture worth a thousand words? delving into spatial reasoning for vision language models

Jiayu Wang, Yifei Ming, Zhenmei Shi, Vibhav Vineet, Xin Wang, Yixuan Li, and Neel Joshi. Is a picture worth a thousand words? delving into spatial reasoning for vision language models. In The Thirty-Eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[65]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022
[66]

Codeplan: Unlocking reasoning potential in large language models by scaling code-form planning

Jiaxin Wen, Jian Guan, Hongning Wang, Wei Wu, and Minlie Huang. Codeplan: Unlocking reasoning potential in large language models by scaling code-form planning. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025
[67]

Logic-rl: Unleashing llm reasoning with rule-based reinforcement learning, 2025

Tian Xie, Zitian Gao, Qingnan Ren, Haoming Luo, Yuqian Hong, Bryan Dai, Joey Zhou, Kai Qiu, Zhirong Wu, and Chong Luo. Logic-rl: Unleashing llm reasoning with rule-based reinforcement learning, 2025

work page 2025
[68]

Evaluating spatial understanding of large language models

Yutaro Yamada, Yihan Bao, Andrew K Lampinen, Jungo Kasai, and Ilker Yildirim. Evaluating spatial understanding of large language models. arXiv preprint arXiv:2310.14540, 2023

work page arXiv 2023
[69]

Thinking in space: How multimodal large language models see, remember, and recall spaces

Jihan Yang, Shusheng Yang, Anjali Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces. arXiv preprint arXiv:2412.14171, 2024

work page arXiv 2024
[70]

Mmsi-bench: A benchmark for multi-image spatial intelligence, 2025

Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, Dahua Lin, Tai Wang, and Jiangmiao Pang. Mmsi-bench: A benchmark for multi-image spatial intelligence, 2025

work page 2025
[71]

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, et al. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization. arXiv preprint arXiv:2503.10615, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[72]

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[73]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, 2022

work page 2022
[74]

Metamath: Bootstrap your own mathematical questions for large language models

Longhui Yu, Weisen Jiang, Han Shi, YU Jincheng, Zhengying Liu, Yu Zhang, James Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. In The Twelfth International Conference on Learning Representations, 2023

work page 2023
[75]

Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?, 2025

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?, 2025

work page 2025
[76]

Scaling and beyond: Advancing spatial reasoning in mllms requires new recipes, 2025

Huanyu Zhang, Chengzu Li, Wenshan Wu, Shaoguang Mao, Yifan Zhang, Haochen Tian, Ivan Vuli´c, Zhang Zhang, Liang Wang, Tieniu Tan, and Furu Wei. Scaling and beyond: Advancing spatial reasoning in mllms requires new recipes, 2025

work page 2025
[77]

From flatland to space: Teaching vision-language models to perceive and reason in 3d, 2025

Jiahui Zhang, Yurui Chen, Yanpeng Zhou, Yueming Xu, Ze Huang, Jilin Mei, Junhui Chen, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, Xingyue Quan, Hang Xu, and Li Zhang. From flatland to space: Teaching vision-language models to perceive and reason in 3d, 2025. 16

work page 2025
[78]

Vsr: a unified framework for document layout analysis combining vision, semantics and relations

Peng Zhang, Can Li, Liang Qiao, Zhanzhan Cheng, Shiliang Pu, Yi Niu, and Fei Wu. Vsr: a unified framework for document layout analysis combining vision, semantics and relations. In Document Analysis and Recognition–ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5–10, 2021, Proceedings, Part I 16, pages 115–130. Springer, 2021

work page 2021
[79]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Video instruction tuning with synthetic data. arXiv preprint arXiv:2410.02713, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[80]

Multimodal chain-of-thought reasoning in language models

Zhuosheng Zhang, Aston Zhang, Mu Li, hai zhao, George Karypis, and Alex Smola. Multimodal chain-of-thought reasoning in language models. Transactions on Machine Learning Research, 2024

work page 2024

Showing first 80 references.