S-Agent: Spatial Tool-Use Elicits Reasoning for Spatial Intelligence

Baoliang Tian; Dingwen Zhang; Fangfu Liu; Fangzhou Hong; Hao Li; Kim-Hui Yap; Runmao Yao; Shulin Tian; Tao Wang; Yalun Dai

arxiv: 2606.20515 · v1 · pith:YIUONH5Qnew · submitted 2026-06-18 · 💻 cs.CV

S-Agent: Spatial Tool-Use Elicits Reasoning for Spatial Intelligence

Yalun Dai , Hao Li , Shulin Tian , Runmao Yao , Yuhao Dong , Fangzhou Hong , Zhaoxi Chen , Fangfu Liu

show 5 more authors

Baoliang Tian Dingwen Zhang Tao Wang Kim-Hui Yap Ziwei Liu

This is my paper

Pith reviewed 2026-06-26 17:53 UTC · model grok-4.3

classification 💻 cs.CV

keywords S-Agentspatial reasoningvision-language modelstool use3D evidence accumulationmulti-view imagesvideo understandingagent memory

0 comments

The pith

S-Agent turns VLMs into spatial planners that accumulate 3D evidence across frames using a hierarchy of grounding and lifting tools plus dual memory stores.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

S-Agent reframes spatial reasoning as ongoing evidence collection over continuous multi-view images and videos instead of isolated frame predictions. The VLM acts only as a high-level planner that requests specific evidence, while specialized tools detect objects in 2D, reconstruct their 3D geometry, and combine the results into answers about counts, distances, orientations, and relative positions. Scene Memory tracks the evolving environment and Agent Memory records reasoning steps so evidence can be integrated across time. This combination works without any training and raises accuracy on spatial benchmarks for both open-source and closed-source models. Fine-tuning a small model on the trajectories the system itself generates produces an 8B agent that exceeds same-size baselines and reaches parity with much larger frontier systems.

Core claim

S-Agent casts the VLM as a semantic planner that decides what evidence is needed, while a hierarchy of spatial tools and experts grounds objects in 2D, lifts them into 3D geometric evidence, and aggregates this evidence into high-level spatial knowledge. A temporal memory mechanism, including Scene Memory for maintaining the evolving scene state and Agent Memory for accumulating reasoning context, enables evidence integration across frames and reasoning steps. Experiments on multi-view and video benchmarks demonstrate consistent gains for many VLMs in a training-free setting, and supervised fine-tuning on the generated S-300K trajectories yields S-Agent-8B, which surpasses similar-scale base

What carries the argument

Hierarchy of spatial tools and experts, paired with Scene Memory and Agent Memory, that converts frame-level detections into accumulated 3D scene knowledge under VLM direction.

If this is right

Multi-view and video spatial reasoning benchmarks show gains for both open-source and closed-source VLMs without any training.
Supervised fine-tuning on S-300K trajectories produces S-Agent-8B that exceeds similar-scale models such as Qwen3-VL-8B.
S-Agent-8B reaches performance levels comparable to GPT-5.4 and Gemini 3 on the tested tasks.
The approach shifts spatial perception from frame-centric recognition to scene-centric understanding of evolving 3D environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same planner-plus-tools pattern could be tested in non-spatial domains where evidence must be accumulated over time.
The S-300K trajectories offer a ready source of synthetic data that future work could scale or diversify for spatial training.
Deploying such agents in robotics or navigation systems would require verifying that the 3D lifting step remains accurate under real sensor noise.

Load-bearing premise

The spatial tools can reliably produce accurate 3D geometric evidence from 2D images that the VLM planner can then use for correct high-level answers.

What would settle it

Running the same spatial-reasoning benchmarks with and without S-Agent augmentation and finding equal or lower accuracy for the augmented version on multiple models would falsify the claim of consistent improvement.

Figures

Figures reproduced from arXiv: 2606.20515 by Baoliang Tian, Dingwen Zhang, Fangfu Liu, Fangzhou Hong, Hao Li, Kim-Hui Yap, Runmao Yao, Shulin Tian, Tao Wang, Yalun Dai, Yuhao Dong, Zhaoxi Chen, Ziwei Liu.

**Figure 1.** Figure 1: Overview of S-Agent. S-Agent is the spatial tool-use agentic paradigm designed for continuous multi-view image and video reasoning, which formulates spatial reasoning as an active process of spatio-temporal evidence accumulation. It contains a VLM semantic planner with a hierarchy of spatial tools to ground, lift, and aggregate geometric cues, alongside a dual-memory system to maintain the evolving scene a… view at source ↗

read the original abstract

Real-world spatial intelligence requires reasoning over a continuous and evolving 3D world, yet existing VLMs and tool-augmented agents largely remain tied to static, stateless inference from isolated visual observations. We introduce \textbf{\textsc{S-Agent}}, a spatial tool-use agentic paradigm for understanding and reasoning over continuous multi-view images and videos. By formulating spatial reasoning as spatio-temporal evidence accumulation rather than isolated frame-level prediction, \textsc{S-Agent} reshapes spatial perception into scene-centric understanding beyond frame-centric recognition. Specifically, \textsc{S-Agent} casts the VLM as a semantic planner that decides what evidence is needed, while a hierarchy of spatial tools and experts grounds objects in 2D, lifts them into 3D geometric evidence, and aggregates this evidence into high-level spatial knowledge (\textit{e.g.}, counting, measurement, orientation, and relative position). Additionally, a temporal memory mechanism, including Scene Memory for maintaining the evolving scene state and Agent Memory for accumulating reasoning context, enables evidence integration across frames and reasoning steps. Comprehensive experiments on multi-view and video spatial reasoning benchmarks show that \textsc{S-Agent} consistently improves both open-source and closed-source VLMs in a training-free manner. Beyond inference-time augmentation, supervised fine-tuning (SFT) on \textsc{S-Agent}-generated spatial trajectories \textsc{S-300K} yields \textsc{S-Agent-8B}, a compact spatial agent that significantly surpasses similar-scale baselines (e.g., Qwen3-VL-8B) and performs comparably to advanced closed-source models (e.g., GPT-5.4 and Gemini 3).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

S-Agent adds a VLM planner plus a hierarchy of 2D-to-3D spatial tools and dual memory to accumulate evidence across frames, delivering training-free gains on spatial benchmarks and a fine-tuned 8B model trained on the resulting trajectories.

read the letter

The main contribution is the shift from single-frame VLM inference to explicit spatio-temporal evidence accumulation. The VLM acts only as planner while separate tools handle grounding, 3D lifting, and aggregation; two memories keep scene state and reasoning history. This setup is applied both at inference time and to generate the S-300K trajectories used for supervised fine-tuning.

The training-free improvements on open- and closed-source VLMs are the clearest practical result. The fine-tuned S-Agent-8B beating Qwen3-VL-8B and matching some larger closed models on the reported benchmarks is also useful, especially if the trajectories are released.

The weakest link is the 2D-to-3D lifting step. Any systematic error there propagates directly into the counts, measurements, and relations the planner relies on, yet the abstract gives no per-tool accuracy numbers or ablation on the lifting component. The dual-memory design is sensible but its contribution is not isolated in the description either.

The work is aimed at researchers building vision-language agents for robotics or video understanding who already accept tool-use frameworks. The method is concrete enough and the claims are testable, so it should go to referees rather than desk rejection. The experiments need to be checked for effect sizes and controls, but the architecture itself is worth a full review.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces S-Agent, an agentic paradigm for spatial intelligence in VLMs. It positions the VLM as a semantic planner that invokes a hierarchy of spatial tools and experts to ground objects in 2D images, lift them to 3D geometric evidence, and aggregate this evidence for high-level spatial reasoning (counting, measurement, orientation, relative position). A temporal memory system (Scene Memory for evolving scene state and Agent Memory for reasoning context) supports integration across multi-view images and videos. The paper claims that this training-free approach consistently improves both open- and closed-source VLMs on multi-view and video spatial reasoning benchmarks; additionally, supervised fine-tuning on the generated S-300K spatial trajectories produces S-Agent-8B, which surpasses similar-scale baselines (e.g., Qwen3-VL-8B) and performs comparably to advanced closed-source models (e.g., GPT-5.4, Gemini 3).

Significance. If the empirical claims hold under rigorous controls, the work would advance spatial reasoning in VLMs by shifting from frame-centric prediction to scene-centric, evidence-accumulating reasoning over continuous 3D scenes. The training-free tool-augmented paradigm and the S-300K trajectory dataset would be useful contributions for both inference-time augmentation and supervised training of compact spatial agents.

major comments (2)

[Abstract] Abstract: the central claims of 'consistent improvements' for both open- and closed-source VLMs and that S-Agent-8B 'significantly surpasses' similar-scale baselines while matching advanced closed-source models are presented without any quantitative results, specific benchmark names, metrics, error bars, ablation studies, or details on tool accuracy. This absence prevents assessment of whether the data support the performance assertions.
[Abstract] Abstract: the hierarchy of spatial tools is described only at a high level (2D grounding, 3D lifting, evidence aggregation). Without concrete specifications of the tools, their accuracy, failure modes, or how the VLM planner interfaces with them, the load-bearing assumption that reliable 2D-to-3D lifting and aggregation can be achieved remains unverified and central to all reported gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We agree that the abstract should better substantiate its central claims with concrete details and will revise it in the next version of the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central claims of 'consistent improvements' for both open- and closed-source VLMs and that S-Agent-8B 'significantly surpasses' similar-scale baselines while matching advanced closed-source models are presented without any quantitative results, specific benchmark names, metrics, error bars, ablation studies, or details on tool accuracy. This absence prevents assessment of whether the data support the performance assertions.

Authors: We agree that the abstract would be strengthened by including quantitative support. In the revision we will add specific benchmark names (multi-view and video spatial reasoning benchmarks), key metrics with example deltas, and the main performance figures for both the training-free S-Agent improvements and the S-Agent-8B model versus Qwen3-VL-8B and closed-source models. This will allow readers to evaluate the claims directly from the abstract. revision: yes
Referee: [Abstract] Abstract: the hierarchy of spatial tools is described only at a high level (2D grounding, 3D lifting, evidence aggregation). Without concrete specifications of the tools, their accuracy, failure modes, or how the VLM planner interfaces with them, the load-bearing assumption that reliable 2D-to-3D lifting and aggregation can be achieved remains unverified and central to all reported gains.

Authors: We acknowledge the abstract currently describes the tool hierarchy at a high level. The manuscript provides concrete tool specifications, accuracy measurements, failure-mode analysis, and planner-tool interface details in Sections 3.2–3.3 together with supporting ablations. We will revise the abstract to name the principal tool components and note that their reliability is quantified in the experimental sections, thereby making the central assumption more transparent in the summary. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an architectural agentic paradigm (VLM planner + spatial tool hierarchy + memory) and reports empirical gains from inference-time tool use and SFT on generated trajectories. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or high-level description. Claims rest on experimental benchmarks rather than any reduction to inputs by construction, satisfying the default expectation of non-circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no information on free parameters, axioms, or invented entities; full text would be required to populate this ledger.

pith-pipeline@v0.9.1-grok · 5864 in / 1324 out tokens · 27649 ms · 2026-06-26T17:53:56.501990+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 21 linked inside Pith

[1]

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, and 8 others. 2022. Flamingo: a visual language model for few-shot le...

Pith/arXiv arXiv 2022
[2]

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, and 35 others. 2023. Rt-2: Vision-language-action models ...

Pith/arXiv arXiv 2023
[3]

train on the test set

Ellis Brown, Jihan Yang, Shusheng Yang, Rob Fergus, and Saining Xie. 2025. Benchmark designers should "train on the test set" to expose exploitable non-visual shortcuts.Preprint, arXiv:2511.04655

arXiv 2025
[4]

Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom

Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. 2020. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11621–11631

2020
[5]

Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, Tongxi Zhou, Jiaqi Li, Hui En Pang, Oscar Qian, Yukun Wei, Zhiqian Lin, Xuanke Shi, Kewang Deng, Xiaoyang Han, and 10 others. 2026. Scaling spatial intelligence with multimodal foundation models. Preprint, arXiv:2511.13719

arXiv 2026
[6]

Siyi Chen, Mikaela Angelina Uy, Chan Hee Song, Faisal Ladhak, Adithyavairavan Murali, Qing Qu, Stan Birchfield, Valts Blukis, and Jonathan Tremblay. 2025. Spacetools: Tool-augmented spatial reasoning via double interactive rl. Preprint, arXiv:2512.04069

Pith/arXiv arXiv 2025
[7]

Zeren Chen, Xiaoya Lu, Zhijie Zheng, Pengrui Li, Lehan He, Yijin Zhou, Jing Shao, Bohan Zhuang, and Lu Sheng
[8]

Geometrically-constrained agent for spatial reasoning.Preprint, arXiv:2511.22659

arXiv
[9]

Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duck- worth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, and 3 others. 2023. Palm-e: An embodied multimodal language mod...

Pith/arXiv arXiv 2023
[10]

Mengfei Du, Binhao Wu, Zejun Li, Xuanjing Huang, and Zhongyu Wei. 2024. Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models.Preprint, arXiv:2406.05756

arXiv 2024
[11]

Smith, Wei-Chiu Ma, and Ranjay Krishna

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A. Smith, Wei-Chiu Ma, and Ranjay Krishna. 2024. Blink: Multimodal large language models can see but not perceive.Preprint, arXiv:2404.12390

Pith/arXiv arXiv 2024
[12]

Andreas Geiger, Philip Lenz, and Raquel Urtasun. 2012. Are we ready for autonomous driving? the kitti vision benchmark suite. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3354–3361

2012
[13]

Dingming Li, Hongxing Li, Zixuan Wang, Yuchen Yan, Hang Zhang, Siqi Chen, Guiyang Hou, Shengpei Jiang, Wenqi Zhang, Yongliang Shen, Weiming Lu, and Yueting Zhuang. 2025. Viewspatial-bench: Evaluating multi-perspective spatial localization in vision-language models.Preprint, arXiv:2505.21500

arXiv 2025
[14]

Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, and Yueting Zhuang. 2025. Spatialladder: Progressive training for spatial reasoning in vision-language models.Preprint, arXiv:2510.08531

arXiv 2025
[15]

Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang

Haotong Lin, Sili Chen, Junhao Liew, Donny Y. Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. 2025. Depth anything 3: Recovering the visual space from any views.Preprint, arXiv:2511.10647

Pith/arXiv arXiv 2025
[16]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual instruction tuning.Preprint, arXiv:2304.08485

Pith/arXiv arXiv 2023
[17]

de Melo, and Alan Yuille

Wufei Ma, Haoyu Chen, Guofeng Zhang, Yu-Cheng Chou, Jieneng Chen, Celso M. de Melo, and Alan Yuille
[18]

3dsrbench: A comprehensive 3d spatial reasoning benchmark.Preprint, arXiv:2412.07825

arXiv
[19]

Damiano Marsili, Rohun Agrawal, Yisong Yue, and Georgia Gkioxari. 2025. Visual agentic ai for spatial reasoning with a dynamic api.Preprint, arXiv:2502.06787

arXiv 2025
[20]

Newcombe, Shahram Izadi, Otmar Hilliges, David Molyneaux, David Kim, Andrew J

Richard A. Newcombe, Shahram Izadi, Otmar Hilliges, David Molyneaux, David Kim, Andrew J. Davison, Pushmeet Kohli, Jamie Shotton, Steve Hodges, and Andrew Fitzgibbon. 2011. Kinectfusion: Real-time dense surface mapping and tracking. In2011 10th IEEE International Symposium on Mixed and Augmented Reality, pages 127–136. IEEE. 14

2011
[21]

Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. 2025. Spacer: Reinforcing mllms in video spatial reasoning.Preprint, arXiv:2504.01805

Pith/arXiv arXiv 2025
[22]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision.Preprint, arXiv:2103.00020

Pith/arXiv arXiv 2021
[23]

Dídac Surís, Sachit Menon, and Carl Vondrick. 2023. Vipergpt: Visual inference via python execution for reasoning.Preprint, arXiv:2303.08128

Pith/arXiv arXiv 2023
[24]

Vishaal Udandarao, Shyamgopal Karthik, Surabhi S Nath, Andreas Hochlehnert, Matthias Bethge, and Ameya Prabhu. 2025. Solving spatial supersensing without spatial supersensing.arXiv preprint arXiv:2511.16655

arXiv 2025
[25]

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. 2025. Vggt: Visual geometry grounded transformer.Preprint, arXiv:2503.11651

arXiv 2025
[26]

Qineng Wang, Baiqiao Yin, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, Saining Xie, Jiajun Wu, Li Fei-Fei, and Manling Li. 2026. Mindcube: Spatial mental modeling from limited views.Preprint, arXiv:2506.21458

arXiv 2026
[27]

Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. 2023. Visual chatgpt: Talking, drawing and editing with visual foundation models.Preprint, arXiv:2303.04671

Pith/arXiv arXiv 2023
[28]

Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. 2025. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.Preprint, arXiv:2505.23747

Pith/arXiv arXiv 2025
[29]

Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. 2025. Rein- forcing spatial reasoning in vision-language models with interwoven thinking and visual drawing.Preprint, arXiv:2506.09965

Pith/arXiv arXiv 2025
[30]

Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie

Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. 2025. Thinking in space: How multimodal large language models see, remember, and recall spaces.Preprint, arXiv:2412.14171

Pith/arXiv arXiv 2025
[31]

Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, Yi Lin, and Hengshuang Zhao. 2025. Visual spatial tuning.Preprint, arXiv:2511.05491

arXiv 2025
[32]

Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, Daohan Lu, Rob Fergus, Yann LeCun, Li Fei-Fei, and Saining Xie. 2025. Cambrian-s: Towards spatial supersensing in video.Preprint, arXiv:2511.04670

Pith/arXiv arXiv 2025
[33]

Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, Dahua Lin, Tai Wang, and Jiangmiao Pang. 2025. Mmsi-bench: A benchmark for multi-image spatial intelligence.Preprint, arXiv:2505.23764

Pith/arXiv arXiv 2025
[34]

Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. 2023. Mm-react: Prompting chatgpt for multimodal reasoning and action. Preprint, arXiv:2303.11381

Pith/arXiv arXiv 2023
[35]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. React: Synergizing reasoning and acting in language models.Preprint, arXiv:2210.03629

Pith/arXiv arXiv 2023
[36]

Jinhui Ye, Zihan Wang, Haosen Sun, Keshigeyan Chandrasegaran, Zane Durante, Cristobal Eyzaguirre, Yonatan Bisk, Juan Carlos Niebles, Ehsan Adeli, Li Fei-Fei, Jiajun Wu, and Manling Li. 2025. T*: Re-thinking temporal search for long-form video understanding.Preprint, arXiv:2504.02259

arXiv 2025
[37]

Hang Zhang, Xin Li, and Lidong Bing. 2023. Video-llama: An instruction-tuned audio-visual language model for video understanding.Preprint, arXiv:2306.02858

Pith/arXiv arXiv 2023
[38]

Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. 2024. Long context transfer from language to vision.Preprint, arXiv:2406.16852

Pith/arXiv arXiv 2024
[39]

Yiming Zhang, Jiacheng Chen, Jiaqi Tan, Yongsen Mao, Wenhu Chen, and Angel X. Chang. 2026. Revsi: Rebuild- ing visual spatial intelligence evaluation for accurate assessment of vlm 3d reasoning.Preprint, arXiv:2604.24300

Pith/arXiv arXiv 2026
[40]

Zaibin Zhang, Yuhan Wu, Lianjie Jia, Yifan Wang, Zhongbo Zhang, Yijiang Li, Binghao Ran, Fuxi Zhang, Zhuohan Sun, Zhenfei Yin, and 1 others. 2026. Think3d: Thinking with space for spatial reasoning.arXiv preprint arXiv:2601.13029

arXiv 2026
[41]

B. Northwest

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. 2024. Llamafactory: Unified efficient fine-tuning of 100+ language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand. Association for Computational Linguisti...

2024

[1] [1]

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, and 8 others. 2022. Flamingo: a visual language model for few-shot le...

Pith/arXiv arXiv 2022

[2] [2]

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, and 35 others. 2023. Rt-2: Vision-language-action models ...

Pith/arXiv arXiv 2023

[3] [3]

train on the test set

Ellis Brown, Jihan Yang, Shusheng Yang, Rob Fergus, and Saining Xie. 2025. Benchmark designers should "train on the test set" to expose exploitable non-visual shortcuts.Preprint, arXiv:2511.04655

arXiv 2025

[4] [4]

Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom

Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. 2020. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11621–11631

2020

[5] [5]

Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, Tongxi Zhou, Jiaqi Li, Hui En Pang, Oscar Qian, Yukun Wei, Zhiqian Lin, Xuanke Shi, Kewang Deng, Xiaoyang Han, and 10 others. 2026. Scaling spatial intelligence with multimodal foundation models. Preprint, arXiv:2511.13719

arXiv 2026

[6] [6]

Siyi Chen, Mikaela Angelina Uy, Chan Hee Song, Faisal Ladhak, Adithyavairavan Murali, Qing Qu, Stan Birchfield, Valts Blukis, and Jonathan Tremblay. 2025. Spacetools: Tool-augmented spatial reasoning via double interactive rl. Preprint, arXiv:2512.04069

Pith/arXiv arXiv 2025

[7] [7]

Zeren Chen, Xiaoya Lu, Zhijie Zheng, Pengrui Li, Lehan He, Yijin Zhou, Jing Shao, Bohan Zhuang, and Lu Sheng

[8] [8]

Geometrically-constrained agent for spatial reasoning.Preprint, arXiv:2511.22659

arXiv

[9] [9]

Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duck- worth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, and 3 others. 2023. Palm-e: An embodied multimodal language mod...

Pith/arXiv arXiv 2023

[10] [10]

Mengfei Du, Binhao Wu, Zejun Li, Xuanjing Huang, and Zhongyu Wei. 2024. Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models.Preprint, arXiv:2406.05756

arXiv 2024

[11] [11]

Smith, Wei-Chiu Ma, and Ranjay Krishna

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A. Smith, Wei-Chiu Ma, and Ranjay Krishna. 2024. Blink: Multimodal large language models can see but not perceive.Preprint, arXiv:2404.12390

Pith/arXiv arXiv 2024

[12] [12]

Andreas Geiger, Philip Lenz, and Raquel Urtasun. 2012. Are we ready for autonomous driving? the kitti vision benchmark suite. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3354–3361

2012

[13] [13]

Dingming Li, Hongxing Li, Zixuan Wang, Yuchen Yan, Hang Zhang, Siqi Chen, Guiyang Hou, Shengpei Jiang, Wenqi Zhang, Yongliang Shen, Weiming Lu, and Yueting Zhuang. 2025. Viewspatial-bench: Evaluating multi-perspective spatial localization in vision-language models.Preprint, arXiv:2505.21500

arXiv 2025

[14] [14]

Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, and Yueting Zhuang. 2025. Spatialladder: Progressive training for spatial reasoning in vision-language models.Preprint, arXiv:2510.08531

arXiv 2025

[15] [15]

Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang

Haotong Lin, Sili Chen, Junhao Liew, Donny Y. Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. 2025. Depth anything 3: Recovering the visual space from any views.Preprint, arXiv:2511.10647

Pith/arXiv arXiv 2025

[16] [16]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual instruction tuning.Preprint, arXiv:2304.08485

Pith/arXiv arXiv 2023

[17] [17]

de Melo, and Alan Yuille

Wufei Ma, Haoyu Chen, Guofeng Zhang, Yu-Cheng Chou, Jieneng Chen, Celso M. de Melo, and Alan Yuille

[18] [18]

3dsrbench: A comprehensive 3d spatial reasoning benchmark.Preprint, arXiv:2412.07825

arXiv

[19] [19]

Damiano Marsili, Rohun Agrawal, Yisong Yue, and Georgia Gkioxari. 2025. Visual agentic ai for spatial reasoning with a dynamic api.Preprint, arXiv:2502.06787

arXiv 2025

[20] [20]

Newcombe, Shahram Izadi, Otmar Hilliges, David Molyneaux, David Kim, Andrew J

Richard A. Newcombe, Shahram Izadi, Otmar Hilliges, David Molyneaux, David Kim, Andrew J. Davison, Pushmeet Kohli, Jamie Shotton, Steve Hodges, and Andrew Fitzgibbon. 2011. Kinectfusion: Real-time dense surface mapping and tracking. In2011 10th IEEE International Symposium on Mixed and Augmented Reality, pages 127–136. IEEE. 14

2011

[21] [21]

Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. 2025. Spacer: Reinforcing mllms in video spatial reasoning.Preprint, arXiv:2504.01805

Pith/arXiv arXiv 2025

[22] [22]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision.Preprint, arXiv:2103.00020

Pith/arXiv arXiv 2021

[23] [23]

Dídac Surís, Sachit Menon, and Carl Vondrick. 2023. Vipergpt: Visual inference via python execution for reasoning.Preprint, arXiv:2303.08128

Pith/arXiv arXiv 2023

[24] [24]

Vishaal Udandarao, Shyamgopal Karthik, Surabhi S Nath, Andreas Hochlehnert, Matthias Bethge, and Ameya Prabhu. 2025. Solving spatial supersensing without spatial supersensing.arXiv preprint arXiv:2511.16655

arXiv 2025

[25] [25]

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. 2025. Vggt: Visual geometry grounded transformer.Preprint, arXiv:2503.11651

arXiv 2025

[26] [26]

Qineng Wang, Baiqiao Yin, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, Saining Xie, Jiajun Wu, Li Fei-Fei, and Manling Li. 2026. Mindcube: Spatial mental modeling from limited views.Preprint, arXiv:2506.21458

arXiv 2026

[27] [27]

Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. 2023. Visual chatgpt: Talking, drawing and editing with visual foundation models.Preprint, arXiv:2303.04671

Pith/arXiv arXiv 2023

[28] [28]

Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. 2025. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.Preprint, arXiv:2505.23747

Pith/arXiv arXiv 2025

[29] [29]

Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. 2025. Rein- forcing spatial reasoning in vision-language models with interwoven thinking and visual drawing.Preprint, arXiv:2506.09965

Pith/arXiv arXiv 2025

[30] [30]

Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie

Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. 2025. Thinking in space: How multimodal large language models see, remember, and recall spaces.Preprint, arXiv:2412.14171

Pith/arXiv arXiv 2025

[31] [31]

Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, Yi Lin, and Hengshuang Zhao. 2025. Visual spatial tuning.Preprint, arXiv:2511.05491

arXiv 2025

[32] [32]

Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, Daohan Lu, Rob Fergus, Yann LeCun, Li Fei-Fei, and Saining Xie. 2025. Cambrian-s: Towards spatial supersensing in video.Preprint, arXiv:2511.04670

Pith/arXiv arXiv 2025

[33] [33]

Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, Dahua Lin, Tai Wang, and Jiangmiao Pang. 2025. Mmsi-bench: A benchmark for multi-image spatial intelligence.Preprint, arXiv:2505.23764

Pith/arXiv arXiv 2025

[34] [34]

Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. 2023. Mm-react: Prompting chatgpt for multimodal reasoning and action. Preprint, arXiv:2303.11381

Pith/arXiv arXiv 2023

[35] [35]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. React: Synergizing reasoning and acting in language models.Preprint, arXiv:2210.03629

Pith/arXiv arXiv 2023

[36] [36]

Jinhui Ye, Zihan Wang, Haosen Sun, Keshigeyan Chandrasegaran, Zane Durante, Cristobal Eyzaguirre, Yonatan Bisk, Juan Carlos Niebles, Ehsan Adeli, Li Fei-Fei, Jiajun Wu, and Manling Li. 2025. T*: Re-thinking temporal search for long-form video understanding.Preprint, arXiv:2504.02259

arXiv 2025

[37] [37]

Hang Zhang, Xin Li, and Lidong Bing. 2023. Video-llama: An instruction-tuned audio-visual language model for video understanding.Preprint, arXiv:2306.02858

Pith/arXiv arXiv 2023

[38] [38]

Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. 2024. Long context transfer from language to vision.Preprint, arXiv:2406.16852

Pith/arXiv arXiv 2024

[39] [39]

Yiming Zhang, Jiacheng Chen, Jiaqi Tan, Yongsen Mao, Wenhu Chen, and Angel X. Chang. 2026. Revsi: Rebuild- ing visual spatial intelligence evaluation for accurate assessment of vlm 3d reasoning.Preprint, arXiv:2604.24300

Pith/arXiv arXiv 2026

[40] [40]

Zaibin Zhang, Yuhan Wu, Lianjie Jia, Yifan Wang, Zhongbo Zhang, Yijiang Li, Binghao Ran, Fuxi Zhang, Zhuohan Sun, Zhenfei Yin, and 1 others. 2026. Think3d: Thinking with space for spatial reasoning.arXiv preprint arXiv:2601.13029

arXiv 2026

[41] [41]

B. Northwest

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. 2024. Llamafactory: Unified efficient fine-tuning of 100+ language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand. Association for Computational Linguisti...

2024