S-Agent: Spatial Tool-Use Elicits Reasoning for Spatial Intelligence

Baoliang Tian; Dingwen Zhang; Fangfu Liu; Fangzhou Hong; Hao Li; Kim-Hui Yap; Runmao Yao; Shulin Tian; Tao Wang; Yalun Dai

arxiv: 2606.20515 · v2 · pith:YIUONH5Qnew · submitted 2026-06-18 · 💻 cs.CV

S-Agent: Spatial Tool-Use Elicits Reasoning for Spatial Intelligence

Yalun Dai , Hao Li , Shulin Tian , Runmao Yao , Yuhao Dong , Fangzhou Hong , Zhaoxi Chen , Fangfu Liu

show 5 more authors

Baoliang Tian Dingwen Zhang Tao Wang Kim-Hui Yap Ziwei Liu

This is my paper

Pith reviewed 2026-06-30 10:18 UTC · model grok-4.3

classification 💻 cs.CV

keywords spatial reasoningtool-use agentsvision-language modelsmulti-view imagesvideo understanding3D evidencescene memoryspatial intelligence

0 comments

The pith

Spatial tool-use lets vision-language models accumulate 3D evidence across frames for continuous reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that spatial reasoning improves when VLMs act as planners that direct a hierarchy of tools to extract and combine geometric evidence from multiple views and video frames rather than predicting from isolated images. The approach includes mechanisms to detect objects in 2D, lift them to 3D, aggregate results into spatial relations, and maintain memory of the evolving scene and reasoning steps. This produces training-free gains on existing models and also supplies data for fine-tuning a compact model that reaches performance levels of much larger systems. A sympathetic reader would care because real tasks such as navigation or object manipulation require understanding an evolving 3D environment from continuous visual input.

Core claim

S-Agent casts the VLM as a semantic planner that decides what evidence is needed, while a hierarchy of spatial tools and experts grounds objects in 2D, lifts them into 3D geometric evidence, and aggregates this evidence into high-level spatial knowledge. A temporal memory mechanism, including Scene Memory for the evolving scene state and Agent Memory for reasoning context, enables evidence integration across frames. Comprehensive experiments show that S-Agent consistently improves both open-source and closed-source VLMs in a training-free manner, and supervised fine-tuning on S-Agent-generated trajectories yields S-Agent-8B that surpasses similar-scale baselines and performs comparably to ad

What carries the argument

Hierarchy of spatial tools and experts that grounds objects in 2D, lifts them into 3D geometric evidence, and aggregates evidence into high-level spatial knowledge, supported by Scene Memory and Agent Memory for temporal integration.

Load-bearing premise

The spatial tools can reliably convert 2D detections into accurate 3D positions and relations without accumulating errors that would undermine the final spatial conclusions.

What would settle it

A benchmark evaluation in which replacing the 3D lifting step with ground-truth 3D measurements produces no accuracy gain over the unaugmented VLM baseline.

Figures

Figures reproduced from arXiv: 2606.20515 by Baoliang Tian, Dingwen Zhang, Fangfu Liu, Fangzhou Hong, Hao Li, Kim-Hui Yap, Runmao Yao, Shulin Tian, Tao Wang, Yalun Dai, Yuhao Dong, Zhaoxi Chen, Ziwei Liu.

**Figure 1.** Figure 1: Overview of S-Agent. S-Agent is the spatial tool-use agentic paradigm designed for continuous multi-view image and video reasoning, which formulates spatial reasoning as an active process of spatio-temporal evidence accumulation. It contains a VLM semantic planner with a hierarchy of spatial tools to ground, lift, and aggregate geometric cues, alongside a dual-memory system to maintain the evolving scene a… view at source ↗

read the original abstract

Real-world spatial intelligence requires reasoning over a continuous and evolving 3D world, yet existing VLMs and tool-augmented agents largely remain tied to static, stateless inference from isolated visual observations. We introduce \textbf{\textsc{S-Agent}}, a spatial tool-use agentic paradigm for understanding and reasoning over continuous multi-view images and videos. By formulating spatial reasoning as spatio-temporal evidence accumulation rather than isolated frame-level prediction, \textsc{S-Agent} reshapes spatial perception into scene-centric understanding beyond frame-centric recognition. Specifically, \textsc{S-Agent} casts the VLM as a semantic planner that decides what evidence is needed, while a hierarchy of spatial tools and experts grounds objects in 2D, lifts them into 3D geometric evidence, and aggregates this evidence into high-level spatial knowledge (\textit{e.g.}, counting, measurement, orientation, and relative position). Additionally, a temporal memory mechanism, including Scene Memory for maintaining the evolving scene state and Agent Memory for accumulating reasoning context, enables evidence integration across frames and reasoning steps. Comprehensive experiments on multi-view and video spatial reasoning benchmarks show that \textsc{S-Agent} consistently improves both open-source and closed-source VLMs in a training-free manner. Beyond inference-time augmentation, supervised fine-tuning (SFT) on \textsc{S-Agent}-generated spatial trajectories \textsc{S-300K} yields \textsc{S-Agent-8B}, a compact spatial agent that significantly surpasses similar-scale baselines (e.g., Qwen3-VL-8B) and performs comparably to advanced closed-source models (e.g., GPT-5.4 and Gemini 3).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

S-Agent gives VLMs a planner-plus-tools loop with dual memory to accumulate spatial evidence over frames, plus a new trajectory dataset, but the 3D lifting accuracy is unmeasured.

read the letter

The paper's main move is to treat spatial reasoning as building evidence across views instead of single-frame guesses. It puts the VLM in charge of planning what to look for, then hands off to a stack of tools that do 2D grounding, 3D lifting, and aggregation, while Scene Memory and Agent Memory track the scene and the reasoning history. That setup is presented as training-free and is also used to generate S-300K trajectories for fine-tuning an 8B model.

The concrete pieces that look new are the explicit tool hierarchy for lifting 2D detections into geometric evidence and the dual-memory design for cross-frame integration. If those components hold up, the approach could be a straightforward way to improve multi-view and video spatial benchmarks on both open and closed models without retraining the base VLM.

The soft spot is exactly where the stress test points: there are no numbers on how accurate the 3D lifting step actually is. Depth scale, occlusion handling, and cross-view consistency are load-bearing, yet the abstract supplies no error rates or ablation on tool fidelity. Without those checks, it is hard to know whether the reported gains come from better evidence or from lucky cases where the tools happen to be right.

The work is aimed at people already building agentic systems for robotics, navigation, or video understanding. It is worth sending to review because the framing is clear, the dataset could be reusable, and the claims are falsifiable once the tool metrics are shown. A referee should focus on the 3D validation experiments and whether the SFT results hold when the tool errors are measured.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces S-Agent, a training-free agentic framework in which a VLM acts as semantic planner while a hierarchy of spatial tools performs 2D grounding, 3D lifting, evidence aggregation, and temporal memory integration (Scene Memory and Agent Memory) to support multi-view and video spatial reasoning. Experiments claim consistent gains for both open- and closed-source VLMs on spatial benchmarks; SFT on the resulting S-300K trajectories produces S-Agent-8B, which surpasses same-scale baselines and approaches advanced closed-source models.

Significance. If the claimed tool hierarchy reliably converts 2D detections into accurate, cross-view-consistent 3D geometric evidence, the work would offer a concrete route to scene-centric rather than frame-centric spatial reasoning and could materially improve VLM performance on counting, measurement, and relative-position tasks without additional training.

major comments (2)

[Experiments (and associated figures/tables)] The central empirical claims rest on the assumption that the 2D-to-3D lifting and aggregation steps produce geometric evidence free of systematic bias. No quantitative evaluation of 3D reconstruction fidelity, depth-scale consistency across views, occlusion handling, or aggregation error rates is supplied; without these measurements it is impossible to determine whether the reported benchmark gains are driven by reliable evidence or by correlated tool errors.
[S-300K construction and SFT experiments] The quality of the S-300K trajectories used for SFT is asserted to be high enough to train a competitive 8B model, yet no human or automatic verification of trajectory correctness (e.g., 3D coordinate accuracy, reasoning-step validity) is reported. This leaves open the possibility that S-Agent-8B simply inherits and amplifies the same unmeasured lifting errors.

minor comments (1)

[Method overview] Notation for the temporal memory components (Scene Memory vs. Agent Memory) is introduced in the abstract but would benefit from an explicit diagram or pseudocode block showing how evidence is written and read across frames.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for direct validation of the 3D lifting pipeline and trajectory quality. We address both major comments point-by-point below and will incorporate additional analyses in the revised manuscript.

read point-by-point responses

Referee: [Experiments (and associated figures/tables)] The central empirical claims rest on the assumption that the 2D-to-3D lifting and aggregation steps produce geometric evidence free of systematic bias. No quantitative evaluation of 3D reconstruction fidelity, depth-scale consistency across views, occlusion handling, or aggregation error rates is supplied; without these measurements it is impossible to determine whether the reported benchmark gains are driven by reliable evidence or by correlated tool errors.

Authors: We agree that explicit quantitative metrics on 3D reconstruction fidelity, cross-view consistency, occlusion handling, and aggregation error would strengthen the claims. The manuscript currently relies on downstream benchmark gains as the primary indicator of tool reliability. In the revision we will add a dedicated analysis subsection that reports these metrics on both synthetic multi-view scenes with ground-truth 3D annotations and selected real-world sequences, including error distributions and failure-case breakdowns. This will allow readers to assess whether the observed improvements arise from accurate evidence or systematic biases. revision: yes
Referee: [S-300K construction and SFT experiments] The quality of the S-300K trajectories used for SFT is asserted to be high enough to train a competitive 8B model, yet no human or automatic verification of trajectory correctness (e.g., 3D coordinate accuracy, reasoning-step validity) is reported. This leaves open the possibility that S-Agent-8B simply inherits and amplifies the same unmeasured lifting errors.

Authors: The S-300K trajectories are generated by executing the full S-Agent loop on curated spatial tasks and retaining only those that reach successful termination according to the agent's own verification. While the original submission does not report separate human or automatic correctness audits, the strong held-out performance of the resulting S-Agent-8B provides indirect support. In the revision we will add: (i) explicit filtering statistics and success-rate thresholds used during trajectory collection, (ii) a human-verified subset analysis (approximately 500 trajectories) measuring 3D coordinate accuracy and reasoning-step validity, and (iii) a comparison of per-step error rates before versus after SFT. These additions will directly address the concern of error inheritance. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance claims rest on external benchmarks, not self-referential derivations

full rationale

The paper describes an agentic system (S-Agent) that augments VLMs via a hierarchy of spatial tools, temporal memory, and optional SFT on generated trajectories (S-300K). All central claims are framed as measured improvements on multi-view and video spatial reasoning benchmarks, with comparisons to baselines like Qwen3-VL-8B and closed-source models. No equations, uniqueness theorems, or parameter-fitting steps are presented that reduce a claimed prediction back to the input data or self-citations by construction. The methodology is self-contained against external evaluation; the 3D lifting and aggregation steps are engineering components whose reliability is asserted via downstream task gains rather than tautological redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The method implicitly assumes the external spatial tools function as black-box oracles.

pith-pipeline@v0.9.1-grok · 5864 in / 1171 out tokens · 30833 ms · 2026-06-30T10:18:59.796774+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 35 canonical work pages · 21 internal anchors

[1]

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, and 8 others. 2022. Flamingo: a visual language model for few-shot le...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, and 35 others. 2023. Rt-2: Vision-language-action models ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

train on the test set

Ellis Brown, Jihan Yang, Shusheng Yang, Rob Fergus, and Saining Xie. 2025. Benchmark designers should "train on the test set" to expose exploitable non-visual shortcuts.Preprint, arXiv:2511.04655

work page arXiv 2025
[4]

Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom

Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. 2020. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11621–11631

2020
[5]

Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, Tongxi Zhou, Jiaqi Li, Hui En Pang, Oscar Qian, Yukun Wei, Zhiqian Lin, Xuanke Shi, Kewang Deng, Xiaoyang Han, and 10 others. 2026. Scaling spatial intelligence with multimodal foundation models. Preprint, arXiv:2511.13719

work page arXiv 2026
[6]

Siyi Chen, Mikaela Angelina Uy, Chan Hee Song, Faisal Ladhak, Adithyavairavan Murali, Qing Qu, Stan Birchfield, Valts Blukis, and Jonathan Tremblay. 2025. Spacetools: Tool-augmented spatial reasoning via double interactive rl. Preprint, arXiv:2512.04069

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Zeren Chen, Xiaoya Lu, Zhijie Zheng, Pengrui Li, Lehan He, Yijin Zhou, Jing Shao, Bohan Zhuang, and Lu Sheng
[8]

Geometrically-constrained agent for spatial reasoning.Preprint, arXiv:2511.22659

work page arXiv
[9]

Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duck- worth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, and 3 others. 2023. Palm-e: An embodied multimodal language mod...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Mengfei Du, Binhao Wu, Zejun Li, Xuanjing Huang, and Zhongyu Wei. 2024. Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models.Preprint, arXiv:2406.05756

work page arXiv 2024
[11]

BLINK: Multimodal Large Language Models Can See but Not Perceive

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A. Smith, Wei-Chiu Ma, and Ranjay Krishna. 2024. Blink: Multimodal large language models can see but not perceive.Preprint, arXiv:2404.12390

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Andreas Geiger, Philip Lenz, and Raquel Urtasun. 2012. Are we ready for autonomous driving? the kitti vision benchmark suite. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3354–3361

2012
[13]

Dingming Li, Hongxing Li, Zixuan Wang, Yuchen Yan, Hang Zhang, Siqi Chen, Guiyang Hou, Shengpei Jiang, Wenqi Zhang, Yongliang Shen, Weiming Lu, and Yueting Zhuang. 2025. Viewspatial-bench: Evaluating multi-perspective spatial localization in vision-language models.Preprint, arXiv:2505.21500

work page arXiv 2025
[14]

Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, and Yueting Zhuang. 2025. Spatialladder: Progressive training for spatial reasoning in vision-language models.Preprint, arXiv:2510.08531

work page arXiv 2025
[15]

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen, Junhao Liew, Donny Y. Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. 2025. Depth anything 3: Recovering the visual space from any views.Preprint, arXiv:2511.10647

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual instruction tuning.Preprint, arXiv:2304.08485

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

de Melo, and Alan Yuille

Wufei Ma, Haoyu Chen, Guofeng Zhang, Yu-Cheng Chou, Jieneng Chen, Celso M. de Melo, and Alan Yuille
[18]

3dsrbench: A comprehensive 3d spatial reasoning benchmark.Preprint, arXiv:2412.07825

work page arXiv
[19]

Damiano Marsili, Rohun Agrawal, Yisong Yue, and Georgia Gkioxari. 2025. Visual agentic ai for spatial reasoning with a dynamic api.Preprint, arXiv:2502.06787

work page arXiv 2025
[20]

Newcombe, Shahram Izadi, Otmar Hilliges, David Molyneaux, David Kim, Andrew J

Richard A. Newcombe, Shahram Izadi, Otmar Hilliges, David Molyneaux, David Kim, Andrew J. Davison, Pushmeet Kohli, Jamie Shotton, Steve Hodges, and Andrew Fitzgibbon. 2011. Kinectfusion: Real-time dense surface mapping and tracking. In2011 10th IEEE International Symposium on Mixed and Augmented Reality, pages 127–136. IEEE. 14

2011
[21]

Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. 2025. Spacer: Reinforcing mllms in video spatial reasoning.Preprint, arXiv:2504.01805

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision.Preprint, arXiv:2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021
[23]

Dídac Surís, Sachit Menon, and Carl Vondrick. 2023. Vipergpt: Visual inference via python execution for reasoning.Preprint, arXiv:2303.08128

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Vishaal Udandarao, Shyamgopal Karthik, Surabhi S Nath, Andreas Hochlehnert, Matthias Bethge, and Ameya Prabhu. 2025. Solving spatial supersensing without spatial supersensing.arXiv preprint arXiv:2511.16655

work page arXiv 2025
[25]

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. 2025. Vggt: Visual geometry grounded transformer.Preprint, arXiv:2503.11651

work page arXiv 2025
[26]

Qineng Wang, Baiqiao Yin, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, Saining Xie, Jiajun Wu, Li Fei-Fei, and Manling Li. 2026. Mindcube: Spatial mental modeling from limited views.Preprint, arXiv:2506.21458

work page arXiv 2026
[27]

Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. 2023. Visual chatgpt: Talking, drawing and editing with visual foundation models.Preprint, arXiv:2303.04671

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. 2025. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.Preprint, arXiv:2505.23747

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. 2025. Rein- forcing spatial reasoning in vision-language models with interwoven thinking and visual drawing.Preprint, arXiv:2506.09965

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. 2025. Thinking in space: How multimodal large language models see, remember, and recall spaces.Preprint, arXiv:2412.14171

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, Yi Lin, and Hengshuang Zhao. 2025. Visual spatial tuning.Preprint, arXiv:2511.05491

work page arXiv 2025
[32]

Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, Daohan Lu, Rob Fergus, Yann LeCun, Li Fei-Fei, and Saining Xie. 2025. Cambrian-s: Towards spatial supersensing in video.Preprint, arXiv:2511.04670

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, Dahua Lin, Tai Wang, and Jiangmiao Pang. 2025. Mmsi-bench: A benchmark for multi-image spatial intelligence.Preprint, arXiv:2505.23764

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. 2023. Mm-react: Prompting chatgpt for multimodal reasoning and action. Preprint, arXiv:2303.11381

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. React: Synergizing reasoning and acting in language models.Preprint, arXiv:2210.03629

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Jinhui Ye, Zihan Wang, Haosen Sun, Keshigeyan Chandrasegaran, Zane Durante, Cristobal Eyzaguirre, Yonatan Bisk, Juan Carlos Niebles, Ehsan Adeli, Li Fei-Fei, Jiajun Wu, and Manling Li. 2025. T*: Re-thinking temporal search for long-form video understanding.Preprint, arXiv:2504.02259

work page arXiv 2025
[37]

Hang Zhang, Xin Li, and Lidong Bing. 2023. Video-llama: An instruction-tuned audio-visual language model for video understanding.Preprint, arXiv:2306.02858

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. 2024. Long context transfer from language to vision.Preprint, arXiv:2406.16852

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Yiming Zhang, Jiacheng Chen, Jiaqi Tan, Yongsen Mao, Wenhu Chen, and Angel X. Chang. 2026. Revsi: Rebuild- ing visual spatial intelligence evaluation for accurate assessment of vlm 3d reasoning.Preprint, arXiv:2604.24300

work page internal anchor Pith review Pith/arXiv arXiv 2026
[40]

Zaibin Zhang, Yuhan Wu, Lianjie Jia, Yifan Wang, Zhongbo Zhang, Yijiang Li, Binghao Ran, Fuxi Zhang, Zhuohan Sun, Zhenfei Yin, and 1 others. 2026. Think3d: Thinking with space for spatial reasoning.arXiv preprint arXiv:2601.13029

work page arXiv 2026
[41]

B. Northwest

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. 2024. Llamafactory: Unified efficient fine-tuning of 100+ language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand. Association for Computational Linguisti...

2024

[1] [1]

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, and 8 others. 2022. Flamingo: a visual language model for few-shot le...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[2] [2]

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, and 35 others. 2023. Rt-2: Vision-language-action models ...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

train on the test set

Ellis Brown, Jihan Yang, Shusheng Yang, Rob Fergus, and Saining Xie. 2025. Benchmark designers should "train on the test set" to expose exploitable non-visual shortcuts.Preprint, arXiv:2511.04655

work page arXiv 2025

[4] [4]

Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom

Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. 2020. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11621–11631

2020

[5] [5]

Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, Tongxi Zhou, Jiaqi Li, Hui En Pang, Oscar Qian, Yukun Wei, Zhiqian Lin, Xuanke Shi, Kewang Deng, Xiaoyang Han, and 10 others. 2026. Scaling spatial intelligence with multimodal foundation models. Preprint, arXiv:2511.13719

work page arXiv 2026

[6] [6]

Siyi Chen, Mikaela Angelina Uy, Chan Hee Song, Faisal Ladhak, Adithyavairavan Murali, Qing Qu, Stan Birchfield, Valts Blukis, and Jonathan Tremblay. 2025. Spacetools: Tool-augmented spatial reasoning via double interactive rl. Preprint, arXiv:2512.04069

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Zeren Chen, Xiaoya Lu, Zhijie Zheng, Pengrui Li, Lehan He, Yijin Zhou, Jing Shao, Bohan Zhuang, and Lu Sheng

[8] [8]

Geometrically-constrained agent for spatial reasoning.Preprint, arXiv:2511.22659

work page arXiv

[9] [9]

Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duck- worth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, and 3 others. 2023. Palm-e: An embodied multimodal language mod...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [10]

Mengfei Du, Binhao Wu, Zejun Li, Xuanjing Huang, and Zhongyu Wei. 2024. Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models.Preprint, arXiv:2406.05756

work page arXiv 2024

[11] [11]

BLINK: Multimodal Large Language Models Can See but Not Perceive

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A. Smith, Wei-Chiu Ma, and Ranjay Krishna. 2024. Blink: Multimodal large language models can see but not perceive.Preprint, arXiv:2404.12390

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Andreas Geiger, Philip Lenz, and Raquel Urtasun. 2012. Are we ready for autonomous driving? the kitti vision benchmark suite. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3354–3361

2012

[13] [13]

Dingming Li, Hongxing Li, Zixuan Wang, Yuchen Yan, Hang Zhang, Siqi Chen, Guiyang Hou, Shengpei Jiang, Wenqi Zhang, Yongliang Shen, Weiming Lu, and Yueting Zhuang. 2025. Viewspatial-bench: Evaluating multi-perspective spatial localization in vision-language models.Preprint, arXiv:2505.21500

work page arXiv 2025

[14] [14]

Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, and Yueting Zhuang. 2025. Spatialladder: Progressive training for spatial reasoning in vision-language models.Preprint, arXiv:2510.08531

work page arXiv 2025

[15] [15]

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen, Junhao Liew, Donny Y. Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. 2025. Depth anything 3: Recovering the visual space from any views.Preprint, arXiv:2511.10647

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual instruction tuning.Preprint, arXiv:2304.08485

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

de Melo, and Alan Yuille

Wufei Ma, Haoyu Chen, Guofeng Zhang, Yu-Cheng Chou, Jieneng Chen, Celso M. de Melo, and Alan Yuille

[18] [18]

3dsrbench: A comprehensive 3d spatial reasoning benchmark.Preprint, arXiv:2412.07825

work page arXiv

[19] [19]

Damiano Marsili, Rohun Agrawal, Yisong Yue, and Georgia Gkioxari. 2025. Visual agentic ai for spatial reasoning with a dynamic api.Preprint, arXiv:2502.06787

work page arXiv 2025

[20] [20]

Newcombe, Shahram Izadi, Otmar Hilliges, David Molyneaux, David Kim, Andrew J

Richard A. Newcombe, Shahram Izadi, Otmar Hilliges, David Molyneaux, David Kim, Andrew J. Davison, Pushmeet Kohli, Jamie Shotton, Steve Hodges, and Andrew Fitzgibbon. 2011. Kinectfusion: Real-time dense surface mapping and tracking. In2011 10th IEEE International Symposium on Mixed and Augmented Reality, pages 127–136. IEEE. 14

2011

[21] [21]

Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. 2025. Spacer: Reinforcing mllms in video spatial reasoning.Preprint, arXiv:2504.01805

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision.Preprint, arXiv:2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021

[23] [23]

Dídac Surís, Sachit Menon, and Carl Vondrick. 2023. Vipergpt: Visual inference via python execution for reasoning.Preprint, arXiv:2303.08128

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Vishaal Udandarao, Shyamgopal Karthik, Surabhi S Nath, Andreas Hochlehnert, Matthias Bethge, and Ameya Prabhu. 2025. Solving spatial supersensing without spatial supersensing.arXiv preprint arXiv:2511.16655

work page arXiv 2025

[25] [25]

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. 2025. Vggt: Visual geometry grounded transformer.Preprint, arXiv:2503.11651

work page arXiv 2025

[26] [26]

Qineng Wang, Baiqiao Yin, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, Saining Xie, Jiajun Wu, Li Fei-Fei, and Manling Li. 2026. Mindcube: Spatial mental modeling from limited views.Preprint, arXiv:2506.21458

work page arXiv 2026

[27] [27]

Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. 2023. Visual chatgpt: Talking, drawing and editing with visual foundation models.Preprint, arXiv:2303.04671

work page internal anchor Pith review Pith/arXiv arXiv 2023

[28] [28]

Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. 2025. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.Preprint, arXiv:2505.23747

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. 2025. Rein- forcing spatial reasoning in vision-language models with interwoven thinking and visual drawing.Preprint, arXiv:2506.09965

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. 2025. Thinking in space: How multimodal large language models see, remember, and recall spaces.Preprint, arXiv:2412.14171

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, Yi Lin, and Hengshuang Zhao. 2025. Visual spatial tuning.Preprint, arXiv:2511.05491

work page arXiv 2025

[32] [32]

Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, Daohan Lu, Rob Fergus, Yann LeCun, Li Fei-Fei, and Saining Xie. 2025. Cambrian-s: Towards spatial supersensing in video.Preprint, arXiv:2511.04670

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, Dahua Lin, Tai Wang, and Jiangmiao Pang. 2025. Mmsi-bench: A benchmark for multi-image spatial intelligence.Preprint, arXiv:2505.23764

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. 2023. Mm-react: Prompting chatgpt for multimodal reasoning and action. Preprint, arXiv:2303.11381

work page internal anchor Pith review Pith/arXiv arXiv 2023

[35] [35]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. React: Synergizing reasoning and acting in language models.Preprint, arXiv:2210.03629

work page internal anchor Pith review Pith/arXiv arXiv 2023

[36] [36]

Jinhui Ye, Zihan Wang, Haosen Sun, Keshigeyan Chandrasegaran, Zane Durante, Cristobal Eyzaguirre, Yonatan Bisk, Juan Carlos Niebles, Ehsan Adeli, Li Fei-Fei, Jiajun Wu, and Manling Li. 2025. T*: Re-thinking temporal search for long-form video understanding.Preprint, arXiv:2504.02259

work page arXiv 2025

[37] [37]

Hang Zhang, Xin Li, and Lidong Bing. 2023. Video-llama: An instruction-tuned audio-visual language model for video understanding.Preprint, arXiv:2306.02858

work page internal anchor Pith review Pith/arXiv arXiv 2023

[38] [38]

Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. 2024. Long context transfer from language to vision.Preprint, arXiv:2406.16852

work page internal anchor Pith review Pith/arXiv arXiv 2024

[39] [39]

Yiming Zhang, Jiacheng Chen, Jiaqi Tan, Yongsen Mao, Wenhu Chen, and Angel X. Chang. 2026. Revsi: Rebuild- ing visual spatial intelligence evaluation for accurate assessment of vlm 3d reasoning.Preprint, arXiv:2604.24300

work page internal anchor Pith review Pith/arXiv arXiv 2026

[40] [40]

Zaibin Zhang, Yuhan Wu, Lianjie Jia, Yifan Wang, Zhongbo Zhang, Yijiang Li, Binghao Ran, Fuxi Zhang, Zhuohan Sun, Zhenfei Yin, and 1 others. 2026. Think3d: Thinking with space for spatial reasoning.arXiv preprint arXiv:2601.13029

work page arXiv 2026

[41] [41]

B. Northwest

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. 2024. Llamafactory: Unified efficient fine-tuning of 100+ language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand. Association for Computational Linguisti...

2024