pith. sign in

arxiv: 2505.17012 · v3 · submitted 2025-05-22 · 💻 cs.CV · cs.AI

SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence

Pith reviewed 2026-05-22 13:03 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords spatial intelligencemultimodal large language modelsbenchmark evaluationspatial reasoningmulti-agent systemsfine-tuning corpusvision-language models
0
0 comments X

The pith

SpatialScore benchmark reveals that current multimodal models lag substantially behind humans in spatial intelligence across 30 tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to move beyond fragmented tests by creating a single large-scale benchmark that covers many visual inputs, question types, and spatial skills in one place. It evaluates dozens of models to document where they fail and then supplies two practical fixes: a large training dataset for fine-tuning and a tool-using agent system that works without retraining. If these claims hold, the work would give the field a shared yardstick plus concrete routes for making vision-language models better at understanding 3D relations, navigation, and manipulation.

Core claim

Existing evaluations of multimodal large language models on spatial intelligence are fragmented and limited; SpatialScore addresses this with roughly 5K manually verified samples across 30 tasks, multiple visual data types, and input modalities. Using the benchmark, 49 models show persistent shortfalls and a clear gap to human performance. The authors further supply SpatialCorpus (331K multimodal QA samples) that improves fine-tuned models and SpatialAgent, a multi-agent system with 12 specialized spatial tools that boosts reasoning through Plan-Execute and ReAct strategies without extra training.

What carries the argument

SpatialScore benchmark, a collection of approximately 5K verified samples spanning 30 tasks that tests multimodal spatial reasoning across varied visual inputs and question formats.

If this is right

  • Evaluating 49 representative MLLMs on SpatialScore documents persistent challenges and a substantial gap to human-level spatial intelligence.
  • Fine-tuning existing models on the 331K-sample SpatialCorpus produces clear performance gains, for example on Qwen3-VL.
  • The training-free SpatialAgent system, equipped with 12 spatial perception tools, delivers substantial reasoning improvements via Plan-Execute and ReAct loops.
  • Releasing the benchmark, corpus, and agent code is intended to provide a shared foundation for future work on human-level spatial intelligence in MLLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If SpatialScore is widely adopted, research may shift from narrow isolated tests toward more integrated spatial reasoning problems.
  • The success of the tool-augmented agent suggests that external perception modules could be a scalable way to augment vision-language models on geometry-heavy tasks.
  • Large-scale spatial training data might transfer to related domains such as robotics planning or augmented-reality interfaces.
  • Future benchmarks could add dynamic video or real-robot interaction to test whether gains on static images generalize to changing environments.

Load-bearing premise

The chosen 30 tasks plus manual verification together cover the core skills of spatial intelligence without missing major real-world abilities or introducing bias.

What would settle it

Demonstrating that current models already match human accuracy on an important spatial task that lies outside the 30 categories would undermine the claim that the benchmark is comprehensive.

read the original abstract

Existing evaluations of multimodal large language models (MLLMs) on spatial intelligence are typically fragmented and limited in scope. In this work, we aim to conduct a holistic assessment of the spatial understanding capabilities of modern MLLMs and propose complementary data-driven and agent-based solutions. Specifically, we make the following contributions: (i) we introduce SpatialScore, to our knowledge, the most comprehensive and diverse benchmark for multimodal spatial intelligence to date. It covers multiple visual data types, input modalities, and question-answering formats, and contains approximately 5K manually verified samples spanning 30 distinct tasks; (ii) using SpatialScore, we extensively evaluate 49 representative MLLMs, revealing persistent challenges and a substantial gap between current models and human-level spatial intelligence; (iii) to advance model capabilities, we construct SpatialCorpus, a large-scale training resource with 331K multimodal QA samples that supports fine-tuning on spatial reasoning tasks and significantly improves the performance of existing models (e.g., Qwen3-VL); (iv) to complement this data-driven route with a training-free paradigm, we develop SpatialAgent, a multi-agent system equipped with 12 specialized spatial perception tools that supports both Plan-Execute and ReAct reasoning, enabling substantial gains in spatial reasoning without additional model training. Extensive experiments and in-depth analyses demonstrate the effectiveness of our benchmark, corpus, and agent framework. We expect these resources to serve as a solid foundation for advancing MLLMs toward human-level spatial intelligence. All data, code, and models will be released to the research community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces SpatialScore as a new benchmark for multimodal spatial intelligence, containing approximately 5K manually verified samples across 30 tasks that span multiple visual data types, input modalities, and QA formats. It reports an evaluation of 49 MLLMs that reveals persistent gaps relative to human performance, constructs SpatialCorpus (331K multimodal QA samples) to support fine-tuning, and proposes SpatialAgent, a multi-agent system with 12 specialized spatial perception tools that enables training-free gains via Plan-Execute and ReAct reasoning. The central claim is that these resources together provide the most comprehensive evaluation suite to date and practical routes toward closing the human-model gap.

Significance. If the 30 tasks and verification process adequately cover essential spatial skills without systematic bias or omission, the released benchmark, corpus, code, and models would constitute a valuable community resource for standardizing evaluation and driving progress in MLLM spatial reasoning. The dual data-driven and agent-based improvement strategies, together with the scale of the model evaluation, add practical utility beyond pure benchmarking.

major comments (1)
  1. [§3] §3 (Benchmark Construction and Task Selection): The claim that SpatialScore is the most comprehensive benchmark rests on the 30 tasks plus manual verification together covering essential spatial intelligence components without bias or major gaps. The manuscript presents the tasks and data sources but supplies no formal taxonomy of spatial abilities (e.g., static vs. dynamic, egocentric vs. allocentric, 2D vs. 3D mental manipulation) nor a coverage analysis showing how the chosen tasks map onto such dimensions. This leaves open the possibility that entire classes of real-world spatial reasoning are absent or underrepresented.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'approximately 5K' should be replaced by the exact sample count for precision and reproducibility.
  2. [§5] §5 (SpatialAgent): A summary table listing the 12 specialized tools, their inputs/outputs, and intended spatial sub-skill would improve clarity and allow readers to assess coverage.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address the major comment regarding benchmark construction and task selection below, and we will revise the paper to incorporate a formal taxonomy and coverage analysis as suggested.

read point-by-point responses
  1. Referee: [§3] §3 (Benchmark Construction and Task Selection): The claim that SpatialScore is the most comprehensive benchmark rests on the 30 tasks plus manual verification together covering essential spatial intelligence components without bias or major gaps. The manuscript presents the tasks and data sources but supplies no formal taxonomy of spatial abilities (e.g., static vs. dynamic, egocentric vs. allocentric, 2D vs. 3D mental manipulation) nor a coverage analysis showing how the chosen tasks map onto such dimensions. This leaves open the possibility that entire classes of real-world spatial reasoning are absent or underrepresented.

    Authors: We thank the referee for this valuable observation. Our 30 tasks were selected after reviewing prior spatial reasoning benchmarks and cognitive science literature to span diverse visual data types, modalities, and formats, including elements of object localization, spatial relations, mental rotation, navigation, and 3D understanding. Nevertheless, we acknowledge that the original manuscript did not include an explicit formal taxonomy or systematic coverage mapping. In the revised version, we will add a dedicated subsection to §3 that introduces a taxonomy of spatial abilities (categorizing along axes such as static vs. dynamic, egocentric vs. allocentric, and 2D vs. 3D mental manipulation, drawing from established frameworks in the field) and provides a table mapping each of the 30 tasks to these dimensions, along with a brief discussion of coverage and potential gaps. This addition will be grounded in the task descriptions and data sources already presented, thereby strengthening the justification for the comprehensiveness claim without altering the core benchmark content. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark and resource creation paper

full rationale

The manuscript introduces SpatialScore (5K manually verified samples across 30 tasks), SpatialCorpus (331K QA pairs), and SpatialAgent (multi-agent system with 12 tools) as new artifacts, then reports evaluations of 49 MLLMs. No derivation chain, equations, fitted parameters, or first-principles predictions are claimed; the central assertions rest on the construction process, manual verification, and experimental outcomes rather than any reduction to self-defined inputs or self-citations. The work is self-contained as resource creation and empirical assessment.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the assumption that manually verified QA pairs constitute a valid proxy for spatial intelligence and that the chosen 30 tasks are representative; no new physical constants or fitted parameters are introduced in the abstract.

axioms (1)
  • domain assumption Human performance on the SpatialScore tasks constitutes the appropriate upper bound for model comparison.
    Stated in the abstract when reporting the gap between current models and human-level spatial intelligence.

pith-pipeline@v0.9.0 · 5823 in / 1217 out tokens · 33314 ms · 2026-05-22T13:03:28.271946+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 7.0

    Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.

  2. SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments

    cs.CV 2026-04 unverdicted novelty 7.0

    SpaMEM benchmark shows multimodal LLMs succeed at spatial tasks with text histories but sharply fail at long-horizon belief maintenance from raw visual streams alone.

  3. Enhancing MLLM Spatial Understanding via Active 3D Scene Exploration for Multi-Perspective Reasoning

    cs.CV 2026-04 unverdicted novelty 7.0

    A training-free Visual Chain-of-Thought framework reconstructs high-fidelity 3D meshes from single images and iteratively synthesizes optimal novel views to enhance MLLM spatial comprehension on benchmarks like 3DSRBench.

  4. World2VLM: Distilling World Model Imagination into VLMs for Dynamic Spatial Reasoning

    cs.CV 2026-04 unverdicted novelty 6.0

    Distilling view-consistent future views and action-outcome supervision from a generative world model into a VLM via two-stage post-training improves dynamic spatial reasoning on SAT-Real, VSI-Bench and similar benchma...

  5. MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

    cs.LG 2026-02 conditional novelty 6.0

    MapTab is a new multimodal benchmark with 328 images and nearly 200k queries that shows current MLLMs have substantial difficulty with multi-criteria route planning when visual and tabular information must be combined.

  6. MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

    cs.LG 2026-02 unverdicted novelty 6.0

    MapTab benchmark shows current MLLMs struggle with multi-criteria multimodal route planning and that combining vision and language frequently underperforms single-modality approaches.

  7. Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs

    cs.CV 2025-05 unverdicted novelty 6.0

    Chain-of-Focus enables VLMs to adaptively search and zoom on important image areas via a two-stage SFT and RL pipeline on a custom 3K-sample dataset, yielding 5% gains on the V* benchmark across resolutions from 224 to 4K.

  8. Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning

    cs.AI 2025-09 unverdicted novelty 5.0

    MoVT unifies different visual reasoning modes in a single model and uses the AdaVaR two-stage framework with supervised cold-start and RL via AdaGRPO to enable context-adaptive mode selection, yielding consistent gain...

  9. Aligning Perception, Reasoning, Modeling and Interaction: A Survey on Physical AI

    cs.AI 2025-10 unverdicted novelty 4.0

    A survey of physical AI that distinguishes theoretical physics reasoning from applied understanding and synthesizes advances in symbolic reasoning, embodied systems, and generative models to advocate for physics-groun...

Reference graph

Works this paper leans on

155 extracted references · 155 canonical work pages · cited by 8 Pith papers · 14 internal anchors

  1. [1]

    Cosmos World Foundation Model Platform for Physical AI

    Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

  2. [2]

    gpt-oss-120b & gpt-oss-20b Model Card

    Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

  3. [3]

    Model card addendum: Claude 3.5 haiku and upgraded claude 3.5 sonnet, 2024

    Anthropic. Model card addendum: Claude 3.5 haiku and upgraded claude 3.5 sonnet, 2024

  4. [4]

    System card: Claude sonnet 4.5, 2025

    Anthropic. System card: Claude sonnet 4.5, 2025

  5. [5]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  6. [6]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  7. [7]

    Spatialthinker: Reinforcing 3d reasoning in multimodal llms via spatial rewards.arXiv preprint arXiv:2511.07403, 2025

    Hunar Batra, Haoqin Tu, Hardy Chen, Yuanze Lin, Cihang Xie, and Ronald Clark. Spatialthinker: Reinforcing 3d reasoning in multimodal llms via spatial rewards.arXiv preprint arXiv:2511.07403, 2025

  8. [8]

    Omni3d: A large benchmark and model for 3d object detection in the wild

    Garrick Brazil, Abhinav Kumar, Julian Straub, Nikhila Ravi, Justin Johnson, and Georgia Gkioxari. Omni3d: A large benchmark and model for 3d object detection in the wild. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023

  9. [9]

    Sims-v: Simulated instruction-tuning for spatial video understanding.arXiv preprint arXiv:2511.04668, 2025

    Ellis Brown, Arijit Ray, Ranjay Krishna, Ross Girshick, Rob Fergus, and Saining Xie. Sims-v: Simulated instruction-tuning for spatial video understanding.arXiv preprint arXiv:2511.04668, 2025

  10. [10]

    Spatialbot: Precise spatial understanding with vision language models

    Wenxiao Cai, Iaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spatialbot: Precise spatial understanding with vision language models. InIEEE International Conference on Robotics and Automation, 2025

  11. [11]

    Scaling spatial intelligence with multimodal foundation models

    Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, et al. Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025. 13

  12. [12]

    Has gpt-5 achieved spatial intelligence? an empirical study.arXiv preprint arXiv:2508.13142, 2025

    Zhongang Cai, Yubo Wang, Qingping Sun, Ruisi Wang, Chenyang Gu, Wanqi Yin, Zhiqian Lin, Zhitao Yang, Chen Wei, Xuanke Shi, et al. Has gpt-5 achieved spatial intelligence? an empirical study.arXiv preprint arXiv:2508.13142, 2025

  13. [13]

    Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024

  14. [14]

    Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors

    Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, et al. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors. InProceedings of the International Conference on Learning Representations, 2024

  15. [15]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024

  16. [16]

    Spatialrgpt: Grounded spatial reasoning in vision-language models

    An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision-language models. InConference on Neural Information Processing Systems, 2025

  17. [17]

    Physbench: Benchmarking and enhancing vision-language models for physical world understanding

    Wei Chow, Jiageng Mao, Boyi Li, Daniel Seita, Vitor Guizilini, and Yue Wang. Physbench: Benchmarking and enhancing vision-language models for physical world understanding. InProceedings of the International Conference on Learning Representations, 2025

  18. [18]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  19. [19]

    Grok-1.5 vision preview: Connecting the digital and physical worlds with our first multimodal model, 2024

    X.AI Corp. Grok-1.5 vision preview: Connecting the digital and physical worlds with our first multimodal model, 2024

  20. [20]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes

    Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017

  21. [21]

    FlashAttention-2: Faster attention with better parallelism and work partitioning

    Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. InProceedings of the International Conference on Learning Representations, 2024

  22. [22]

    Mm-spatial: Exploring 3d spatial understanding in multimodal llms

    Erik Daxberger, Nina Wenzel, David Griffiths, Haiming Gang, Justin Lazarow, Gefen Kohavi, Kai Kang, Marcin Eichner, Yinfei Yang, Afshin Dehghan, et al. Mm-spatial: Exploring 3d spatial understanding in multimodal llms. InProceedings of the International Conference on Computer Vision, 2025

  23. [23]

    Vlm-3r: Vision-language models augmented with instruction-aligned 3d reconstruction

    Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, et al. Vlm-3r: Vision-language models augmented with instruction-aligned 3d reconstruction. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2026

  24. [24]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025

  25. [25]

    Blink: Multimodal large language models can see but not perceive

    Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. InProceedings of the European Conference on Computer Vision, 2024

  26. [26]

    Sciagents: Automating scientific discovery through bioinspired multi-agent intelligent graph reasoning.Advanced Materials, 2024

    Alireza Ghafarollahi and Markus J Buehler. Sciagents: Automating scientific discovery through bioinspired multi-agent intelligent graph reasoning.Advanced Materials, 2024

  27. [27]

    Space-10: A comprehensive benchmark for multimodal large language models in compositional spatial intelligence

    Ziyang Gong, Wenhao Li, Oliver Ma, Songyuan Li, Zhaokai Wang, Jiayi Ji, Xue Yang, Gen Luo, Junchi Yan, and Rongrong Ji. Space-10: A comprehensive benchmark for multimodal large language models in compositional spatial intelligence. InProceedings of the International Conference on Learning Representations, 2026

  28. [28]

    A new era of intelligence with gemini 3, 2025

    Google. A new era of intelligence with gemini 3, 2025

  29. [29]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. 14

  30. [30]

    Embodied llm agents learn to cooperate in organized teams.arXiv preprint arXiv:2403.12482, 2024

    Xudong Guo, Kaixuan Huang, Jiale Liu, Wenhui Fan, Natalia Vélez, Qingyun Wu, Huazheng Wang, Thomas L Griffiths, and Mengdi Wang. Embodied llm agents learn to cooperate in organized teams.arXiv preprint arXiv:2403.12482, 2024

  31. [31]

    Cambridge university press, 2003

    Richard Hartley and Andrew Zisserman.Multiple view geometry in computer vision, volume 665. Cambridge university press, 2003

  32. [32]

    Metagpt: Meta programming for a multi-agent collaborative framework

    Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. Metagpt: Meta programming for a multi-agent collaborative framework. InProceedings of the International Conference on Learning Representations, 2024

  33. [33]

    Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models

    Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models. InProceedings of the International Conference on Learning Representations, 2026

  34. [34]

    Detect anything via next point prediction

    Qing Jiang, Junan Huo, Xingyu Chen, Yuda Xiong, Zhaoyang Zeng, Yihao Chen, Tianhe Ren, Junzhi Yu, and Lei Zhang. Detect anything via next point prediction. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2026

  35. [35]

    What’s “up” with vision-language models? investigating their struggle with spatial reasoning

    Amita Kamath, Jack Hessel, and Kai-Wei Chang. What’s “up” with vision-language models? investigating their struggle with spatial reasoning. InProceedings of the Conference on Empirical Methods in Natural Language Processing, 2023

  36. [36]

    Mapanything: Universal feed-forward metric 3d reconstruction

    Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, et al. Mapanything: Universal feed-forward metric 3d reconstruction. InInternational Conference on 3D Vision, 2026

  37. [37]

    Cubify anything: Scaling indoor 3d object detection

    Justin Lazarow, David Griffiths, Gefen Kohavi, Francisco Crespo, and Afshin Dehghan. Cubify anything: Scaling indoor 3d object detection. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025

  38. [38]

    LLaVA-onevision: Easy visual task transfer.Transactions on Machine Learning Research, 2025

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. LLaVA-onevision: Easy visual task transfer.Transactions on Machine Learning Research, 2025

  39. [39]

    Seed-bench: Benchmarking multimodal large language models

    Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench: Benchmarking multimodal large language models. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024

  40. [40]

    Camel: Communicative agents for "mind" exploration of large language model society

    Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for "mind" exploration of large language model society. InConference on Neural Information Processing Systems, 2023

  41. [41]

    Agent hospital: A simulacrum of hospital with evolvable medical agents,

    Junkai Li, Yunghwei Lai, Weitao Li, Jingyi Ren, Meng Zhang, Xinhui Kang, Siyu Wang, Peng Li, Ya-Qin Zhang, Weizhi Ma, et al. Agent hospital: A simulacrum of hospital with evolvable medical agents.arXiv preprint arXiv:2405.02957, 2024

  42. [42]

    Sti-bench: Are mllms ready for precise spatial-temporal world understanding? InProceedings of the International Conference on Computer Vision, 2025

    Yun Li, Yiming Zhang, Tao Lin, Xiangrui Liu, Wenxiao Cai, Zheng Liu, and Bo Zhao. Sti-bench: Are mllms ready for precise spatial-temporal world understanding? InProceedings of the International Conference on Computer Vision, 2025

  43. [43]

    Optimus-1: Hybrid multimodal memory empowered agents excel in long-horizon tasks

    Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Dongmei Jiang, and Liqiang Nie. Optimus-1: Hybrid multimodal memory empowered agents excel in long-horizon tasks. InConference on Neural Information Processing Systems, 2024

  44. [44]

    Reasoning paths with reference objects elicit quantitative spatial reasoning in large vision-language models

    Yuan-Hong Liao, Rafid Mahmood, Sanja Fidler, and David Acuna. Reasoning paths with reference objects elicit quantitative spatial reasoning in large vision-language models. InProceedings of the Conference on Empirical Methods in Natural Language Processing, 2024

  45. [45]

    Vila: On pre-training for visual language models

    Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024

  46. [46]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024. 15

  47. [47]

    Mirage: A multi-modal benchmark for spatial perception, reasoning, and intelligence

    Chonghan Liu, Haoran Wang, Felix Henry, Pu Miao, Yajie Zhang, Yu Zhao, and Peiran Wu. Mirage: A multi-modal benchmark for spatial perception, reasoning, and intelligence. InConference on Neural Information Processing Systems, 2025

  48. [48]

    Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 2023

    Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 2023

  49. [49]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024

  50. [50]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InConference on Neural Information Processing Systems, 2023

  51. [51]

    Versavit: Enhancing mllm vision backbones via task-guided optimization.arXiv preprint arXiv:2602.09934, 2026

    Yikun Liu, Yuan Liu, Shangzhe Di, Haicheng Wang, Zhongyin Zhao, Le Tian, Xiao Zhou, Jie Zhou, Jiangchao Yao, Yanfeng Wang, et al. Versavit: Enhancing mllm vision backbones via task-guided optimization.arXiv preprint arXiv:2602.09934, 2026

  52. [52]

    Lamra: Large multimodal model as your advanced retrieval assistant

    Yikun Liu, Yajie Zhang, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiangchao Yao, Yanfeng Wang, and Weidi Xie. Lamra: Large multimodal model as your advanced retrieval assistant. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025

  53. [53]

    Mmbench: Is your multi-modal model an all-around player? InProceedings of the European Conference on Computer Vision, 2024

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InProceedings of the European Conference on Computer Vision, 2024

  54. [54]

    Distinctive image features from scale-invariant keypoints.International Journal of Computer Vision, 2004

    David G Lowe. Distinctive image features from scale-invariant keypoints.International Journal of Computer Vision, 2004

  55. [55]

    Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InProceedings of the International Conference on Learning Representations, 2024

  56. [56]

    3dsrbench: A comprehensive 3d spatial reasoning benchmark

    Wufei Ma, Haoyu Chen, Guofeng Zhang, Celso M de Melo, Alan Yuille, and Jieneng Chen. 3dsrbench: A comprehensive 3d spatial reasoning benchmark. InProceedings of the International Conference on Computer Vision, 2025

  57. [57]

    Spatialrea- soner: Towards explicit and generalizable 3d spatial reasoning

    Wufei Ma, Yu-Cheng Chou, Qihao Liu, Xingrui Wang, Celso de Melo, Jianwen Xie, and Alan Yuille. Spatialrea- soner: Towards explicit and generalizable 3d spatial reasoning. InConference on Neural Information Processing Systems, 2025

  58. [58]

    Spatialllm: A compound 3d-informed design towards spatially-intelligent large multimodal models

    Wufei Ma, Luoxin Ye, Celso M de Melo, Alan Yuille, and Jieneng Chen. Spatialllm: A compound 3d-informed design towards spatially-intelligent large multimodal models. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025

  59. [59]

    Taco: Learning multi-modal action models with synthetic chains-of-thought-and-action.arXiv preprint arXiv:2412.05479, 2024

    Zixian Ma, Jianguo Zhang, Zhiwei Liu, Jieyu Zhang, Juntao Tan, Manli Shu, Niebles, et al. Taco: Learning multi-modal action models with synthetic chains-of-thought-and-action.arXiv preprint arXiv:2412.05479, 2024

  60. [60]

    Mmiu: Multimodal multi-image understanding for evaluating large vision-language models

    Fanqing Meng, Chuanhao Li, Jin Wang, Quanfeng Lu, Hao Tian, Tianshuo Yang, Jiaqi Liao, Xizhou Zhu, Jifeng Dai, Yu Qiao, et al. Mmiu: Multimodal multi-image understanding for evaluating large vision-language models. InProceedings of the International Conference on Learning Representations, 2025

  61. [61]

    Scenegen: Single-image 3d scene generation in one feedforward pass

    Yanxu Meng, Haoning Wu, Ya Zhang, and Weidi Xie. Scenegen: Single-image 3d scene generation in one feedforward pass. InInternational Conference on 3D Vision, 2026

  62. [62]

    GPT-5 System Card, 2025

    OpenAI. GPT-5 System Card, 2025. Accessed: 2025-11-1

  63. [63]

    SpaceR: Reinforcing MLLMs in Video Spatial Reasoning

    Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. Spacer: Reinforcing mllms in video spatial reasoning.arXiv preprint arXiv:2504.01805, 2025

  64. [64]

    Chatdev: Communicative agents for software development

    Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, et al. Chatdev: Communicative agents for software development. InAssociation for Computational Linguistics, 2024

  65. [65]

    Multi-agent system for comprehensive soccer understanding

    Jiayuan Rao, Zifeng Li, Haoning Wu, Ya Zhang, Yanfeng Wang, and Weidi Xie. Multi-agent system for comprehensive soccer understanding. InACM Multimedia, 2025. 16

  66. [66]

    Towards universal soccer video understanding

    Jiayuan Rao, Haoning Wu, Hao Jiang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Towards universal soccer video understanding. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025

  67. [67]

    Matchtime: Towards automatic soccer game commentary generation

    Jiayuan Rao, Haoning Wu, Chang Liu, Yanfeng Wang, and Weidi Xie. Matchtime: Towards automatic soccer game commentary generation. InProceedings of the Conference on Empirical Methods in Natural Language Processing, 2024

  68. [68]

    Sam 2: Segment anything in images and videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. In Proceedings of the International Conference on Learning Representations, 2025

  69. [69]

    Structure-from-motion revisited

    Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016

  70. [70]

    Satori-r1: Incentivizing multimodal reasoning through explicit visual anchoring.arXiv preprint arXiv:2505.19094, 2025

    Chuming Shen, Wei Wei, Xiaoye Qu, and Yu Cheng. Satori-r1: Incentivizing multimodal reasoning through explicit visual anchoring.arXiv preprint arXiv:2505.19094, 2025

  71. [71]

    Reflexion: Language agents with verbal reinforcement learning

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InConference on Neural Information Processing Systems, 2023

  72. [72]

    Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics

    Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, and Stan Birchfield. Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025

  73. [73]

    Mind the gap: Benchmarking spatial reasoning in vision-language models.arXiv preprint arXiv:2503.19707,

    Ilias Stogiannidis, Steven McDonagh, and Sotirios A Tsaftaris. Mind the gap: Benchmarking spatial reasoning in vision-language models.arXiv preprint arXiv:2503.19707, 2025

  74. [74]

    Multi-agent embodied question answering in interactive environments

    Sinan Tan, Weilai Xiang, Huaping Liu, Di Guo, and Fuchun Sun. Multi-agent embodied question answering in interactive environments. InProceedings of the European Conference on Computer Vision, 2020

  75. [75]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalk- wyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

  76. [76]

    Kimi-VL Technical Report

    Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

  77. [77]

    Raft: Recurrent all-pairs field transforms for optical flow

    Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InProceedings of the European Conference on Computer Vision, 2020

  78. [78]

    Cambrian-1: A fully open, vision-centric exploration of multimodal llms

    Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. InConference on Neural Information Processing Systems, 2024

  79. [79]

    Eyes wide shut? exploring the visual shortcomings of multimodal llms

    Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024

  80. [80]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025

Showing first 80 references.