SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence
Pith reviewed 2026-05-22 13:03 UTC · model grok-4.3
The pith
SpatialScore benchmark reveals that current multimodal models lag substantially behind humans in spatial intelligence across 30 tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Existing evaluations of multimodal large language models on spatial intelligence are fragmented and limited; SpatialScore addresses this with roughly 5K manually verified samples across 30 tasks, multiple visual data types, and input modalities. Using the benchmark, 49 models show persistent shortfalls and a clear gap to human performance. The authors further supply SpatialCorpus (331K multimodal QA samples) that improves fine-tuned models and SpatialAgent, a multi-agent system with 12 specialized spatial tools that boosts reasoning through Plan-Execute and ReAct strategies without extra training.
What carries the argument
SpatialScore benchmark, a collection of approximately 5K verified samples spanning 30 tasks that tests multimodal spatial reasoning across varied visual inputs and question formats.
If this is right
- Evaluating 49 representative MLLMs on SpatialScore documents persistent challenges and a substantial gap to human-level spatial intelligence.
- Fine-tuning existing models on the 331K-sample SpatialCorpus produces clear performance gains, for example on Qwen3-VL.
- The training-free SpatialAgent system, equipped with 12 spatial perception tools, delivers substantial reasoning improvements via Plan-Execute and ReAct loops.
- Releasing the benchmark, corpus, and agent code is intended to provide a shared foundation for future work on human-level spatial intelligence in MLLMs.
Where Pith is reading between the lines
- If SpatialScore is widely adopted, research may shift from narrow isolated tests toward more integrated spatial reasoning problems.
- The success of the tool-augmented agent suggests that external perception modules could be a scalable way to augment vision-language models on geometry-heavy tasks.
- Large-scale spatial training data might transfer to related domains such as robotics planning or augmented-reality interfaces.
- Future benchmarks could add dynamic video or real-robot interaction to test whether gains on static images generalize to changing environments.
Load-bearing premise
The chosen 30 tasks plus manual verification together cover the core skills of spatial intelligence without missing major real-world abilities or introducing bias.
What would settle it
Demonstrating that current models already match human accuracy on an important spatial task that lies outside the 30 categories would undermine the claim that the benchmark is comprehensive.
read the original abstract
Existing evaluations of multimodal large language models (MLLMs) on spatial intelligence are typically fragmented and limited in scope. In this work, we aim to conduct a holistic assessment of the spatial understanding capabilities of modern MLLMs and propose complementary data-driven and agent-based solutions. Specifically, we make the following contributions: (i) we introduce SpatialScore, to our knowledge, the most comprehensive and diverse benchmark for multimodal spatial intelligence to date. It covers multiple visual data types, input modalities, and question-answering formats, and contains approximately 5K manually verified samples spanning 30 distinct tasks; (ii) using SpatialScore, we extensively evaluate 49 representative MLLMs, revealing persistent challenges and a substantial gap between current models and human-level spatial intelligence; (iii) to advance model capabilities, we construct SpatialCorpus, a large-scale training resource with 331K multimodal QA samples that supports fine-tuning on spatial reasoning tasks and significantly improves the performance of existing models (e.g., Qwen3-VL); (iv) to complement this data-driven route with a training-free paradigm, we develop SpatialAgent, a multi-agent system equipped with 12 specialized spatial perception tools that supports both Plan-Execute and ReAct reasoning, enabling substantial gains in spatial reasoning without additional model training. Extensive experiments and in-depth analyses demonstrate the effectiveness of our benchmark, corpus, and agent framework. We expect these resources to serve as a solid foundation for advancing MLLMs toward human-level spatial intelligence. All data, code, and models will be released to the research community.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SpatialScore as a new benchmark for multimodal spatial intelligence, containing approximately 5K manually verified samples across 30 tasks that span multiple visual data types, input modalities, and QA formats. It reports an evaluation of 49 MLLMs that reveals persistent gaps relative to human performance, constructs SpatialCorpus (331K multimodal QA samples) to support fine-tuning, and proposes SpatialAgent, a multi-agent system with 12 specialized spatial perception tools that enables training-free gains via Plan-Execute and ReAct reasoning. The central claim is that these resources together provide the most comprehensive evaluation suite to date and practical routes toward closing the human-model gap.
Significance. If the 30 tasks and verification process adequately cover essential spatial skills without systematic bias or omission, the released benchmark, corpus, code, and models would constitute a valuable community resource for standardizing evaluation and driving progress in MLLM spatial reasoning. The dual data-driven and agent-based improvement strategies, together with the scale of the model evaluation, add practical utility beyond pure benchmarking.
major comments (1)
- [§3] §3 (Benchmark Construction and Task Selection): The claim that SpatialScore is the most comprehensive benchmark rests on the 30 tasks plus manual verification together covering essential spatial intelligence components without bias or major gaps. The manuscript presents the tasks and data sources but supplies no formal taxonomy of spatial abilities (e.g., static vs. dynamic, egocentric vs. allocentric, 2D vs. 3D mental manipulation) nor a coverage analysis showing how the chosen tasks map onto such dimensions. This leaves open the possibility that entire classes of real-world spatial reasoning are absent or underrepresented.
minor comments (2)
- [Abstract] Abstract: The phrase 'approximately 5K' should be replaced by the exact sample count for precision and reproducibility.
- [§5] §5 (SpatialAgent): A summary table listing the 12 specialized tools, their inputs/outputs, and intended spatial sub-skill would improve clarity and allow readers to assess coverage.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address the major comment regarding benchmark construction and task selection below, and we will revise the paper to incorporate a formal taxonomy and coverage analysis as suggested.
read point-by-point responses
-
Referee: [§3] §3 (Benchmark Construction and Task Selection): The claim that SpatialScore is the most comprehensive benchmark rests on the 30 tasks plus manual verification together covering essential spatial intelligence components without bias or major gaps. The manuscript presents the tasks and data sources but supplies no formal taxonomy of spatial abilities (e.g., static vs. dynamic, egocentric vs. allocentric, 2D vs. 3D mental manipulation) nor a coverage analysis showing how the chosen tasks map onto such dimensions. This leaves open the possibility that entire classes of real-world spatial reasoning are absent or underrepresented.
Authors: We thank the referee for this valuable observation. Our 30 tasks were selected after reviewing prior spatial reasoning benchmarks and cognitive science literature to span diverse visual data types, modalities, and formats, including elements of object localization, spatial relations, mental rotation, navigation, and 3D understanding. Nevertheless, we acknowledge that the original manuscript did not include an explicit formal taxonomy or systematic coverage mapping. In the revised version, we will add a dedicated subsection to §3 that introduces a taxonomy of spatial abilities (categorizing along axes such as static vs. dynamic, egocentric vs. allocentric, and 2D vs. 3D mental manipulation, drawing from established frameworks in the field) and provides a table mapping each of the 30 tasks to these dimensions, along with a brief discussion of coverage and potential gaps. This addition will be grounded in the task descriptions and data sources already presented, thereby strengthening the justification for the comprehensiveness claim without altering the core benchmark content. revision: yes
Circularity Check
No circularity: empirical benchmark and resource creation paper
full rationale
The manuscript introduces SpatialScore (5K manually verified samples across 30 tasks), SpatialCorpus (331K QA pairs), and SpatialAgent (multi-agent system with 12 tools) as new artifacts, then reports evaluations of 49 MLLMs. No derivation chain, equations, fitted parameters, or first-principles predictions are claimed; the central assertions rest on the construction process, manual verification, and experimental outcomes rather than any reduction to self-defined inputs or self-citations. The work is self-contained as resource creation and empirical assessment.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human performance on the SpatialScore tasks constitutes the appropriate upper bound for model comparison.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SpatialScore ... spanning 30 distinct tasks ... grouped ... into 10 intuitive categories: mental animation, counting, depth estimation, object distance, object motion, camera pose & motion, temporal reasoning, view reasoning, object size, and object localization.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we construct SpatialCorpus, a large-scale training resource with 331K multimodal QA samples ... SpatialAgent, a multi-agent system equipped with 12 specialized spatial perception tools
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 9 Pith papers
-
CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models
Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.
-
SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments
SpaMEM benchmark shows multimodal LLMs succeed at spatial tasks with text histories but sharply fail at long-horizon belief maintenance from raw visual streams alone.
-
Enhancing MLLM Spatial Understanding via Active 3D Scene Exploration for Multi-Perspective Reasoning
A training-free Visual Chain-of-Thought framework reconstructs high-fidelity 3D meshes from single images and iteratively synthesizes optimal novel views to enhance MLLM spatial comprehension on benchmarks like 3DSRBench.
-
World2VLM: Distilling World Model Imagination into VLMs for Dynamic Spatial Reasoning
Distilling view-consistent future views and action-outcome supervision from a generative world model into a VLM via two-stage post-training improves dynamic spatial reasoning on SAT-Real, VSI-Bench and similar benchma...
-
MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?
MapTab is a new multimodal benchmark with 328 images and nearly 200k queries that shows current MLLMs have substantial difficulty with multi-criteria route planning when visual and tabular information must be combined.
-
MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?
MapTab benchmark shows current MLLMs struggle with multi-criteria multimodal route planning and that combining vision and language frequently underperforms single-modality approaches.
-
Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs
Chain-of-Focus enables VLMs to adaptively search and zoom on important image areas via a two-stage SFT and RL pipeline on a custom 3K-sample dataset, yielding 5% gains on the V* benchmark across resolutions from 224 to 4K.
-
Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning
MoVT unifies different visual reasoning modes in a single model and uses the AdaVaR two-stage framework with supervised cold-start and RL via AdaGRPO to enable context-adaptive mode selection, yielding consistent gain...
-
Aligning Perception, Reasoning, Modeling and Interaction: A Survey on Physical AI
A survey of physical AI that distinguishes theoretical physics reasoning from applied understanding and synthesizes advances in symbolic reasoning, embodied systems, and generative models to advocate for physics-groun...
Reference graph
Works this paper leans on
-
[1]
Cosmos World Foundation Model Platform for Physical AI
Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
gpt-oss-120b & gpt-oss-20b Model Card
Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Model card addendum: Claude 3.5 haiku and upgraded claude 3.5 sonnet, 2024
Anthropic. Model card addendum: Claude 3.5 haiku and upgraded claude 3.5 sonnet, 2024
work page 2024
- [4]
-
[5]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Hunar Batra, Haoqin Tu, Hardy Chen, Yuanze Lin, Cihang Xie, and Ronald Clark. Spatialthinker: Reinforcing 3d reasoning in multimodal llms via spatial rewards.arXiv preprint arXiv:2511.07403, 2025
-
[8]
Omni3d: A large benchmark and model for 3d object detection in the wild
Garrick Brazil, Abhinav Kumar, Julian Straub, Nikhila Ravi, Justin Johnson, and Georgia Gkioxari. Omni3d: A large benchmark and model for 3d object detection in the wild. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023
work page 2023
-
[9]
Ellis Brown, Arijit Ray, Ranjay Krishna, Ross Girshick, Rob Fergus, and Saining Xie. Sims-v: Simulated instruction-tuning for spatial video understanding.arXiv preprint arXiv:2511.04668, 2025
-
[10]
Spatialbot: Precise spatial understanding with vision language models
Wenxiao Cai, Iaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spatialbot: Precise spatial understanding with vision language models. InIEEE International Conference on Robotics and Automation, 2025
work page 2025
-
[11]
Scaling spatial intelligence with multimodal foundation models
Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, et al. Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025. 13
-
[12]
Has gpt-5 achieved spatial intelligence? an empirical study.arXiv preprint arXiv:2508.13142, 2025
Zhongang Cai, Yubo Wang, Qingping Sun, Ruisi Wang, Chenyang Gu, Wanqi Yin, Zhiqian Lin, Zhitao Yang, Chen Wei, Xuanke Shi, et al. Has gpt-5 achieved spatial intelligence? an empirical study.arXiv preprint arXiv:2508.13142, 2025
-
[13]
Spatialvlm: Endowing vision-language models with spatial reasoning capabilities
Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024
work page 2024
-
[14]
Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors
Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, et al. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors. InProceedings of the International Conference on Learning Representations, 2024
work page 2024
-
[15]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Spatialrgpt: Grounded spatial reasoning in vision-language models
An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision-language models. InConference on Neural Information Processing Systems, 2025
work page 2025
-
[17]
Physbench: Benchmarking and enhancing vision-language models for physical world understanding
Wei Chow, Jiageng Mao, Boyi Li, Daniel Seita, Vitor Guizilini, and Yue Wang. Physbench: Benchmarking and enhancing vision-language models for physical world understanding. InProceedings of the International Conference on Learning Representations, 2025
work page 2025
-
[18]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
X.AI Corp. Grok-1.5 vision preview: Connecting the digital and physical worlds with our first multimodal model, 2024
work page 2024
-
[20]
Scannet: Richly-annotated 3d reconstructions of indoor scenes
Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017
work page 2017
-
[21]
FlashAttention-2: Faster attention with better parallelism and work partitioning
Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. InProceedings of the International Conference on Learning Representations, 2024
work page 2024
-
[22]
Mm-spatial: Exploring 3d spatial understanding in multimodal llms
Erik Daxberger, Nina Wenzel, David Griffiths, Haiming Gang, Justin Lazarow, Gefen Kohavi, Kai Kang, Marcin Eichner, Yinfei Yang, Afshin Dehghan, et al. Mm-spatial: Exploring 3d spatial understanding in multimodal llms. InProceedings of the International Conference on Computer Vision, 2025
work page 2025
-
[23]
Vlm-3r: Vision-language models augmented with instruction-aligned 3d reconstruction
Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, et al. Vlm-3r: Vision-language models augmented with instruction-aligned 3d reconstruction. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2026
work page 2026
-
[24]
Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025
work page 2025
-
[25]
Blink: Multimodal large language models can see but not perceive
Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. InProceedings of the European Conference on Computer Vision, 2024
work page 2024
-
[26]
Alireza Ghafarollahi and Markus J Buehler. Sciagents: Automating scientific discovery through bioinspired multi-agent intelligent graph reasoning.Advanced Materials, 2024
work page 2024
-
[27]
Ziyang Gong, Wenhao Li, Oliver Ma, Songyuan Li, Zhaokai Wang, Jiayi Ji, Xue Yang, Gen Luo, Junchi Yan, and Rongrong Ji. Space-10: A comprehensive benchmark for multimodal large language models in compositional spatial intelligence. InProceedings of the International Conference on Learning Representations, 2026
work page 2026
-
[28]
A new era of intelligence with gemini 3, 2025
Google. A new era of intelligence with gemini 3, 2025
work page 2025
-
[29]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. 14
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
Embodied llm agents learn to cooperate in organized teams.arXiv preprint arXiv:2403.12482, 2024
Xudong Guo, Kaixuan Huang, Jiale Liu, Wenhui Fan, Natalia Vélez, Qingyun Wu, Huazheng Wang, Thomas L Griffiths, and Mengdi Wang. Embodied llm agents learn to cooperate in organized teams.arXiv preprint arXiv:2403.12482, 2024
-
[31]
Cambridge university press, 2003
Richard Hartley and Andrew Zisserman.Multiple view geometry in computer vision, volume 665. Cambridge university press, 2003
work page 2003
-
[32]
Metagpt: Meta programming for a multi-agent collaborative framework
Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. Metagpt: Meta programming for a multi-agent collaborative framework. InProceedings of the International Conference on Learning Representations, 2024
work page 2024
-
[33]
Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models
Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models. InProceedings of the International Conference on Learning Representations, 2026
work page 2026
-
[34]
Detect anything via next point prediction
Qing Jiang, Junan Huo, Xingyu Chen, Yuda Xiong, Zhaoyang Zeng, Yihao Chen, Tianhe Ren, Junzhi Yu, and Lei Zhang. Detect anything via next point prediction. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2026
work page 2026
-
[35]
What’s “up” with vision-language models? investigating their struggle with spatial reasoning
Amita Kamath, Jack Hessel, and Kai-Wei Chang. What’s “up” with vision-language models? investigating their struggle with spatial reasoning. InProceedings of the Conference on Empirical Methods in Natural Language Processing, 2023
work page 2023
-
[36]
Mapanything: Universal feed-forward metric 3d reconstruction
Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, et al. Mapanything: Universal feed-forward metric 3d reconstruction. InInternational Conference on 3D Vision, 2026
work page 2026
-
[37]
Cubify anything: Scaling indoor 3d object detection
Justin Lazarow, David Griffiths, Gefen Kohavi, Francisco Crespo, and Afshin Dehghan. Cubify anything: Scaling indoor 3d object detection. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025
work page 2025
-
[38]
LLaVA-onevision: Easy visual task transfer.Transactions on Machine Learning Research, 2025
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. LLaVA-onevision: Easy visual task transfer.Transactions on Machine Learning Research, 2025
work page 2025
-
[39]
Seed-bench: Benchmarking multimodal large language models
Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench: Benchmarking multimodal large language models. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024
work page 2024
-
[40]
Camel: Communicative agents for "mind" exploration of large language model society
Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for "mind" exploration of large language model society. InConference on Neural Information Processing Systems, 2023
work page 2023
-
[41]
Agent hospital: A simulacrum of hospital with evolvable medical agents,
Junkai Li, Yunghwei Lai, Weitao Li, Jingyi Ren, Meng Zhang, Xinhui Kang, Siyu Wang, Peng Li, Ya-Qin Zhang, Weizhi Ma, et al. Agent hospital: A simulacrum of hospital with evolvable medical agents.arXiv preprint arXiv:2405.02957, 2024
-
[42]
Yun Li, Yiming Zhang, Tao Lin, Xiangrui Liu, Wenxiao Cai, Zheng Liu, and Bo Zhao. Sti-bench: Are mllms ready for precise spatial-temporal world understanding? InProceedings of the International Conference on Computer Vision, 2025
work page 2025
-
[43]
Optimus-1: Hybrid multimodal memory empowered agents excel in long-horizon tasks
Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Dongmei Jiang, and Liqiang Nie. Optimus-1: Hybrid multimodal memory empowered agents excel in long-horizon tasks. InConference on Neural Information Processing Systems, 2024
work page 2024
-
[44]
Yuan-Hong Liao, Rafid Mahmood, Sanja Fidler, and David Acuna. Reasoning paths with reference objects elicit quantitative spatial reasoning in large vision-language models. InProceedings of the Conference on Empirical Methods in Natural Language Processing, 2024
work page 2024
-
[45]
Vila: On pre-training for visual language models
Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024
work page 2024
-
[46]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024. 15
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[47]
Mirage: A multi-modal benchmark for spatial perception, reasoning, and intelligence
Chonghan Liu, Haoran Wang, Felix Henry, Pu Miao, Yajie Zhang, Yu Zhao, and Peiran Wu. Mirage: A multi-modal benchmark for spatial perception, reasoning, and intelligence. InConference on Neural Information Processing Systems, 2025
work page 2025
-
[48]
Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 2023
Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 2023
work page 2023
-
[49]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024
work page 2024
-
[50]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InConference on Neural Information Processing Systems, 2023
work page 2023
-
[51]
Yikun Liu, Yuan Liu, Shangzhe Di, Haicheng Wang, Zhongyin Zhao, Le Tian, Xiao Zhou, Jie Zhou, Jiangchao Yao, Yanfeng Wang, et al. Versavit: Enhancing mllm vision backbones via task-guided optimization.arXiv preprint arXiv:2602.09934, 2026
-
[52]
Lamra: Large multimodal model as your advanced retrieval assistant
Yikun Liu, Yajie Zhang, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiangchao Yao, Yanfeng Wang, and Weidi Xie. Lamra: Large multimodal model as your advanced retrieval assistant. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025
work page 2025
-
[53]
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InProceedings of the European Conference on Computer Vision, 2024
work page 2024
-
[54]
David G Lowe. Distinctive image features from scale-invariant keypoints.International Journal of Computer Vision, 2004
work page 2004
-
[55]
Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InProceedings of the International Conference on Learning Representations, 2024
work page 2024
-
[56]
3dsrbench: A comprehensive 3d spatial reasoning benchmark
Wufei Ma, Haoyu Chen, Guofeng Zhang, Celso M de Melo, Alan Yuille, and Jieneng Chen. 3dsrbench: A comprehensive 3d spatial reasoning benchmark. InProceedings of the International Conference on Computer Vision, 2025
work page 2025
-
[57]
Spatialrea- soner: Towards explicit and generalizable 3d spatial reasoning
Wufei Ma, Yu-Cheng Chou, Qihao Liu, Xingrui Wang, Celso de Melo, Jianwen Xie, and Alan Yuille. Spatialrea- soner: Towards explicit and generalizable 3d spatial reasoning. InConference on Neural Information Processing Systems, 2025
work page 2025
-
[58]
Spatialllm: A compound 3d-informed design towards spatially-intelligent large multimodal models
Wufei Ma, Luoxin Ye, Celso M de Melo, Alan Yuille, and Jieneng Chen. Spatialllm: A compound 3d-informed design towards spatially-intelligent large multimodal models. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025
work page 2025
-
[59]
Zixian Ma, Jianguo Zhang, Zhiwei Liu, Jieyu Zhang, Juntao Tan, Manli Shu, Niebles, et al. Taco: Learning multi-modal action models with synthetic chains-of-thought-and-action.arXiv preprint arXiv:2412.05479, 2024
-
[60]
Mmiu: Multimodal multi-image understanding for evaluating large vision-language models
Fanqing Meng, Chuanhao Li, Jin Wang, Quanfeng Lu, Hao Tian, Tianshuo Yang, Jiaqi Liao, Xizhou Zhu, Jifeng Dai, Yu Qiao, et al. Mmiu: Multimodal multi-image understanding for evaluating large vision-language models. InProceedings of the International Conference on Learning Representations, 2025
work page 2025
-
[61]
Scenegen: Single-image 3d scene generation in one feedforward pass
Yanxu Meng, Haoning Wu, Ya Zhang, and Weidi Xie. Scenegen: Single-image 3d scene generation in one feedforward pass. InInternational Conference on 3D Vision, 2026
work page 2026
- [62]
-
[63]
SpaceR: Reinforcing MLLMs in Video Spatial Reasoning
Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. Spacer: Reinforcing mllms in video spatial reasoning.arXiv preprint arXiv:2504.01805, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[64]
Chatdev: Communicative agents for software development
Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, et al. Chatdev: Communicative agents for software development. InAssociation for Computational Linguistics, 2024
work page 2024
-
[65]
Multi-agent system for comprehensive soccer understanding
Jiayuan Rao, Zifeng Li, Haoning Wu, Ya Zhang, Yanfeng Wang, and Weidi Xie. Multi-agent system for comprehensive soccer understanding. InACM Multimedia, 2025. 16
work page 2025
-
[66]
Towards universal soccer video understanding
Jiayuan Rao, Haoning Wu, Hao Jiang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Towards universal soccer video understanding. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025
work page 2025
-
[67]
Matchtime: Towards automatic soccer game commentary generation
Jiayuan Rao, Haoning Wu, Chang Liu, Yanfeng Wang, and Weidi Xie. Matchtime: Towards automatic soccer game commentary generation. InProceedings of the Conference on Empirical Methods in Natural Language Processing, 2024
work page 2024
-
[68]
Sam 2: Segment anything in images and videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. In Proceedings of the International Conference on Learning Representations, 2025
work page 2025
-
[69]
Structure-from-motion revisited
Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016
work page 2016
-
[70]
Chuming Shen, Wei Wei, Xiaoye Qu, and Yu Cheng. Satori-r1: Incentivizing multimodal reasoning through explicit visual anchoring.arXiv preprint arXiv:2505.19094, 2025
-
[71]
Reflexion: Language agents with verbal reinforcement learning
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InConference on Neural Information Processing Systems, 2023
work page 2023
-
[72]
Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics
Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, and Stan Birchfield. Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025
work page 2025
-
[73]
Ilias Stogiannidis, Steven McDonagh, and Sotirios A Tsaftaris. Mind the gap: Benchmarking spatial reasoning in vision-language models.arXiv preprint arXiv:2503.19707, 2025
-
[74]
Multi-agent embodied question answering in interactive environments
Sinan Tan, Weilai Xiang, Huaping Liu, Di Guo, and Fuchun Sun. Multi-agent embodied question answering in interactive environments. InProceedings of the European Conference on Computer Vision, 2020
work page 2020
-
[75]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalk- wyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[76]
Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[77]
Raft: Recurrent all-pairs field transforms for optical flow
Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InProceedings of the European Conference on Computer Vision, 2020
work page 2020
-
[78]
Cambrian-1: A fully open, vision-centric exploration of multimodal llms
Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. InConference on Neural Information Processing Systems, 2024
work page 2024
-
[79]
Eyes wide shut? exploring the visual shortcomings of multimodal llms
Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024
work page 2024
-
[80]
Vggt: Visual geometry grounded transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.