SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence

Haoning Wu; Weidi Xie; Xiao Huang; Yanfeng Wang; Yaohui Chen; Ya Zhang

arxiv: 2505.17012 · v3 · submitted 2025-05-22 · 💻 cs.CV · cs.AI

SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence

Haoning Wu , Xiao Huang , Yaohui Chen , Ya Zhang , Yanfeng Wang , Weidi Xie This is my paper

Pith reviewed 2026-05-22 13:03 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords spatial intelligencemultimodal large language modelsbenchmark evaluationspatial reasoningmulti-agent systemsfine-tuning corpusvision-language models

0 comments

The pith

SpatialScore benchmark reveals that current multimodal models lag substantially behind humans in spatial intelligence across 30 tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to move beyond fragmented tests by creating a single large-scale benchmark that covers many visual inputs, question types, and spatial skills in one place. It evaluates dozens of models to document where they fail and then supplies two practical fixes: a large training dataset for fine-tuning and a tool-using agent system that works without retraining. If these claims hold, the work would give the field a shared yardstick plus concrete routes for making vision-language models better at understanding 3D relations, navigation, and manipulation.

Core claim

Existing evaluations of multimodal large language models on spatial intelligence are fragmented and limited; SpatialScore addresses this with roughly 5K manually verified samples across 30 tasks, multiple visual data types, and input modalities. Using the benchmark, 49 models show persistent shortfalls and a clear gap to human performance. The authors further supply SpatialCorpus (331K multimodal QA samples) that improves fine-tuned models and SpatialAgent, a multi-agent system with 12 specialized spatial tools that boosts reasoning through Plan-Execute and ReAct strategies without extra training.

What carries the argument

SpatialScore benchmark, a collection of approximately 5K verified samples spanning 30 tasks that tests multimodal spatial reasoning across varied visual inputs and question formats.

If this is right

Evaluating 49 representative MLLMs on SpatialScore documents persistent challenges and a substantial gap to human-level spatial intelligence.
Fine-tuning existing models on the 331K-sample SpatialCorpus produces clear performance gains, for example on Qwen3-VL.
The training-free SpatialAgent system, equipped with 12 spatial perception tools, delivers substantial reasoning improvements via Plan-Execute and ReAct loops.
Releasing the benchmark, corpus, and agent code is intended to provide a shared foundation for future work on human-level spatial intelligence in MLLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If SpatialScore is widely adopted, research may shift from narrow isolated tests toward more integrated spatial reasoning problems.
The success of the tool-augmented agent suggests that external perception modules could be a scalable way to augment vision-language models on geometry-heavy tasks.
Large-scale spatial training data might transfer to related domains such as robotics planning or augmented-reality interfaces.
Future benchmarks could add dynamic video or real-robot interaction to test whether gains on static images generalize to changing environments.

Load-bearing premise

The chosen 30 tasks plus manual verification together cover the core skills of spatial intelligence without missing major real-world abilities or introducing bias.

What would settle it

Demonstrating that current models already match human accuracy on an important spatial task that lies outside the 30 categories would undermine the claim that the benchmark is comprehensive.

read the original abstract

Existing evaluations of multimodal large language models (MLLMs) on spatial intelligence are typically fragmented and limited in scope. In this work, we aim to conduct a holistic assessment of the spatial understanding capabilities of modern MLLMs and propose complementary data-driven and agent-based solutions. Specifically, we make the following contributions: (i) we introduce SpatialScore, to our knowledge, the most comprehensive and diverse benchmark for multimodal spatial intelligence to date. It covers multiple visual data types, input modalities, and question-answering formats, and contains approximately 5K manually verified samples spanning 30 distinct tasks; (ii) using SpatialScore, we extensively evaluate 49 representative MLLMs, revealing persistent challenges and a substantial gap between current models and human-level spatial intelligence; (iii) to advance model capabilities, we construct SpatialCorpus, a large-scale training resource with 331K multimodal QA samples that supports fine-tuning on spatial reasoning tasks and significantly improves the performance of existing models (e.g., Qwen3-VL); (iv) to complement this data-driven route with a training-free paradigm, we develop SpatialAgent, a multi-agent system equipped with 12 specialized spatial perception tools that supports both Plan-Execute and ReAct reasoning, enabling substantial gains in spatial reasoning without additional model training. Extensive experiments and in-depth analyses demonstrate the effectiveness of our benchmark, corpus, and agent framework. We expect these resources to serve as a solid foundation for advancing MLLMs toward human-level spatial intelligence. All data, code, and models will be released to the research community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers a larger benchmark plus usable training data and an agent for spatial reasoning in MLLMs, but the 30 tasks lack a clear coverage map.

read the letter

The main point is that SpatialScore gives the field a bigger, more varied test set for spatial intelligence in multimodal models, paired with a training corpus and a tool-based agent that both produce measurable gains on the same tasks. They evaluate 49 models, document a clear human-model gap, then show that fine-tuning on the 331K SpatialCorpus lifts results and that SpatialAgent with its 12 perception tools does the same without training. The manual verification of the 5K samples and the release of data, code, and models are the practical parts that stand out. Prior spatial evals were smaller and more scattered, so this combination of scale, task variety, and dual improvement routes is what is actually new here. The evaluation section and the reported improvements look like solid empirical work on the surface. The soft spot is the task selection. The manuscript lists the 30 tasks and their sources but supplies no taxonomy or coverage analysis against standard spatial dimensions such as static versus dynamic, egocentric versus allocentric, or 2D versus 3D manipulation. Without that map it is hard to judge whether important real-world skills are missing or over-weighted, even if each sample was checked by hand. That concern is real but not fatal to the main claims. The paper is aimed at researchers who work on multimodal models for robotics or AR and need a shared testbed or starting resources. A reader who wants concrete numbers on current model limits and two concrete ways to move forward will find usable material. The work shows clear thinking about the problem and honest reporting of results, so it deserves a serious referee rather than a desk reject.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces SpatialScore as a new benchmark for multimodal spatial intelligence, containing approximately 5K manually verified samples across 30 tasks that span multiple visual data types, input modalities, and QA formats. It reports an evaluation of 49 MLLMs that reveals persistent gaps relative to human performance, constructs SpatialCorpus (331K multimodal QA samples) to support fine-tuning, and proposes SpatialAgent, a multi-agent system with 12 specialized spatial perception tools that enables training-free gains via Plan-Execute and ReAct reasoning. The central claim is that these resources together provide the most comprehensive evaluation suite to date and practical routes toward closing the human-model gap.

Significance. If the 30 tasks and verification process adequately cover essential spatial skills without systematic bias or omission, the released benchmark, corpus, code, and models would constitute a valuable community resource for standardizing evaluation and driving progress in MLLM spatial reasoning. The dual data-driven and agent-based improvement strategies, together with the scale of the model evaluation, add practical utility beyond pure benchmarking.

major comments (1)

[§3] §3 (Benchmark Construction and Task Selection): The claim that SpatialScore is the most comprehensive benchmark rests on the 30 tasks plus manual verification together covering essential spatial intelligence components without bias or major gaps. The manuscript presents the tasks and data sources but supplies no formal taxonomy of spatial abilities (e.g., static vs. dynamic, egocentric vs. allocentric, 2D vs. 3D mental manipulation) nor a coverage analysis showing how the chosen tasks map onto such dimensions. This leaves open the possibility that entire classes of real-world spatial reasoning are absent or underrepresented.

minor comments (2)

[Abstract] Abstract: The phrase 'approximately 5K' should be replaced by the exact sample count for precision and reproducibility.
[§5] §5 (SpatialAgent): A summary table listing the 12 specialized tools, their inputs/outputs, and intended spatial sub-skill would improve clarity and allow readers to assess coverage.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address the major comment regarding benchmark construction and task selection below, and we will revise the paper to incorporate a formal taxonomy and coverage analysis as suggested.

read point-by-point responses

Referee: [§3] §3 (Benchmark Construction and Task Selection): The claim that SpatialScore is the most comprehensive benchmark rests on the 30 tasks plus manual verification together covering essential spatial intelligence components without bias or major gaps. The manuscript presents the tasks and data sources but supplies no formal taxonomy of spatial abilities (e.g., static vs. dynamic, egocentric vs. allocentric, 2D vs. 3D mental manipulation) nor a coverage analysis showing how the chosen tasks map onto such dimensions. This leaves open the possibility that entire classes of real-world spatial reasoning are absent or underrepresented.

Authors: We thank the referee for this valuable observation. Our 30 tasks were selected after reviewing prior spatial reasoning benchmarks and cognitive science literature to span diverse visual data types, modalities, and formats, including elements of object localization, spatial relations, mental rotation, navigation, and 3D understanding. Nevertheless, we acknowledge that the original manuscript did not include an explicit formal taxonomy or systematic coverage mapping. In the revised version, we will add a dedicated subsection to §3 that introduces a taxonomy of spatial abilities (categorizing along axes such as static vs. dynamic, egocentric vs. allocentric, and 2D vs. 3D mental manipulation, drawing from established frameworks in the field) and provides a table mapping each of the 30 tasks to these dimensions, along with a brief discussion of coverage and potential gaps. This addition will be grounded in the task descriptions and data sources already presented, thereby strengthening the justification for the comprehensiveness claim without altering the core benchmark content. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark and resource creation paper

full rationale

The manuscript introduces SpatialScore (5K manually verified samples across 30 tasks), SpatialCorpus (331K QA pairs), and SpatialAgent (multi-agent system with 12 tools) as new artifacts, then reports evaluations of 49 MLLMs. No derivation chain, equations, fitted parameters, or first-principles predictions are claimed; the central assertions rest on the construction process, manual verification, and experimental outcomes rather than any reduction to self-defined inputs or self-citations. The work is self-contained as resource creation and empirical assessment.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the assumption that manually verified QA pairs constitute a valid proxy for spatial intelligence and that the chosen 30 tasks are representative; no new physical constants or fitted parameters are introduced in the abstract.

axioms (1)

domain assumption Human performance on the SpatialScore tasks constitutes the appropriate upper bound for model comparison.
Stated in the abstract when reporting the gap between current models and human-level spatial intelligence.

pith-pipeline@v0.9.0 · 5823 in / 1217 out tokens · 33314 ms · 2026-05-22T13:03:28.271946+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SpatialScore ... spanning 30 distinct tasks ... grouped ... into 10 intuitive categories: mental animation, counting, depth estimation, object distance, object motion, camera pose & motion, temporal reasoning, view reasoning, object size, and object localization.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we construct SpatialCorpus, a large-scale training resource with 331K multimodal QA samples ... SpatialAgent, a multi-agent system equipped with 12 specialized spatial perception tools

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models
cs.CV 2026-05 unverdicted novelty 7.0

Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.
SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments
cs.CV 2026-04 unverdicted novelty 7.0

SpaMEM benchmark shows multimodal LLMs succeed at spatial tasks with text histories but sharply fail at long-horizon belief maintenance from raw visual streams alone.
Enhancing MLLM Spatial Understanding via Active 3D Scene Exploration for Multi-Perspective Reasoning
cs.CV 2026-04 unverdicted novelty 7.0

A training-free Visual Chain-of-Thought framework reconstructs high-fidelity 3D meshes from single images and iteratively synthesizes optimal novel views to enhance MLLM spatial comprehension on benchmarks like 3DSRBench.
World2VLM: Distilling World Model Imagination into VLMs for Dynamic Spatial Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

Distilling view-consistent future views and action-outcome supervision from a generative world model into a VLM via two-stage post-training improves dynamic spatial reasoning on SAT-Real, VSI-Bench and similar benchma...
MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?
cs.LG 2026-02 conditional novelty 6.0

MapTab is a new multimodal benchmark with 328 images and nearly 200k queries that shows current MLLMs have substantial difficulty with multi-criteria route planning when visual and tabular information must be combined.
MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?
cs.LG 2026-02 unverdicted novelty 6.0

MapTab benchmark shows current MLLMs struggle with multi-criteria multimodal route planning and that combining vision and language frequently underperforms single-modality approaches.
Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs
cs.CV 2025-05 unverdicted novelty 6.0

Chain-of-Focus enables VLMs to adaptively search and zoom on important image areas via a two-stage SFT and RL pipeline on a custom 3K-sample dataset, yielding 5% gains on the V* benchmark across resolutions from 224 to 4K.
Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning
cs.AI 2025-09 unverdicted novelty 5.0

MoVT unifies different visual reasoning modes in a single model and uses the AdaVaR two-stage framework with supervised cold-start and RL via AdaGRPO to enable context-adaptive mode selection, yielding consistent gain...
Aligning Perception, Reasoning, Modeling and Interaction: A Survey on Physical AI
cs.AI 2025-10 unverdicted novelty 4.0

A survey of physical AI that distinguishes theoretical physics reasoning from applied understanding and synthesizes advances in symbolic reasoning, embodied systems, and generative models to advocate for physics-groun...

Reference graph

Works this paper leans on

155 extracted references · 155 canonical work pages · cited by 8 Pith papers · 14 internal anchors

[1]

Cosmos World Foundation Model Platform for Physical AI

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

gpt-oss-120b & gpt-oss-20b Model Card

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Model card addendum: Claude 3.5 haiku and upgraded claude 3.5 sonnet, 2024

Anthropic. Model card addendum: Claude 3.5 haiku and upgraded claude 3.5 sonnet, 2024

work page 2024
[4]

System card: Claude sonnet 4.5, 2025

Anthropic. System card: Claude sonnet 4.5, 2025

work page 2025
[5]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Spatialthinker: Reinforcing 3d reasoning in multimodal llms via spatial rewards.arXiv preprint arXiv:2511.07403, 2025

Hunar Batra, Haoqin Tu, Hardy Chen, Yuanze Lin, Cihang Xie, and Ronald Clark. Spatialthinker: Reinforcing 3d reasoning in multimodal llms via spatial rewards.arXiv preprint arXiv:2511.07403, 2025

work page arXiv 2025
[8]

Omni3d: A large benchmark and model for 3d object detection in the wild

Garrick Brazil, Abhinav Kumar, Julian Straub, Nikhila Ravi, Justin Johnson, and Georgia Gkioxari. Omni3d: A large benchmark and model for 3d object detection in the wild. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023

work page 2023
[9]

Sims-v: Simulated instruction-tuning for spatial video understanding.arXiv preprint arXiv:2511.04668, 2025

Ellis Brown, Arijit Ray, Ranjay Krishna, Ross Girshick, Rob Fergus, and Saining Xie. Sims-v: Simulated instruction-tuning for spatial video understanding.arXiv preprint arXiv:2511.04668, 2025

work page arXiv 2025
[10]

Spatialbot: Precise spatial understanding with vision language models

Wenxiao Cai, Iaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spatialbot: Precise spatial understanding with vision language models. InIEEE International Conference on Robotics and Automation, 2025

work page 2025
[11]

Scaling spatial intelligence with multimodal foundation models

Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, et al. Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025. 13

work page arXiv 2025
[12]

Has gpt-5 achieved spatial intelligence? an empirical study.arXiv preprint arXiv:2508.13142, 2025

Zhongang Cai, Yubo Wang, Qingping Sun, Ruisi Wang, Chenyang Gu, Wanqi Yin, Zhiqian Lin, Zhitao Yang, Chen Wei, Xuanke Shi, et al. Has gpt-5 achieved spatial intelligence? an empirical study.arXiv preprint arXiv:2508.13142, 2025

work page arXiv 2025
[13]

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[14]

Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors

Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, et al. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors. InProceedings of the International Conference on Learning Representations, 2024

work page 2024
[15]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Spatialrgpt: Grounded spatial reasoning in vision-language models

An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision-language models. InConference on Neural Information Processing Systems, 2025

work page 2025
[17]

Physbench: Benchmarking and enhancing vision-language models for physical world understanding

Wei Chow, Jiageng Mao, Boyi Li, Daniel Seita, Vitor Guizilini, and Yue Wang. Physbench: Benchmarking and enhancing vision-language models for physical world understanding. InProceedings of the International Conference on Learning Representations, 2025

work page 2025
[18]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Grok-1.5 vision preview: Connecting the digital and physical worlds with our first multimodal model, 2024

X.AI Corp. Grok-1.5 vision preview: Connecting the digital and physical worlds with our first multimodal model, 2024

work page 2024
[20]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017

work page 2017
[21]

FlashAttention-2: Faster attention with better parallelism and work partitioning

Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. InProceedings of the International Conference on Learning Representations, 2024

work page 2024
[22]

Mm-spatial: Exploring 3d spatial understanding in multimodal llms

Erik Daxberger, Nina Wenzel, David Griffiths, Haiming Gang, Justin Lazarow, Gefen Kohavi, Kai Kang, Marcin Eichner, Yinfei Yang, Afshin Dehghan, et al. Mm-spatial: Exploring 3d spatial understanding in multimodal llms. InProceedings of the International Conference on Computer Vision, 2025

work page 2025
[23]

Vlm-3r: Vision-language models augmented with instruction-aligned 3d reconstruction

Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, et al. Vlm-3r: Vision-language models augmented with instruction-aligned 3d reconstruction. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2026

work page 2026
[24]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025

work page 2025
[25]

Blink: Multimodal large language models can see but not perceive

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. InProceedings of the European Conference on Computer Vision, 2024

work page 2024
[26]

Sciagents: Automating scientific discovery through bioinspired multi-agent intelligent graph reasoning.Advanced Materials, 2024

Alireza Ghafarollahi and Markus J Buehler. Sciagents: Automating scientific discovery through bioinspired multi-agent intelligent graph reasoning.Advanced Materials, 2024

work page 2024
[27]

Space-10: A comprehensive benchmark for multimodal large language models in compositional spatial intelligence

Ziyang Gong, Wenhao Li, Oliver Ma, Songyuan Li, Zhaokai Wang, Jiayi Ji, Xue Yang, Gen Luo, Junchi Yan, and Rongrong Ji. Space-10: A comprehensive benchmark for multimodal large language models in compositional spatial intelligence. InProceedings of the International Conference on Learning Representations, 2026

work page 2026
[28]

A new era of intelligence with gemini 3, 2025

Google. A new era of intelligence with gemini 3, 2025

work page 2025
[29]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. 14

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Embodied llm agents learn to cooperate in organized teams.arXiv preprint arXiv:2403.12482, 2024

Xudong Guo, Kaixuan Huang, Jiale Liu, Wenhui Fan, Natalia Vélez, Qingyun Wu, Huazheng Wang, Thomas L Griffiths, and Mengdi Wang. Embodied llm agents learn to cooperate in organized teams.arXiv preprint arXiv:2403.12482, 2024

work page arXiv 2024
[31]

Cambridge university press, 2003

Richard Hartley and Andrew Zisserman.Multiple view geometry in computer vision, volume 665. Cambridge university press, 2003

work page 2003
[32]

Metagpt: Meta programming for a multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. Metagpt: Meta programming for a multi-agent collaborative framework. InProceedings of the International Conference on Learning Representations, 2024

work page 2024
[33]

Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models

Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models. InProceedings of the International Conference on Learning Representations, 2026

work page 2026
[34]

Detect anything via next point prediction

Qing Jiang, Junan Huo, Xingyu Chen, Yuda Xiong, Zhaoyang Zeng, Yihao Chen, Tianhe Ren, Junzhi Yu, and Lei Zhang. Detect anything via next point prediction. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2026

work page 2026
[35]

What’s “up” with vision-language models? investigating their struggle with spatial reasoning

Amita Kamath, Jack Hessel, and Kai-Wei Chang. What’s “up” with vision-language models? investigating their struggle with spatial reasoning. InProceedings of the Conference on Empirical Methods in Natural Language Processing, 2023

work page 2023
[36]

Mapanything: Universal feed-forward metric 3d reconstruction

Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, et al. Mapanything: Universal feed-forward metric 3d reconstruction. InInternational Conference on 3D Vision, 2026

work page 2026
[37]

Cubify anything: Scaling indoor 3d object detection

Justin Lazarow, David Griffiths, Gefen Kohavi, Francisco Crespo, and Afshin Dehghan. Cubify anything: Scaling indoor 3d object detection. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025

work page 2025
[38]

LLaVA-onevision: Easy visual task transfer.Transactions on Machine Learning Research, 2025

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. LLaVA-onevision: Easy visual task transfer.Transactions on Machine Learning Research, 2025

work page 2025
[39]

Seed-bench: Benchmarking multimodal large language models

Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench: Benchmarking multimodal large language models. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[40]

Camel: Communicative agents for "mind" exploration of large language model society

Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for "mind" exploration of large language model society. InConference on Neural Information Processing Systems, 2023

work page 2023
[41]

Agent hospital: A simulacrum of hospital with evolvable medical agents,

Junkai Li, Yunghwei Lai, Weitao Li, Jingyi Ren, Meng Zhang, Xinhui Kang, Siyu Wang, Peng Li, Ya-Qin Zhang, Weizhi Ma, et al. Agent hospital: A simulacrum of hospital with evolvable medical agents.arXiv preprint arXiv:2405.02957, 2024

work page arXiv 2024
[42]

Sti-bench: Are mllms ready for precise spatial-temporal world understanding? InProceedings of the International Conference on Computer Vision, 2025

Yun Li, Yiming Zhang, Tao Lin, Xiangrui Liu, Wenxiao Cai, Zheng Liu, and Bo Zhao. Sti-bench: Are mllms ready for precise spatial-temporal world understanding? InProceedings of the International Conference on Computer Vision, 2025

work page 2025
[43]

Optimus-1: Hybrid multimodal memory empowered agents excel in long-horizon tasks

Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Dongmei Jiang, and Liqiang Nie. Optimus-1: Hybrid multimodal memory empowered agents excel in long-horizon tasks. InConference on Neural Information Processing Systems, 2024

work page 2024
[44]

Reasoning paths with reference objects elicit quantitative spatial reasoning in large vision-language models

Yuan-Hong Liao, Rafid Mahmood, Sanja Fidler, and David Acuna. Reasoning paths with reference objects elicit quantitative spatial reasoning in large vision-language models. InProceedings of the Conference on Empirical Methods in Natural Language Processing, 2024

work page 2024
[45]

Vila: On pre-training for visual language models

Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[46]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024. 15

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Mirage: A multi-modal benchmark for spatial perception, reasoning, and intelligence

Chonghan Liu, Haoran Wang, Felix Henry, Pu Miao, Yajie Zhang, Yu Zhao, and Peiran Wu. Mirage: A multi-modal benchmark for spatial perception, reasoning, and intelligence. InConference on Neural Information Processing Systems, 2025

work page 2025
[48]

Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 2023

Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 2023

work page 2023
[49]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[50]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InConference on Neural Information Processing Systems, 2023

work page 2023
[51]

Versavit: Enhancing mllm vision backbones via task-guided optimization.arXiv preprint arXiv:2602.09934, 2026

Yikun Liu, Yuan Liu, Shangzhe Di, Haicheng Wang, Zhongyin Zhao, Le Tian, Xiao Zhou, Jie Zhou, Jiangchao Yao, Yanfeng Wang, et al. Versavit: Enhancing mllm vision backbones via task-guided optimization.arXiv preprint arXiv:2602.09934, 2026

work page arXiv 2026
[52]

Lamra: Large multimodal model as your advanced retrieval assistant

Yikun Liu, Yajie Zhang, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiangchao Yao, Yanfeng Wang, and Weidi Xie. Lamra: Large multimodal model as your advanced retrieval assistant. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025

work page 2025
[53]

Mmbench: Is your multi-modal model an all-around player? InProceedings of the European Conference on Computer Vision, 2024

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InProceedings of the European Conference on Computer Vision, 2024

work page 2024
[54]

Distinctive image features from scale-invariant keypoints.International Journal of Computer Vision, 2004

David G Lowe. Distinctive image features from scale-invariant keypoints.International Journal of Computer Vision, 2004

work page 2004
[55]

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InProceedings of the International Conference on Learning Representations, 2024

work page 2024
[56]

3dsrbench: A comprehensive 3d spatial reasoning benchmark

Wufei Ma, Haoyu Chen, Guofeng Zhang, Celso M de Melo, Alan Yuille, and Jieneng Chen. 3dsrbench: A comprehensive 3d spatial reasoning benchmark. InProceedings of the International Conference on Computer Vision, 2025

work page 2025
[57]

Spatialrea- soner: Towards explicit and generalizable 3d spatial reasoning

Wufei Ma, Yu-Cheng Chou, Qihao Liu, Xingrui Wang, Celso de Melo, Jianwen Xie, and Alan Yuille. Spatialrea- soner: Towards explicit and generalizable 3d spatial reasoning. InConference on Neural Information Processing Systems, 2025

work page 2025
[58]

Spatialllm: A compound 3d-informed design towards spatially-intelligent large multimodal models

Wufei Ma, Luoxin Ye, Celso M de Melo, Alan Yuille, and Jieneng Chen. Spatialllm: A compound 3d-informed design towards spatially-intelligent large multimodal models. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025

work page 2025
[59]

Taco: Learning multi-modal action models with synthetic chains-of-thought-and-action.arXiv preprint arXiv:2412.05479, 2024

Zixian Ma, Jianguo Zhang, Zhiwei Liu, Jieyu Zhang, Juntao Tan, Manli Shu, Niebles, et al. Taco: Learning multi-modal action models with synthetic chains-of-thought-and-action.arXiv preprint arXiv:2412.05479, 2024

work page arXiv 2024
[60]

Mmiu: Multimodal multi-image understanding for evaluating large vision-language models

Fanqing Meng, Chuanhao Li, Jin Wang, Quanfeng Lu, Hao Tian, Tianshuo Yang, Jiaqi Liao, Xizhou Zhu, Jifeng Dai, Yu Qiao, et al. Mmiu: Multimodal multi-image understanding for evaluating large vision-language models. InProceedings of the International Conference on Learning Representations, 2025

work page 2025
[61]

Scenegen: Single-image 3d scene generation in one feedforward pass

Yanxu Meng, Haoning Wu, Ya Zhang, and Weidi Xie. Scenegen: Single-image 3d scene generation in one feedforward pass. InInternational Conference on 3D Vision, 2026

work page 2026
[62]

GPT-5 System Card, 2025

OpenAI. GPT-5 System Card, 2025. Accessed: 2025-11-1

work page 2025
[63]

SpaceR: Reinforcing MLLMs in Video Spatial Reasoning

Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. Spacer: Reinforcing mllms in video spatial reasoning.arXiv preprint arXiv:2504.01805, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[64]

Chatdev: Communicative agents for software development

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, et al. Chatdev: Communicative agents for software development. InAssociation for Computational Linguistics, 2024

work page 2024
[65]

Multi-agent system for comprehensive soccer understanding

Jiayuan Rao, Zifeng Li, Haoning Wu, Ya Zhang, Yanfeng Wang, and Weidi Xie. Multi-agent system for comprehensive soccer understanding. InACM Multimedia, 2025. 16

work page 2025
[66]

Towards universal soccer video understanding

Jiayuan Rao, Haoning Wu, Hao Jiang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Towards universal soccer video understanding. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025

work page 2025
[67]

Matchtime: Towards automatic soccer game commentary generation

Jiayuan Rao, Haoning Wu, Chang Liu, Yanfeng Wang, and Weidi Xie. Matchtime: Towards automatic soccer game commentary generation. InProceedings of the Conference on Empirical Methods in Natural Language Processing, 2024

work page 2024
[68]

Sam 2: Segment anything in images and videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. In Proceedings of the International Conference on Learning Representations, 2025

work page 2025
[69]

Structure-from-motion revisited

Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016

work page 2016
[70]

Satori-r1: Incentivizing multimodal reasoning through explicit visual anchoring.arXiv preprint arXiv:2505.19094, 2025

Chuming Shen, Wei Wei, Xiaoye Qu, and Yu Cheng. Satori-r1: Incentivizing multimodal reasoning through explicit visual anchoring.arXiv preprint arXiv:2505.19094, 2025

work page arXiv 2025
[71]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InConference on Neural Information Processing Systems, 2023

work page 2023
[72]

Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics

Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, and Stan Birchfield. Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025

work page 2025
[73]

Mind the gap: Benchmarking spatial reasoning in vision-language models.arXiv preprint arXiv:2503.19707,

Ilias Stogiannidis, Steven McDonagh, and Sotirios A Tsaftaris. Mind the gap: Benchmarking spatial reasoning in vision-language models.arXiv preprint arXiv:2503.19707, 2025

work page arXiv 2025
[74]

Multi-agent embodied question answering in interactive environments

Sinan Tan, Weilai Xiang, Huaping Liu, Di Guo, and Fuchun Sun. Multi-agent embodied question answering in interactive environments. InProceedings of the European Conference on Computer Vision, 2020

work page 2020
[75]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalk- wyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[76]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[77]

Raft: Recurrent all-pairs field transforms for optical flow

Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InProceedings of the European Conference on Computer Vision, 2020

work page 2020
[78]

Cambrian-1: A fully open, vision-centric exploration of multimodal llms

Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. InConference on Neural Information Processing Systems, 2024

work page 2024
[79]

Eyes wide shut? exploring the visual shortcomings of multimodal llms

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[80]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025

work page 2025

Showing first 80 references.

[1] [1]

Cosmos World Foundation Model Platform for Physical AI

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

gpt-oss-120b & gpt-oss-20b Model Card

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Model card addendum: Claude 3.5 haiku and upgraded claude 3.5 sonnet, 2024

Anthropic. Model card addendum: Claude 3.5 haiku and upgraded claude 3.5 sonnet, 2024

work page 2024

[4] [4]

System card: Claude sonnet 4.5, 2025

Anthropic. System card: Claude sonnet 4.5, 2025

work page 2025

[5] [5]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Spatialthinker: Reinforcing 3d reasoning in multimodal llms via spatial rewards.arXiv preprint arXiv:2511.07403, 2025

Hunar Batra, Haoqin Tu, Hardy Chen, Yuanze Lin, Cihang Xie, and Ronald Clark. Spatialthinker: Reinforcing 3d reasoning in multimodal llms via spatial rewards.arXiv preprint arXiv:2511.07403, 2025

work page arXiv 2025

[8] [8]

Omni3d: A large benchmark and model for 3d object detection in the wild

Garrick Brazil, Abhinav Kumar, Julian Straub, Nikhila Ravi, Justin Johnson, and Georgia Gkioxari. Omni3d: A large benchmark and model for 3d object detection in the wild. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023

work page 2023

[9] [9]

Sims-v: Simulated instruction-tuning for spatial video understanding.arXiv preprint arXiv:2511.04668, 2025

Ellis Brown, Arijit Ray, Ranjay Krishna, Ross Girshick, Rob Fergus, and Saining Xie. Sims-v: Simulated instruction-tuning for spatial video understanding.arXiv preprint arXiv:2511.04668, 2025

work page arXiv 2025

[10] [10]

Spatialbot: Precise spatial understanding with vision language models

Wenxiao Cai, Iaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spatialbot: Precise spatial understanding with vision language models. InIEEE International Conference on Robotics and Automation, 2025

work page 2025

[11] [11]

Scaling spatial intelligence with multimodal foundation models

Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, et al. Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025. 13

work page arXiv 2025

[12] [12]

Has gpt-5 achieved spatial intelligence? an empirical study.arXiv preprint arXiv:2508.13142, 2025

Zhongang Cai, Yubo Wang, Qingping Sun, Ruisi Wang, Chenyang Gu, Wanqi Yin, Zhiqian Lin, Zhitao Yang, Chen Wei, Xuanke Shi, et al. Has gpt-5 achieved spatial intelligence? an empirical study.arXiv preprint arXiv:2508.13142, 2025

work page arXiv 2025

[13] [13]

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024

work page 2024

[14] [14]

Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors

Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, et al. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors. InProceedings of the International Conference on Learning Representations, 2024

work page 2024

[15] [15]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

Spatialrgpt: Grounded spatial reasoning in vision-language models

An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision-language models. InConference on Neural Information Processing Systems, 2025

work page 2025

[17] [17]

Physbench: Benchmarking and enhancing vision-language models for physical world understanding

Wei Chow, Jiageng Mao, Boyi Li, Daniel Seita, Vitor Guizilini, and Yue Wang. Physbench: Benchmarking and enhancing vision-language models for physical world understanding. InProceedings of the International Conference on Learning Representations, 2025

work page 2025

[18] [18]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Grok-1.5 vision preview: Connecting the digital and physical worlds with our first multimodal model, 2024

X.AI Corp. Grok-1.5 vision preview: Connecting the digital and physical worlds with our first multimodal model, 2024

work page 2024

[20] [20]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017

work page 2017

[21] [21]

FlashAttention-2: Faster attention with better parallelism and work partitioning

Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. InProceedings of the International Conference on Learning Representations, 2024

work page 2024

[22] [22]

Mm-spatial: Exploring 3d spatial understanding in multimodal llms

Erik Daxberger, Nina Wenzel, David Griffiths, Haiming Gang, Justin Lazarow, Gefen Kohavi, Kai Kang, Marcin Eichner, Yinfei Yang, Afshin Dehghan, et al. Mm-spatial: Exploring 3d spatial understanding in multimodal llms. InProceedings of the International Conference on Computer Vision, 2025

work page 2025

[23] [23]

Vlm-3r: Vision-language models augmented with instruction-aligned 3d reconstruction

Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, et al. Vlm-3r: Vision-language models augmented with instruction-aligned 3d reconstruction. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2026

work page 2026

[24] [24]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025

work page 2025

[25] [25]

Blink: Multimodal large language models can see but not perceive

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. InProceedings of the European Conference on Computer Vision, 2024

work page 2024

[26] [26]

Sciagents: Automating scientific discovery through bioinspired multi-agent intelligent graph reasoning.Advanced Materials, 2024

Alireza Ghafarollahi and Markus J Buehler. Sciagents: Automating scientific discovery through bioinspired multi-agent intelligent graph reasoning.Advanced Materials, 2024

work page 2024

[27] [27]

Space-10: A comprehensive benchmark for multimodal large language models in compositional spatial intelligence

Ziyang Gong, Wenhao Li, Oliver Ma, Songyuan Li, Zhaokai Wang, Jiayi Ji, Xue Yang, Gen Luo, Junchi Yan, and Rongrong Ji. Space-10: A comprehensive benchmark for multimodal large language models in compositional spatial intelligence. InProceedings of the International Conference on Learning Representations, 2026

work page 2026

[28] [28]

A new era of intelligence with gemini 3, 2025

Google. A new era of intelligence with gemini 3, 2025

work page 2025

[29] [29]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. 14

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

Embodied llm agents learn to cooperate in organized teams.arXiv preprint arXiv:2403.12482, 2024

Xudong Guo, Kaixuan Huang, Jiale Liu, Wenhui Fan, Natalia Vélez, Qingyun Wu, Huazheng Wang, Thomas L Griffiths, and Mengdi Wang. Embodied llm agents learn to cooperate in organized teams.arXiv preprint arXiv:2403.12482, 2024

work page arXiv 2024

[31] [31]

Cambridge university press, 2003

Richard Hartley and Andrew Zisserman.Multiple view geometry in computer vision, volume 665. Cambridge university press, 2003

work page 2003

[32] [32]

Metagpt: Meta programming for a multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. Metagpt: Meta programming for a multi-agent collaborative framework. InProceedings of the International Conference on Learning Representations, 2024

work page 2024

[33] [33]

Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models

Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models. InProceedings of the International Conference on Learning Representations, 2026

work page 2026

[34] [34]

Detect anything via next point prediction

Qing Jiang, Junan Huo, Xingyu Chen, Yuda Xiong, Zhaoyang Zeng, Yihao Chen, Tianhe Ren, Junzhi Yu, and Lei Zhang. Detect anything via next point prediction. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2026

work page 2026

[35] [35]

What’s “up” with vision-language models? investigating their struggle with spatial reasoning

Amita Kamath, Jack Hessel, and Kai-Wei Chang. What’s “up” with vision-language models? investigating their struggle with spatial reasoning. InProceedings of the Conference on Empirical Methods in Natural Language Processing, 2023

work page 2023

[36] [36]

Mapanything: Universal feed-forward metric 3d reconstruction

Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, et al. Mapanything: Universal feed-forward metric 3d reconstruction. InInternational Conference on 3D Vision, 2026

work page 2026

[37] [37]

Cubify anything: Scaling indoor 3d object detection

Justin Lazarow, David Griffiths, Gefen Kohavi, Francisco Crespo, and Afshin Dehghan. Cubify anything: Scaling indoor 3d object detection. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025

work page 2025

[38] [38]

LLaVA-onevision: Easy visual task transfer.Transactions on Machine Learning Research, 2025

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. LLaVA-onevision: Easy visual task transfer.Transactions on Machine Learning Research, 2025

work page 2025

[39] [39]

Seed-bench: Benchmarking multimodal large language models

Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench: Benchmarking multimodal large language models. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024

work page 2024

[40] [40]

Camel: Communicative agents for "mind" exploration of large language model society

Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for "mind" exploration of large language model society. InConference on Neural Information Processing Systems, 2023

work page 2023

[41] [41]

Agent hospital: A simulacrum of hospital with evolvable medical agents,

Junkai Li, Yunghwei Lai, Weitao Li, Jingyi Ren, Meng Zhang, Xinhui Kang, Siyu Wang, Peng Li, Ya-Qin Zhang, Weizhi Ma, et al. Agent hospital: A simulacrum of hospital with evolvable medical agents.arXiv preprint arXiv:2405.02957, 2024

work page arXiv 2024

[42] [42]

Sti-bench: Are mllms ready for precise spatial-temporal world understanding? InProceedings of the International Conference on Computer Vision, 2025

Yun Li, Yiming Zhang, Tao Lin, Xiangrui Liu, Wenxiao Cai, Zheng Liu, and Bo Zhao. Sti-bench: Are mllms ready for precise spatial-temporal world understanding? InProceedings of the International Conference on Computer Vision, 2025

work page 2025

[43] [43]

Optimus-1: Hybrid multimodal memory empowered agents excel in long-horizon tasks

Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Dongmei Jiang, and Liqiang Nie. Optimus-1: Hybrid multimodal memory empowered agents excel in long-horizon tasks. InConference on Neural Information Processing Systems, 2024

work page 2024

[44] [44]

Reasoning paths with reference objects elicit quantitative spatial reasoning in large vision-language models

Yuan-Hong Liao, Rafid Mahmood, Sanja Fidler, and David Acuna. Reasoning paths with reference objects elicit quantitative spatial reasoning in large vision-language models. InProceedings of the Conference on Empirical Methods in Natural Language Processing, 2024

work page 2024

[45] [45]

Vila: On pre-training for visual language models

Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024

work page 2024

[46] [46]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024. 15

work page internal anchor Pith review Pith/arXiv arXiv 2024

[47] [47]

Mirage: A multi-modal benchmark for spatial perception, reasoning, and intelligence

Chonghan Liu, Haoran Wang, Felix Henry, Pu Miao, Yajie Zhang, Yu Zhao, and Peiran Wu. Mirage: A multi-modal benchmark for spatial perception, reasoning, and intelligence. InConference on Neural Information Processing Systems, 2025

work page 2025

[48] [48]

Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 2023

Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 2023

work page 2023

[49] [49]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024

work page 2024

[50] [50]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InConference on Neural Information Processing Systems, 2023

work page 2023

[51] [51]

Versavit: Enhancing mllm vision backbones via task-guided optimization.arXiv preprint arXiv:2602.09934, 2026

Yikun Liu, Yuan Liu, Shangzhe Di, Haicheng Wang, Zhongyin Zhao, Le Tian, Xiao Zhou, Jie Zhou, Jiangchao Yao, Yanfeng Wang, et al. Versavit: Enhancing mllm vision backbones via task-guided optimization.arXiv preprint arXiv:2602.09934, 2026

work page arXiv 2026

[52] [52]

Lamra: Large multimodal model as your advanced retrieval assistant

Yikun Liu, Yajie Zhang, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiangchao Yao, Yanfeng Wang, and Weidi Xie. Lamra: Large multimodal model as your advanced retrieval assistant. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025

work page 2025

[53] [53]

Mmbench: Is your multi-modal model an all-around player? InProceedings of the European Conference on Computer Vision, 2024

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InProceedings of the European Conference on Computer Vision, 2024

work page 2024

[54] [54]

Distinctive image features from scale-invariant keypoints.International Journal of Computer Vision, 2004

David G Lowe. Distinctive image features from scale-invariant keypoints.International Journal of Computer Vision, 2004

work page 2004

[55] [55]

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InProceedings of the International Conference on Learning Representations, 2024

work page 2024

[56] [56]

3dsrbench: A comprehensive 3d spatial reasoning benchmark

Wufei Ma, Haoyu Chen, Guofeng Zhang, Celso M de Melo, Alan Yuille, and Jieneng Chen. 3dsrbench: A comprehensive 3d spatial reasoning benchmark. InProceedings of the International Conference on Computer Vision, 2025

work page 2025

[57] [57]

Spatialrea- soner: Towards explicit and generalizable 3d spatial reasoning

Wufei Ma, Yu-Cheng Chou, Qihao Liu, Xingrui Wang, Celso de Melo, Jianwen Xie, and Alan Yuille. Spatialrea- soner: Towards explicit and generalizable 3d spatial reasoning. InConference on Neural Information Processing Systems, 2025

work page 2025

[58] [58]

Spatialllm: A compound 3d-informed design towards spatially-intelligent large multimodal models

Wufei Ma, Luoxin Ye, Celso M de Melo, Alan Yuille, and Jieneng Chen. Spatialllm: A compound 3d-informed design towards spatially-intelligent large multimodal models. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025

work page 2025

[59] [59]

Taco: Learning multi-modal action models with synthetic chains-of-thought-and-action.arXiv preprint arXiv:2412.05479, 2024

Zixian Ma, Jianguo Zhang, Zhiwei Liu, Jieyu Zhang, Juntao Tan, Manli Shu, Niebles, et al. Taco: Learning multi-modal action models with synthetic chains-of-thought-and-action.arXiv preprint arXiv:2412.05479, 2024

work page arXiv 2024

[60] [60]

Mmiu: Multimodal multi-image understanding for evaluating large vision-language models

Fanqing Meng, Chuanhao Li, Jin Wang, Quanfeng Lu, Hao Tian, Tianshuo Yang, Jiaqi Liao, Xizhou Zhu, Jifeng Dai, Yu Qiao, et al. Mmiu: Multimodal multi-image understanding for evaluating large vision-language models. InProceedings of the International Conference on Learning Representations, 2025

work page 2025

[61] [61]

Scenegen: Single-image 3d scene generation in one feedforward pass

Yanxu Meng, Haoning Wu, Ya Zhang, and Weidi Xie. Scenegen: Single-image 3d scene generation in one feedforward pass. InInternational Conference on 3D Vision, 2026

work page 2026

[62] [62]

GPT-5 System Card, 2025

OpenAI. GPT-5 System Card, 2025. Accessed: 2025-11-1

work page 2025

[63] [63]

SpaceR: Reinforcing MLLMs in Video Spatial Reasoning

Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. Spacer: Reinforcing mllms in video spatial reasoning.arXiv preprint arXiv:2504.01805, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[64] [64]

Chatdev: Communicative agents for software development

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, et al. Chatdev: Communicative agents for software development. InAssociation for Computational Linguistics, 2024

work page 2024

[65] [65]

Multi-agent system for comprehensive soccer understanding

Jiayuan Rao, Zifeng Li, Haoning Wu, Ya Zhang, Yanfeng Wang, and Weidi Xie. Multi-agent system for comprehensive soccer understanding. InACM Multimedia, 2025. 16

work page 2025

[66] [66]

Towards universal soccer video understanding

Jiayuan Rao, Haoning Wu, Hao Jiang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Towards universal soccer video understanding. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025

work page 2025

[67] [67]

Matchtime: Towards automatic soccer game commentary generation

Jiayuan Rao, Haoning Wu, Chang Liu, Yanfeng Wang, and Weidi Xie. Matchtime: Towards automatic soccer game commentary generation. InProceedings of the Conference on Empirical Methods in Natural Language Processing, 2024

work page 2024

[68] [68]

Sam 2: Segment anything in images and videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. In Proceedings of the International Conference on Learning Representations, 2025

work page 2025

[69] [69]

Structure-from-motion revisited

Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016

work page 2016

[70] [70]

Satori-r1: Incentivizing multimodal reasoning through explicit visual anchoring.arXiv preprint arXiv:2505.19094, 2025

Chuming Shen, Wei Wei, Xiaoye Qu, and Yu Cheng. Satori-r1: Incentivizing multimodal reasoning through explicit visual anchoring.arXiv preprint arXiv:2505.19094, 2025

work page arXiv 2025

[71] [71]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InConference on Neural Information Processing Systems, 2023

work page 2023

[72] [72]

Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics

Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, and Stan Birchfield. Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025

work page 2025

[73] [73]

Mind the gap: Benchmarking spatial reasoning in vision-language models.arXiv preprint arXiv:2503.19707,

Ilias Stogiannidis, Steven McDonagh, and Sotirios A Tsaftaris. Mind the gap: Benchmarking spatial reasoning in vision-language models.arXiv preprint arXiv:2503.19707, 2025

work page arXiv 2025

[74] [74]

Multi-agent embodied question answering in interactive environments

Sinan Tan, Weilai Xiang, Huaping Liu, Di Guo, and Fuchun Sun. Multi-agent embodied question answering in interactive environments. InProceedings of the European Conference on Computer Vision, 2020

work page 2020

[75] [75]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalk- wyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[76] [76]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[77] [77]

Raft: Recurrent all-pairs field transforms for optical flow

Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InProceedings of the European Conference on Computer Vision, 2020

work page 2020

[78] [78]

Cambrian-1: A fully open, vision-centric exploration of multimodal llms

Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. InConference on Neural Information Processing Systems, 2024

work page 2024

[79] [79]

Eyes wide shut? exploring the visual shortcomings of multimodal llms

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024

work page 2024

[80] [80]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025

work page 2025