pith. machine review for the scientific record. sign in

arxiv: 2605.10106 · v1 · submitted 2026-05-11 · 💻 cs.CV · cs.AI

Recognition: 1 theorem link

· Lean Theorem

ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:55 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords spatial reasoningmulti-modal large language modelsMLLMsvideo-based agenttraining-free3D understandingexpert modelsplug-and-play
0
0 comments X

The pith

ViSRA improves spatial reasoning in MLLMs by feeding explicit 3D data from expert models into video inputs without any training or fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a modular agent called ViSRA can elicit better 3D spatial reasoning inside existing multi-modal language models by pulling accurate spatial details from separate expert systems and passing them along in video form. This matters because most current gains in spatial intelligence come from expensive post-training on hand-curated datasets, while ViSRA stays training-free and works across different base models. A sympathetic reader would see it as a practical way to add human-like spatial understanding to MLLMs on both familiar benchmarks and entirely new tasks. The core idea is that making spatial information explicit and modular lets the language model use it flexibly rather than overfitting to narrow examples.

Core claim

ViSRA is a human-aligned Video-based Spatial Reasoning Agent that elicits spatial reasoning in MLLMs in a modular and extensible manner by leveraging explicit spatial information from expert models, enabling a plug-and-play flexible paradigm with no post-training computational cost or heavy manual curation of spatial reasoning datasets.

What carries the argument

ViSRA, the Video-based Spatial Reasoning Agent, which extracts explicit spatial cues from expert models and supplies them to MLLMs through video inputs to probe and strengthen spatial reasoning.

If this is right

  • Consistent absolute gains of up to 15.6 percent on existing spatial reasoning benchmarks across multiple MLLMs.
  • Larger absolute gains of up to 28.9 percent on previously unseen 3D spatial reasoning tasks.
  • No requirement for post-training or creation of new spatial datasets.
  • A plug-and-play structure that works with different underlying MLLMs without modification.
  • Human-aligned 3D understanding that transfers rather than overfits to specific training distributions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same modular separation of expert perception from the language core could be tried for other reasoning types such as temporal or causal inference.
  • Making spatial information explicit may make it easier to debug or audit where MLLMs succeed or fail on 3D tasks.
  • Because the method needs no fine-tuning, it could support rapid testing of new expert models as they become available.
  • The training-free design opens the possibility of applying ViSRA in real-time or resource-constrained settings where retraining is impractical.

Load-bearing premise

External expert models can reliably supply accurate and transferable 3D spatial information that MLLMs can integrate effectively at inference time without fine-tuning or task-specific adaptation.

What would settle it

Running ViSRA on standard spatial benchmarks and finding no measurable gain over the plain MLLM baselines, or discovering that the expert models frequently output inaccurate or non-transferable spatial data for the tested scenes.

Figures

Figures reproduced from arXiv: 2605.10106 by Ce Liu, Hao Yang, Jiabo He, Jingjing Chen, Renying Wang, Tiehua Zhang, Tingshu Mou, Xingjun Ma.

Figure 1
Figure 1. Figure 1: Comparison of three paradigms for 3D spatial reasoning. We evaluate ViSRA against the [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Performance comparison across problem types on VSI-Bench (a subset of 779 questions). Qwen2.5- VL-7B yields a drop on six question types given ground-truth(GT) cogni￾tive maps. Qwen-3VL-8B If I am standing by the stove and facing the tv, is the sofa to my front-left, front￾right, back-left, or back-right?\nThe directions refer to the quadrants of a Cartesian plane (if I am standing at the origin and facing… view at source ↗
Figure 3
Figure 3. Figure 3: A comparison example. Qwen3-VL-8B succeeds to answer a spatial question with the source video but outputs the wrong answer with the summarized cognitive map. We have witnessed growing interest in spatial reasoning for MLLMs in recent years, accompanied by the emergence of benchmarks that systematically evaluated video-based spatial intelligence [41, 24, 40, 27]. To our best knowl￾edge, VSI-Bench was first … view at source ↗
Figure 4
Figure 4. Figure 4: Overview of ViSRA. The left panel summarizes spatial tools that produce accurate interme [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: An inference example. ViSRA answers a relative-distance question by using four roles and [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: The system decomposes plan￾ning, control, execution, and answer syn￾thesis into four specialized roles: the Plan￾ner, the Reflector, the Executor, and the Summarizer. Given a question, the Planner parses the query together with the tool schemas and produces a structured execution plan that specifies the required evidence and tool se￾quence. ViSRA then performs a bounded iterative procedure alternating betw… view at source ↗
Figure 6
Figure 6. Figure 6: An example of ViSRA solving an object-counting question correctly. [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: An example of ViSRA solving an appearance-order question correctly. [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: An example of ViSRA solving a relative-direction question correctly. [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: An example of ViSRA solving an object-counting question incorrectly due to misdetecting [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Visualized and textual examples of ground-truth cognitive maps generated from 3D [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Two examples in VSI-Bench where ambiguous referring expressions lead to ambiguous [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗
read the original abstract

Recent advances in Multi-modal Large Language Models (MLLMs) target 3D spatial intelligence, yet the progress has been largely driven by post-training on curated benchmarks, leaving the inference-time approach relatively underexplored. In this paper, we take a training-free perspective and introduce ViSRA, a human-aligned Video-based Spatial Reasoning Agent, as a framework to probe the spatial reasoning mechanism of MLLMs. ViSRA elicits spatial reasoning in a modular and extensible manner by leveraging explicit spatial information from expert models, enabling a plug-and-play flexible paradigm. ViSRA offers two key advantages: (1) human-aligned and transferable 3D understanding rather than task-specific overfitting; and (2) no post-training computational cost along with heavy manual curation of spatial reasoning datasets. Experimental results demonstrate consistent improvement across a set of MLLMs on both existing benchmarks and unseen 3D spatial reasoning tasks, with ViSRA outperforming baselines by up to a 15.6% and 28.9% absolute margin respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces ViSRA, a training-free Video-based Spatial Reasoning Agent that augments multi-modal large language models (MLLMs) with explicit 3D spatial information extracted from external expert models. The framework is presented as modular and extensible, enabling plug-and-play integration at inference time to elicit spatial reasoning without fine-tuning or curated dataset post-training. Experimental claims include consistent gains across MLLMs on existing benchmarks (up to 15.6% absolute) and unseen 3D spatial tasks (up to 28.9% absolute).

Significance. If substantiated with rigorous controls, the training-free modular design would be a meaningful contribution to inference-time methods for 3D spatial intelligence in MLLMs, avoiding the costs of fine-tuning while emphasizing transferability. The explicit separation of expert spatial extraction from the MLLM itself is a clear strength that could generalize to other reasoning domains.

major comments (2)
  1. [Framework and Experimental Results] The central claim that expert models supply accurate, human-aligned, and transferable 3D spatial information that MLLMs integrate without adaptation is load-bearing but unsupported by robustness analysis. No ablation or error-propagation study examines how depth-estimation inaccuracies, occlusion handling, or 2D-to-3D projection failures in the experts affect downstream MLLM outputs.
  2. [Experimental Results] Performance gains are reported without sufficient experimental detail on baseline definitions, expert-model selection criteria, prompt-construction specifics, or statistical significance testing. This prevents assessment of whether improvements reflect elicited reasoning or simply richer prompt augmentation.
minor comments (2)
  1. [Abstract and Introduction] The abstract and introduction would benefit from explicit enumeration of the unseen 3D tasks and how they were constructed to differ from training distributions.
  2. [Method] Notation for spatial representations (e.g., how 3D coordinates or depth maps are serialized into text prompts) should be formalized for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, agreeing where additional rigor is warranted and outlining the specific revisions we will implement.

read point-by-point responses
  1. Referee: [Framework and Experimental Results] The central claim that expert models supply accurate, human-aligned, and transferable 3D spatial information that MLLMs integrate without adaptation is load-bearing but unsupported by robustness analysis. No ablation or error-propagation study examines how depth-estimation inaccuracies, occlusion handling, or 2D-to-3D projection failures in the experts affect downstream MLLM outputs.

    Authors: We agree that direct robustness analysis is necessary to fully substantiate the load-bearing claim about the accuracy, human alignment, and transferability of the 3D spatial cues extracted by expert models. While the consistent gains on unseen tasks provide indirect support for transferability without task-specific overfitting, we acknowledge the lack of explicit error-propagation studies in the current version. In the revised manuscript, we will add a dedicated ablation subsection that quantifies the effects of depth-estimation inaccuracies, occlusion handling failures, and 2D-to-3D projection errors on final MLLM outputs, including controlled experiments that inject synthetic errors into the expert outputs and measure downstream performance degradation. revision: yes

  2. Referee: [Experimental Results] Performance gains are reported without sufficient experimental detail on baseline definitions, expert-model selection criteria, prompt-construction specifics, or statistical significance testing. This prevents assessment of whether improvements reflect elicited reasoning or simply richer prompt augmentation.

    Authors: We recognize that greater experimental transparency is required to allow readers to distinguish between genuine spatial reasoning elicitation and the effects of richer prompting. The original manuscript describes the overall setup and reports absolute gains, but we agree that more granular details are needed. In the revision, we will expand the experimental section to include: precise definitions of all baselines, explicit criteria for selecting the expert models (e.g., preference for models with demonstrated human perceptual alignment on standard depth and pose benchmarks), the full set of prompt templates used for MLLM integration, and statistical significance testing with standard deviations and p-values computed over multiple independent runs. These additions will strengthen the evidence that the observed improvements derive from the modular spatial cue integration rather than generic prompt enrichment. revision: yes

Circularity Check

0 steps flagged

No circularity: framework uses external experts without derivations or self-referential predictions

full rationale

The paper proposes ViSRA as a modular, training-free agent that plugs explicit 3D spatial outputs from separate expert models into MLLM prompts. No equations, parameter fitting, or derivation chain exist in the described approach; performance gains are reported from direct experiments on benchmarks and unseen tasks. The central premise rests on the (external) reliability of those expert models rather than any reduction of outputs to the paper's own inputs or prior self-citations. This is a standard empirical framework paper with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that expert models deliver reliable spatial cues and that MLLMs can utilize them modularly without adaptation. No free parameters or invented physical entities are introduced; the main axiom is domain-level reliance on pre-existing expert tools.

axioms (1)
  • domain assumption Expert models provide accurate and human-aligned 3D spatial information transferable across tasks
    Invoked to justify the plug-and-play paradigm and performance gains without training.

pith-pipeline@v0.9.0 · 5503 in / 1260 out tokens · 72667 ms · 2026-05-12T03:55:15.929959+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 13 internal anchors

  1. [1]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  2. [2]

    ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data

    Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, et al. Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data.arXiv preprint arXiv:2111.08897, 2021

  3. [3]

    Spatialthinker: Reinforcing 3d reasoning in multimodal llms via spatial rewards.arXiv preprint arXiv:2511.07403, 2025

    Hunar Batra, Haoqin Tu, Hardy Chen, Yuanze Lin, Cihang Xie, and Ronald Clark. Spatialthinker: Reinforcing 3d reasoning in multimodal llms via spatial rewards.arXiv preprint arXiv:2511.07403, 2025

  4. [4]

    Omni3d: A large benchmark and model for 3d object detection in the wild

    Garrick Brazil, Abhinav Kumar, Julian Straub, Nikhila Ravi, Justin Johnson, and Georgia Gkioxari. Omni3d: A large benchmark and model for 3d object detection in the wild. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13154–13164, 2023

  5. [5]

    Scaling spatial intelligence with multi- modal foundation models.arXiv preprint arXiv:2511.13719,

    Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, et al. Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025

  6. [6]

    SAM 3: Segment Anything with Concepts

    Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

  7. [7]

    Reasoning in space via grounding in the world.arXiv preprint arXiv:2510.13800, 2025

    Yiming Chen, Zekun Qi, Wenyao Zhang, Xin Jin, Li Zhang, and Peidong Liu. Reasoning in space via grounding in the world.arXiv preprint arXiv:2510.13800, 2025

  8. [8]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024

  9. [9]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes

    Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017

  10. [10]

    Mm-spatial: Exploring 3d spatial understanding in multimodal llms

    Erik Daxberger, Nina Wenzel, David Griffiths, Haiming Gang, Justin Lazarow, Gefen Kohavi, Kai Kang, Marcin Eichner, Yinfei Yang, Afshin Dehghan, et al. Mm-spatial: Exploring 3d spatial understanding in multimodal llms. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7395–7408, 2025

  11. [11]

    A density-based algorithm for discovering clusters in large spatial databases with noise

    Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. Inkdd, volume 96, pages 226–231, 1996

  12. [12]

    Videoagent: A memory- augmented multimodal agent for video understanding

    Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, and Qing Li. Videoagent: A memory- augmented multimodal agent for video understanding. InEuropean Conference on Computer Vision, pages 75–92. Springer, 2024

  13. [13]

    VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

    Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, et al. Vlm-3r: Vision-language models augmented with instruction-aligned 3d reconstruction.arXiv preprint arXiv:2505.20279, 2025

  14. [14]

    Video-of-thought: Step-by-step video reasoning from perception to cognition

    Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Meishan Zhang, Mong-Li Lee, and Wynne Hsu. Video- of-thought: Step-by-step video reasoning from perception to cognition.arXiv preprint arXiv:2501.03230, 2024

  15. [15]

    Video-R1: Reinforcing Video Reasoning in MLLMs

    Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025

  16. [16]

    Towards visuospatial cognition via hierarchical fusion of visual experts.arXiv preprint arXiv:2505.12363, 2025

    Qi Feng. Towards visuospatial cognition via hierarchical fusion of visual experts.arXiv preprint arXiv:2505.12363, 2025

  17. [17]

    Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395, 1981

    Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395, 1981

  18. [18]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 10

  19. [19]

    Detect anything via next point prediction,

    Qing Jiang, Junan Huo, Xingyu Chen, Yuda Xiong, Zhaoyang Zeng, Yihao Chen, Tianhe Ren, Junzhi Yu, and Lei Zhang. Detect anything via next point prediction.arXiv preprint arXiv:2510.12798, 2025

  20. [20]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  21. [21]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

  22. [22]

    Viewspatial-bench: Evaluating multi-perspective spatial localization in vision- language models.arXiv preprint arXiv:2505.21500, 2025

    Dingming Li, Hongxing Li, Zixuan Wang, Yuchen Yan, Hang Zhang, Siqi Chen, Guiyang Hou, Shengpei Jiang, Wenqi Zhang, Yongliang Shen, et al. Viewspatial-bench: Evaluating multi-perspective spatial localization in vision-language models.arXiv preprint arXiv:2505.21500, 2025

  23. [23]

    Spatialladder: Progressive training for spatial reasoning in vision-language models.arXiv preprint arXiv:2510.08531,

    Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, and Yueting Zhuang. Spatialladder: Progressive training for spatial reasoning in vision-language models.arXiv preprint arXiv:2510.08531, 2025

  24. [24]

    Sti-bench: Are mllms ready for precise spatial-temporal world understanding?arXiv preprint arXiv:2503.23765, 2025

    Yun Li, Yiming Zhang, Tao Lin, XiangRui Liu, Wenxiao Cai, Zheng Liu, and Bo Zhao. Sti-bench: Are mllms ready for precise spatial-temporal world understanding?arXiv preprint arXiv:2503.23765, 2025

  25. [25]

    Perception, reason, think, and plan: A survey on large multimodal reasoning models.arXiv preprint arXiv:2505.04921, 2025

    Yunxin Li, Zhenyu Liu, Zitao Li, Xuanyu Zhang, Zhenran Xu, Xinyu Chen, Haoyuan Shi, Shenyuan Jiang, Xintong Wang, Jifang Wang, et al. Perception, reason, think, and plan: A survey on large multimodal reasoning models.arXiv preprint arXiv:2505.04921, 2025

  26. [26]

    Video-llava: Learning united visual representation by alignment before projection

    Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 5971–5984, 2024

  27. [27]

    Mmsi-video-bench: A holistic benchmark for video-based spatial intelligence

    Jingli Lin, Runsen Xu, Shaohao Zhu, Sihan Yang, Peizhou Cao, Yunlong Ran, Miao Hu, Chenming Zhu, Yiman Xie, Yilin Long, et al. Mmsi-video-bench: A holistic benchmark for video-based spatial intelligence. arXiv preprint arXiv:2512.10863, 2025

  28. [28]

    Ost- bench: Evaluating the capabilities of mllms in online spatio-temporal scene understanding.arXiv preprint arXiv:2507.07984, 2025

    Jingli Lin, Chenming Zhu, Runsen Xu, Xiaohan Mao, Xihui Liu, Tai Wang, and Jiangmiao Pang. Ost- bench: Evaluating the capabilities of mllms in online spatio-temporal scene understanding.arXiv preprint arXiv:2507.07984, 2025

  29. [29]

    Coarse correspondences boost spatial-temporal reasoning in multimodal language model

    Benlin Liu, Yuhao Dong, Yiqin Wang, Zixian Ma, Yansong Tang, Luming Tang, Yongming Rao, Wei-Chiu Ma, and Ranjay Krishna. Coarse correspondences boost spatial-temporal reasoning in multimodal language model. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3783–3792, 2025

  30. [30]

    Spatial reasoning in multimodal large language models: A survey of tasks, benchmarks and methods

    Weichen Liu, Qiyao Xue, Haoming Wang, Xiangyu Yin, Boyuan Yang, and Wei Gao. Spatial reasoning in multimodal large language models: A survey of tasks, benchmarks and methods.arXiv preprint arXiv:2511.15722, 2025

  31. [31]

    Video-chatgpt: Towards detailed video understanding via large vision and language models

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 12585–12602, 2024

  32. [32]

    Gpt4scene: Understand 3d scenes from videos with vision-language models

    Zhangyang Qi, Zhixiong Zhang, Ye Fang, Jiaqi Wang, and Hengshuang Zhao. Gpt4scene: Understand 3d scenes from videos with vision-language models.arXiv preprint arXiv:2501.01428, 2025

  33. [33]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024

  34. [34]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  35. [35]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

  36. [36]

    Qwen3 Technical Report

    Qwen Team. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388. 11

  37. [37]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

  38. [38]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

  39. [39]

    Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747, 2025

    Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747, 2025

  40. [40]

    St-think: How multimodal large language models reason about 4d worlds from ego-centric videos.arXiv preprint arXiv:2503.12542, 2025

    Peiran Wu, Yunze Liu, Miao Liu, and Junxiao Shen. St-think: How multimodal large language models reason about 4d worlds from ego-centric videos.arXiv preprint arXiv:2503.12542, 2025

  41. [41]

    Thinking in space: How multimodal large language models see, remember, and recall spaces

    Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025

  42. [42]

    Track anything: Segment anything meets videos.arXiv preprint arXiv:2304.11968, 2023

    Jinyu Yang, Mingqi Gao, Zhe Li, Shang Gao, Fangjing Wang, and Feng Zheng. Track anything: Segment anything meets videos.arXiv preprint arXiv:2304.11968, 2023

  43. [43]

    Visual spatial tuning.arXiv preprint arXiv:2511.05491, 2025

    Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, et al. Visual spatial tuning.arXiv preprint arXiv:2511.05491, 2025

  44. [44]

    Cambrian-S: Towards Spatial Supersensing in Video.arXiv preprint arXiv:2511.04670, 2025

    Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, et al. Cambrian-s: Towards spatial supersensing in video.arXiv preprint arXiv:2511.04670, 2025

  45. [45]

    Scannet++: A high-fidelity dataset of 3d indoor scenes

    Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d indoor scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–22, 2023

  46. [46]

    How to enable llm with 3d capacity? a survey of spatial reasoning in llm.arXiv preprint arXiv:2504.05786, 2025

    Jirong Zha, Yuxuan Fan, Xiao Yang, Chen Gao, and Xinlei Chen. How to enable llm with 3d capacity? a survey of spatial reasoning in llm.arXiv preprint arXiv:2504.05786, 2025

  47. [47]

    Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

    Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding.arXiv preprint arXiv:2306.02858, 2023

  48. [48]

    Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.arXiv preprint arXiv:2505.24625,

    Duo Zheng, Shijia Huang, Yanyang Li, and Liwei Wang. Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.arXiv preprint arXiv:2505.24625, 2025. 12 Algorithm 2Constrained Greedy Clustering: Views→Instances (Single Category) Require:distance thresholdε; viewsV={v k}K k=1 withv k = (fk,b k,c k), wheref k ∈Nis the frame index, bk...

  49. [49]

    You are an expert at estimating room size (area) from videos.\n

    Room size / area estimation prompt = ( "You are an expert at estimating room size (area) from videos.\n" "Use the visual information in the video to answer the user's question.\n" "Return ONLY a single best numerical estimate (integer or decimal) in square meters.\n" "Output format (STRICT): <answer>NUMBER</answer>\n" "- Do NOT output units.\n" "- Do NOT ...

  50. [50]

    You are an expert at estimating REAL-WORLD distance between two objects from videos.\n

    Distance estimation (between two objects) prompt = ( "You are an expert at estimating REAL-WORLD distance between two objects from videos.\n" "Use the visual information in the video to answer the user's question.\n" "Return ONLY a single best numerical estimate (integer or decimal) in meters.\n" "Output format (STRICT): <answer>NUMBER</answer>\n" "- Do N...