Recognition: 1 theorem link
· Lean TheoremViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models
Pith reviewed 2026-05-12 03:55 UTC · model grok-4.3
The pith
ViSRA improves spatial reasoning in MLLMs by feeding explicit 3D data from expert models into video inputs without any training or fine-tuning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ViSRA is a human-aligned Video-based Spatial Reasoning Agent that elicits spatial reasoning in MLLMs in a modular and extensible manner by leveraging explicit spatial information from expert models, enabling a plug-and-play flexible paradigm with no post-training computational cost or heavy manual curation of spatial reasoning datasets.
What carries the argument
ViSRA, the Video-based Spatial Reasoning Agent, which extracts explicit spatial cues from expert models and supplies them to MLLMs through video inputs to probe and strengthen spatial reasoning.
If this is right
- Consistent absolute gains of up to 15.6 percent on existing spatial reasoning benchmarks across multiple MLLMs.
- Larger absolute gains of up to 28.9 percent on previously unseen 3D spatial reasoning tasks.
- No requirement for post-training or creation of new spatial datasets.
- A plug-and-play structure that works with different underlying MLLMs without modification.
- Human-aligned 3D understanding that transfers rather than overfits to specific training distributions.
Where Pith is reading between the lines
- The same modular separation of expert perception from the language core could be tried for other reasoning types such as temporal or causal inference.
- Making spatial information explicit may make it easier to debug or audit where MLLMs succeed or fail on 3D tasks.
- Because the method needs no fine-tuning, it could support rapid testing of new expert models as they become available.
- The training-free design opens the possibility of applying ViSRA in real-time or resource-constrained settings where retraining is impractical.
Load-bearing premise
External expert models can reliably supply accurate and transferable 3D spatial information that MLLMs can integrate effectively at inference time without fine-tuning or task-specific adaptation.
What would settle it
Running ViSRA on standard spatial benchmarks and finding no measurable gain over the plain MLLM baselines, or discovering that the expert models frequently output inaccurate or non-transferable spatial data for the tested scenes.
Figures
read the original abstract
Recent advances in Multi-modal Large Language Models (MLLMs) target 3D spatial intelligence, yet the progress has been largely driven by post-training on curated benchmarks, leaving the inference-time approach relatively underexplored. In this paper, we take a training-free perspective and introduce ViSRA, a human-aligned Video-based Spatial Reasoning Agent, as a framework to probe the spatial reasoning mechanism of MLLMs. ViSRA elicits spatial reasoning in a modular and extensible manner by leveraging explicit spatial information from expert models, enabling a plug-and-play flexible paradigm. ViSRA offers two key advantages: (1) human-aligned and transferable 3D understanding rather than task-specific overfitting; and (2) no post-training computational cost along with heavy manual curation of spatial reasoning datasets. Experimental results demonstrate consistent improvement across a set of MLLMs on both existing benchmarks and unseen 3D spatial reasoning tasks, with ViSRA outperforming baselines by up to a 15.6% and 28.9% absolute margin respectively.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ViSRA, a training-free Video-based Spatial Reasoning Agent that augments multi-modal large language models (MLLMs) with explicit 3D spatial information extracted from external expert models. The framework is presented as modular and extensible, enabling plug-and-play integration at inference time to elicit spatial reasoning without fine-tuning or curated dataset post-training. Experimental claims include consistent gains across MLLMs on existing benchmarks (up to 15.6% absolute) and unseen 3D spatial tasks (up to 28.9% absolute).
Significance. If substantiated with rigorous controls, the training-free modular design would be a meaningful contribution to inference-time methods for 3D spatial intelligence in MLLMs, avoiding the costs of fine-tuning while emphasizing transferability. The explicit separation of expert spatial extraction from the MLLM itself is a clear strength that could generalize to other reasoning domains.
major comments (2)
- [Framework and Experimental Results] The central claim that expert models supply accurate, human-aligned, and transferable 3D spatial information that MLLMs integrate without adaptation is load-bearing but unsupported by robustness analysis. No ablation or error-propagation study examines how depth-estimation inaccuracies, occlusion handling, or 2D-to-3D projection failures in the experts affect downstream MLLM outputs.
- [Experimental Results] Performance gains are reported without sufficient experimental detail on baseline definitions, expert-model selection criteria, prompt-construction specifics, or statistical significance testing. This prevents assessment of whether improvements reflect elicited reasoning or simply richer prompt augmentation.
minor comments (2)
- [Abstract and Introduction] The abstract and introduction would benefit from explicit enumeration of the unseen 3D tasks and how they were constructed to differ from training distributions.
- [Method] Notation for spatial representations (e.g., how 3D coordinates or depth maps are serialized into text prompts) should be formalized for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, agreeing where additional rigor is warranted and outlining the specific revisions we will implement.
read point-by-point responses
-
Referee: [Framework and Experimental Results] The central claim that expert models supply accurate, human-aligned, and transferable 3D spatial information that MLLMs integrate without adaptation is load-bearing but unsupported by robustness analysis. No ablation or error-propagation study examines how depth-estimation inaccuracies, occlusion handling, or 2D-to-3D projection failures in the experts affect downstream MLLM outputs.
Authors: We agree that direct robustness analysis is necessary to fully substantiate the load-bearing claim about the accuracy, human alignment, and transferability of the 3D spatial cues extracted by expert models. While the consistent gains on unseen tasks provide indirect support for transferability without task-specific overfitting, we acknowledge the lack of explicit error-propagation studies in the current version. In the revised manuscript, we will add a dedicated ablation subsection that quantifies the effects of depth-estimation inaccuracies, occlusion handling failures, and 2D-to-3D projection errors on final MLLM outputs, including controlled experiments that inject synthetic errors into the expert outputs and measure downstream performance degradation. revision: yes
-
Referee: [Experimental Results] Performance gains are reported without sufficient experimental detail on baseline definitions, expert-model selection criteria, prompt-construction specifics, or statistical significance testing. This prevents assessment of whether improvements reflect elicited reasoning or simply richer prompt augmentation.
Authors: We recognize that greater experimental transparency is required to allow readers to distinguish between genuine spatial reasoning elicitation and the effects of richer prompting. The original manuscript describes the overall setup and reports absolute gains, but we agree that more granular details are needed. In the revision, we will expand the experimental section to include: precise definitions of all baselines, explicit criteria for selecting the expert models (e.g., preference for models with demonstrated human perceptual alignment on standard depth and pose benchmarks), the full set of prompt templates used for MLLM integration, and statistical significance testing with standard deviations and p-values computed over multiple independent runs. These additions will strengthen the evidence that the observed improvements derive from the modular spatial cue integration rather than generic prompt enrichment. revision: yes
Circularity Check
No circularity: framework uses external experts without derivations or self-referential predictions
full rationale
The paper proposes ViSRA as a modular, training-free agent that plugs explicit 3D spatial outputs from separate expert models into MLLM prompts. No equations, parameter fitting, or derivation chain exist in the described approach; performance gains are reported from direct experiments on benchmarks and unseen tasks. The central premise rests on the (external) reliability of those expert models rather than any reduction of outputs to the paper's own inputs or prior self-citations. This is a standard empirical framework paper with no load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Expert models provide accurate and human-aligned 3D spatial information transferable across tasks
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean, IndisputableMonolith/Foundation/AlexanderDuality.lean, IndisputableMonolith/Cost/FunctionalEquation.leanreality_from_one_distinction, alexander_duality_circle_linking, washburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ViSRA elicits spatial reasoning in a modular and extensible manner by leveraging explicit spatial information from expert models... four-role agent framework (Plan, Reflect, Execute, Summarize)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data
Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, et al. Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data.arXiv preprint arXiv:2111.08897, 2021
work page internal anchor Pith review arXiv 2021
-
[3]
Hunar Batra, Haoqin Tu, Hardy Chen, Yuanze Lin, Cihang Xie, and Ronald Clark. Spatialthinker: Reinforcing 3d reasoning in multimodal llms via spatial rewards.arXiv preprint arXiv:2511.07403, 2025
-
[4]
Omni3d: A large benchmark and model for 3d object detection in the wild
Garrick Brazil, Abhinav Kumar, Julian Straub, Nikhila Ravi, Justin Johnson, and Georgia Gkioxari. Omni3d: A large benchmark and model for 3d object detection in the wild. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13154–13164, 2023
work page 2023
-
[5]
Scaling spatial intelligence with multi- modal foundation models.arXiv preprint arXiv:2511.13719,
Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, et al. Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025
-
[6]
SAM 3: Segment Anything with Concepts
Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Reasoning in space via grounding in the world.arXiv preprint arXiv:2510.13800, 2025
Yiming Chen, Zekun Qi, Wenyao Zhang, Xin Jin, Li Zhang, and Peidong Liu. Reasoning in space via grounding in the world.arXiv preprint arXiv:2510.13800, 2025
-
[8]
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024
work page 2024
-
[9]
Scannet: Richly-annotated 3d reconstructions of indoor scenes
Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017
work page 2017
-
[10]
Mm-spatial: Exploring 3d spatial understanding in multimodal llms
Erik Daxberger, Nina Wenzel, David Griffiths, Haiming Gang, Justin Lazarow, Gefen Kohavi, Kai Kang, Marcin Eichner, Yinfei Yang, Afshin Dehghan, et al. Mm-spatial: Exploring 3d spatial understanding in multimodal llms. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7395–7408, 2025
work page 2025
-
[11]
A density-based algorithm for discovering clusters in large spatial databases with noise
Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. Inkdd, volume 96, pages 226–231, 1996
work page 1996
-
[12]
Videoagent: A memory- augmented multimodal agent for video understanding
Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, and Qing Li. Videoagent: A memory- augmented multimodal agent for video understanding. InEuropean Conference on Computer Vision, pages 75–92. Springer, 2024
work page 2024
-
[13]
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction
Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, et al. Vlm-3r: Vision-language models augmented with instruction-aligned 3d reconstruction.arXiv preprint arXiv:2505.20279, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Video-of-thought: Step-by-step video reasoning from perception to cognition
Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Meishan Zhang, Mong-Li Lee, and Wynne Hsu. Video- of-thought: Step-by-step video reasoning from perception to cognition.arXiv preprint arXiv:2501.03230, 2024
-
[15]
Video-R1: Reinforcing Video Reasoning in MLLMs
Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025
work page internal anchor Pith review arXiv 2025
-
[16]
Qi Feng. Towards visuospatial cognition via hierarchical fusion of visual experts.arXiv preprint arXiv:2505.12363, 2025
-
[17]
Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395, 1981
work page 1981
-
[18]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 10
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Detect anything via next point prediction,
Qing Jiang, Junan Huo, Xingyu Chen, Yuda Xiong, Zhaoyang Zeng, Yihao Chen, Tianhe Ren, Junzhi Yu, and Lei Zhang. Detect anything via next point prediction.arXiv preprint arXiv:2510.12798, 2025
-
[20]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023
work page 2023
-
[21]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
Dingming Li, Hongxing Li, Zixuan Wang, Yuchen Yan, Hang Zhang, Siqi Chen, Guiyang Hou, Shengpei Jiang, Wenqi Zhang, Yongliang Shen, et al. Viewspatial-bench: Evaluating multi-perspective spatial localization in vision-language models.arXiv preprint arXiv:2505.21500, 2025
-
[23]
Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, and Yueting Zhuang. Spatialladder: Progressive training for spatial reasoning in vision-language models.arXiv preprint arXiv:2510.08531, 2025
-
[24]
Yun Li, Yiming Zhang, Tao Lin, XiangRui Liu, Wenxiao Cai, Zheng Liu, and Bo Zhao. Sti-bench: Are mllms ready for precise spatial-temporal world understanding?arXiv preprint arXiv:2503.23765, 2025
-
[25]
Yunxin Li, Zhenyu Liu, Zitao Li, Xuanyu Zhang, Zhenran Xu, Xinyu Chen, Haoyuan Shi, Shenyuan Jiang, Xintong Wang, Jifang Wang, et al. Perception, reason, think, and plan: A survey on large multimodal reasoning models.arXiv preprint arXiv:2505.04921, 2025
-
[26]
Video-llava: Learning united visual representation by alignment before projection
Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 5971–5984, 2024
work page 2024
-
[27]
Mmsi-video-bench: A holistic benchmark for video-based spatial intelligence
Jingli Lin, Runsen Xu, Shaohao Zhu, Sihan Yang, Peizhou Cao, Yunlong Ran, Miao Hu, Chenming Zhu, Yiman Xie, Yilin Long, et al. Mmsi-video-bench: A holistic benchmark for video-based spatial intelligence. arXiv preprint arXiv:2512.10863, 2025
-
[28]
Jingli Lin, Chenming Zhu, Runsen Xu, Xiaohan Mao, Xihui Liu, Tai Wang, and Jiangmiao Pang. Ost- bench: Evaluating the capabilities of mllms in online spatio-temporal scene understanding.arXiv preprint arXiv:2507.07984, 2025
-
[29]
Coarse correspondences boost spatial-temporal reasoning in multimodal language model
Benlin Liu, Yuhao Dong, Yiqin Wang, Zixian Ma, Yansong Tang, Luming Tang, Yongming Rao, Wei-Chiu Ma, and Ranjay Krishna. Coarse correspondences boost spatial-temporal reasoning in multimodal language model. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3783–3792, 2025
work page 2025
-
[30]
Spatial reasoning in multimodal large language models: A survey of tasks, benchmarks and methods
Weichen Liu, Qiyao Xue, Haoming Wang, Xiangyu Yin, Boyuan Yang, and Wei Gao. Spatial reasoning in multimodal large language models: A survey of tasks, benchmarks and methods.arXiv preprint arXiv:2511.15722, 2025
-
[31]
Video-chatgpt: Towards detailed video understanding via large vision and language models
Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 12585–12602, 2024
work page 2024
-
[32]
Gpt4scene: Understand 3d scenes from videos with vision-language models
Zhangyang Qi, Zhixiong Zhang, Ye Fang, Jiaqi Wang, and Hengshuang Zhao. Gpt4scene: Understand 3d scenes from videos with vision-language models.arXiv preprint arXiv:2501.01428, 2025
-
[33]
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
Qwen Team. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388. 11
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
Vggt: Visual geometry grounded transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025
work page 2025
-
[38]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747, 2025
-
[40]
Peiran Wu, Yunze Liu, Miao Liu, and Junxiao Shen. St-think: How multimodal large language models reason about 4d worlds from ego-centric videos.arXiv preprint arXiv:2503.12542, 2025
-
[41]
Thinking in space: How multimodal large language models see, remember, and recall spaces
Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025
work page 2025
-
[42]
Track anything: Segment anything meets videos.arXiv preprint arXiv:2304.11968, 2023
Jinyu Yang, Mingqi Gao, Zhe Li, Shang Gao, Fangjing Wang, and Feng Zheng. Track anything: Segment anything meets videos.arXiv preprint arXiv:2304.11968, 2023
-
[43]
Visual spatial tuning.arXiv preprint arXiv:2511.05491, 2025
Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, et al. Visual spatial tuning.arXiv preprint arXiv:2511.05491, 2025
-
[44]
Cambrian-S: Towards Spatial Supersensing in Video.arXiv preprint arXiv:2511.04670, 2025
Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, et al. Cambrian-s: Towards spatial supersensing in video.arXiv preprint arXiv:2511.04670, 2025
-
[45]
Scannet++: A high-fidelity dataset of 3d indoor scenes
Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d indoor scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–22, 2023
work page 2023
-
[46]
Jirong Zha, Yuxuan Fan, Xiao Yang, Chen Gao, and Xinlei Chen. How to enable llm with 3d capacity? a survey of spatial reasoning in llm.arXiv preprint arXiv:2504.05786, 2025
-
[47]
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding.arXiv preprint arXiv:2306.02858, 2023
work page internal anchor Pith review arXiv 2023
-
[48]
Duo Zheng, Shijia Huang, Yanyang Li, and Liwei Wang. Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.arXiv preprint arXiv:2505.24625, 2025. 12 Algorithm 2Constrained Greedy Clustering: Views→Instances (Single Category) Require:distance thresholdε; viewsV={v k}K k=1 withv k = (fk,b k,c k), wheref k ∈Nis the frame index, bk...
-
[49]
You are an expert at estimating room size (area) from videos.\n
Room size / area estimation prompt = ( "You are an expert at estimating room size (area) from videos.\n" "Use the visual information in the video to answer the user's question.\n" "Return ONLY a single best numerical estimate (integer or decimal) in square meters.\n" "Output format (STRICT): <answer>NUMBER</answer>\n" "- Do NOT output units.\n" "- Do NOT ...
-
[50]
You are an expert at estimating REAL-WORLD distance between two objects from videos.\n
Distance estimation (between two objects) prompt = ( "You are an expert at estimating REAL-WORLD distance between two objects from videos.\n" "Use the visual information in the video to answer the user's question.\n" "Return ONLY a single best numerical estimate (integer or decimal) in meters.\n" "Output format (STRICT): <answer>NUMBER</answer>\n" "- Do N...
work page 1970
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.