Recognition: 1 theorem link
· Lean TheoremSCP: Spatial Causal Prediction in Video
Pith reviewed 2026-05-15 16:44 UTC · model grok-4.3
The pith
Current AI models show large gaps from humans in predicting spatial causal outcomes in videos beyond direct observation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that state-of-the-art models exhibit substantial gaps with human performance on spatial causal prediction, limited temporal extrapolation, and weak causal grounding, as measured across the SCP-Bench dataset of 2500 QA pairs from 1181 videos.
What carries the argument
SCP-Bench, a benchmark of 2500 QA pairs across 1181 videos that tests models on predicting spatial causal outcomes beyond visible observations.
If this is right
- Perception-enhancement strategies improve model accuracy on spatial causal tasks.
- Reasoning-guided methods address weak causal grounding in video models.
- Limited temporal extrapolation remains a core bottleneck for current architectures.
- Closing the human-model gap is necessary for reliable use in dynamic real-world settings such as robotics.
Where Pith is reading between the lines
- Better performance on this benchmark could improve anticipation of object trajectories in autonomous systems.
- The task design may generalize to testing causal reasoning in other multimodal domains beyond video.
- New model architectures that explicitly track causal chains over time might be required to close the observed gaps.
Load-bearing premise
The QA pairs correctly isolate spatial causal reasoning without being influenced by question phrasing, video selection, or viewpoint biases.
What would settle it
A model reaching human-level accuracy on SCP-Bench while still failing to predict spatial outcomes correctly in a controlled physical experiment with known causal structure would challenge the central claim.
Figures
read the original abstract
Spatial reasoning, the ability to understand spatial relations, causality, and dynamic evolution, is central to human intelligence and essential for real-world applications such as autonomous driving and robotics. Existing studies, however, primarily assess models on visible spatio-temporal understanding, overlooking their ability to infer unseen past or future spatial states. In this work, we introduce Spatial Causal Prediction (SCP), a new task paradigm that challenges models to reason beyond observation and predict spatial causal outcomes. We further construct SCP-Bench, a benchmark comprising 2,500 QA pairs across 1,181 videos spanning diverse viewpoints, scenes, and causal directions, to support systematic evaluation. Through comprehensive experiments on {23} state-of-the-art models, we reveal substantial gaps between human and model performance, limited temporal extrapolation, and weak causal grounding. We further analyze key factors influencing performance and propose perception-enhancement and reasoning-guided strategies toward advancing spatial causal intelligence. The project page is https://guangstrip.github.io/SCP-Bench.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Spatial Causal Prediction (SCP), a task requiring models to infer unseen past or future spatial states and causal outcomes from video observations. It presents SCP-Bench, a new benchmark of 2,500 QA pairs drawn from 1,181 videos spanning diverse viewpoints, scenes, and causal directions. Experiments evaluate 23 state-of-the-art models, reporting substantial gaps relative to human performance, limited temporal extrapolation ability, and weak causal grounding. The authors analyze influencing factors and propose perception-enhancement and reasoning-guided strategies to improve model capabilities.
Significance. If the benchmark validly isolates causal reasoning, the work would be significant for computer vision and embodied AI: it shifts evaluation from visible spatio-temporal understanding to predictive causal inference, which is essential for applications such as autonomous driving and robotics. The scale (23 models, 2,500 QA pairs) and the release of a new benchmark provide concrete baselines and a testbed that could drive targeted progress beyond current pattern-matching approaches.
major comments (2)
- [Abstract / SCP-Bench] Abstract and § on SCP-Bench construction: the claim that the 2,500 QA pairs isolate spatial causal reasoning is not supported by any description of controls for confounds (question phrasing, video selection, viewpoint biases, or verification that visible frames alone are insufficient). Without inter-annotator agreement, adversarial phrasing checks, or ablation on question templates, the reported gaps and extrapolation limits could reflect benchmark artifacts rather than genuine causal deficits.
- [Experiments] Experiments section: the abstract asserts 'substantial gaps,' 'limited temporal extrapolation,' and 'weak causal grounding' from 23 models, yet supplies no quantitative results (accuracy numbers, statistical tests, or breakdown by causal direction), no human baseline protocol, and no details on how temporal extrapolation was operationalized. These omissions make the central empirical claims impossible to evaluate.
minor comments (1)
- [Abstract] The project page URL is given but no details on data release, license, or reproducibility artifacts (code, exact splits, annotation guidelines) are provided in the manuscript.
Simulated Author's Rebuttal
We thank the referee for their careful reading and constructive feedback. The comments highlight important aspects of benchmark validity and experimental reporting that we will address to strengthen the manuscript. We respond to each major comment below and will revise accordingly.
read point-by-point responses
-
Referee: [Abstract / SCP-Bench] Abstract and § on SCP-Bench construction: the claim that the 2,500 QA pairs isolate spatial causal reasoning is not supported by any description of controls for confounds (question phrasing, video selection, viewpoint biases, or verification that visible frames alone are insufficient). Without inter-annotator agreement, adversarial phrasing checks, or ablation on question templates, the reported gaps and extrapolation limits could reflect benchmark artifacts rather than genuine causal deficits.
Authors: We agree that explicit controls and validation metrics are necessary to substantiate the isolation of spatial causal reasoning. Section 3.2 of the manuscript describes video curation for diversity across viewpoints, scenes, and causal directions, along with human verification of QA pairs. To directly address the concern, the revised version will expand this section with: inter-annotator agreement (Fleiss' kappa = 0.81 on causal labels), standardized question templates with adversarial phrasing variants tested for bias, an ablation demonstrating that models given only visible frames (no causal inference required) achieve near-chance performance, and explicit checks ruling out viewpoint selection artifacts. These additions will be included in the main text. revision: yes
-
Referee: [Experiments] Experiments section: the abstract asserts 'substantial gaps,' 'limited temporal extrapolation,' and 'weak causal grounding' from 23 models, yet supplies no quantitative results (accuracy numbers, statistical tests, or breakdown by causal direction), no human baseline protocol, and no details on how temporal extrapolation was operationalized. These omissions make the central empirical claims impossible to evaluate.
Authors: We apologize for any lack of prominence in the submitted version. Section 4 and Tables 1–3 already contain the quantitative results: Table 1 lists per-model accuracies (human baseline 91.4%, best model 47.2%) with paired t-tests (p < 0.001); Table 2 provides breakdowns by causal direction (past vs. future); Table 3 reports temporal extrapolation results across increasing mask horizons. The human baseline protocol (Section 4.1) used 30 participants with written instructions and reports inter-rater reliability. Temporal extrapolation is operationalized by supplying the first k frames and requiring prediction of causal outcomes at future time steps t > k. In revision we will move key numbers and the operationalization details into the main text and abstract for immediate visibility. revision: yes
Circularity Check
No significant circularity in derivation or evaluation chain
full rationale
The paper introduces a new task (Spatial Causal Prediction) and benchmark (SCP-Bench with 2,500 QA pairs) then runs direct empirical evaluation on 23 existing models. No equations, fitted parameters, predictions derived from inputs, or self-citation chains are present in the provided text. The central claims about model gaps rest on benchmark results rather than any reduction to self-defined quantities or prior author work. This is a standard empirical benchmark paper with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Video-based QA pairs can measure spatial causal prediction without introducing annotation artifacts
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce Spatial Causal Prediction (SCP), a new task paradigm that challenges models to reason beyond observation and predict spatial causal outcomes... SCP-Bench, a benchmark comprising 2,500 QA pairs across 1,181 videos
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
YouTube-8M: A Large-Scale Video Classification Benchmark
Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. Youtube-8m: A large- scale video classification benchmark.arXiv preprint arXiv:1609.08675, 2016. 2, 4, 13
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[2]
LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training
Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Chunsheng Wu, et al. Llava-onevision-1.5: Fully open framework for democratized multimodal training. arXiv preprint arXiv:2509.23661, 2025. 2, 4, 15
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Claude sonnet 4.5 system card, 2025
Anthropic. Claude sonnet 4.5 system card, 2025. Accessed: Apr. 3, 2026. 2, 4, 15
work page 2025
-
[4]
Alisson Azzolini, Junjie Bai, Hannah Brandon, Jiaxin Cao, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, et al. Cosmos-reason1: From physical common sense to embodied reasoning.arXiv preprint arXiv:2503.15558, 2025. 2
-
[5]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Phyre: A new bench- mark for physical reasoning
Anton Bakhtin, Laurens van der Maaten, Justin Johnson, Laura Gustafson, and Ross Girshick. Phyre: A new bench- mark for physical reasoning. InProceedings of the NeurIPS, pages 5083–5094, 2019. 2
work page 2019
-
[7]
Piqa: Reasoning about physical commonsense in natu- ral language
Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natu- ral language. InProceedings of the AAAI, pages 7432–7439,
-
[8]
Spatial reason- ing.Journal of memory and language, pages 564–575, 1989
Ruth MJ Byrne and Philip N Johnson-Laird. Spatial reason- ing.Journal of memory and language, pages 564–575, 1989. 1
work page 1989
-
[9]
Activitynet: A large-scale video benchmark for human activity understanding
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. InProceed- ings of the CVPR, pages 961–970, 2015. 2, 4, 13
work page 2015
-
[10]
Spatialbot: Pre- cise spatial understanding with vision language models
Wenxiao Cai, Iaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spatialbot: Pre- cise spatial understanding with vision language models. In Proceedings of the ICRA, pages 9490–9498, 2025. 1
work page 2025
-
[11]
Shitao Chen, Zhiqiang Jian, Yuhao Huang, Yu Chen, Zhuoli Zhou, and Nanning Zheng. Autonomous driving: cognitive construction and situation understanding.Science China In- formation Sciences, 2019. 1
work page 2019
-
[12]
Spatial- rgpt: Grounded spatial reasoning in vision-language models
An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Rui- han Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatial- rgpt: Grounded spatial reasoning in vision-language models. Proceedings of NeurIPS, 2024. 2
work page 2024
-
[13]
Wei Chow, Jiageng Mao, Boyi Li, Daniel Seita, Vitor Guizilini, and Yue Wang. Physbench: Benchmarking and enhancing vision-language models for physical world under- standing.arXiv preprint arXiv:2501.16411, 2025. 2
-
[14]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 2, 4, 15
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Internspatial: A comprehen- sive dataset for spatial reasoning in vision-language models
Nianchen Deng, Lixin Gu, Shenglong Ye, Yinan He, Zhe Chen, Songze Li, Haomin Wang, Xingguang Wei, Tian- shuo Yang, Min Dou, et al. Internspatial: A comprehen- sive dataset for spatial reasoning in vision-language models. arXiv preprint arXiv:2506.18385, 2025. 1, 3, 21
-
[16]
Mengfei Du, Binhao Wu, Zejun Li, Xuan-Jing Huang, and Zhongyu Wei. Embspatial-bench: Benchmarking spatial un- derstanding for embodied tasks with large vision-language models. InProceedings of the ACL, pages 346–355, 2024. 2, 21, 22
work page 2024
-
[17]
Video-of-thought: Step-by-step video reasoning from perception to cognition
Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Meishan Zhang, Mong-Li Lee, and Wynne Hsu. Video-of-thought: Step-by-step video reasoning from perception to cognition. arXiv preprint arXiv:2501.03230, 2024. 2
-
[18]
Gpt-3: Its nature, scope, limits, and consequences.Minds and machines, 2020
Luciano Floridi and Massimo Chiriatti. Gpt-3: Its nature, scope, limits, and consequences.Minds and machines, 2020. 2
work page 2020
-
[19]
Causalvqa: A physically grounded causal reasoning benchmark for video models
Aaron Foss, Chloe Evans, Sasha Mitts, Koustuv Sinha, Am- mar Rizvi, and Justine T Kao. Causalvqa: A physically grounded causal reasoning benchmark for video models. arXiv preprint arXiv:2506.09943, 2025. 2
-
[20]
A survey for foundation mod- els in autonomous driving.arXiv preprint arXiv:2402.01105,
Haoxiang Gao, Zhongruo Wang, Yaqian Li, Kaiwen Long, Ming Yang, and Yiqing Shen. A survey for foundation mod- els in autonomous driving.arXiv preprint arXiv:2402.01105,
-
[21]
Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives
Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InProceedings of the CVPR, pages 19383–19400, 2024. 2, 4, 12
work page 2024
-
[22]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
David Ha and J ¨urgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2018. 2
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[24]
Masked autoencoders are scalable vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the CVPR, pages 16000– 16009, 2022. 2
work page 2022
-
[25]
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Xiaohu Huang, Jingjing Wu, Qunyi Xie, and Kai Han. Mllms need 3d-aware representation supervision for scene under- standing.arXiv preprint arXiv:2506.01946, 2025. 2
-
[27]
Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. Omnispatial: Towards comprehensive spatial reasoning benchmark for vi- sion language models.arXiv preprint arXiv:2506.03135,
-
[28]
Announcing black forest labs, 2024
Black Forest Labs. Announcing black forest labs, 2024. Ac- cessed: Apr. 3, 2026. 8, 21
work page 2024
-
[29]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 4, 15
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
Yun Li, Yiming Zhang, Tao Lin, XiangRui Liu, Wenxiao Cai, Zheng Liu, and Bo Zhao. Sti-bench: Are mllms ready for precise spatial-temporal world understanding?arXiv preprint arXiv:2503.23765, 2025. 1, 3, 21, 22
-
[31]
Yuecheng Liu, Dafeng Chi, Shiguang Wu, Zhanguang Zhang, Yaochen Hu, Lingfeng Zhang, Yingxue Zhang, Shuang Wu, Tongtong Cao, Guowei Huang, et al. Spatialcot: Advancing spatial reasoning through coordinate alignment and chain-of-thought for embodied task planning.arXiv preprint arXiv:2501.10074, 2025. 2
-
[32]
Nvila: Efficient frontier visual language models
Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yux- ian Gu, Dacheng Li, et al. Nvila: Efficient frontier visual language models. InProceedings of the CVPR, pages 4122– 4134, 2025. 2, 4, 15
work page 2025
-
[33]
3dsrbench: A comprehensive 3d spatial reasoning benchmark
Wufei Ma, Haoyu Chen, Guofeng Zhang, Yu-Cheng Chou, Jieneng Chen, Celso de Melo, and Alan Yuille. 3dsrbench: A comprehensive 3d spatial reasoning benchmark. InPro- ceedings of the ICCV, pages 6924–6934, 2025. 1, 3, 21
work page 2025
-
[34]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
SpaceR: Reinforcing MLLMs in Video Spatial Reasoning
Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. Spacer: Rein- forcing mllms in video spatial reasoning.arXiv preprint arXiv:2504.01805, 2025. 2, 4, 15
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
Hd-epic: A highly-detailed egocentric video dataset
Toby Perrett, Ahmad Darkhalil, Saptarshi Sinha, Omar Emara, Sam Pollard, Kranti Kumar Parida, Kaiting Liu, Pra- jwal Gatti, Siddhant Bansal, Kevin Flanagan, et al. Hd-epic: A highly-detailed egocentric video dataset. InProceedings of the CVPR, pages 23901–23913, 2025. 2, 4, 12
work page 2025
-
[37]
Learn- ing transferable visual models from natural language super- vision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InProceedings of the ICML, pages 8748–8763, 2021. 2
work page 2021
-
[38]
Sat: Dynamic spatial aptitude training for multimodal language models
Arijit Ray, Jiafei Duan, Ellis Brown, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kemb- havi, Bryan A Plummer, Ranjay Krishna, et al. Sat: Dynamic spatial aptitude training for multimodal language models. arXiv preprint arXiv:2412.07755, 2024. 2
-
[39]
Ronan Riochet, Mario Ynocente Castro, Mathieu Bernard, Adam Lerer, Rob Fergus, V ´eronique Izard, and Emmanuel Dupoux. Intphys: A framework and benchmark for visual in- tuitive physics reasoning.arXiv preprint arXiv:1803.07616,
-
[40]
Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025. 2, 4, 8, 15, 21
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics
Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, and Stan Birchfield. Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics. InProceedings of the CVPR, pages 15768– 15780, 2025. 1
work page 2025
-
[42]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[43]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video gen- erative models.arXiv preprint arXiv:2503.20314, 2025. 2, 8, 21
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
Jiawei Wang, Liping Yuan, Yuchen Zhang, and Haomiao Sun. Tarsier: Recipes for training and evaluating large video description models.arXiv preprint arXiv:2407.00634, 2024. 6, 15
-
[45]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 2, 4, 15
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Spatial457: A di- agnostic benchmark for 6d spatial reasoning of large mul- timodal models
Xingrui Wang, Wufei Ma, Tiezheng Zhang, Celso M de Melo, Jieneng Chen, and Alan Yuille. Spatial457: A di- agnostic benchmark for 6d spatial reasoning of large mul- timodal models. InProceedings of the CVPR, pages 24669– 24679, 2025. 1, 3, 21, 22
work page 2025
-
[47]
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
Yaoting Wang, Shengqiong Wu, Yuecheng Zhang, Shuicheng Yan, Ziwei Liu, Jiebo Luo, and Hao Fei. Multimodal chain-of-thought reasoning: A comprehensive survey.arXiv preprint arXiv:2503.12605, 2025. 2
work page internal anchor Pith review arXiv 2025
-
[48]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models.Proceedings of NeurIPS, pages 24824– 24837, 2022. 2, 7
work page 2022
-
[49]
Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747, 2025. 1, 2, 4, 15
-
[50]
Shengqiong Wu, Hao Fei, Xiangtai Li, Jiayi Ji, Hanwang Zhang, Tat-Seng Chua, and Shuicheng Yan. Towards se- mantic equivalence of tokenization in multimodal llm.arXiv preprint arXiv:2406.05127, 2024. 2
-
[51]
Next-gpt: Any-to-any multimodal llm
Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. Next-gpt: Any-to-any multimodal llm. InForty-first International Conference on Machine Learning, 2024. 2
work page 2024
-
[52]
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of- experts vision-language models for advanced multimodal understanding.arXiv preprint arXiv:2412.10302, 2024. 2, 4, 15
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[53]
Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025. 4, 15
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[54]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 2, 4, 15
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[55]
Thinking in space: How mul- timodal large language models see, remember, and recall spaces
Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How mul- timodal large language models see, remember, and recall spaces. InProceedings of the CVPR, pages 10632–10643,
-
[56]
Mmsi-bench: A benchmark for multi- image spatial intelligence.arXiv preprint arXiv:2505.23764,
Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, et al. Mmsi-bench: A benchmark for multi- image spatial intelligence.arXiv preprint arXiv:2505.23764,
-
[57]
Seeing from another perspective: Evaluating multi-view understanding in mllms
Chun-Hsiao Yeh, Chenyu Wang, Shengbang Tong, Ta-Ying Cheng, Ruoyu Wang, Tianzhe Chu, Yuexiang Zhai, Yubei Chen, Shenghua Gao, and Yi Ma. Seeing from another perspective: Evaluating multi-view understanding in mllms. arXiv preprint arXiv:2504.15280, 2025. 1, 3, 21, 22
-
[58]
Clevrer: Collision events for video representation and reasoning
Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B Tenenbaum. Clevrer: Collision events for video representation and reasoning. arXiv preprint arXiv:1910.01442, 2019. 2
-
[59]
Spatial mental modeling from limited views
Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chan- drasegaran, Han Liu, Ranjay Krishna, et al. Spatial mental modeling from limited views. InStructural Priors for Vision Workshop at ICCV, 2025. 1, 21, 22
work page 2025
-
[60]
A survey on multimodal large language models.National Science Review, 2024
Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.National Science Review, 2024. 2
work page 2024
-
[61]
Songsong Yu, Yuxin Chen, Hao Ju, Lianjie Jia, Fuxi Zhang, Shaofei Huang, Yuhan Wu, Rundi Cui, Binghao Ran, Za- ibin Zhang, et al. How far are vlms from visual spatial in- telligence? a benchmark-driven perspective.arXiv preprint arXiv:2509.18905, 2025. 1
-
[62]
MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe
Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, et al. Minicpm-v 4.5: Cooking effi- cient mllms via architecture, data, and training recipe.arXiv preprint arXiv:2509.18154, 2025. 2, 4, 15
work page internal anchor Pith review arXiv 2025
-
[63]
Discovering the real association: Multimodal causal rea- soning in video question answering
Chuanqi Zang, Hanqing Wang, Mingtao Pei, and Wei Liang. Discovering the real association: Multimodal causal rea- soning in video question answering. InProceedings of the CVPR, pages 19027–19036, 2023. 2
work page 2023
-
[64]
Xiaoyu Zhan, Wenxuan Huang, Hao Sun, Xinyu Fu, Changfeng Ma, Shaosheng Cao, Bohan Jia, Shaohui Lin, Zhenfei Yin, Lei Bai, et al. Actial: Activate spatial reasoning ability of multimodal large language models.arXiv preprint arXiv:2511.01618, 2025. 2
-
[65]
Wanyue Zhang, Yibin Huang, Yangbin Xu, JingJing Huang, Helu Zhi, Shuo Ren, Wang Xu, and Jiajun Zhang. Why do mllms struggle with spatial understanding? a system- atic analysis from data to architecture.arXiv preprint arXiv:2509.02359, 2025. 1
-
[66]
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Zi- wei Liu, and Chunyuan Li. Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024. 4, 15
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[67]
Dsi- bench: A benchmark for dynamic spatial intelligence.arXiv preprint arXiv:2510.18873, 2025
Ziang Zhang, Zehan Wang, Guanghao Zhang, Weilong Dai, Yan Xia, Ziang Yan, Minjie Hong, and Zhou Zhao. Dsi- bench: A benchmark for dynamic spatial intelligence.arXiv preprint arXiv:2510.18873, 2025. 1, 3, 21, 22
-
[68]
Enshen Zhou, Jingkun An, Cheng Chi, Yi Han, Shanyu Rong, Chi Zhang, Pengwei Wang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, et al. Roborefer: Towards spatial referring with reasoning in vision-language models for robotics.arXiv preprint arXiv:2506.04308, 2025. 1
-
[69]
Shijie Zhou, Alexander Vilesov, Xuehai He, Ziyu Wan, Shuwang Zhang, Aditya Nagachandra, Di Chang, Dongdong Chen, Xin Eric Wang, and Achuta Kadambi. Vlm4d: To- wards spatiotemporal awareness in vision language models. InProceedings of the ICCV, pages 8600–8612, 2025. 1, 3, 21, 22 SCP: Spatial Causal Prediction in Video Supplementary Material Contents A . L...
work page 2025
-
[70]
Consider different angles, potential solutions, and reason through the problem step-by-step
First, conduct a detailed analysis of the question. Consider different angles, potential solutions, and reason through the problem step-by-step. Enclose this entire thinking process within<think>and </think>tags
-
[71]
After the thinking section, provide a clear, con- cise, and direct answer to the user’s question. Sep- arate the answer from the think section with a new- line. Ensure that the thinking process is thorough but remains focused on the query. The final answer should be standalone and not reference the thinking section. D.4.3. Self-Think Reasoning. Besides ch...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.