Recognition: unknown
OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence
Pith reviewed 2026-05-10 17:36 UTC · model grok-4.3
The pith
OpenSpatial shows that 3D bounding boxes can serve as primitives for an open engine that generates 3 million diverse spatial samples and lifts model performance 19 percent on reasoning benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OpenSpatial is an open-source data engine that adopts 3D bounding boxes as the fundamental primitive to construct a comprehensive data hierarchy across five foundational tasks—Spatial Measurement, Spatial Relationship, Camera Perception, Multi-view Consistency, and Scene-Aware Reasoning—yielding the OpenSpatial-3M dataset of three million high-fidelity samples; versatile models trained on the dataset achieve state-of-the-art performance across spatial reasoning benchmarks, with the best model showing a 19 percent relative average improvement.
What carries the argument
3D bounding boxes as the fundamental primitive that constructs a scalable data hierarchy for the five spatial tasks and drives automatic sample generation.
If this is right
- Versatile models trained on the dataset achieve state-of-the-art performance across a wide spectrum of spatial reasoning benchmarks.
- The best-performing model exhibits a 19 percent relative average improvement.
- Data attributes can be analyzed systematically to reveal their influence on spatial perception.
- Open-sourcing the engine and the 3M-scale dataset supplies a reusable foundation for future spatial intelligence research.
- The infrastructure supports high quality, extensive scalability, broad task diversity, and optimized efficiency in data production.
Where Pith is reading between the lines
- If bounding-box primitives suffice, geometric abstractions may substitute for expensive real-world annotations in other 3D understanding tasks.
- The same hierarchy could be reused to generate training data for downstream applications such as robotic navigation or augmented-reality scene editing.
- Open availability of the generator lowers the cost for independent labs to create custom spatial datasets matched to their own benchmarks.
- Success here would encourage similar primitive-driven engines for related domains like temporal reasoning or physical simulation.
Load-bearing premise
The automatically generated samples using 3D bounding boxes as primitives are sufficiently high-fidelity and diverse to produce genuine gains in spatial understanding rather than overfitting to the synthetic distribution.
What would settle it
A controlled test in which models trained on OpenSpatial-3M are evaluated on real-world spatial benchmarks with human-annotated ground truth and show no improvement or worse performance than identical models trained without the dataset would falsify the central claim.
Figures
read the original abstract
Spatial understanding is a fundamental cornerstone of human-level intelligence. Nonetheless, current research predominantly focuses on domain-specific data production, leaving a critical void: the absence of a principled, open-source engine capable of fully unleashing the potential of high-quality spatial data. To bridge this gap, we elucidate the design principles of a robust data generation system and introduce OpenSpatial -- an open-source data engine engineered for high quality, extensive scalability, broad task diversity, and optimized efficiency. OpenSpatial adopts 3D bounding boxes as the fundamental primitive to construct a comprehensive data hierarchy across five foundational tasks: Spatial Measurement (SM), Spatial Relationship (SR), Camera Perception (CP), Multi-view Consistency (MC), and Scene-Aware Reasoning (SAR). Leveraging this scalable infrastructure, we curate OpenSpatial-3M, a large-scale dataset comprising 3 million high-fidelity samples. Extensive evaluations demonstrate that versatile models trained on our dataset achieve state-of-the-art performance across a wide spectrum of spatial reasoning benchmarks. Notably, the best-performing model exhibits a substantial average improvement of 19 percent, relatively. Furthermore, we provide a systematic analysis of how data attributes influence spatial perception. By open-sourcing both the engine and the 3M-scale dataset, we provide a robust foundation to accelerate future research in spatial intelligence.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces OpenSpatial, an open-source data engine that uses 3D bounding boxes as primitives to generate a hierarchical dataset for five spatial tasks (Spatial Measurement, Spatial Relationship, Camera Perception, Multi-view Consistency, and Scene-Aware Reasoning). It releases the OpenSpatial-3M dataset of 3 million samples and claims that versatile models trained on it achieve state-of-the-art performance across spatial reasoning benchmarks, with the best model showing a 19% relative average improvement. The authors also analyze how data attributes influence spatial perception and open-source both the engine and dataset.
Significance. If the performance claims hold under detailed scrutiny, the work would be significant for providing a scalable, principled, and open-source alternative to ad-hoc spatial data creation. This could accelerate progress in spatial intelligence by enabling reproducible high-quality data at 3M scale, with the open-sourcing of the engine and dataset as a clear strength for the community.
major comments (3)
- [Abstract] Abstract: The central claim of SOTA performance and a 19% relative improvement is presented without any reference to specific baselines, evaluation benchmarks, protocols, data-quality metrics, or statistical tests. This absence makes the empirical result impossible to assess and is load-bearing for the paper's main contribution.
- [Abstract and data generation description] The data generation process (using perfect 3D bounding boxes with no sensor noise or lighting variation) is described as producing high-fidelity samples, but no ablations or analysis address whether models exploit synthetic artifacts (e.g., viewpoint consistency or exact box primitives) rather than acquiring transferable spatial understanding. This directly impacts the claim of genuine gains on real-world spatial reasoning benchmarks.
- [Evaluation and experiments] The evaluation section provides no details on whether the reported benchmarks are drawn from synthetic or real distributions, nor any cross-distribution transfer experiments. Without this isolation, the 19% improvement cannot be attributed to improved spatial intelligence rather than distribution matching.
minor comments (2)
- [Introduction] The five task abbreviations (SM, SR, CP, MC, SAR) are introduced without an accompanying summary table, which would improve readability when discussing the data hierarchy.
- [Abstract] The abstract refers to 'versatile models' without specifying model architectures or training details; these should be clarified early to support the performance claims.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and commit to revisions that will enhance the transparency and rigor of the empirical claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim of SOTA performance and a 19% relative improvement is presented without any reference to specific baselines, evaluation benchmarks, protocols, data-quality metrics, or statistical tests. This absence makes the empirical result impossible to assess and is load-bearing for the paper's main contribution.
Authors: We agree that the abstract would benefit from greater specificity to allow readers to assess the claims. In the revised manuscript, we will expand the abstract to name the primary spatial reasoning benchmarks (e.g., VSR, SpatialVQA, and related suites), the main baselines, the evaluation protocol, and a brief note on statistical validation across runs. These additions will be kept concise while making the 19% relative improvement claim fully traceable. revision: yes
-
Referee: [Abstract and data generation description] The data generation process (using perfect 3D bounding boxes with no sensor noise or lighting variation) is described as producing high-fidelity samples, but no ablations or analysis address whether models exploit synthetic artifacts (e.g., viewpoint consistency or exact box primitives) rather than acquiring transferable spatial understanding. This directly impacts the claim of genuine gains on real-world spatial reasoning benchmarks.
Authors: This concern about potential exploitation of synthetic cues is well-taken. Although the benchmarks contain real-world data and our models demonstrate gains there, we did not previously include targeted ablations for artifact exploitation. In the revision we will add experiments that inject controlled noise into 3D bounding boxes and lighting variations during generation, retrain models, and measure resulting performance changes on the real-world benchmarks. These results will be reported to isolate the contribution of genuine spatial understanding. revision: yes
-
Referee: [Evaluation and experiments] The evaluation section provides no details on whether the reported benchmarks are drawn from synthetic or real distributions, nor any cross-distribution transfer experiments. Without this isolation, the 19% improvement cannot be attributed to improved spatial intelligence rather than distribution matching.
Authors: We acknowledge the need for explicit distribution details and transfer tests. The reported benchmarks are predominantly real-world; however, the evaluation section will be expanded to state the synthetic versus real composition of each benchmark explicitly. We will also add cross-distribution experiments, including training on OpenSpatial-3M and evaluating on held-out real-world subsets (and the reverse), to demonstrate that gains reflect improved spatial reasoning rather than distribution alignment. revision: yes
Circularity Check
No circularity: empirical dataset construction and external benchmark evaluation
full rationale
The paper presents a data engine that generates samples from 3D bounding boxes for five spatial tasks, curates a 3M dataset, and reports that models trained on it achieve SOTA results with a 19% relative gain on external benchmarks. No equations, fitted parameters, predictions, or self-citations appear in the derivation chain. The central claim rests on empirical training and evaluation against independent benchmarks rather than any self-referential definition or reduction of outputs to inputs by construction. This is a standard non-circular empirical contribution.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Text Reading, and Beyond2(1), 1 (2023)
Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A versatile vision-language model for understanding, localization. Text Reading, and Beyond2(1), 1 (2023)
2023
-
[3]
Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data
Baruch, G., Chen, Z., Dehghan, A., Dimry, T., Feigin, Y., Fu, P., Gebauer, T., Joffe, B., Kurz, D., Schwartz, A., et al.: Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data. arXiv preprint arXiv:2111.08897 (2021)
work page internal anchor Pith review arXiv 2021
-
[5]
RT-1: Robotics Transformer for Real-World Control at Scale
Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakr- ishnan, K., Hausman, K., Herzog, A., Hsu, J., et al.: Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817 (2022)
work page internal anchor Pith review arXiv 2022
-
[6]
In: 2025 IEEE International Conference on Robotics and Automation (ICRA)
Cai, W., Ponomarenko, I., Yuan, J., Li, X., Yang, W., Dong, H., Zhao, B.: Spa- tialbot: Precise spatial understanding with vision language models. In: 2025 IEEE International Conference on Robotics and Automation (ICRA). pp. 9490–9498. IEEE (2025)
2025
-
[7]
Scaling spatial intelligence with multi- modal foundation models.arXiv preprint arXiv:2511.13719,
Cai, Z., Wang, R., Gu, C., Pu, F., Xu, J., Wang, Y., Yin, W., Yang, Z., Wei, C., Sun, Q., et al.: Scaling spatial intelligence with multimodal foundation models. arXiv preprint arXiv:2511.13719 (2025)
-
[8]
Matterport3D: Learning from RGB-D Data in Indoor Environments
Chang, A., Dai, A., Funkhouser, T., Halber, M., Niessner, M., Savva, M., Song, S., Zeng, A., Zhang, Y.: Matterport3d: Learning from rgb-d data in indoor envi- ronments. arXiv preprint arXiv:1709.06158 (2017)
work page Pith review arXiv 2017
-
[9]
In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition (CVPR)
Chen, B., Xu, Z., Kirmani, S., Ichter, B., Sadigh, D., Guibas, L., Xia, F.: Spa- tialvlm: Endowing vision-language models with spatial reasoning capabilities. In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition (CVPR). pp. 14455–14465 (June 2024) 16 J. Liu et al
2024
-
[10]
Chen, L., Li, J., Dong, X., Zhang, P., Zang, Y., Chen, Z., Duan, H., Wang, J., Qiao, Y., Lin, D., et al.: Are we on the right way for evaluating large vision- language models? Advances in Neural Information Processing Systems37, 27056– 27087 (2024)
2024
-
[11]
Chen, Z., Zhang, M., Yu, X., Luo, X., Sun, M., Pan, Z., Feng, Y., Pei, P., Cai, X., Huang, R.: Think with 3d: Geometric imagination grounded spatial reasoning from limited views. arXiv preprint arXiv:2510.18632 (2025)
-
[12]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24185–24198 (2024)
2024
-
[13]
Advances in Neural Information Processing Systems37, 135062–135093 (2024)
Cheng, A.C., Yin, H., Fu, Y., Guo, Q., Yang, R., Kautz, J., Wang, X., Liu, S.: Spatialrgpt: Grounded spatial reasoning in vision-language models. Advances in Neural Information Processing Systems37, 135062–135093 (2024)
2024
-
[14]
Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Dai,A.,Chang,A.X.,Savva,M.,Halber,M.,Funkhouser,T.,Nießner,M.:Scannet: Richly-annotated 3d reconstructions of indoor scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 5828–5839 (2017)
2017
-
[16]
Advances in neural information processing systems36, 49250–49267 (2023)
Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P.N., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in neural information processing systems36, 49250–49267 (2023)
2023
-
[17]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Daxberger, E., Wenzel, N., Griffiths, D., Gang, H., Lazarow, J., Kohavi, G., Kang, K., Eichner, M., Yang, Y., Dehghan, A., et al.: Mm-spatial: Exploring 3d spatial understanding in multimodal llms. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7395–7408 (2025)
2025
-
[18]
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction
Fan, Z., Zhang, J., Li, R., Zhang, J., Chen, R., Hu, H., Wang, K., Qu, H., Wang, D., Yan, Z., et al.: Vlm-3r: Vision-language models augmented with instruction-aligned 3d reconstruction. arXiv preprint arXiv:2505.20279 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
In: European Conference on Computer Vision
Fu, X., Hu, Y., Li, B., Feng, Y., Wang, H., Lin, X., Roth, D., Smith, N.A., Ma, W.C., Krishna, R.: Blink: Multimodal large language models can see but not per- ceive. In: European Conference on Computer Vision. pp. 148–166. Springer (2024)
2024
-
[20]
In: Conference on Robot Learning
Goyal, A., Xu, J., Guo, Y., Blukis, V., Chao, Y.W., Fox, D.: Rvt: Robotic view transformer for 3d object manipulation. In: Conference on Robot Learning. pp. 694–710. PMLR (2023)
2023
-
[21]
Guo, D., Wu, F., Zhu, F., Leng, F., Shi, G., Chen, H., Fan, H., Wang, J., Jiang, J., Wang, J., et al.: Seed1. 5-vl technical report. arXiv preprint arXiv:2505.07062 (2025)
work page internal anchor Pith review arXiv 2025
-
[22]
Cambridge university press (2003)
Hartley, R., Zisserman, A.: Multiple view geometry in computer vision. Cambridge university press (2003)
2003
-
[23]
In: Proceedings of the IEEE/CVF international conference on computer vision
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4015–4026 (2023)
2023
-
[24]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Lee, S., Park, S., Kim, H.: Dynscene: Scalable generation of dynamic robotic ma- nipulation scenes for embodied ai. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12166–12175 (2025)
2025
-
[25]
LLaVA-OneVision: Easy Visual Task Transfer
Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024) OpenSpatial 17
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
Li, H., Li, D., Wang, Z., Yan, Y., Wu, H., Zhang, W., Shen, Y., Lu, W., Xiao, J., Zhuang, Y.: Spatialladder: Progressive training for spatial reasoning in vision- language models. arXiv preprint arXiv:2510.08531 (2025)
-
[27]
Li, J., Chen, J., Qu, Y., Xu, S., Lin, Z., Zhu, J., Xu, B., Tan, W., Fu, P., Ju, J., et al.: Xiaomi mimo-vl-miloco technical report. arXiv preprint arXiv:2512.17436 (2025)
-
[28]
Li, M., Zhang, Y., Long, D., Chen, K., Song, S., Bai, S., Yang, Z., Xie, P., Yang, A., Liu, D., et al.: Qwen3-vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multimodal retrieval and ranking. arXiv preprint arXiv:2601.04720 (2026)
work page internal anchor Pith review arXiv 2026
-
[29]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Lin, X., Lin, T., Huang, L., Xie, H., Su, Z.: Bip3d: Bridging 2d images and 3d perception for embodied intelligence. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 9007–9016 (2025)
2025
-
[30]
Advances in neural information processing systems36, 34892–34916 (2023)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023)
2023
-
[31]
Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., et al.: Mmbench: Is your multi-modal model an all-around player? In: European conference on computer vision. pp. 216–233. Springer (2024)
2024
-
[32]
arXiv preprint arXiv:2602.07458 (2026)
Long, Y., Yang, Y., Wei, H., Chen, W., Zhang, T., Liu, C., Jiang, K., Chen, J., Tang, K., Wen, B., et al.: Spatialreward: Bridging the perception gap in online rl for image editing via explicit spatial reasoning. arXiv preprint arXiv:2602.07458 (2026)
work page internal anchor Pith review arXiv 2026
-
[33]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Ma, W., Chen, H., Zhang, G., Chou, Y.C., Chen, J., de Melo, C., Yuille, A.: 3dsr- bench: A comprehensive 3d spatial reasoning benchmark. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6924–6934 (2025)
2025
-
[34]
arXiv preprint arXiv:2504.01805 (2025)
Ouyang, K., Liu, Y., Wu, H., Liu, Y., Zhou, H., Zhou, J., Meng, F., Sun, X.: Spacer: Reinforcing mllms in video spatial reasoning. arXiv preprint arXiv:2504.01805 (2025)
-
[35]
In: Proceedings of the IEEE/CVF international conference on computer vision
Roberts, M., Ramapuram, J., Ranjan, A., Kumar, A., Bautista, M.A., Paczan, N., Webb, R., Susskind, J.M.: Hypersim: A photorealistic synthetic dataset for holis- tic indoor scene understanding. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 10912–10922 (2021)
2021
-
[36]
8 model card: Towards generalized real-world agency
Seed, B.: Seed1. 8 model card: Towards generalized real-world agency. Tech. rep., Technical report (model card), December 2025. URL https://lf3-static ... (2025)
2025
-
[37]
In: Conference on Robot Learning
Shridhar, M., Manuelli, L., Fox, D.: Perceiver-actor: A multi-task transformer for robotic manipulation. In: Conference on Robot Learning. pp. 785–799. PMLR (2023)
2023
-
[38]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Song,S.,Lichtenberg,S.P.,Xiao,J.:Sunrgb-d:Argb-dsceneunderstandingbench- mark suite. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 567–576 (2015)
2015
-
[39]
Gemini Robotics: Bringing AI into the Physical World
Team, G.R., Abeyruwan, S., Ainslie, J., Alayrac, J.B., Arenas, M.G., Armstrong, T., Balakrishna, A., Baruch, R., Bauza, M., Blokzijl, M., et al.: Gemini robotics: Bringing ai into the physical world. arXiv preprint arXiv:2503.20020 (2025)
work page internal anchor Pith review arXiv 2025
-
[40]
Kimi K2: Open Agentic Intelligence
Team, K., Bai, Y., Bao, Y., Chen, G., Chen, J., Chen, N., Chen, R., Chen, Y., Chen, Y., Chen, Y., et al.: Kimi k2: Open agentic intelligence. arXiv preprint arXiv:2507.20534 (2025)
work page internal anchor Pith review arXiv 2025
-
[41]
Team, K., Du, A., Yin, B., Xing, B., Qu, B., Wang, B., Chen, C., Zhang, C., Du, C., Wei, C., et al.: Kimi-vl technical report. arXiv preprint arXiv:2504.07491 (2025) 18 J. Liu et al
work page internal anchor Pith review arXiv 2025
-
[42]
Advances in Neural Information Processing Systems 37, 87310–87356 (2024)
Tong, P., Brown, E., Wu, P., Woo, S., Iyer, A.J.V., Akula, S.C., Yang, S., Yang, J., Middepogu, M., Wang, Z., et al.: Cambrian-1: A fully open, vision-centric ex- ploration of multimodal llms. Advances in Neural Information Processing Systems 37, 87310–87356 (2024)
2024
-
[43]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[44]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Wang, T., Mao, X., Zhu, C., Xu, R., Lyu, R., Li, P., Chen, X., Zhang, W., Chen, K., Xue, T., et al.: Embodiedscan: A holistic multi-modal 3d perception suite towards embodied ai. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19757–19767 (2024)
2024
-
[45]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
arXiv preprint arXiv:2505.23747 (2025)
Wu, D., Liu, F., Hung, Y.H., Duan, Y.: Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence. arXiv preprint arXiv:2505.23747 (2025)
-
[47]
Wu, J., Guan, J., Feng, K., Liu, Q., Wu, S., Wang, L., Wu, W., Tan, T.: Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing. arXiv preprint arXiv:2506.09965 (2025)
-
[48]
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
Wu,Z.,Chen,X.,Pan,Z.,Liu,X.,Liu,W.,Dai,D.,Gao,H.,Ma,Y.,Wu,C.,Wang, B., et al.: Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding. arXiv preprint arXiv:2412.10302 (2024)
work page internal anchor Pith review arXiv 2024
-
[49]
xAI: Grok-1.5v preview.https://x.ai/news/grok-1.5v(2024)
2024
-
[50]
Advances in Neural Information Processing Systems37, 52040–52094 (2024)
Xie, T., Zhang, D., Chen, J., Li, X., Zhao, S., Cao, R., Hua, T.J., Cheng, Z., Shin, D., Lei, F., et al.: Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. Advances in Neural Information Processing Systems37, 52040–52094 (2024)
2024
-
[51]
Xu, R., Wang, W., Tang, H., Chen, X., Wang, X., Chu, F.J., Lin, D., Feiszli, M., Liang, K.J.: Multi-spatialmllm: Multi-frame spatial understanding with multi- modal large language models. arXiv preprint arXiv:2505.17015 (2025)
-
[52]
In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference
Yang, J., Yang, S., Gupta, A.W., Han, R., Fei-Fei, L., Xie, S.: Thinking in space: How multimodal large language models see, remember, and recall spaces. In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference. pp. 10632– 10643 (2025)
2025
-
[53]
Visual spatial tuning.arXiv preprint arXiv:2511.05491, 2025
Yang, R., Zhu, Z., Li, Y., Huang, J., Yan, S., Zhou, S., Liu, Z., Li, X., Li, S., Wang, W., et al.: Visual spatial tuning. arXiv preprint arXiv:2511.05491 (2025)
-
[54]
Cambrian-S: Towards Spatial Supersensing in Video.arXiv preprint arXiv:2511.04670, 2025
Yang, S., Yang, J., Huang, P., Brown, E., Yang, Z., Yu, Y., Tong, S., Zheng, Z., Xu, Y., Wang, M., et al.: Cambrian-s: Towards spatial supersensing in video. arXiv preprint arXiv:2511.04670 (2025)
-
[55]
Mmsi-bench: A benchmark for multi-image spatial intelligence.arXiv preprint arXiv:2505.23764, 2025
Yang, S., Xu, R., Xie, Y., Yang, S., Li, M., Lin, J., Zhu, C., Chen, X., Duan, H., Yue, X., et al.: Mmsi-bench: A benchmark for multi-image spatial intelligence. arXiv preprint arXiv:2505.23764 (2025)
-
[56]
Seeing from another perspective: Evaluating multi-view understanding in mllms
Yeh, C.H., Wang, C., Tong, S., Cheng, T.Y., Wang, R., Chu, T., Zhai, Y., Chen, Y., Gao, S., Ma, Y.: Seeing from another perspective: Evaluating multi-view un- derstanding in mllms. arXiv preprint arXiv:2504.15280 (2025)
-
[57]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Yeshwanth, C., Liu, Y.C., Nießner, M., Dai, A.: Scannet++: A high-fidelity dataset of 3d indoor scenes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 12–22 (2023)
2023
-
[58]
In: Structural Priors for Vision Workshop at ICCV’25 (2025) OpenSpatial 19
Yin, B., Wang, Q., Zhang, P., Zhang, J., Wang, K., Wang, Z., Zhang, J., Chan- drasegaran, K., Liu, H., Krishna, R., et al.: Spatial mental modeling from limited views. In: Structural Priors for Vision Workshop at ICCV’25 (2025) OpenSpatial 19
2025
-
[59]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., et al.: Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9556–9567 (2024)
2024
-
[60]
In: Proceedings of the IEEE/CVF international conference on computer vision
Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language im- age pre-training. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 11975–11986 (2023)
2023
-
[61]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., et al.: Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.