Recognition: unknown
Exploring Spatial Intelligence from a Generative Perspective
Pith reviewed 2026-05-10 00:42 UTC · model grok-4.3
The pith
Fine-tuning multimodal models on synthetic spatial image edits improves both generation fidelity and downstream spatial understanding.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We define generative spatial intelligence as the ability of unified multimodal models to respect 3D spatial constraints while generating or editing images. We introduce GSI-Bench with two parts: GSI-Real, a filtered real-world collection created through 3D-prior-guided pipelines, and GSI-Syn, a large synthetic collection offering controllable spatial operations and automatic labels. A unified evaluation protocol measures spatial compliance and editing accuracy. Fine-tuning models on GSI-Syn produces substantial improvements on both synthetic and real GSI tasks; the same models also score higher on downstream spatial-understanding benchmarks.
What carries the argument
GSI-Bench, a dual real-plus-synthetic benchmark that tests image editing for compliance with explicit 3D spatial constraints and supplies automated, controllable labels for scalable assessment.
If this is right
- Unified models can acquire stronger spatial capabilities through targeted generative training rather than understanding-only objectives.
- Large synthetic datasets with explicit spatial controls can transfer to real-world image-editing performance.
- Improvements in generative spatial compliance also raise scores on separate spatial-reasoning tests.
- A single training pipeline can advance both image generation fidelity and spatial comprehension within the same model family.
Where Pith is reading between the lines
- The link between generation practice and understanding gains may extend to other constraint types such as physics or temporal consistency.
- If the pattern holds, future model development could treat generative spatial exercises as a regular pre-training or alignment step.
- Benchmarks that isolate spatial constraints could become standard diagnostics for multimodal systems before deployment in robotics or scene-layout applications.
Load-bearing premise
The observed gains after training on the synthetic spatial dataset arise specifically from strengthened spatial awareness rather than from generic adaptation to any new data or from quirks of the data-generation process itself.
What would settle it
A controlled experiment in which models fine-tuned on GSI-Syn show no measurable improvement on GSI-Real editing tasks or on independent spatial-understanding benchmarks, or in which equivalent gains appear after training on matched non-spatial synthetic images.
Figures
read the original abstract
Spatial intelligence is essential for multimodal large language models, yet current benchmarks largely assess it only from an understanding perspective. We ask whether modern generative or unified multimodal models also possess generative spatial intelligence (GSI), the ability to respect and manipulate 3D spatial constraints during image generation, and whether such capability can be measured or improved. We introduce GSI-Bench, the first benchmark designed to quantify GSI through spatially grounded image editing. It consists of two complementary components: GSI-Real, a high-quality real-world dataset built via a 3D-prior-guided generation and filtering pipeline, and GSI-Syn, a large-scale synthetic benchmark with controllable spatial operations and fully automated labeling. Together with a unified evaluation protocol, GSI-Bench enables scalable, model-agnostic assessment of spatial compliance and editing fidelity. Experiments show that fine-tuning unified multimodal models on GSI-Syn yields substantial gains on both synthetic and real tasks and, strikingly, also improves downstream spatial understanding. This provides the first clear evidence that generative training can tangibly strengthen spatial reasoning, establishing a new pathway for advancing spatial intelligence in multimodal models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the concept of Generative Spatial Intelligence (GSI) in multimodal large language models, defined as the ability to respect and manipulate 3D spatial constraints during image generation. It presents GSI-Bench, consisting of GSI-Real (a real-world dataset constructed via a 3D-prior-guided generation and filtering pipeline) and GSI-Syn (a large-scale synthetic dataset with controllable spatial operations and automated labeling), along with a unified evaluation protocol for spatial compliance and editing fidelity. The central empirical result is that fine-tuning unified multimodal models on GSI-Syn produces substantial gains on both GSI-Bench components and transfers to improved performance on downstream spatial understanding tasks, providing the first evidence that generative training can strengthen spatial reasoning.
Significance. If the attribution of gains to improved GSI holds after controls, the work would be significant for expanding spatial intelligence research from understanding-only benchmarks to generative capabilities. The introduction of a scalable, model-agnostic benchmark with complementary synthetic and real components is a clear strength, enabling reproducible assessment of spatial editing. The observed transfer from generative fine-tuning to understanding tasks suggests a promising pathway for model improvement, though this requires confirmation that effects are not due to dataset artifacts.
major comments (2)
- [GSI-Real construction and Experiments] GSI-Real construction (via 3D-prior-guided pipeline): The central claim that fine-tuning on GSI-Syn improves generative spatial intelligence (rather than general adaptation) is load-bearing on the assumption that gains on GSI-Real reflect internalized 3D constraints. However, GSI-Real is generated and filtered using an explicit 3D-prior pipeline, so models could improve by matching pipeline-specific statistical signatures (e.g., depth-consistent lighting or occlusion patterns) without acquiring transferable spatial reasoning. No control experiments, such as fine-tuning on a matched non-spatial synthetic corpus or evaluating on GSI-Real variants without the 3D prior, are described to isolate this.
- [Evaluation protocol and Experiments] Evaluation protocol and results: The unified protocol claims to quantify spatial compliance and editing fidelity in a model-agnostic way, but without explicit details on how GSI-Bench tasks isolate spatial constraints independent of 3D-prior artifacts, the transfer results to downstream spatial understanding tasks cannot unambiguously support the claim of strengthened spatial reasoning. The abstract asserts 'substantial gains' and 'first clear evidence' but the provided description supplies no quantitative metrics, baselines, or error bars to evaluate effect sizes.
minor comments (1)
- [Abstract] The abstract would benefit from a brief mention of the specific metrics used for 'spatial compliance and editing fidelity' to improve immediate clarity for readers.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The concerns about potential dataset artifacts in GSI-Real and the need for clearer quantitative details and isolation of spatial effects are well-taken. We address each major comment below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: GSI-Real construction (via 3D-prior-guided pipeline): The central claim that fine-tuning on GSI-Syn improves generative spatial intelligence (rather than general adaptation) is load-bearing on the assumption that gains on GSI-Real reflect internalized 3D constraints. However, GSI-Real is generated and filtered using an explicit 3D-prior pipeline, so models could improve by matching pipeline-specific statistical signatures (e.g., depth-consistent lighting or occlusion patterns) without acquiring transferable spatial reasoning. No control experiments, such as fine-tuning on a matched non-spatial synthetic corpus or evaluating on GSI-Real variants without the 3D prior, are described to isolate this.
Authors: We agree this is a substantive concern that could undermine attribution of gains specifically to GSI. In the revised manuscript we will add two new control experiments: (1) fine-tuning on a size- and style-matched non-spatial synthetic corpus generated without the 3D prior, and (2) evaluation on GSI-Real variants constructed by ablating the 3D-prior guidance step. These will be reported alongside the original results with the same metrics. If the gains diminish under controls we will revise our interpretation accordingly; preliminary internal checks suggest the spatial-specific improvements remain, but we will let the new data decide. revision: yes
-
Referee: Evaluation protocol and results: The unified protocol claims to quantify spatial compliance and editing fidelity in a model-agnostic way, but without explicit details on how GSI-Bench tasks isolate spatial constraints independent of 3D-prior artifacts, the transfer results to downstream spatial understanding tasks cannot unambiguously support the claim of strengthened spatial reasoning. The abstract asserts 'substantial gains' and 'first clear evidence' but the provided description supplies no quantitative metrics, baselines, or error bars to evaluate effect sizes.
Authors: The full manuscript already contains quantitative tables (Tables 2–4) with exact compliance percentages, baseline comparisons (untuned models and prior methods), and standard deviations over three random seeds. To make isolation of spatial constraints explicit we will expand Section 4.2 with a new paragraph detailing prompt design choices that target relational geometry rather than pipeline-specific appearance cues (e.g., lighting, texture). We will also add a short ablation showing that performance drops when spatial relations are randomized while keeping other factors fixed. The abstract claim will be softened to 'provides evidence' pending the new controls. These additions will be placed in the main text rather than appendix. revision: yes
Circularity Check
Empirical benchmark introduction and fine-tuning experiments with no derivation chain
full rationale
The paper introduces GSI-Bench (GSI-Real via 3D-prior pipeline and GSI-Syn synthetic data) and reports that fine-tuning unified multimodal models on GSI-Syn produces gains on synthetic, real, and downstream tasks. No equations, fitted parameters, uniqueness theorems, or ansatzes appear in the provided text. Claims rest on experimental results rather than any self-referential reduction of outputs to inputs by construction. Self-citations, if present, are not load-bearing for the central empirical findings.
Axiom & Free-Parameter Ledger
invented entities (1)
-
Generative Spatial Intelligence (GSI)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Holistic evaluation of multimodal llms on spatial intelligence.arXiv preprint arXiv:2508.13142, 2025
Zhongang Cai, Yubo Wang, Qingping Sun, Ruisi Wang, Chenyang Gu, Wanqi Yin, Zhiqian Lin, Zhitao Yang, Chen Wei, Xuanke Shi, et al. Has gpt-5 achieved spa- tial intelligence? an empirical study.arXiv preprint arXiv:2508.13142, 2025. 1
-
[3]
Revision: Rendering tools enable spatial fidelity in vision-language models
Agneet Chatterjee, Yiran Luo, Tejas Gokhale, Yezhou Yang, and Chitta Baral. Revision: Rendering tools enable spatial fidelity in vision-language models. InEuropean Conference on Computer Vision, pages 339–357. Springer, 2024. 3
2024
-
[4]
Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Danny Driess, Pete Florence, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language mod- els with spatial reasoning capabilities.arXiv preprint arXiv:2401.12168, 2024. 1
-
[5]
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Sil- vio Savarese, et al. Blip3-o: A family of fully open uni- fied multimodal models—architecture, training, and dataset. arXiv preprint arXiv:2505.09568, 2025. 1, 3
work page Pith review arXiv 2025
-
[6]
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus- pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811,
work page internal anchor Pith review arXiv
-
[7]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 1, 3, 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [8]
-
[9]
Scannet: Richly-annotated 3d reconstructions of indoor scenes
Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017. 1
2017
-
[10]
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 1, 3, 6, 7
work page internal anchor Pith review arXiv 2025
-
[11]
David Ha and J ¨urgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2018. 1
work page internal anchor Pith review arXiv 2018
-
[12]
Mesatask: Towards task-driven tabletop scene generation via 3d spatial reasoning
Jinkun Hao, Naifu Liang, Zhen Luo, Xudong Xu, Weipeng Zhong, Ran Yi, Yichen Jin, Zhaoyang Lyu, Feng Zheng, Lizhuang Ma, et al. Mesatask: Towards task-driven table- top scene generation via 3d spatial reasoning.arXiv preprint arXiv:2509.22281, 2025. 2, 4, 6
-
[13]
Zheng Huang, Mingyu Liu, Xiaoyi Lin, Muzhi Zhu, Canyu Zhao, Zongze Du, Xiaoman Li, Yiduo Jia, Hao Zhong, Hao Chen, et al. Notvla: Narrowing of dense action trajecto- ries for generalizable robot manipulation.arXiv preprint arXiv:2510.03895, 2025. 1
work page internal anchor Pith review arXiv 2025
-
[14]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 1, 3, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, et al. Robobrain: A unified brain model for robotic manipulation from abstract to concrete.arXiv preprint arXiv:2502.21257, 2025. 1, 3
-
[16]
Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. Omnispatial: Towards comprehensive spatial reasoning benchmark for vi- sion language models.arXiv preprint arXiv:2506.03135,
-
[17]
AnyEdit: Edit Any Knowledge Encoded in Language Models, February 2025
Houcheng Jiang, Junfeng Fang, Ningyu Zhang, Guojun Ma, Mingyang Wan, Xiang Wang, Xiangnan He, and Tat-seng Chua. Anyedit: Edit any knowledge encoded in language models.arXiv preprint arXiv:2502.05628, 2025. 1, 6
-
[18]
AI2-THOR: An Interactive 3D Environment for Visual AI
Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, et al. Ai2-thor: An interactive 3d environment for visual ai.arXiv preprint arXiv:1712.05474,
work page internal anchor Pith review arXiv
-
[19]
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld: High-resolution semantic en- coders for unified visual understanding and generation.arXiv preprint arXiv:2506.03147, 2025. 1, 6
work page internal anchor Pith review arXiv 2025
-
[20]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 7
2024
-
[21]
World model on million-length video and language with blockwise ringattention
Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with blockwise ringattention.arXiv preprint arXiv:2402.08268,
-
[22]
Janusflow: Harmonizing autore- gression and rectified flow for unified multimodal under- standing and generation
Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai Yu, et al. Janusflow: Harmonizing autore- gression and rectified flow for unified multimodal under- standing and generation. InProceedings of the Computer Vi- sion and Pattern Recognition Conference, pages 7739–7751,
-
[23]
Membership determination in open clusters using the dbscan clustering algorithm.Astronomy and Computing, 47:100826, 2024
Mudasir Raja, Priya Hasan, Md Mahmudunnobe, Md Sai- fuddin, and SN Hasan. Membership determination in open clusters using the dbscan clustering algorithm.Astronomy and Computing, 47:100826, 2024. 4
2024
-
[24]
Sat: Dynamic spatial aptitude training for multimodal language models
Arijit Ray, Jiafei Duan, Ellis Brown, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kemb- havi, Bryan A. Plummer, Ranjay Krishna, Kuo-Hao Zeng, and Kate Saenko. Sat: Dynamic spatial aptitude train- ing for multimodal language models.arXiv preprint arXiv:2412.07755, 2024. 2, 3, 7, 8
-
[25]
Com- mon objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction
Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Com- mon objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. InProceedings of the IEEE/CVF international conference on computer vision, pages 10901–10911, 2021. 1
2021
-
[26]
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers
Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, et al. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers.arXiv preprint arXiv:2506.23918, 2025. 1
work page internal anchor Pith review arXiv 2025
-
[27]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, et al. Gem- ini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 1, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.CoRR, abs/2409.12191, 2024. 1, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Emu3: Next-Token Prediction is All You Need
Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024. 1, 3
work page internal anchor Pith review arXiv 2024
-
[30]
Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004
Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 6
2004
-
[31]
Genspace: Bench- marking spatially-aware image generation.arXiv preprint arXiv:2505.24870, 2025
Zehan Wang, Jiayang Xu, Ziang Zhang, Tianyu Pang, Chao Du, Hengshuang Zhao, and Zhou Zhao. Genspace: Bench- marking spatially-aware image generation.arXiv preprint arXiv:2505.24870, 2025. 1
-
[32]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 6
work page internal anchor Pith review arXiv 2025
-
[33]
OmniGen2: Towards Instruction-Aligned Multimodal Generation
Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025. 1, 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747, 2025. 1, 3
-
[35]
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024. 1, 3
work page internal anchor Pith review arXiv 2024
-
[36]
Show-o2: Improved Native Unified Multimodal Models
Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show- o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025. 1, 3
work page internal anchor Pith review arXiv 2025
-
[37]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
Thinking in space: How mul- timodal large language models see, remember, and recall spaces
Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How mul- timodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025. 1, 3
2025
-
[39]
Learning interactive real-world simulators
Sherry Yang, Yilun Du, Seyed Ghasemipour, Jonathan Tompson, Leslie Kaelbling, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators. InIn- ternational Conference on Representation Learning, pages 45210–45234, 2024. 1
2024
-
[40]
Scannet++: A high-fidelity dataset of 3d in- door scenes
Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d in- door scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–22, 2023. 1, 2, 5, 6
2023
-
[41]
arXiv preprint arXiv:2506.21458 (2025)
Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chan- drasegaran, Han Liu, Ranjay Krishna, Saining Xie, Man- ling Li, Jiajun Wu, and Li Fei-Fei. Mindcube: Spa- tial mental modeling from limited views.arXiv preprint arXiv:2506.21458, 2025. 1, 3
-
[42]
arXiv preprint arXiv:2406.10721 (2024)
Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay, Ranjay Krishna, Adithyavairavan Murali, Arsalan Mousa- vian, and Dieter Fox. Robopoint: A vision-language model for spatial affordance prediction for robotics.arXiv preprint arXiv:2406.10721, 2024. 1, 3
- [43]
-
[44]
Embodied navigation foundation model.arXiv preprint arXiv:2509.12129, 2025
Jiazhao Zhang, Anqi Li, Yunpeng Qi, Minghan Li, Jiahang Liu, Shaoan Wang, Haoran Liu, Gengze Zhou, Yuze Wu, Xingxing Li, et al. Embodied navigation foundation model. arXiv preprint arXiv:2509.12129, 2025. 1
-
[45]
The unreasonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 6
2018
-
[46]
Ultraedit: Instruction-based fine-grained image editing at scale.Advances in Neural Information Pro- cessing Systems, 37:3058–3093, 2024
Haozhe Zhao, Xiaojian Shawn Ma, Liang Chen, Shuzheng Si, Rujie Wu, Kaikai An, Peiyu Yu, Minjia Zhang, Qing Li, and Baobao Chang. Ultraedit: Instruction-based fine-grained image editing at scale.Advances in Neural Information Pro- cessing Systems, 37:3058–3093, 2024. 1, 6
2024
-
[47]
Towards learning a generalist model for embod- ied navigation
Duo Zheng, Shijia Huang, Lin Zhao, Yiwu Zhong, and Li- wei Wang. Towards learning a generalist model for embod- ied navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13624–13634, 2024. 1
2024
-
[48]
InternScenes: A Large-scale Simulatable Indoor Scene Dataset with Realistic Layouts
Weipeng Zhong, Peizhou Cao, Yichen Jin, Li Luo, Wenzhe Cai, Jingli Lin, Hanqing Wang, Zhaoyang Lyu, Tai Wang, Bo Dai, et al. Internscenes: A large-scale simulatable in- door scene dataset with realistic layouts.arXiv preprint arXiv:2509.10813, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[49]
Robotrom-nav: A unified frame- work for embodied navigation integrating perception, plan- ning, and prediction
Yufeng Zhong, Chengjian Feng, Feng Yan, Fanfan Liu, Lim- ing Zheng, and Lin Ma. Robotrom-nav: A unified frame- work for embodied navigation integrating perception, plan- ning, and prediction. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 6416–6425, 2025. 1
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.