pith. machine review for the scientific record. sign in

arxiv: 2604.20570 · v1 · submitted 2026-04-22 · 💻 cs.CV

Recognition: unknown

Exploring Spatial Intelligence from a Generative Perspective

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:42 UTC · model grok-4.3

classification 💻 cs.CV
keywords spatial intelligencegenerative modelsmultimodal modelsimage editingspatial reasoningsynthetic benchmarkfine-tuning3D constraints
0
0 comments X

The pith

Fine-tuning multimodal models on synthetic spatial image edits improves both generation fidelity and downstream spatial understanding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether current multimodal models possess generative spatial intelligence, the capacity to produce images that correctly respect and manipulate 3D spatial constraints. It builds a benchmark consisting of real-world images and a large synthetic set with precise spatial controls to measure this ability through editing tasks. Experiments demonstrate that training on the synthetic portion produces clear gains on both the synthetic and real test sets. The same training also lifts performance on separate tasks that test spatial comprehension rather than generation. A sympathetic reader would see this as evidence that practicing spatial generation can strengthen overall spatial reasoning in these models.

Core claim

We define generative spatial intelligence as the ability of unified multimodal models to respect 3D spatial constraints while generating or editing images. We introduce GSI-Bench with two parts: GSI-Real, a filtered real-world collection created through 3D-prior-guided pipelines, and GSI-Syn, a large synthetic collection offering controllable spatial operations and automatic labels. A unified evaluation protocol measures spatial compliance and editing accuracy. Fine-tuning models on GSI-Syn produces substantial improvements on both synthetic and real GSI tasks; the same models also score higher on downstream spatial-understanding benchmarks.

What carries the argument

GSI-Bench, a dual real-plus-synthetic benchmark that tests image editing for compliance with explicit 3D spatial constraints and supplies automated, controllable labels for scalable assessment.

If this is right

  • Unified models can acquire stronger spatial capabilities through targeted generative training rather than understanding-only objectives.
  • Large synthetic datasets with explicit spatial controls can transfer to real-world image-editing performance.
  • Improvements in generative spatial compliance also raise scores on separate spatial-reasoning tests.
  • A single training pipeline can advance both image generation fidelity and spatial comprehension within the same model family.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The link between generation practice and understanding gains may extend to other constraint types such as physics or temporal consistency.
  • If the pattern holds, future model development could treat generative spatial exercises as a regular pre-training or alignment step.
  • Benchmarks that isolate spatial constraints could become standard diagnostics for multimodal systems before deployment in robotics or scene-layout applications.

Load-bearing premise

The observed gains after training on the synthetic spatial dataset arise specifically from strengthened spatial awareness rather than from generic adaptation to any new data or from quirks of the data-generation process itself.

What would settle it

A controlled experiment in which models fine-tuned on GSI-Syn show no measurable improvement on GSI-Real editing tasks or on independent spatial-understanding benchmarks, or in which equivalent gains appear after training on matched non-spatial synthetic images.

Figures

Figures reproduced from arXiv: 2604.20570 by Anzhou Li, Chunhua Shen, Hao Chen, Hao Zhong, Huanyi Zheng, Jintao Rong, Kaijun Wang, Muzhi Zhu, Shunyao Jiang, Tao Lin, Yang Liu, Zekai Luo.

Figure 1
Figure 1. Figure 1: We introduce GSI Bench, a benchmark for grounded spatial intelligence that spans both real-world and synthetic scenes. GSI [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Benchmark curation pipeline.The pipeline builds both synthetic (GSI-Syn) and real-world (GSI-Real) benchmarks through unified scene processing, action generation, and validation. For GSI-Syn, scenes are sampled from diverse viewpoints, feasible actions are generated via 3D geometric checks, and a simulator validates outcomes before filtering failures and anomalies. For GSI-Real, clear frames are selected, … view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison of spatial editing results across five instruction types. Rows 1–2 use GSI-Real samples, Rows 3–4 [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Spatial intelligence is essential for multimodal large language models, yet current benchmarks largely assess it only from an understanding perspective. We ask whether modern generative or unified multimodal models also possess generative spatial intelligence (GSI), the ability to respect and manipulate 3D spatial constraints during image generation, and whether such capability can be measured or improved. We introduce GSI-Bench, the first benchmark designed to quantify GSI through spatially grounded image editing. It consists of two complementary components: GSI-Real, a high-quality real-world dataset built via a 3D-prior-guided generation and filtering pipeline, and GSI-Syn, a large-scale synthetic benchmark with controllable spatial operations and fully automated labeling. Together with a unified evaluation protocol, GSI-Bench enables scalable, model-agnostic assessment of spatial compliance and editing fidelity. Experiments show that fine-tuning unified multimodal models on GSI-Syn yields substantial gains on both synthetic and real tasks and, strikingly, also improves downstream spatial understanding. This provides the first clear evidence that generative training can tangibly strengthen spatial reasoning, establishing a new pathway for advancing spatial intelligence in multimodal models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces the concept of Generative Spatial Intelligence (GSI) in multimodal large language models, defined as the ability to respect and manipulate 3D spatial constraints during image generation. It presents GSI-Bench, consisting of GSI-Real (a real-world dataset constructed via a 3D-prior-guided generation and filtering pipeline) and GSI-Syn (a large-scale synthetic dataset with controllable spatial operations and automated labeling), along with a unified evaluation protocol for spatial compliance and editing fidelity. The central empirical result is that fine-tuning unified multimodal models on GSI-Syn produces substantial gains on both GSI-Bench components and transfers to improved performance on downstream spatial understanding tasks, providing the first evidence that generative training can strengthen spatial reasoning.

Significance. If the attribution of gains to improved GSI holds after controls, the work would be significant for expanding spatial intelligence research from understanding-only benchmarks to generative capabilities. The introduction of a scalable, model-agnostic benchmark with complementary synthetic and real components is a clear strength, enabling reproducible assessment of spatial editing. The observed transfer from generative fine-tuning to understanding tasks suggests a promising pathway for model improvement, though this requires confirmation that effects are not due to dataset artifacts.

major comments (2)
  1. [GSI-Real construction and Experiments] GSI-Real construction (via 3D-prior-guided pipeline): The central claim that fine-tuning on GSI-Syn improves generative spatial intelligence (rather than general adaptation) is load-bearing on the assumption that gains on GSI-Real reflect internalized 3D constraints. However, GSI-Real is generated and filtered using an explicit 3D-prior pipeline, so models could improve by matching pipeline-specific statistical signatures (e.g., depth-consistent lighting or occlusion patterns) without acquiring transferable spatial reasoning. No control experiments, such as fine-tuning on a matched non-spatial synthetic corpus or evaluating on GSI-Real variants without the 3D prior, are described to isolate this.
  2. [Evaluation protocol and Experiments] Evaluation protocol and results: The unified protocol claims to quantify spatial compliance and editing fidelity in a model-agnostic way, but without explicit details on how GSI-Bench tasks isolate spatial constraints independent of 3D-prior artifacts, the transfer results to downstream spatial understanding tasks cannot unambiguously support the claim of strengthened spatial reasoning. The abstract asserts 'substantial gains' and 'first clear evidence' but the provided description supplies no quantitative metrics, baselines, or error bars to evaluate effect sizes.
minor comments (1)
  1. [Abstract] The abstract would benefit from a brief mention of the specific metrics used for 'spatial compliance and editing fidelity' to improve immediate clarity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The concerns about potential dataset artifacts in GSI-Real and the need for clearer quantitative details and isolation of spatial effects are well-taken. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: GSI-Real construction (via 3D-prior-guided pipeline): The central claim that fine-tuning on GSI-Syn improves generative spatial intelligence (rather than general adaptation) is load-bearing on the assumption that gains on GSI-Real reflect internalized 3D constraints. However, GSI-Real is generated and filtered using an explicit 3D-prior pipeline, so models could improve by matching pipeline-specific statistical signatures (e.g., depth-consistent lighting or occlusion patterns) without acquiring transferable spatial reasoning. No control experiments, such as fine-tuning on a matched non-spatial synthetic corpus or evaluating on GSI-Real variants without the 3D prior, are described to isolate this.

    Authors: We agree this is a substantive concern that could undermine attribution of gains specifically to GSI. In the revised manuscript we will add two new control experiments: (1) fine-tuning on a size- and style-matched non-spatial synthetic corpus generated without the 3D prior, and (2) evaluation on GSI-Real variants constructed by ablating the 3D-prior guidance step. These will be reported alongside the original results with the same metrics. If the gains diminish under controls we will revise our interpretation accordingly; preliminary internal checks suggest the spatial-specific improvements remain, but we will let the new data decide. revision: yes

  2. Referee: Evaluation protocol and results: The unified protocol claims to quantify spatial compliance and editing fidelity in a model-agnostic way, but without explicit details on how GSI-Bench tasks isolate spatial constraints independent of 3D-prior artifacts, the transfer results to downstream spatial understanding tasks cannot unambiguously support the claim of strengthened spatial reasoning. The abstract asserts 'substantial gains' and 'first clear evidence' but the provided description supplies no quantitative metrics, baselines, or error bars to evaluate effect sizes.

    Authors: The full manuscript already contains quantitative tables (Tables 2–4) with exact compliance percentages, baseline comparisons (untuned models and prior methods), and standard deviations over three random seeds. To make isolation of spatial constraints explicit we will expand Section 4.2 with a new paragraph detailing prompt design choices that target relational geometry rather than pipeline-specific appearance cues (e.g., lighting, texture). We will also add a short ablation showing that performance drops when spatial relations are randomized while keeping other factors fixed. The abstract claim will be softened to 'provides evidence' pending the new controls. These additions will be placed in the main text rather than appendix. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark introduction and fine-tuning experiments with no derivation chain

full rationale

The paper introduces GSI-Bench (GSI-Real via 3D-prior pipeline and GSI-Syn synthetic data) and reports that fine-tuning unified multimodal models on GSI-Syn produces gains on synthetic, real, and downstream tasks. No equations, fitted parameters, uniqueness theorems, or ansatzes appear in the provided text. Claims rest on experimental results rather than any self-referential reduction of outputs to inputs by construction. Self-citations, if present, are not load-bearing for the central empirical findings.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract introduces GSI and GSI-Bench without stating free parameters, background axioms, or external benchmarks; the central claim rests on the unverified validity of the new evaluation pipeline.

invented entities (1)
  • Generative Spatial Intelligence (GSI) no independent evidence
    purpose: Defines the target capability of respecting 3D spatial constraints during image generation
    New term introduced to frame the benchmark and experiments

pith-pipeline@v0.9.0 · 5527 in / 1133 out tokens · 33881 ms · 2026-05-10T00:42:28.289324+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 35 canonical work pages · 19 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    Holistic evaluation of multimodal llms on spatial intelligence.arXiv preprint arXiv:2508.13142, 2025

    Zhongang Cai, Yubo Wang, Qingping Sun, Ruisi Wang, Chenyang Gu, Wanqi Yin, Zhiqian Lin, Zhitao Yang, Chen Wei, Xuanke Shi, et al. Has gpt-5 achieved spa- tial intelligence? an empirical study.arXiv preprint arXiv:2508.13142, 2025. 1

  3. [3]

    Revision: Rendering tools enable spatial fidelity in vision-language models

    Agneet Chatterjee, Yiran Luo, Tejas Gokhale, Yezhou Yang, and Chitta Baral. Revision: Rendering tools enable spatial fidelity in vision-language models. InEuropean Conference on Computer Vision, pages 339–357. Springer, 2024. 3

  4. [4]

    SpatialVLM: Endowing vision-language models with spatial reasoning capabilities.arXiv preprint arXiv:2401.12168, 2024

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Danny Driess, Pete Florence, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language mod- els with spatial reasoning capabilities.arXiv preprint arXiv:2401.12168, 2024. 1

  5. [5]

    BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

    Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Sil- vio Savarese, et al. Blip3-o: A family of fully open uni- fied multimodal models—architecture, training, and dataset. arXiv preprint arXiv:2505.09568, 2025. 1, 3

  6. [6]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus- pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811,

  7. [7]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 1, 3, 6

  8. [8]

    Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xing- hang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, et al. Emu3. 5: Native multimodal models are world learners.arXiv preprint arXiv:2510.26583,

  9. [9]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes

    Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017. 1

  10. [10]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 1, 3, 6, 7

  11. [11]

    World Models

    David Ha and J ¨urgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2018. 1

  12. [12]

    Mesatask: Towards task-driven tabletop scene generation via 3d spatial reasoning

    Jinkun Hao, Naifu Liang, Zhen Luo, Xudong Xu, Weipeng Zhong, Ran Yi, Yichen Jin, Zhaoyang Lyu, Feng Zheng, Lizhuang Ma, et al. Mesatask: Towards task-driven table- top scene generation via 3d spatial reasoning.arXiv preprint arXiv:2509.22281, 2025. 2, 4, 6

  13. [13]

    Notvla: Narrowing of dense action trajecto- ries for generalizable robot manipulation.arXiv preprint arXiv:2510.03895, 2025

    Zheng Huang, Mingyu Liu, Xiaoyi Lin, Muzhi Zhu, Canyu Zhao, Zongze Du, Xiaoman Li, Yiduo Jia, Hao Zhong, Hao Chen, et al. Notvla: Narrowing of dense action trajecto- ries for generalizable robot manipulation.arXiv preprint arXiv:2510.03895, 2025. 1

  14. [14]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 1, 3, 6

  15. [15]

    RoboBrain: A Unified Brain Model for Robotic Manipula- tion from Abstract to Concrete.arXiv preprint arXiv:2502.21257, 2025

    Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, et al. Robobrain: A unified brain model for robotic manipulation from abstract to concrete.arXiv preprint arXiv:2502.21257, 2025. 1, 3

  16. [16]

    Omnispatial: Towards comprehensive spatial reasoning benchmark for vi- sion language models.arXiv preprint arXiv:2506.03135,

    Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. Omnispatial: Towards comprehensive spatial reasoning benchmark for vi- sion language models.arXiv preprint arXiv:2506.03135,

  17. [17]

    AnyEdit: Edit Any Knowledge Encoded in Language Models, February 2025

    Houcheng Jiang, Junfeng Fang, Ningyu Zhang, Guojun Ma, Mingyang Wan, Xiang Wang, Xiangnan He, and Tat-seng Chua. Anyedit: Edit any knowledge encoded in language models.arXiv preprint arXiv:2502.05628, 2025. 1, 6

  18. [18]

    AI2-THOR: An Interactive 3D Environment for Visual AI

    Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, et al. Ai2-thor: An interactive 3d environment for visual ai.arXiv preprint arXiv:1712.05474,

  19. [19]

    UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

    Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld: High-resolution semantic en- coders for unified visual understanding and generation.arXiv preprint arXiv:2506.03147, 2025. 1, 6

  20. [20]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 7

  21. [21]

    World model on million-length video and language with blockwise ringattention

    Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with blockwise ringattention.arXiv preprint arXiv:2402.08268,

  22. [22]

    Janusflow: Harmonizing autore- gression and rectified flow for unified multimodal under- standing and generation

    Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai Yu, et al. Janusflow: Harmonizing autore- gression and rectified flow for unified multimodal under- standing and generation. InProceedings of the Computer Vi- sion and Pattern Recognition Conference, pages 7739–7751,

  23. [23]

    Membership determination in open clusters using the dbscan clustering algorithm.Astronomy and Computing, 47:100826, 2024

    Mudasir Raja, Priya Hasan, Md Mahmudunnobe, Md Sai- fuddin, and SN Hasan. Membership determination in open clusters using the dbscan clustering algorithm.Astronomy and Computing, 47:100826, 2024. 4

  24. [24]

    Sat: Dynamic spatial aptitude training for multimodal language models

    Arijit Ray, Jiafei Duan, Ellis Brown, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kemb- havi, Bryan A. Plummer, Ranjay Krishna, Kuo-Hao Zeng, and Kate Saenko. Sat: Dynamic spatial aptitude train- ing for multimodal language models.arXiv preprint arXiv:2412.07755, 2024. 2, 3, 7, 8

  25. [25]

    Com- mon objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction

    Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Com- mon objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. InProceedings of the IEEE/CVF international conference on computer vision, pages 10901–10911, 2021. 1

  26. [26]

    Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

    Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, et al. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers.arXiv preprint arXiv:2506.23918, 2025. 1

  27. [27]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, et al. Gem- ini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 1, 7

  28. [28]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.CoRR, abs/2409.12191, 2024. 1, 7

  29. [29]

    Emu3: Next-Token Prediction is All You Need

    Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024. 1, 3

  30. [30]

    Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 6

  31. [31]

    Genspace: Bench- marking spatially-aware image generation.arXiv preprint arXiv:2505.24870, 2025

    Zehan Wang, Jiayang Xu, Ziang Zhang, Tianyu Pang, Chao Du, Hengshuang Zhao, and Zhou Zhao. Genspace: Bench- marking spatially-aware image generation.arXiv preprint arXiv:2505.24870, 2025. 1

  32. [32]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 6

  33. [33]

    OmniGen2: Towards Instruction-Aligned Multimodal Generation

    Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025. 1, 6

  34. [34]

    Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747, 2025

    Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747, 2025. 1, 3

  35. [35]

    Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

    Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024. 1, 3

  36. [36]

    Show-o2: Improved Native Unified Multimodal Models

    Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show- o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025. 1, 3

  37. [37]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 1

  38. [38]

    Thinking in space: How mul- timodal large language models see, remember, and recall spaces

    Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How mul- timodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025. 1, 3

  39. [39]

    Learning interactive real-world simulators

    Sherry Yang, Yilun Du, Seyed Ghasemipour, Jonathan Tompson, Leslie Kaelbling, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators. InIn- ternational Conference on Representation Learning, pages 45210–45234, 2024. 1

  40. [40]

    Scannet++: A high-fidelity dataset of 3d in- door scenes

    Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d in- door scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–22, 2023. 1, 2, 5, 6

  41. [41]

    arXiv preprint arXiv:2506.21458 (2025)

    Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chan- drasegaran, Han Liu, Ranjay Krishna, Saining Xie, Man- ling Li, Jiajun Wu, and Li Fei-Fei. Mindcube: Spa- tial mental modeling from limited views.arXiv preprint arXiv:2506.21458, 2025. 1, 3

  42. [42]

    arXiv preprint arXiv:2406.10721 (2024)

    Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay, Ranjay Krishna, Adithyavairavan Murali, Arsalan Mousa- vian, and Dieter Fox. Robopoint: A vision-language model for spatial affordance prediction for robotics.arXiv preprint arXiv:2406.10721, 2024. 1, 3

  43. [43]

    Zhang, H

    Hanxue Zhang, Haoran Jiang, Qingsong Yao, Yanan Sun, Renrui Zhang, Hao Zhao, Hongyang Li, Hongzi Zhu, and Zetong Yang. Detect anything 3d in the wild.arXiv preprint arXiv:2504.07958, 2025. 5

  44. [44]

    Embodied navigation foundation model.arXiv preprint arXiv:2509.12129, 2025

    Jiazhao Zhang, Anqi Li, Yunpeng Qi, Minghan Li, Jiahang Liu, Shaoan Wang, Haoran Liu, Gengze Zhou, Yuze Wu, Xingxing Li, et al. Embodied navigation foundation model. arXiv preprint arXiv:2509.12129, 2025. 1

  45. [45]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 6

  46. [46]

    Ultraedit: Instruction-based fine-grained image editing at scale.Advances in Neural Information Pro- cessing Systems, 37:3058–3093, 2024

    Haozhe Zhao, Xiaojian Shawn Ma, Liang Chen, Shuzheng Si, Rujie Wu, Kaikai An, Peiyu Yu, Minjia Zhang, Qing Li, and Baobao Chang. Ultraedit: Instruction-based fine-grained image editing at scale.Advances in Neural Information Pro- cessing Systems, 37:3058–3093, 2024. 1, 6

  47. [47]

    Towards learning a generalist model for embod- ied navigation

    Duo Zheng, Shijia Huang, Lin Zhao, Yiwu Zhong, and Li- wei Wang. Towards learning a generalist model for embod- ied navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13624–13634, 2024. 1

  48. [48]

    InternScenes: A Large-scale Simulatable Indoor Scene Dataset with Realistic Layouts

    Weipeng Zhong, Peizhou Cao, Yichen Jin, Li Luo, Wenzhe Cai, Jingli Lin, Hanqing Wang, Zhaoyang Lyu, Tai Wang, Bo Dai, et al. Internscenes: A large-scale simulatable in- door scene dataset with realistic layouts.arXiv preprint arXiv:2509.10813, 2025. 2

  49. [49]

    Robotrom-nav: A unified frame- work for embodied navigation integrating perception, plan- ning, and prediction

    Yufeng Zhong, Chengjian Feng, Feng Yan, Fanfan Liu, Lim- ing Zheng, and Lin Ma. Robotrom-nav: A unified frame- work for embodied navigation integrating perception, plan- ning, and prediction. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 6416–6425, 2025. 1