arxiv: 2604.20570 · v1 · submitted 2026-04-22 · 💻 cs.CV

Recognition: unknown

Exploring Spatial Intelligence from a Generative Perspective

Muzhi Zhu , Shunyao Jiang , Huanyi Zheng , Zekai Luo , Hao Zhong , Anzhou Li , Kaijun Wang , Jintao Rong

show 4 more authors

Yang Liu Hao Chen Tao Lin Chunhua Shen

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:42 UTC · model grok-4.3

classification 💻 cs.CV

keywords spatial intelligencegenerative modelsmultimodal modelsimage editingspatial reasoningsynthetic benchmarkfine-tuning3D constraints

0 comments

The pith

Fine-tuning multimodal models on synthetic spatial image edits improves both generation fidelity and downstream spatial understanding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether current multimodal models possess generative spatial intelligence, the capacity to produce images that correctly respect and manipulate 3D spatial constraints. It builds a benchmark consisting of real-world images and a large synthetic set with precise spatial controls to measure this ability through editing tasks. Experiments demonstrate that training on the synthetic portion produces clear gains on both the synthetic and real test sets. The same training also lifts performance on separate tasks that test spatial comprehension rather than generation. A sympathetic reader would see this as evidence that practicing spatial generation can strengthen overall spatial reasoning in these models.

Core claim

We define generative spatial intelligence as the ability of unified multimodal models to respect 3D spatial constraints while generating or editing images. We introduce GSI-Bench with two parts: GSI-Real, a filtered real-world collection created through 3D-prior-guided pipelines, and GSI-Syn, a large synthetic collection offering controllable spatial operations and automatic labels. A unified evaluation protocol measures spatial compliance and editing accuracy. Fine-tuning models on GSI-Syn produces substantial improvements on both synthetic and real GSI tasks; the same models also score higher on downstream spatial-understanding benchmarks.

What carries the argument

GSI-Bench, a dual real-plus-synthetic benchmark that tests image editing for compliance with explicit 3D spatial constraints and supplies automated, controllable labels for scalable assessment.

If this is right

Unified models can acquire stronger spatial capabilities through targeted generative training rather than understanding-only objectives.
Large synthetic datasets with explicit spatial controls can transfer to real-world image-editing performance.
Improvements in generative spatial compliance also raise scores on separate spatial-reasoning tests.
A single training pipeline can advance both image generation fidelity and spatial comprehension within the same model family.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The link between generation practice and understanding gains may extend to other constraint types such as physics or temporal consistency.
If the pattern holds, future model development could treat generative spatial exercises as a regular pre-training or alignment step.
Benchmarks that isolate spatial constraints could become standard diagnostics for multimodal systems before deployment in robotics or scene-layout applications.

Load-bearing premise

The observed gains after training on the synthetic spatial dataset arise specifically from strengthened spatial awareness rather than from generic adaptation to any new data or from quirks of the data-generation process itself.

What would settle it

A controlled experiment in which models fine-tuned on GSI-Syn show no measurable improvement on GSI-Real editing tasks or on independent spatial-understanding benchmarks, or in which equivalent gains appear after training on matched non-spatial synthetic images.

Figures

Figures reproduced from arXiv: 2604.20570 by Anzhou Li, Chunhua Shen, Hao Chen, Hao Zhong, Huanyi Zheng, Jintao Rong, Kaijun Wang, Muzhi Zhu, Shunyao Jiang, Tao Lin, Yang Liu, Zekai Luo.

**Figure 1.** Figure 1: We introduce GSI Bench, a benchmark for grounded spatial intelligence that spans both real-world and synthetic scenes. GSI [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Benchmark curation pipeline.The pipeline builds both synthetic (GSI-Syn) and real-world (GSI-Real) benchmarks through unified scene processing, action generation, and validation. For GSI-Syn, scenes are sampled from diverse viewpoints, feasible actions are generated via 3D geometric checks, and a simulator validates outcomes before filtering failures and anomalies. For GSI-Real, clear frames are selected, … view at source ↗

**Figure 3.** Figure 3: Qualitative comparison of spatial editing results across five instruction types. Rows 1–2 use GSI-Real samples, Rows 3–4 [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Spatial intelligence is essential for multimodal large language models, yet current benchmarks largely assess it only from an understanding perspective. We ask whether modern generative or unified multimodal models also possess generative spatial intelligence (GSI), the ability to respect and manipulate 3D spatial constraints during image generation, and whether such capability can be measured or improved. We introduce GSI-Bench, the first benchmark designed to quantify GSI through spatially grounded image editing. It consists of two complementary components: GSI-Real, a high-quality real-world dataset built via a 3D-prior-guided generation and filtering pipeline, and GSI-Syn, a large-scale synthetic benchmark with controllable spatial operations and fully automated labeling. Together with a unified evaluation protocol, GSI-Bench enables scalable, model-agnostic assessment of spatial compliance and editing fidelity. Experiments show that fine-tuning unified multimodal models on GSI-Syn yields substantial gains on both synthetic and real tasks and, strikingly, also improves downstream spatial understanding. This provides the first clear evidence that generative training can tangibly strengthen spatial reasoning, establishing a new pathway for advancing spatial intelligence in multimodal models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces GSI-Bench for measuring generative spatial intelligence and reports transfer from synthetic fine-tuning to real tasks plus understanding, but the 3D-prior pipeline in GSI-Real leaves open whether gains reflect true spatial reasoning or artifact matching.

read the letter

This paper introduces GSI-Bench to evaluate generative spatial intelligence in multimodal models and presents evidence that fine-tuning on their synthetic dataset leads to gains on both synthetic and real tasks, as well as downstream spatial understanding. The novel element is positioning generative spatial intelligence as something measurable through image editing tasks that respect 3D constraints. They create GSI-Real from real-world images via a 3D-prior-guided pipeline and GSI-Syn as a large synthetic set with controllable operations and automatic labels. The unified protocol for assessing spatial compliance is a practical addition. The work does well in shifting focus to the generative side and showing a potential link between generation training and improved understanding. This could open a scalable path for better spatial capabilities in models. The soft spots center on the experimental design. The stress-test note highlights a real issue: GSI-Real is built using the same kind of 3D-prior pipeline, so models might improve by matching those specific patterns in depth, occlusion, or camera parameters without developing general spatial intelligence. The abstract does not describe any control experiments to separate this, like using a non-spatial synthetic fine-tuning set. Without the full results including numbers, baselines, and ablations, the substantial gains remain difficult to assess fully. This is useful for researchers in multimodal learning and spatial AI applications such as robotics or image editing. It provides a benchmark they can adopt and initial data on training effects. Anyone evaluating model capabilities in 3D-aware generation would get something from it. The paper merits peer review because the benchmark construction is concrete and the transfer idea is worth testing. I recommend sending it for review, with particular attention to verifying the controls and the quantitative support for the claims.

Referee Report

2 major / 1 minor

Summary. The paper introduces the concept of Generative Spatial Intelligence (GSI) in multimodal large language models, defined as the ability to respect and manipulate 3D spatial constraints during image generation. It presents GSI-Bench, consisting of GSI-Real (a real-world dataset constructed via a 3D-prior-guided generation and filtering pipeline) and GSI-Syn (a large-scale synthetic dataset with controllable spatial operations and automated labeling), along with a unified evaluation protocol for spatial compliance and editing fidelity. The central empirical result is that fine-tuning unified multimodal models on GSI-Syn produces substantial gains on both GSI-Bench components and transfers to improved performance on downstream spatial understanding tasks, providing the first evidence that generative training can strengthen spatial reasoning.

Significance. If the attribution of gains to improved GSI holds after controls, the work would be significant for expanding spatial intelligence research from understanding-only benchmarks to generative capabilities. The introduction of a scalable, model-agnostic benchmark with complementary synthetic and real components is a clear strength, enabling reproducible assessment of spatial editing. The observed transfer from generative fine-tuning to understanding tasks suggests a promising pathway for model improvement, though this requires confirmation that effects are not due to dataset artifacts.

major comments (2)

[GSI-Real construction and Experiments] GSI-Real construction (via 3D-prior-guided pipeline): The central claim that fine-tuning on GSI-Syn improves generative spatial intelligence (rather than general adaptation) is load-bearing on the assumption that gains on GSI-Real reflect internalized 3D constraints. However, GSI-Real is generated and filtered using an explicit 3D-prior pipeline, so models could improve by matching pipeline-specific statistical signatures (e.g., depth-consistent lighting or occlusion patterns) without acquiring transferable spatial reasoning. No control experiments, such as fine-tuning on a matched non-spatial synthetic corpus or evaluating on GSI-Real variants without the 3D prior, are described to isolate this.
[Evaluation protocol and Experiments] Evaluation protocol and results: The unified protocol claims to quantify spatial compliance and editing fidelity in a model-agnostic way, but without explicit details on how GSI-Bench tasks isolate spatial constraints independent of 3D-prior artifacts, the transfer results to downstream spatial understanding tasks cannot unambiguously support the claim of strengthened spatial reasoning. The abstract asserts 'substantial gains' and 'first clear evidence' but the provided description supplies no quantitative metrics, baselines, or error bars to evaluate effect sizes.

minor comments (1)

[Abstract] The abstract would benefit from a brief mention of the specific metrics used for 'spatial compliance and editing fidelity' to improve immediate clarity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The concerns about potential dataset artifacts in GSI-Real and the need for clearer quantitative details and isolation of spatial effects are well-taken. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: GSI-Real construction (via 3D-prior-guided pipeline): The central claim that fine-tuning on GSI-Syn improves generative spatial intelligence (rather than general adaptation) is load-bearing on the assumption that gains on GSI-Real reflect internalized 3D constraints. However, GSI-Real is generated and filtered using an explicit 3D-prior pipeline, so models could improve by matching pipeline-specific statistical signatures (e.g., depth-consistent lighting or occlusion patterns) without acquiring transferable spatial reasoning. No control experiments, such as fine-tuning on a matched non-spatial synthetic corpus or evaluating on GSI-Real variants without the 3D prior, are described to isolate this.

Authors: We agree this is a substantive concern that could undermine attribution of gains specifically to GSI. In the revised manuscript we will add two new control experiments: (1) fine-tuning on a size- and style-matched non-spatial synthetic corpus generated without the 3D prior, and (2) evaluation on GSI-Real variants constructed by ablating the 3D-prior guidance step. These will be reported alongside the original results with the same metrics. If the gains diminish under controls we will revise our interpretation accordingly; preliminary internal checks suggest the spatial-specific improvements remain, but we will let the new data decide. revision: yes
Referee: Evaluation protocol and results: The unified protocol claims to quantify spatial compliance and editing fidelity in a model-agnostic way, but without explicit details on how GSI-Bench tasks isolate spatial constraints independent of 3D-prior artifacts, the transfer results to downstream spatial understanding tasks cannot unambiguously support the claim of strengthened spatial reasoning. The abstract asserts 'substantial gains' and 'first clear evidence' but the provided description supplies no quantitative metrics, baselines, or error bars to evaluate effect sizes.

Authors: The full manuscript already contains quantitative tables (Tables 2–4) with exact compliance percentages, baseline comparisons (untuned models and prior methods), and standard deviations over three random seeds. To make isolation of spatial constraints explicit we will expand Section 4.2 with a new paragraph detailing prompt design choices that target relational geometry rather than pipeline-specific appearance cues (e.g., lighting, texture). We will also add a short ablation showing that performance drops when spatial relations are randomized while keeping other factors fixed. The abstract claim will be softened to 'provides evidence' pending the new controls. These additions will be placed in the main text rather than appendix. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark introduction and fine-tuning experiments with no derivation chain

full rationale

The paper introduces GSI-Bench (GSI-Real via 3D-prior pipeline and GSI-Syn synthetic data) and reports that fine-tuning unified multimodal models on GSI-Syn produces gains on synthetic, real, and downstream tasks. No equations, fitted parameters, uniqueness theorems, or ansatzes appear in the provided text. Claims rest on experimental results rather than any self-referential reduction of outputs to inputs by construction. Self-citations, if present, are not load-bearing for the central empirical findings.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract introduces GSI and GSI-Bench without stating free parameters, background axioms, or external benchmarks; the central claim rests on the unverified validity of the new evaluation pipeline.

invented entities (1)

Generative Spatial Intelligence (GSI) no independent evidence
purpose: Defines the target capability of respecting 3D spatial constraints during image generation
New term introduced to frame the benchmark and experiments

pith-pipeline@v0.9.0 · 5527 in / 1133 out tokens · 33881 ms · 2026-05-10T00:42:28.289324+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 35 canonical work pages · 19 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Holistic evaluation of multimodal llms on spatial intelligence.arXiv preprint arXiv:2508.13142, 2025

Zhongang Cai, Yubo Wang, Qingping Sun, Ruisi Wang, Chenyang Gu, Wanqi Yin, Zhiqian Lin, Zhitao Yang, Chen Wei, Xuanke Shi, et al. Has gpt-5 achieved spa- tial intelligence? an empirical study.arXiv preprint arXiv:2508.13142, 2025. 1

work page arXiv 2025
[3]

Revision: Rendering tools enable spatial fidelity in vision-language models

Agneet Chatterjee, Yiran Luo, Tejas Gokhale, Yezhou Yang, and Chitta Baral. Revision: Rendering tools enable spatial fidelity in vision-language models. InEuropean Conference on Computer Vision, pages 339–357. Springer, 2024. 3

2024
[4]

SpatialVLM: Endowing vision-language models with spatial reasoning capabilities.arXiv preprint arXiv:2401.12168, 2024

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Danny Driess, Pete Florence, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language mod- els with spatial reasoning capabilities.arXiv preprint arXiv:2401.12168, 2024. 1

work page arXiv 2024
[5]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Sil- vio Savarese, et al. Blip3-o: A family of fully open uni- fied multimodal models—architecture, training, and dataset. arXiv preprint arXiv:2505.09568, 2025. 1, 3

work page Pith review arXiv 2025
[6]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus- pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811,

work page internal anchor Pith review arXiv
[7]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 1, 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xing- hang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, et al. Emu3. 5: Native multimodal models are world learners.arXiv preprint arXiv:2510.26583,

work page arXiv
[9]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017. 1

2017
[10]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 1, 3, 6, 7

work page internal anchor Pith review arXiv 2025
[11]

World Models

David Ha and J ¨urgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2018. 1

work page internal anchor Pith review arXiv 2018
[12]

Mesatask: Towards task-driven tabletop scene generation via 3d spatial reasoning

Jinkun Hao, Naifu Liang, Zhen Luo, Xudong Xu, Weipeng Zhong, Ran Yi, Yichen Jin, Zhaoyang Lyu, Feng Zheng, Lizhuang Ma, et al. Mesatask: Towards task-driven table- top scene generation via 3d spatial reasoning.arXiv preprint arXiv:2509.22281, 2025. 2, 4, 6

work page arXiv 2025
[13]

Notvla: Narrowing of dense action trajecto- ries for generalizable robot manipulation.arXiv preprint arXiv:2510.03895, 2025

Zheng Huang, Mingyu Liu, Xiaoyi Lin, Muzhi Zhu, Canyu Zhao, Zongze Du, Xiaoman Li, Yiduo Jia, Hao Zhong, Hao Chen, et al. Notvla: Narrowing of dense action trajecto- ries for generalizable robot manipulation.arXiv preprint arXiv:2510.03895, 2025. 1

work page internal anchor Pith review arXiv 2025
[14]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 1, 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

RoboBrain: A Unified Brain Model for Robotic Manipula- tion from Abstract to Concrete.arXiv preprint arXiv:2502.21257, 2025

Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, et al. Robobrain: A unified brain model for robotic manipulation from abstract to concrete.arXiv preprint arXiv:2502.21257, 2025. 1, 3

work page arXiv 2025
[16]

Omnispatial: Towards comprehensive spatial reasoning benchmark for vi- sion language models.arXiv preprint arXiv:2506.03135,

Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. Omnispatial: Towards comprehensive spatial reasoning benchmark for vi- sion language models.arXiv preprint arXiv:2506.03135,

work page arXiv
[17]

AnyEdit: Edit Any Knowledge Encoded in Language Models, February 2025

Houcheng Jiang, Junfeng Fang, Ningyu Zhang, Guojun Ma, Mingyang Wan, Xiang Wang, Xiangnan He, and Tat-seng Chua. Anyedit: Edit any knowledge encoded in language models.arXiv preprint arXiv:2502.05628, 2025. 1, 6

work page arXiv 2025
[18]

AI2-THOR: An Interactive 3D Environment for Visual AI

Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, et al. Ai2-thor: An interactive 3d environment for visual ai.arXiv preprint arXiv:1712.05474,

work page internal anchor Pith review arXiv
[19]

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld: High-resolution semantic en- coders for unified visual understanding and generation.arXiv preprint arXiv:2506.03147, 2025. 1, 6

work page internal anchor Pith review arXiv 2025
[20]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 7

2024
[21]

World model on million-length video and language with blockwise ringattention

Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with blockwise ringattention.arXiv preprint arXiv:2402.08268,

work page arXiv
[22]

Janusflow: Harmonizing autore- gression and rectified flow for unified multimodal under- standing and generation

Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai Yu, et al. Janusflow: Harmonizing autore- gression and rectified flow for unified multimodal under- standing and generation. InProceedings of the Computer Vi- sion and Pattern Recognition Conference, pages 7739–7751,
[23]

Membership determination in open clusters using the dbscan clustering algorithm.Astronomy and Computing, 47:100826, 2024

Mudasir Raja, Priya Hasan, Md Mahmudunnobe, Md Sai- fuddin, and SN Hasan. Membership determination in open clusters using the dbscan clustering algorithm.Astronomy and Computing, 47:100826, 2024. 4

2024
[24]

Sat: Dynamic spatial aptitude training for multimodal language models

Arijit Ray, Jiafei Duan, Ellis Brown, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kemb- havi, Bryan A. Plummer, Ranjay Krishna, Kuo-Hao Zeng, and Kate Saenko. Sat: Dynamic spatial aptitude train- ing for multimodal language models.arXiv preprint arXiv:2412.07755, 2024. 2, 3, 7, 8

work page arXiv 2024
[25]

Com- mon objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction

Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Com- mon objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. InProceedings of the IEEE/CVF international conference on computer vision, pages 10901–10911, 2021. 1

2021
[26]

Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, et al. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers.arXiv preprint arXiv:2506.23918, 2025. 1

work page internal anchor Pith review arXiv 2025
[27]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, et al. Gem- ini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 1, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.CoRR, abs/2409.12191, 2024. 1, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Emu3: Next-Token Prediction is All You Need

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024. 1, 3

work page internal anchor Pith review arXiv 2024
[30]

Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 6

2004
[31]

Genspace: Bench- marking spatially-aware image generation.arXiv preprint arXiv:2505.24870, 2025

Zehan Wang, Jiayang Xu, Ziang Zhang, Tianyu Pang, Chao Du, Hengshuang Zhao, and Zhou Zhao. Genspace: Bench- marking spatially-aware image generation.arXiv preprint arXiv:2505.24870, 2025. 1

work page arXiv 2025
[32]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 6

work page internal anchor Pith review arXiv 2025
[33]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025. 1, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747, 2025

Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747, 2025. 1, 3

work page arXiv 2025
[35]

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024. 1, 3

work page internal anchor Pith review arXiv 2024
[36]

Show-o2: Improved Native Unified Multimodal Models

Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show- o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025. 1, 3

work page internal anchor Pith review arXiv 2025
[37]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Thinking in space: How mul- timodal large language models see, remember, and recall spaces

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How mul- timodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025. 1, 3

2025
[39]

Learning interactive real-world simulators

Sherry Yang, Yilun Du, Seyed Ghasemipour, Jonathan Tompson, Leslie Kaelbling, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators. InIn- ternational Conference on Representation Learning, pages 45210–45234, 2024. 1

2024
[40]

Scannet++: A high-fidelity dataset of 3d in- door scenes

Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d in- door scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–22, 2023. 1, 2, 5, 6

2023
[41]

arXiv preprint arXiv:2506.21458 (2025)

Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chan- drasegaran, Han Liu, Ranjay Krishna, Saining Xie, Man- ling Li, Jiajun Wu, and Li Fei-Fei. Mindcube: Spa- tial mental modeling from limited views.arXiv preprint arXiv:2506.21458, 2025. 1, 3

work page arXiv 2025
[42]

arXiv preprint arXiv:2406.10721 (2024)

Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay, Ranjay Krishna, Adithyavairavan Murali, Arsalan Mousa- vian, and Dieter Fox. Robopoint: A vision-language model for spatial affordance prediction for robotics.arXiv preprint arXiv:2406.10721, 2024. 1, 3

work page arXiv 2024
[43]

Zhang, H

Hanxue Zhang, Haoran Jiang, Qingsong Yao, Yanan Sun, Renrui Zhang, Hao Zhao, Hongyang Li, Hongzi Zhu, and Zetong Yang. Detect anything 3d in the wild.arXiv preprint arXiv:2504.07958, 2025. 5

work page arXiv 2025
[44]

Embodied navigation foundation model.arXiv preprint arXiv:2509.12129, 2025

Jiazhao Zhang, Anqi Li, Yunpeng Qi, Minghan Li, Jiahang Liu, Shaoan Wang, Haoran Liu, Gengze Zhou, Yuze Wu, Xingxing Li, et al. Embodied navigation foundation model. arXiv preprint arXiv:2509.12129, 2025. 1

work page arXiv 2025
[45]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 6

2018
[46]

Ultraedit: Instruction-based fine-grained image editing at scale.Advances in Neural Information Pro- cessing Systems, 37:3058–3093, 2024

Haozhe Zhao, Xiaojian Shawn Ma, Liang Chen, Shuzheng Si, Rujie Wu, Kaikai An, Peiyu Yu, Minjia Zhang, Qing Li, and Baobao Chang. Ultraedit: Instruction-based fine-grained image editing at scale.Advances in Neural Information Pro- cessing Systems, 37:3058–3093, 2024. 1, 6

2024
[47]

Towards learning a generalist model for embod- ied navigation

Duo Zheng, Shijia Huang, Lin Zhao, Yiwu Zhong, and Li- wei Wang. Towards learning a generalist model for embod- ied navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13624–13634, 2024. 1

2024
[48]

InternScenes: A Large-scale Simulatable Indoor Scene Dataset with Realistic Layouts

Weipeng Zhong, Peizhou Cao, Yichen Jin, Li Luo, Wenzhe Cai, Jingli Lin, Hanqing Wang, Zhaoyang Lyu, Tai Wang, Bo Dai, et al. Internscenes: A large-scale simulatable in- door scene dataset with realistic layouts.arXiv preprint arXiv:2509.10813, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

Robotrom-nav: A unified frame- work for embodied navigation integrating perception, plan- ning, and prediction

Yufeng Zhong, Chengjian Feng, Feng Yan, Fanfan Liu, Lim- ing Zheng, and Lin Ma. Robotrom-nav: A unified frame- work for embodied navigation integrating perception, plan- ning, and prediction. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 6416–6425, 2025. 1

2025