pith. machine review for the scientific record. sign in

arxiv: 2512.23365 · v3 · submitted 2025-12-29 · 💻 cs.CV

Recognition: no theorem link

SpatialMosaic: A Multiview VLM Dataset for Partial Visibility

Authors on Pith no claims yet

Pith reviewed 2026-05-16 19:40 UTC · model grok-4.3

classification 💻 cs.CV
keywords multi-viewspatial reasoningvision-language modelsdatasetpartial visibilityocclusion3D scene understandingbenchmark
0
0 comments X

The pith

A new 2-million-pair multi-view dataset trains vision-language models to reason about 3D scenes from partial and occluded views.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SpatialMosaic, a dataset of 2 million automatically generated QA pairs drawn from multi-view images to teach multimodal models spatial reasoning when only fragments of a scene are visible. It pairs this with SpatialMosaic-Bench, a 1-million-pair evaluation set spanning six tasks in both indoor and outdoor environments, and tests a baseline model that feeds 3D reconstruction outputs into a vision-language model. The central goal is to supply training data that reflects real-world conditions such as occlusion and low overlap without requiring full 3D reconstructions at inference time. If the generated questions faithfully represent these conditions, models should improve at tasks that currently fail when visual cues are incomplete. The work therefore targets the gap between current multi-view training sets and the fragmented observations typical of robotics, surveillance, and augmented-reality applications.

Core claim

A scalable pipeline that renders multi-view images from indoor and outdoor scenes, then automatically produces realistic spatial-reasoning question-answer pairs, yields SpatialMosaic with 2 million training pairs and SpatialMosaic-Bench with 1 million evaluation pairs across six tasks; when used to fine-tune a hybrid SpatialMosaicVLM that inserts 3D geometry encoders into a vision-language model, the data measurably improves performance on partial-visibility, occlusion, and low-overlap spatial reasoning.

What carries the argument

The multi-view data generation and annotation pipeline that automatically constructs spatial-reasoning QA pairs capturing partial visibility and low-overlap conditions.

If this is right

  • Training on the dataset improves model accuracy on multi-view spatial tasks that involve occlusion and fragmented views.
  • The accompanying benchmark provides a standardized testbed for comparing methods across six distinct spatial-reasoning skills.
  • Hybrid models that combine 3D geometry encoders with vision-language models become more robust when fine-tuned on the generated pairs.
  • The same pipeline scales to both indoor and outdoor scenes, supporting evaluation in diverse real-world environments.
  • Automatic QA generation removes the need for manual annotation while still producing challenging questions that require 3D inference from 2D cues.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the pipeline generalizes, similar automatic generation could be applied to video sequences to add temporal spatial reasoning.
  • The approach suggests that large synthetic multi-view corpora may substitute for explicit 3D supervision in downstream robotics tasks.
  • Low-overlap camera configurations common in drone mapping could benefit directly from models trained on the dataset.
  • Future benchmarks might incorporate adversarial occlusions to test whether performance gains persist under more extreme visibility limits.

Load-bearing premise

Automatically generated question-answer pairs capture real-world partial visibility, occlusion, and low-overlap conditions without systematic biases or annotation errors.

What would settle it

A controlled test in which models trained on SpatialMosaic are evaluated on newly captured real-world multi-view images with known ground-truth occlusions and low overlap; if accuracy gains disappear relative to models trained on existing multi-view datasets, the claim that the generated data is representative would be falsified.

Figures

Figures reproduced from arXiv: 2512.23365 by Injae Lee, Jaesik Park, Jungi Hong, Kanghee Lee, Kwonyoung Ryu, Minseok Kwak.

Figure 1
Figure 1. Figure 1: We present SpatialMosaic, a benchmark designed to evaluate 3D spatial reasoning capabilities from fragmented visual cues across multiple viewpoints. Our benchmark focuses on three challenging real-world scenarios involving partial visibility, occlusion, and low-overlap, where current MLLMs often struggle to maintain geometric and cross-view consistency. Abstract The rapid progress of Multimodal Large Langu… view at source ↗
Figure 2
Figure 2. Figure 2: SpatialMosaic data generation pipeline. Given multi-view images dataset, with 3D annotations (Sec. 3.1), we compute object-level and image-level occlusion ratios for each instance. Images are then filtered by overlap to ensure diverse viewpoints, and instances are filtered based on visibility constraints (Sec. 3.2). Finally, spatial relations are computed and used to populate task-specific templates, gener… view at source ↗
Figure 3
Figure 3. Figure 3: Occlusion ratio calculation. We render each in￾stance independently to measure visible (green) and occluded (magenta) pixels. Object Occlusion: Object-level occlu￾sion (robj) captures inter-object obstruction from the actual camera view. Field-of-view Occlusion: Field-of-view trunca￾tion (rFoV) uses extended field-of-view rendering to quantify boundary occlusion from frame cropping. of the extended field. … view at source ↗
Figure 4
Figure 4. Figure 4: SpatialMosaicVLM architecture. Multi-image inputs are processed through parallel Geometry and Visual Encoders to extract 3D structural and appearance features. The resulting geometry and visual tokens are fused via cross￾attention, then combined with question tokens and processed by a Large Language Model to answer spatial reasoning ques￾tions under occlusion and partial visibility circumstances. visual an… view at source ↗
Figure 5
Figure 5. Figure 5: Difficulty Level Distribution Frame combination construction. For each scene, can￾didate multi-view combinations are formed by enumerat￾ing frame sets that satisfy the required view count and the overlap constraint. Only combinations whose internal view￾overlap stays below the specified threshold are retained, en￾suring sparse and complementary viewpoints. Valid instance set collection. Within each retaine… view at source ↗
Figure 6
Figure 6. Figure 6: Comparison between SpatialMosaicVLM and InternVL2-8B on SpatialMosaic-Bench. These computations are task-specific, including directional separations between instance bounding boxes, visible-pixel statistics, or merged instance counts across views. All com￾putations operate directly on the pre-annotated camera-frame geometry. Answer and distractor generation. The computed values are inserted into the task t… view at source ↗
Figure 7
Figure 7. Figure 7: Object Count example Task: Best-View Selection Q) How many chair(s) are visible across these frames? And tell me which frame provides the most informative view of the chair(s)? A: 7 chair(s); Frame 3 B: 4 chair(s); Frame 3 C: 6 chair(s); Frame 4 D: 4 chair(s); Frame 1 Frame 1 Frame 2 Frame 3 Frame 4 Frame 5 [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Best-View Selection example Task: Object Localization Q) Is there a(n) box in Frame 4? If so, what is the bounding box center coordinates? A: Yes; (197,50) B: No C: Yes; (353,351) D: Yes; (127,487) Frame 1 Frame 2 Frame 3 Frame 4 Frame 5 [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Object Localization example 8 [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Occlusion-Aware Object Existence examples [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Occlusion-Aware Attribute examples 10 [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Occlusion-Aware Spatial Relation examples [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
read the original abstract

The rapid progress of Multimodal Large Language Models (MLLMs) has unlocked the potential for enhanced 3D scene understanding and spatial reasoning. A recent line of work explores learning spatial reasoning directly from multi-view images, enabling MLLMs to understand 3D scenes without explicit 3D reconstructions. Nevertheless, key challenges that frequently arise in real-world environments, such as partial visibility, occlusion, and low-overlap conditions that require spatial reasoning from fragmented visual cues, remain under-explored. To address these limitations, we propose a scalable multi-view data generation and annotation pipeline that constructs realistic spatial reasoning QAs, resulting in SpatialMosaic, a comprehensive instruction-tuning dataset featuring 2M QA pairs. We further introduce SpatialMosaic-Bench, a challenging benchmark for evaluating multi-view spatial reasoning under complex and diverse scenarios, consisting of 1M QA pairs across 6 tasks. Our proposed dataset spans both indoor and outdoor scenes, enabling comprehensive evaluation in diverse real-world scenarios. In addition, we introduce a new baseline for multi-view settings, SpatialMosaicVLM, a hybrid framework that integrates 3D reconstruction models as geometry encoders within VLMs for robust spatial reasoning. Extensive experiments demonstrate that our proposed dataset effectively enhances spatial reasoning under challenging multi-view conditions, validating the effectiveness of our data generation pipeline in constructing realistic and challenging QAs. Code and dataset will be available soon.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces SpatialMosaic, a scalable multi-view data generation pipeline that produces a 2M-pair instruction-tuning dataset of spatial reasoning QA pairs focused on partial visibility, occlusion, and low-overlap conditions across indoor and outdoor scenes. It also releases SpatialMosaic-Bench (1M QA pairs across 6 tasks) and proposes SpatialMosaicVLM, a hybrid VLM that integrates 3D reconstruction models as geometry encoders. The central claim is that the dataset and baseline enhance spatial reasoning under challenging multi-view conditions, as demonstrated by extensive experiments validating the data generation pipeline.

Significance. If the empirical validation holds under independent testing, the work would supply a large-scale resource addressing an under-explored gap in multi-view VLM training for fragmented visual cues, potentially improving robustness without requiring explicit 3D reconstructions. The dataset scale and new baseline could serve as a foundation for future research on real-world spatial reasoning.

major comments (2)
  1. Abstract: the assertion that 'extensive experiments demonstrate that our proposed dataset effectively enhances spatial reasoning under challenging multi-view conditions' provides no quantitative metrics, baseline comparisons, error analysis, or protocol details on how partial-visibility cases were constructed or measured, leaving the central empirical claim without verifiable support.
  2. Evaluation setup (implied by abstract and skeptic note): the test distribution in SpatialMosaic-Bench is generated by the identical pipeline used for the 2M training pairs, creating a circularity risk where gains may exploit synthetic cues (e.g., depth noise patterns or QA phrasing) rather than transferable multi-view reasoning; no transfer results on real captures such as ScanNet or Matterport, nor human realism ratings, are reported to anchor the 'realistic' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: Abstract: the assertion that 'extensive experiments demonstrate that our proposed dataset effectively enhances spatial reasoning under challenging multi-view conditions' provides no quantitative metrics, baseline comparisons, error analysis, or protocol details on how partial-visibility cases were constructed or measured, leaving the central empirical claim without verifiable support.

    Authors: We agree that the abstract, constrained by length, summarizes the empirical results at a high level without specific numbers or protocol details. The full manuscript provides these elements in Sections 4 and 5, including quantitative gains over baselines, error breakdowns, and the exact construction protocol for partial-visibility, occlusion, and low-overlap cases described in Section 3. To improve verifiability, we will revise the abstract to include key metrics (e.g., accuracy improvements on the six benchmark tasks) while preserving brevity. revision: yes

  2. Referee: Evaluation setup (implied by abstract and skeptic note): the test distribution in SpatialMosaic-Bench is generated by the identical pipeline used for the 2M training pairs, creating a circularity risk where gains may exploit synthetic cues (e.g., depth noise patterns or QA phrasing) rather than transferable multi-view reasoning; no transfer results on real captures such as ScanNet or Matterport, nor human realism ratings, are reported to anchor the 'realistic' claim.

    Authors: We acknowledge the valid concern about potential distribution overlap and synthetic cue exploitation. The benchmark uses disjoint scenes and deliberately varied generation parameters (visibility ratios, overlap thresholds, and noise levels) to reduce this risk, as detailed in Section 3.2. However, the current experiments do not include transfer evaluations on real captures such as ScanNet or Matterport, nor human realism ratings. We will add an explicit limitations subsection discussing this gap and outlining future real-world validation steps. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical dataset construction with no derivation chain or self-referential reductions.

full rationale

The paper introduces a data generation pipeline to produce the SpatialMosaic dataset (2M QA pairs) and SpatialMosaic-Bench (1M QA pairs), then reports empirical gains from training SpatialMosaicVLM on the data. No equations, fitted parameters, uniqueness theorems, or ansatzes are defined in terms of the target results. Claims rest on the new artifacts and standard train/test splits rather than any step that reduces by construction to its own inputs. The shared generation pipeline between train and test data is a standard empirical setup for synthetic benchmarks and does not meet the criteria for circularity (no quoted self-definition or load-bearing self-citation that collapses the central claim).

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that the data-generation pipeline produces high-quality, realistic spatial reasoning examples; no numerical free parameters are fitted and no new physical entities are postulated.

axioms (1)
  • domain assumption The automated multi-view data generation and annotation pipeline produces realistic and challenging spatial reasoning QAs that match real-world partial-visibility conditions.
    Invoked when describing the pipeline and when claiming experimental validation of enhanced spatial reasoning.

pith-pipeline@v0.9.0 · 5568 in / 1286 out tokens · 45954 ms · 2026-05-16T19:40:06.997331+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 5 internal anchors

  1. [1]

    Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes

    Panos Achlioptas, Ahmed Abdelreheem, Fei Xia, Mohamed Elhoseiny, and Leonidas Guibas. Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. InEuropean conference on computer vision, pages 422–440. Springer, 2020. 3

  2. [2]

    Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

  3. [3]

    Neural module networks

    Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural module networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 39–48, 2016. 3

  4. [4]

    Scanqa: 3d question answering for spatial scene understanding

    Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19129– 19139, 2022. 3

  5. [5]

    Grounded 3d-llm with referent tokens.arXiv preprint arXiv:2405.10370,

    Yilun Chen, Shuai Yang, Haifeng Huang, Tai Wang, Runsen Xu, Ruiyuan Lyu, Dahua Lin, and Jiangmiao Pang. Grounded 3d-llm with referent tokens.arXiv preprint arXiv:2405.10370,

  6. [6]

    How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

    Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. Science China Information Sciences, 67(12):220101, 2024. 7, 8, 1

  7. [7]

    Hsfm: Hybrid structure-from-motion

    Hainan Cui, Xiang Gao, Shuhan Shen, and Zhanyi Hu. Hsfm: Hybrid structure-from-motion. 2017. 3

  8. [8]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes

    Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017. 5

  9. [9]

    Mm-spatial: Exploring 3d spatial understanding in multimodal llms

    Erik Daxberger, Nina Wenzel, David Griffiths, Haiming Gang, Justin Lazarow, Gefen Kohavi, Kai Kang, Marcin Eichner, Yinfei Yang, Afshin Dehghan, et al. Mm-spatial: Exploring 3d spatial understanding in multimodal llms. InProceed- ings of the IEEE/CVF International Conference on Computer Vision, pages 7395–7408, 2025. 2

  10. [10]

    3d-llava: Towards generalist 3d lmms with omni superpoint transformer

    Jiajun Deng, Tianyu He, Li Jiang, Tianyu Wang, Feras Day- oub, and Ian Reid. 3d-llava: Towards generalist 3d lmms with omni superpoint transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3772–3782,

  11. [11]

    VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

    Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, et al. Vlm-3r: Vision-language models aug- mented with instruction-aligned 3d reconstruction.arXiv preprint arXiv:2505.20279, 2025. 2, 3, 6, 7, 8

  12. [12]

    Scene-llm: Extending language model for 3d visual understanding and reasoning.arXiv preprint arXiv:2403.11401, 2024

    Rao Fu, Jingyu Liu, Xilun Chen, Yixin Nie, and Wen- han Xiong. Scene-llm: Extending language model for 3d visual understanding and reasoning.arXiv preprint arXiv:2403.11401, 2024. 3

  13. [13]

    Blink: Multimodal large language mod- els can see but not perceive

    Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language mod- els can see but not perceive. InEuropean Conference on Computer Vision, pages 148–166. Springer, 2024. 2

  14. [14]

    Multi-view stereo: A tutorial.Foundations and trends® in Computer Graphics and Vision, 9(1-2):1–148, 2015

    Yasutaka Furukawa, Carlos Hern ´andez, et al. Multi-view stereo: A tutorial.Foundations and trends® in Computer Graphics and Vision, 9(1-2):1–148, 2015. 3

  15. [15]

    Cambridge university press, 2003

    Richard Hartley and Andrew Zisserman.Multiple view geom- etry in computer vision. Cambridge university press, 2003. 3

  16. [16]

    3d-llm: Injecting the 3d world into large language models.Advances in Neural Information Processing Systems, 36:20482–20494, 2023

    Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-llm: Injecting the 3d world into large language models.Advances in Neural Information Processing Systems, 36:20482–20494, 2023. 3

  17. [17]

    Multi- view transformer for 3d visual grounding

    Shijia Huang, Yilun Chen, Jiaya Jia, and Liwei Wang. Multi- view transformer for 3d visual grounding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15524–15533, 2022. 3

  18. [18]

    Text2scene: Text-driven indoor scene stylization with part- aware details

    Inwoo Hwang, Hyeonwoo Kim, and Young Min Kim. Text2scene: Text-driven indoor scene stylization with part- aware details. InProceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 1890–1899,

  19. [19]

    Ground- ing image matching in 3d with mast3r

    Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r. InEuropean Confer- ence on Computer Vision, pages 71–91. Springer, 2024. 2, 3

  20. [20]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 7

  21. [21]

    Viewspatial-bench: Evaluating multi-perspective spatial localization in vision-language models.arXiv preprint arXiv:2505.21500, 2025

    Dingming Li, Hongxing Li, Zixuan Wang, Yuchen Yan, Hang Zhang, Siqi Chen, Guiyang Hou, Shengpei Jiang, Wenqi Zhang, Yongliang Shen, et al. Viewspatial-bench: Evaluat- ing multi-perspective spatial localization in vision-language models.arXiv preprint arXiv:2505.21500, 2025. 2

  22. [22]

    Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR,

  23. [23]

    Mvbench: A comprehensive multi-modal video understand- ing benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195– 22206, 2024. 2

  24. [24]

    V oxformer: Sparse voxel transformer for camera- based 3d semantic scene completion

    Yiming Li, Zhiding Yu, Christopher Choy, Chaowei Xiao, Jose M Alvarez, Sanja Fidler, Chen Feng, and Anima Anand- kumar. V oxformer: Sparse voxel transformer for camera- based 3d semantic scene completion. InProceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, pages 9087–9098, 2023. 2

  25. [25]

    Vila: On pre-training for visual 9 language models

    Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual 9 language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26689– 26699, 2024. 7, 8

  26. [26]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 2, 3

  27. [27]

    Sqa3d: Situated question answering in 3d scenes.arXiv preprint arXiv:2210.07474, 2022

    Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, and Siyuan Huang. Sqa3d: Sit- uated question answering in 3d scenes.arXiv preprint arXiv:2210.07474, 2022. 3

  28. [28]

    Nerf in the wild: Neural radiance fields for uncon- strained photo collections

    Ricardo Martin-Brualla, Noha Radwan, Mehdi SM Sajjadi, Jonathan T Barron, Alexey Dosovitskiy, and Daniel Duck- worth. Nerf in the wild: Neural radiance fields for uncon- strained photo collections. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7210–7219, 2021. 2

  29. [29]

    I2mvformer: Large language model generated multi-view document supervi- sion for zero-shot image classification

    Muhammad Ferjad Naeem, Muhammad Gul Zain Ali Khan, Yongqin Xian, Muhammad Zeshan Afzal, Didier Stricker, Luc Van Gool, and Federico Tombari. I2mvformer: Large language model generated multi-view document supervi- sion for zero-shot image classification. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15169–15179, 2023. 2

  30. [30]

    Global structure-from-motion revisited

    Linfei Pan, D´aniel Bar´ath, Marc Pollefeys, and Johannes L Sch¨onberger. Global structure-from-motion revisited. InEuro- pean Conference on Computer Vision, pages 58–77. Springer,

  31. [31]

    Structure- from-motion revisited

    Johannes L Sch¨onberger and Jan-Michael Frahm. Structure- from-motion revisited. InProceedings of the IEEE confer- ence on computer vision and pattern recognition, pages 4104– 4113, 2016. 3

  32. [32]

    Pixelwise view selection for unstructured multi-view stereo

    Johannes L Sch¨onberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. InComputer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, Octo- ber 11-14, 2016, Proceedings, Part III 14, pages 501–518. Springer, 2016. 3

  33. [33]

    Vipergpt: Vi- sual inference via python execution for reasoning

    D´ıdac Sur´ıs, Sachit Menon, and Carl V ondrick. Vipergpt: Vi- sual inference via python execution for reasoning. InProceed- ings of the IEEE/CVF international conference on computer vision, pages 11888–11898, 2023. 3

  34. [34]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 2, 3, 6

  35. [35]

    Continuous 3d perception model with persistent state

    Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10510– 10522, 2025. 3

  36. [36]

    Dust3r: Geometric 3d vision made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 20697–20709, 2024. 2, 3

  37. [37]

    Multi-SpatialMLLM: Multi- frame spatial understanding with multi-modal large language models.arXiv preprint arXiv:2505.17015, 2025

    Runsen Xu, Weiyao Wang, Hao Tang, Xingyu Chen, Xi- aodong Wang, Fu-Jen Chu, Dahua Lin, Matt Feiszli, and Kevin J Liang. Multi-spatialmllm: Multi-frame spatial un- derstanding with multi-modal large language models.arXiv preprint arXiv:2505.17015, 2025. 2, 3

  38. [38]

    Thinking in space: How multimodal large language models see, remember, and recall spaces

    Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025. 2, 3, 7

  39. [39]

    Seeing from another perspective: Evaluating multi-view understanding in mllms

    Chun-Hsiao Yeh, Chenyu Wang, Shengbang Tong, Ta-Ying Cheng, Ruoyu Wang, Tianzhe Chu, Yuexiang Zhai, Yubei Chen, Shenghua Gao, and Yi Ma. Seeing from another perspective: Evaluating multi-view understanding in mllms. arXiv preprint arXiv:2504.15280, 2025. 2

  40. [40]

    inerf: Inverting neural radiance fields for pose estimation

    Lin Yen-Chen, Pete Florence, Jonathan T Barron, Alberto Rodriguez, Phillip Isola, and Tsung-Yi Lin. inerf: Inverting neural radiance fields for pose estimation. In2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1323–1330. IEEE, 2021. 2

  41. [41]

    Scannet++: A high-fidelity dataset of 3d indoor scenes

    Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d indoor scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–22, 2023. 4, 5, 6

  42. [42]

    Neural-symbolic vqa: Dis- entangling reasoning from vision and language understanding

    Kexin Yi, Jiajun Wu, Chuang Gan, Antonio Torralba, Push- meet Kohli, and Josh Tenenbaum. Neural-symbolic vqa: Dis- entangling reasoning from vision and language understanding. Advances in neural information processing systems, 31, 2018. 3

  43. [43]

    CLEVRER: CoLlision Events for Video REpresentation and Reasoning

    Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B Tenenbaum. Clevrer: Collision events for video representation and reasoning.arXiv preprint arXiv:1910.01442, 2019. 2

  44. [44]

    From flatland to space: Teaching vision-language models to perceive and reason in 3d.arXiv preprint arXiv:2503.22976,

    Jiahui Zhang, Yurui Chen, Yanpeng Zhou, Yueming Xu, Ze Huang, Jilin Mei, Junhui Chen, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, et al. From flatland to space: Teaching vision-language models to perceive and reason in 3d.arXiv preprint arXiv:2503.22976, 2025. 2, 3

  45. [45]

    Long Context Transfer from Language to Vision

    Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision.arXiv preprint arXiv:2406.16852, 2024. 7

  46. [46]

    Llava- next: A strong zero-shot video understanding model, 2024

    Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava- next: A strong zero-shot video understanding model, 2024. 6, 7, 8

  47. [47]

    Mmicl: Empowering vision-language model with multi-modal in-context learn- ing

    Haozhe Zhao, Zefan Cai, Shuzheng Si, Xiaojian Ma, Kaikai An, Liang Chen, Zixuan Liu, Sheng Wang, Wenjuan Han, and Baobao Chang. Mmicl: Empowering vision-language model with multi-modal in-context learning.arXiv preprint arXiv:2309.07915, 2023. 2

  48. [48]

    Lscenellm: Enhancing large 3d scene understanding using adaptive visual preferences

    Hongyan Zhi, Peihao Chen, Junyan Li, Shuailei Ma, Xinyu Sun, Tianhang Xiang, Yinjie Lei, Mingkui Tan, and Chuang Gan. Lscenellm: Enhancing large 3d scene understanding using adaptive visual preferences. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3761–3771, 2025. 3

  49. [49]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language 10 understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023. 2, 3

  50. [50]

    Towards foundation models for 3d vision: How close are we? In2025 International Conference on 3D Vision (3DV), pages 1285–1296

    Yiming Zuo, Karhan Kayan, Maggie Wang, Kevin Jeon, Jia Deng, and Thomas L Griffiths. Towards foundation models for 3d vision: How close are we? In2025 International Conference on 3D Vision (3DV), pages 1285–1296. IEEE,

  51. [51]

    Yes;(x t, yt)

    2 11 SpatialMosaic: A Multiview VLM Dataset for Partial Visibility Supplementary Material A. Statistics of SpatialMosaic-Bench We provide detailed statistics ofSpatialMosaic-Benchacross different difficulty levels in Fig. 5. Our benchmark contains total 1M QA pairs distributed across six main task categories: Count, Best-View Selection, Existence, Attribu...