arxiv: 2512.23365 · v3 · submitted 2025-12-29 · 💻 cs.CV

Recognition: no theorem link

SpatialMosaic: A Multiview VLM Dataset for Partial Visibility

Kanghee Lee , Injae Lee , Minseok Kwak , Jungi Hong , Kwonyoung Ryu , Jaesik Park

Authors on Pith no claims yet

Pith reviewed 2026-05-16 19:40 UTC · model grok-4.3

classification 💻 cs.CV

keywords multi-viewspatial reasoningvision-language modelsdatasetpartial visibilityocclusion3D scene understandingbenchmark

0 comments

The pith

A new 2-million-pair multi-view dataset trains vision-language models to reason about 3D scenes from partial and occluded views.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SpatialMosaic, a dataset of 2 million automatically generated QA pairs drawn from multi-view images to teach multimodal models spatial reasoning when only fragments of a scene are visible. It pairs this with SpatialMosaic-Bench, a 1-million-pair evaluation set spanning six tasks in both indoor and outdoor environments, and tests a baseline model that feeds 3D reconstruction outputs into a vision-language model. The central goal is to supply training data that reflects real-world conditions such as occlusion and low overlap without requiring full 3D reconstructions at inference time. If the generated questions faithfully represent these conditions, models should improve at tasks that currently fail when visual cues are incomplete. The work therefore targets the gap between current multi-view training sets and the fragmented observations typical of robotics, surveillance, and augmented-reality applications.

Core claim

A scalable pipeline that renders multi-view images from indoor and outdoor scenes, then automatically produces realistic spatial-reasoning question-answer pairs, yields SpatialMosaic with 2 million training pairs and SpatialMosaic-Bench with 1 million evaluation pairs across six tasks; when used to fine-tune a hybrid SpatialMosaicVLM that inserts 3D geometry encoders into a vision-language model, the data measurably improves performance on partial-visibility, occlusion, and low-overlap spatial reasoning.

What carries the argument

The multi-view data generation and annotation pipeline that automatically constructs spatial-reasoning QA pairs capturing partial visibility and low-overlap conditions.

If this is right

Training on the dataset improves model accuracy on multi-view spatial tasks that involve occlusion and fragmented views.
The accompanying benchmark provides a standardized testbed for comparing methods across six distinct spatial-reasoning skills.
Hybrid models that combine 3D geometry encoders with vision-language models become more robust when fine-tuned on the generated pairs.
The same pipeline scales to both indoor and outdoor scenes, supporting evaluation in diverse real-world environments.
Automatic QA generation removes the need for manual annotation while still producing challenging questions that require 3D inference from 2D cues.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the pipeline generalizes, similar automatic generation could be applied to video sequences to add temporal spatial reasoning.
The approach suggests that large synthetic multi-view corpora may substitute for explicit 3D supervision in downstream robotics tasks.
Low-overlap camera configurations common in drone mapping could benefit directly from models trained on the dataset.
Future benchmarks might incorporate adversarial occlusions to test whether performance gains persist under more extreme visibility limits.

Load-bearing premise

Automatically generated question-answer pairs capture real-world partial visibility, occlusion, and low-overlap conditions without systematic biases or annotation errors.

What would settle it

A controlled test in which models trained on SpatialMosaic are evaluated on newly captured real-world multi-view images with known ground-truth occlusions and low overlap; if accuracy gains disappear relative to models trained on existing multi-view datasets, the claim that the generated data is representative would be falsified.

Figures

Figures reproduced from arXiv: 2512.23365 by Injae Lee, Jaesik Park, Jungi Hong, Kanghee Lee, Kwonyoung Ryu, Minseok Kwak.

**Figure 1.** Figure 1: We present SpatialMosaic, a benchmark designed to evaluate 3D spatial reasoning capabilities from fragmented visual cues across multiple viewpoints. Our benchmark focuses on three challenging real-world scenarios involving partial visibility, occlusion, and low-overlap, where current MLLMs often struggle to maintain geometric and cross-view consistency. Abstract The rapid progress of Multimodal Large Langu… view at source ↗

**Figure 2.** Figure 2: SpatialMosaic data generation pipeline. Given multi-view images dataset, with 3D annotations (Sec. 3.1), we compute object-level and image-level occlusion ratios for each instance. Images are then filtered by overlap to ensure diverse viewpoints, and instances are filtered based on visibility constraints (Sec. 3.2). Finally, spatial relations are computed and used to populate task-specific templates, gener… view at source ↗

**Figure 3.** Figure 3: Occlusion ratio calculation. We render each instance independently to measure visible (green) and occluded (magenta) pixels. Object Occlusion: Object-level occlusion (robj) captures inter-object obstruction from the actual camera view. Field-of-view Occlusion: Field-of-view truncation (rFoV) uses extended field-of-view rendering to quantify boundary occlusion from frame cropping. of the extended field. … view at source ↗

**Figure 4.** Figure 4: SpatialMosaicVLM architecture. Multi-image inputs are processed through parallel Geometry and Visual Encoders to extract 3D structural and appearance features. The resulting geometry and visual tokens are fused via crossattention, then combined with question tokens and processed by a Large Language Model to answer spatial reasoning questions under occlusion and partial visibility circumstances. visual an… view at source ↗

**Figure 5.** Figure 5: Difficulty Level Distribution Frame combination construction. For each scene, candidate multi-view combinations are formed by enumerating frame sets that satisfy the required view count and the overlap constraint. Only combinations whose internal viewoverlap stays below the specified threshold are retained, ensuring sparse and complementary viewpoints. Valid instance set collection. Within each retaine… view at source ↗

**Figure 6.** Figure 6: Comparison between SpatialMosaicVLM and InternVL2-8B on SpatialMosaic-Bench. These computations are task-specific, including directional separations between instance bounding boxes, visible-pixel statistics, or merged instance counts across views. All computations operate directly on the pre-annotated camera-frame geometry. Answer and distractor generation. The computed values are inserted into the task t… view at source ↗

**Figure 7.** Figure 7: Object Count example Task: Best-View Selection Q) How many chair(s) are visible across these frames? And tell me which frame provides the most informative view of the chair(s)? A: 7 chair(s); Frame 3 B: 4 chair(s); Frame 3 C: 6 chair(s); Frame 4 D: 4 chair(s); Frame 1 Frame 1 Frame 2 Frame 3 Frame 4 Frame 5 [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Best-View Selection example Task: Object Localization Q) Is there a(n) box in Frame 4? If so, what is the bounding box center coordinates? A: Yes; (197,50) B: No C: Yes; (353,351) D: Yes; (127,487) Frame 1 Frame 2 Frame 3 Frame 4 Frame 5 [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Object Localization example 8 [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Occlusion-Aware Object Existence examples [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: Occlusion-Aware Attribute examples 10 [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: Occlusion-Aware Spatial Relation examples [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

read the original abstract

The rapid progress of Multimodal Large Language Models (MLLMs) has unlocked the potential for enhanced 3D scene understanding and spatial reasoning. A recent line of work explores learning spatial reasoning directly from multi-view images, enabling MLLMs to understand 3D scenes without explicit 3D reconstructions. Nevertheless, key challenges that frequently arise in real-world environments, such as partial visibility, occlusion, and low-overlap conditions that require spatial reasoning from fragmented visual cues, remain under-explored. To address these limitations, we propose a scalable multi-view data generation and annotation pipeline that constructs realistic spatial reasoning QAs, resulting in SpatialMosaic, a comprehensive instruction-tuning dataset featuring 2M QA pairs. We further introduce SpatialMosaic-Bench, a challenging benchmark for evaluating multi-view spatial reasoning under complex and diverse scenarios, consisting of 1M QA pairs across 6 tasks. Our proposed dataset spans both indoor and outdoor scenes, enabling comprehensive evaluation in diverse real-world scenarios. In addition, we introduce a new baseline for multi-view settings, SpatialMosaicVLM, a hybrid framework that integrates 3D reconstruction models as geometry encoders within VLMs for robust spatial reasoning. Extensive experiments demonstrate that our proposed dataset effectively enhances spatial reasoning under challenging multi-view conditions, validating the effectiveness of our data generation pipeline in constructing realistic and challenging QAs. Code and dataset will be available soon.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SpatialMosaic supplies a sizable new synthetic dataset and benchmark for partial-visibility multi-view reasoning, but the validation stays thin on numbers and real-world checks.

read the letter

The paper's core offering is a 2M-pair instruction dataset plus a 1M-pair benchmark across six tasks, all built around multi-view images with heavy occlusion and low overlap. They also release a hybrid baseline that feeds 3D reconstruction features into a VLM. This targets a practical hole: most existing multi-view VLM work assumes decent view coverage, while real scenes often force reasoning from fragments. The scale and indoor/outdoor mix are straightforward wins for anyone who needs training data in this regime. The hybrid encoder idea is a reasonable first baseline and avoids starting from scratch on geometry. The soft spots sit in the evaluation. The abstract states that experiments show the dataset improves spatial reasoning and that the generation pipeline works, yet no metrics, tables, or baseline comparisons appear. If the test distribution is produced by the same synthetic pipeline as the training data, reported gains could come from matching generation artifacts rather than learning transferable cues. No human realism ratings or transfer results on real captures like ScanNet are mentioned, which leaves the central claim under-anchored. The citation pattern is normal for a dataset release and does not raise flags. This paper is for groups building or evaluating MLLMs on 3D spatial tasks who need data that stresses partial visibility. A reader looking for a ready benchmark or starting point for fine-tuning would find it useful. It deserves peer review because the resource itself is large and the problem matters, even if the current experiments need more concrete evidence and external validation to hold up.

Referee Report

2 major / 0 minor

Summary. The paper introduces SpatialMosaic, a scalable multi-view data generation pipeline that produces a 2M-pair instruction-tuning dataset of spatial reasoning QA pairs focused on partial visibility, occlusion, and low-overlap conditions across indoor and outdoor scenes. It also releases SpatialMosaic-Bench (1M QA pairs across 6 tasks) and proposes SpatialMosaicVLM, a hybrid VLM that integrates 3D reconstruction models as geometry encoders. The central claim is that the dataset and baseline enhance spatial reasoning under challenging multi-view conditions, as demonstrated by extensive experiments validating the data generation pipeline.

Significance. If the empirical validation holds under independent testing, the work would supply a large-scale resource addressing an under-explored gap in multi-view VLM training for fragmented visual cues, potentially improving robustness without requiring explicit 3D reconstructions. The dataset scale and new baseline could serve as a foundation for future research on real-world spatial reasoning.

major comments (2)

Abstract: the assertion that 'extensive experiments demonstrate that our proposed dataset effectively enhances spatial reasoning under challenging multi-view conditions' provides no quantitative metrics, baseline comparisons, error analysis, or protocol details on how partial-visibility cases were constructed or measured, leaving the central empirical claim without verifiable support.
Evaluation setup (implied by abstract and skeptic note): the test distribution in SpatialMosaic-Bench is generated by the identical pipeline used for the 2M training pairs, creating a circularity risk where gains may exploit synthetic cues (e.g., depth noise patterns or QA phrasing) rather than transferable multi-view reasoning; no transfer results on real captures such as ScanNet or Matterport, nor human realism ratings, are reported to anchor the 'realistic' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: Abstract: the assertion that 'extensive experiments demonstrate that our proposed dataset effectively enhances spatial reasoning under challenging multi-view conditions' provides no quantitative metrics, baseline comparisons, error analysis, or protocol details on how partial-visibility cases were constructed or measured, leaving the central empirical claim without verifiable support.

Authors: We agree that the abstract, constrained by length, summarizes the empirical results at a high level without specific numbers or protocol details. The full manuscript provides these elements in Sections 4 and 5, including quantitative gains over baselines, error breakdowns, and the exact construction protocol for partial-visibility, occlusion, and low-overlap cases described in Section 3. To improve verifiability, we will revise the abstract to include key metrics (e.g., accuracy improvements on the six benchmark tasks) while preserving brevity. revision: yes
Referee: Evaluation setup (implied by abstract and skeptic note): the test distribution in SpatialMosaic-Bench is generated by the identical pipeline used for the 2M training pairs, creating a circularity risk where gains may exploit synthetic cues (e.g., depth noise patterns or QA phrasing) rather than transferable multi-view reasoning; no transfer results on real captures such as ScanNet or Matterport, nor human realism ratings, are reported to anchor the 'realistic' claim.

Authors: We acknowledge the valid concern about potential distribution overlap and synthetic cue exploitation. The benchmark uses disjoint scenes and deliberately varied generation parameters (visibility ratios, overlap thresholds, and noise levels) to reduce this risk, as detailed in Section 3.2. However, the current experiments do not include transfer evaluations on real captures such as ScanNet or Matterport, nor human realism ratings. We will add an explicit limitations subsection discussing this gap and outlining future real-world validation steps. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical dataset construction with no derivation chain or self-referential reductions.

full rationale

The paper introduces a data generation pipeline to produce the SpatialMosaic dataset (2M QA pairs) and SpatialMosaic-Bench (1M QA pairs), then reports empirical gains from training SpatialMosaicVLM on the data. No equations, fitted parameters, uniqueness theorems, or ansatzes are defined in terms of the target results. Claims rest on the new artifacts and standard train/test splits rather than any step that reduces by construction to its own inputs. The shared generation pipeline between train and test data is a standard empirical setup for synthetic benchmarks and does not meet the criteria for circularity (no quoted self-definition or load-bearing self-citation that collapses the central claim).

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that the data-generation pipeline produces high-quality, realistic spatial reasoning examples; no numerical free parameters are fitted and no new physical entities are postulated.

axioms (1)

domain assumption The automated multi-view data generation and annotation pipeline produces realistic and challenging spatial reasoning QAs that match real-world partial-visibility conditions.
Invoked when describing the pipeline and when claiming experimental validation of enhanced spatial reasoning.

pith-pipeline@v0.9.0 · 5568 in / 1286 out tokens · 45954 ms · 2026-05-16T19:40:06.997331+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 5 internal anchors

[1]

Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes

Panos Achlioptas, Ahmed Abdelreheem, Fei Xia, Mohamed Elhoseiny, and Leonidas Guibas. Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. InEuropean conference on computer vision, pages 422–440. Springer, 2020. 3

work page 2020
[2]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

work page
[3]

Neural module networks

Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural module networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 39–48, 2016. 3

work page 2016
[4]

Scanqa: 3d question answering for spatial scene understanding

Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19129– 19139, 2022. 3

work page 2022
[5]

Grounded 3d-llm with referent tokens.arXiv preprint arXiv:2405.10370,

Yilun Chen, Shuai Yang, Haifeng Huang, Tai Wang, Runsen Xu, Ruiyuan Lyu, Dahua Lin, and Jiangmiao Pang. Grounded 3d-llm with referent tokens.arXiv preprint arXiv:2405.10370,

work page arXiv
[6]

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. Science China Information Sciences, 67(12):220101, 2024. 7, 8, 1

work page 2024
[7]

Hsfm: Hybrid structure-from-motion

Hainan Cui, Xiang Gao, Shuhan Shen, and Zhanyi Hu. Hsfm: Hybrid structure-from-motion. 2017. 3

work page 2017
[8]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017. 5

work page 2017
[9]

Mm-spatial: Exploring 3d spatial understanding in multimodal llms

Erik Daxberger, Nina Wenzel, David Griffiths, Haiming Gang, Justin Lazarow, Gefen Kohavi, Kai Kang, Marcin Eichner, Yinfei Yang, Afshin Dehghan, et al. Mm-spatial: Exploring 3d spatial understanding in multimodal llms. InProceed- ings of the IEEE/CVF International Conference on Computer Vision, pages 7395–7408, 2025. 2

work page 2025
[10]

3d-llava: Towards generalist 3d lmms with omni superpoint transformer

Jiajun Deng, Tianyu He, Li Jiang, Tianyu Wang, Feras Day- oub, and Ian Reid. 3d-llava: Towards generalist 3d lmms with omni superpoint transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3772–3782,

work page
[11]

VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, et al. Vlm-3r: Vision-language models aug- mented with instruction-aligned 3d reconstruction.arXiv preprint arXiv:2505.20279, 2025. 2, 3, 6, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Scene-llm: Extending language model for 3d visual understanding and reasoning.arXiv preprint arXiv:2403.11401, 2024

Rao Fu, Jingyu Liu, Xilun Chen, Yixin Nie, and Wen- han Xiong. Scene-llm: Extending language model for 3d visual understanding and reasoning.arXiv preprint arXiv:2403.11401, 2024. 3

work page arXiv 2024
[13]

Blink: Multimodal large language mod- els can see but not perceive

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language mod- els can see but not perceive. InEuropean Conference on Computer Vision, pages 148–166. Springer, 2024. 2

work page 2024
[14]

Multi-view stereo: A tutorial.Foundations and trends® in Computer Graphics and Vision, 9(1-2):1–148, 2015

Yasutaka Furukawa, Carlos Hern ´andez, et al. Multi-view stereo: A tutorial.Foundations and trends® in Computer Graphics and Vision, 9(1-2):1–148, 2015. 3

work page 2015
[15]

Cambridge university press, 2003

Richard Hartley and Andrew Zisserman.Multiple view geom- etry in computer vision. Cambridge university press, 2003. 3

work page 2003
[16]

3d-llm: Injecting the 3d world into large language models.Advances in Neural Information Processing Systems, 36:20482–20494, 2023

Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-llm: Injecting the 3d world into large language models.Advances in Neural Information Processing Systems, 36:20482–20494, 2023. 3

work page 2023
[17]

Multi- view transformer for 3d visual grounding

Shijia Huang, Yilun Chen, Jiaya Jia, and Liwei Wang. Multi- view transformer for 3d visual grounding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15524–15533, 2022. 3

work page 2022
[18]

Text2scene: Text-driven indoor scene stylization with part- aware details

Inwoo Hwang, Hyeonwoo Kim, and Young Min Kim. Text2scene: Text-driven indoor scene stylization with part- aware details. InProceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 1890–1899,

work page
[19]

Ground- ing image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r. InEuropean Confer- ence on Computer Vision, pages 71–91. Springer, 2024. 2, 3

work page 2024
[20]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Viewspatial-bench: Evaluating multi-perspective spatial localization in vision-language models.arXiv preprint arXiv:2505.21500, 2025

Dingming Li, Hongxing Li, Zixuan Wang, Yuchen Yan, Hang Zhang, Siqi Chen, Guiyang Hou, Shengpei Jiang, Wenqi Zhang, Yongliang Shen, et al. Viewspatial-bench: Evaluat- ing multi-perspective spatial localization in vision-language models.arXiv preprint arXiv:2505.21500, 2025. 2

work page arXiv 2025
[22]

Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR,

work page
[23]

Mvbench: A comprehensive multi-modal video understand- ing benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195– 22206, 2024. 2

work page 2024
[24]

V oxformer: Sparse voxel transformer for camera- based 3d semantic scene completion

Yiming Li, Zhiding Yu, Christopher Choy, Chaowei Xiao, Jose M Alvarez, Sanja Fidler, Chen Feng, and Anima Anand- kumar. V oxformer: Sparse voxel transformer for camera- based 3d semantic scene completion. InProceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, pages 9087–9098, 2023. 2

work page 2023
[25]

Vila: On pre-training for visual 9 language models

Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual 9 language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26689– 26699, 2024. 7, 8

work page 2024
[26]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 2, 3

work page 2023
[27]

Sqa3d: Situated question answering in 3d scenes.arXiv preprint arXiv:2210.07474, 2022

Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, and Siyuan Huang. Sqa3d: Sit- uated question answering in 3d scenes.arXiv preprint arXiv:2210.07474, 2022. 3

work page arXiv 2022
[28]

Nerf in the wild: Neural radiance fields for uncon- strained photo collections

Ricardo Martin-Brualla, Noha Radwan, Mehdi SM Sajjadi, Jonathan T Barron, Alexey Dosovitskiy, and Daniel Duck- worth. Nerf in the wild: Neural radiance fields for uncon- strained photo collections. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7210–7219, 2021. 2

work page 2021
[29]

I2mvformer: Large language model generated multi-view document supervi- sion for zero-shot image classification

Muhammad Ferjad Naeem, Muhammad Gul Zain Ali Khan, Yongqin Xian, Muhammad Zeshan Afzal, Didier Stricker, Luc Van Gool, and Federico Tombari. I2mvformer: Large language model generated multi-view document supervi- sion for zero-shot image classification. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15169–15179, 2023. 2

work page 2023
[30]

Global structure-from-motion revisited

Linfei Pan, D´aniel Bar´ath, Marc Pollefeys, and Johannes L Sch¨onberger. Global structure-from-motion revisited. InEuro- pean Conference on Computer Vision, pages 58–77. Springer,

work page
[31]

Structure- from-motion revisited

Johannes L Sch¨onberger and Jan-Michael Frahm. Structure- from-motion revisited. InProceedings of the IEEE confer- ence on computer vision and pattern recognition, pages 4104– 4113, 2016. 3

work page 2016
[32]

Pixelwise view selection for unstructured multi-view stereo

Johannes L Sch¨onberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. InComputer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, Octo- ber 11-14, 2016, Proceedings, Part III 14, pages 501–518. Springer, 2016. 3

work page 2016
[33]

Vipergpt: Vi- sual inference via python execution for reasoning

D´ıdac Sur´ıs, Sachit Menon, and Carl V ondrick. Vipergpt: Vi- sual inference via python execution for reasoning. InProceed- ings of the IEEE/CVF international conference on computer vision, pages 11888–11898, 2023. 3

work page 2023
[34]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 2, 3, 6

work page 2025
[35]

Continuous 3d perception model with persistent state

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10510– 10522, 2025. 3

work page 2025
[36]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 20697–20709, 2024. 2, 3

work page 2024
[37]

Multi-SpatialMLLM: Multi- frame spatial understanding with multi-modal large language models.arXiv preprint arXiv:2505.17015, 2025

Runsen Xu, Weiyao Wang, Hao Tang, Xingyu Chen, Xi- aodong Wang, Fu-Jen Chu, Dahua Lin, Matt Feiszli, and Kevin J Liang. Multi-spatialmllm: Multi-frame spatial un- derstanding with multi-modal large language models.arXiv preprint arXiv:2505.17015, 2025. 2, 3

work page arXiv 2025
[38]

Thinking in space: How multimodal large language models see, remember, and recall spaces

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025. 2, 3, 7

work page 2025
[39]

Seeing from another perspective: Evaluating multi-view understanding in mllms

Chun-Hsiao Yeh, Chenyu Wang, Shengbang Tong, Ta-Ying Cheng, Ruoyu Wang, Tianzhe Chu, Yuexiang Zhai, Yubei Chen, Shenghua Gao, and Yi Ma. Seeing from another perspective: Evaluating multi-view understanding in mllms. arXiv preprint arXiv:2504.15280, 2025. 2

work page arXiv 2025
[40]

inerf: Inverting neural radiance fields for pose estimation

Lin Yen-Chen, Pete Florence, Jonathan T Barron, Alberto Rodriguez, Phillip Isola, and Tsung-Yi Lin. inerf: Inverting neural radiance fields for pose estimation. In2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1323–1330. IEEE, 2021. 2

work page 2021
[41]

Scannet++: A high-fidelity dataset of 3d indoor scenes

Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d indoor scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–22, 2023. 4, 5, 6

work page 2023
[42]

Neural-symbolic vqa: Dis- entangling reasoning from vision and language understanding

Kexin Yi, Jiajun Wu, Chuang Gan, Antonio Torralba, Push- meet Kohli, and Josh Tenenbaum. Neural-symbolic vqa: Dis- entangling reasoning from vision and language understanding. Advances in neural information processing systems, 31, 2018. 3

work page 2018
[43]

CLEVRER: CoLlision Events for Video REpresentation and Reasoning

Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B Tenenbaum. Clevrer: Collision events for video representation and reasoning.arXiv preprint arXiv:1910.01442, 2019. 2

work page internal anchor Pith review Pith/arXiv arXiv 1910
[44]

From flatland to space: Teaching vision-language models to perceive and reason in 3d.arXiv preprint arXiv:2503.22976,

Jiahui Zhang, Yurui Chen, Yanpeng Zhou, Yueming Xu, Ze Huang, Jilin Mei, Junhui Chen, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, et al. From flatland to space: Teaching vision-language models to perceive and reason in 3d.arXiv preprint arXiv:2503.22976, 2025. 2, 3

work page arXiv 2025
[45]

Long Context Transfer from Language to Vision

Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision.arXiv preprint arXiv:2406.16852, 2024. 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

Llava- next: A strong zero-shot video understanding model, 2024

Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava- next: A strong zero-shot video understanding model, 2024. 6, 7, 8

work page 2024
[47]

Mmicl: Empowering vision-language model with multi-modal in-context learn- ing

Haozhe Zhao, Zefan Cai, Shuzheng Si, Xiaojian Ma, Kaikai An, Liang Chen, Zixuan Liu, Sheng Wang, Wenjuan Han, and Baobao Chang. Mmicl: Empowering vision-language model with multi-modal in-context learning.arXiv preprint arXiv:2309.07915, 2023. 2

work page arXiv 2023
[48]

Lscenellm: Enhancing large 3d scene understanding using adaptive visual preferences

Hongyan Zhi, Peihao Chen, Junyan Li, Shuailei Ma, Xinyu Sun, Tianhang Xiang, Yinjie Lei, Mingkui Tan, and Chuang Gan. Lscenellm: Enhancing large 3d scene understanding using adaptive visual preferences. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3761–3771, 2025. 3

work page 2025
[49]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language 10 understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Towards foundation models for 3d vision: How close are we? In2025 International Conference on 3D Vision (3DV), pages 1285–1296

Yiming Zuo, Karhan Kayan, Maggie Wang, Kevin Jeon, Jia Deng, and Thomas L Griffiths. Towards foundation models for 3d vision: How close are we? In2025 International Conference on 3D Vision (3DV), pages 1285–1296. IEEE,

work page
[51]

Yes;(x t, yt)

2 11 SpatialMosaic: A Multiview VLM Dataset for Partial Visibility Supplementary Material A. Statistics of SpatialMosaic-Bench We provide detailed statistics ofSpatialMosaic-Benchacross different difficulty levels in Fig. 5. Our benchmark contains total 1M QA pairs distributed across six main task categories: Count, Best-View Selection, Existence, Attribu...

work page