pith. sign in

arxiv: 2606.05833 · v2 · pith:JT2EVYTQnew · submitted 2026-06-04 · 💻 cs.CV · cs.AI

Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

Pith reviewed 2026-06-28 02:11 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords geometric representationsmultimodal large language modelsspatial intelligencevideo-based learningcamera pose estimationdepth map regressionfeature distillation3D awareness
0
0 comments X

The pith

Distilling four geometric targets from videos restructures MLLM latent space to create spatial intelligence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GeoVR as a way to give multimodal large language models intrinsic 3D awareness using only 2D video sequences. It distills knowledge of inter-frame camera poses, dense depth maps, metric scale factors, and multi-scale 3D features from pre-trained 3D models into the MLLM through a multi-objective strategy. This is intended to reshape the model's internal semantic representations rather than add superficial features. A sympathetic reader would care because existing MLLMs often fail to maintain geometric consistency across video frames, which restricts their reliability in tasks needing spatial understanding.

Core claim

GeoVR learns geometric representations using purely 2D video sequences by distilling geometry knowledge from pre-trained 3D foundation models. This is accomplished through a multi-objective learning strategy driven by four complementary geometric targets: estimating inter-frame camera poses to embed varying viewpoint dynamics, regressing dense depth maps to anchor physical distances, predicting a metric scale factor for real-world calibration, and distilling multi-scale 3D features to align the intermediate feature space. Guided by these explicit physical and geometric constraints, the model's internal representations naturally develop strong 3D awareness and achieve state-of-the-art perform

What carries the argument

The multi-objective learning strategy driven by four complementary geometric targets distilled from pre-trained 3D models into the MLLM.

If this is right

  • The model's internal representations naturally develop strong 3D awareness.
  • GeoVR achieves state-of-the-art performance on spatial reasoning benchmarks.
  • The approach establishes a new paradigm for endowing foundation models with spatial intelligence.
  • It functions effectively even when large-scale 3D data is scarce.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same targets could be used to add spatial consistency to video generation or editing models.
  • The method may transfer to single-image inputs if the distilled features generalize beyond video sequences.
  • Applications in robotics or AR could exploit the learned metric scale factor for calibrated outputs.
  • Testing on held-out video domains would reveal whether the 3D awareness is robust to new camera motions.

Load-bearing premise

That distilling the four geometric targets from pre-trained 3D models into an MLLM via video will restructure its semantic latent space to produce genuine spatial intelligence rather than superficial feature alignment.

What would settle it

A spatial reasoning benchmark where the trained model shows no gain in cross-frame geometric consistency or 3D feature correlation compared with the base MLLM.

Figures

Figures reproduced from arXiv: 2606.05833 by Haibo Wang, Lifu Huang.

Figure 1
Figure 1. Figure 1: Comparison of different paradigms. P and V denote point clouds and RGB video. EP , E2D, and E3D denote point cloud, 2D vision, and 3D foundation encoders, respectively. (a) relies on scarce 3D data, limiting scalability. (b) patches external 3D features onto 2D tokens, causing inference overhead. (c) (ours) restructures the latent space via training-only geometric constraints. tion of a consistent, implici… view at source ↗
Figure 2
Figure 2. Figure 2: Framework of GeoVR. During training, alongside the standard next-token prediction (Ltext), the MLLM representation is additionally restructured via: camera pose estimation (Lcam), depth prediction (Ldepth), metric scale calibration (Lscale), and geometric representation alignment (Lalign) from a frozen 3D teacher (E3D). All auxiliary heads and the E3D branch are discarded during inference. 3. Method We int… view at source ↗
Figure 3
Figure 3. Figure 3: Cross-view correspondences and PCA projections of [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distill the geometric prior from F3D into F2D. dimension of F2D to D3D. The geometric representation alignment loss Lalign is then optimized by minimizing the cosine distance between the projected MLLM features and the teacher’s geometric features: Lalign = 1 |L| X l∈L  Sim  F l 3D, ϕ  F s(l) 2D  (5) where L defines the set of E3D’s layer indices chosen for distillation, and s(l) denotes the correspo… view at source ↗
Figure 5
Figure 5. Figure 5: PCA projections of visual representations. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Multimodal Large Language Models (MLLMs) excel at 2D semantic understanding but lack intrinsic 3D awareness, resulting in representations that fail to maintain geometric and spatial consistency across video frames. Given the scarcity of large-scale 3D data, we present GeoVR, a novel framework that learns geometric representations using purely 2D video sequences. This approach effectively restructures the semantic latent space within MLLMs to unlock spatial intelligence. Rather than employing superficial feature mixing, GeoVR reshapes the internal representations of the MLLM by distilling geometry knowledge from pre-trained 3D foundation models. This is accomplished through a multi-objective learning strategy driven by four complementary geometric targets: (1) estimating inter-frame camera poses to embed varying viewpoint dynamics, (2) regressing dense depth maps to anchor physical distances, (3) predicting a metric scale factor for real-world calibration, and (4) distilling multi-scale 3D features to align the intermediate feature space. Guided by these explicit physical and geometric constraints, the model's internal representations naturally develop strong 3D awareness. Extensive experiments on spatial reasoning benchmarks demonstrate that GeoVR achieves state-of-the-art performance, establishing a new paradigm for endowing foundation models with spatial intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes GeoVR, a framework that distills geometric knowledge into MLLMs from 2D video sequences using four targets derived from pre-trained 3D foundation models: inter-frame camera pose estimation, dense depth map regression, metric scale factor prediction, and multi-scale 3D feature distillation. The central claim is that this multi-objective strategy restructures the MLLM's semantic latent space to produce intrinsic 3D awareness and spatial intelligence, yielding state-of-the-art results on spatial reasoning benchmarks.

Significance. If the four-target distillation can be shown to induce genuine internal restructuring for 3D awareness independent of the teacher models, the work would offer a practical route to spatial capabilities in MLLMs without requiring large-scale 3D data. The explicit use of complementary geometric constraints (pose, depth, scale, features) from video is a coherent design choice that merits further investigation.

major comments (3)
  1. [Abstract] Abstract: The claim that the approach 'restructures the semantic latent space within MLLMs to unlock spatial intelligence' and achieves SOTA performance is asserted without any reported baselines, ablation results, quantitative metrics, or representation diagnostics, making it impossible to assess whether the data support the central mechanism.
  2. [Method (four-target distillation)] Method description of the four-target strategy: No representation-level evidence (e.g., CCA between layers, linear probes on frozen backbone features, or controlled ablations isolating backbone changes from head training) is provided to demonstrate that the distillation produces internal 3D awareness rather than output-head alignment with the pre-trained teachers; this distinction is load-bearing for the claim of 'genuine' restructuring versus superficial imitation.
  3. [Experiments] Experiments section: The dependence on pre-trained 3D foundation models for all four targets is not accompanied by any analysis or controls showing that the resulting spatial reasoning is independent of those models' own training data and objectives, which directly affects whether the performance gains constitute a new derivation.
minor comments (1)
  1. [Abstract and Method] The abstract and method sections would benefit from explicit notation for the loss weights on the four geometric objectives to clarify the multi-objective optimization.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger evidence supporting our claims of internal representation restructuring. We address each major comment below and indicate planned revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that the approach 'restructures the semantic latent space within MLLMs to unlock spatial intelligence' and achieves SOTA performance is asserted without any reported baselines, ablation results, quantitative metrics, or representation diagnostics, making it impossible to assess whether the data support the central mechanism.

    Authors: We agree the abstract would be strengthened by explicit grounding. In the revision we will update the abstract to reference the specific SOTA margins, ablation results, and quantitative metrics reported in the experiments section. revision: yes

  2. Referee: [Method (four-target distillation)] Method description of the four-target strategy: No representation-level evidence (e.g., CCA between layers, linear probes on frozen backbone features, or controlled ablations isolating backbone changes from head training) is provided to demonstrate that the distillation produces internal 3D awareness rather than output-head alignment with the pre-trained teachers; this distinction is load-bearing for the claim of 'genuine' restructuring versus superficial imitation.

    Authors: The referee is correct that the manuscript currently relies on downstream task performance rather than direct representation diagnostics. We will add linear probe experiments on frozen backbone features and controlled ablations that isolate backbone changes from the prediction heads to provide this evidence in the revised method and experiments sections. revision: yes

  3. Referee: [Experiments] Experiments section: The dependence on pre-trained 3D foundation models for all four targets is not accompanied by any analysis or controls showing that the resulting spatial reasoning is independent of those models' own training data and objectives, which directly affects whether the performance gains constitute a new derivation.

    Authors: We acknowledge the absence of such controls. While complete independence from the teachers' training data is difficult to demonstrate without their original datasets, we will add a discussion of this point together with comparisons against direct teacher-model outputs and a note on the complementary nature of the four targets in the revised experiments section. revision: partial

Circularity Check

0 steps flagged

No circularity: derivation relies on external pre-trained models and empirical benchmarks

full rationale

The paper presents a distillation framework using four geometric targets from pre-trained 3D foundation models applied to video inputs for an MLLM. No equations, self-citations, or fitted parameters are shown reducing the claimed restructuring of latent space or SOTA benchmark results to the inputs by construction. The method is externally falsifiable via spatial reasoning benchmarks, and the pre-trained models constitute independent external support rather than a self-referential chain. This is a standard non-circular empirical approach.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The approach rests on the untested premise that the four geometric targets will embed metric 3D structure inside an MLLM's existing latent space. It also treats pre-trained 3D foundation models as reliable oracles whose internal representations transfer without distortion. No free parameters are named, but the relative weighting of the four loss terms is necessarily chosen by hand.

free parameters (1)
  • loss weights for the four geometric objectives
    The multi-objective strategy requires balancing terms for pose, depth, scale, and feature distillation; these scalars are not derived and must be selected to make training stable.
axioms (2)
  • domain assumption Distilling geometry from pre-trained 3D models via video will restructure MLLM representations to produce intrinsic 3D awareness rather than task-specific feature copying.
    Invoked in the description of how the four targets 'naturally develop strong 3D awareness' inside the model.
  • domain assumption 2D video sequences contain sufficient viewpoint and depth signal to support metric-scale geometric learning when supervised by external 3D models.
    Stated as the premise that allows purely 2D input to unlock spatial intelligence.

pith-pipeline@v0.9.1-grok · 5749 in / 1670 out tokens · 31610 ms · 2026-06-28T02:11:30.018543+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 24 canonical work pages · 18 internal anchors

  1. [1]

    Scanqa: 3d question answering for spatial scene understanding

    Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19129– 19139, 2022. 6

  2. [2]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhao- hai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Jun- yang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shix...

  3. [3]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5- vl technical report.a...

  4. [4]

    SpatialVLM: Endowing Vision-Language Mod- els with Spatial Reasoning Capabilities.arXiv e-prints, art

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Danny Driess, Pete Florence, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. SpatialVLM: Endowing Vision-Language Mod- els with Spatial Reasoning Capabilities.arXiv e-prints, art. arXiv:2401.12168, 2024. 2

  5. [5]

    Ll3da: Visual interactive instruction tuning for omni-3d understand- ing reasoning and planning

    Sijin Chen, Xin Chen, Chi Zhang, Mingsheng Li, Gang Yu, Hao Fei, Hongyuan Zhu, Jiayuan Fan, and Tao Chen. Ll3da: Visual interactive instruction tuning for omni-3d understand- ing reasoning and planning. InCVPR, pages 26428–26438,

  6. [6]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long con- text, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 1, 6

  7. [7]

    3d-llava: Towards generalist 3d lmms with omni superpoint transformer

    Jiajun Deng, Tianyu He, Li Jiang, Tianyu Wang, Feras Day- oub, and Ian Reid. 3d-llava: Towards generalist 3d lmms with omni superpoint transformer. InCVPR, pages 3772–3782,

  8. [8]

    Depth map prediction from a single image using a multi-scale deep network.Advances in neural information processing systems, 27, 2014

    David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network.Advances in neural information processing systems, 27, 2014. 8

  9. [9]

    VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

    Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Shijie Zhou, Dilin Wang, et al. Vlm-3r: Vision-language models aug- mented with instruction-aligned 3d reconstruction.arXiv preprint arXiv:2505.20279, 2025. 2, 5, 6, 7

  10. [10]

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

    Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis.arXiv:2405.21075, 2024. 1

  11. [11]

    3d-llm: Injecting the 3d world into large language models.NeurIPS, 36:20482– 20494, 2023

    Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-llm: Injecting the 3d world into large language models.NeurIPS, 36:20482– 20494, 2023. 1, 2, 6

  12. [12]

    G2vlm: Ge- ometry grounded vision language model with unified 3d reconstruction and spatial reasoning,

    Wenbo Hu, Jingli Lin, Yilin Long, Yunlong Ran, Lihan Jiang, Yifan Wang, Chenming Zhu, Runsen Xu, Tai Wang, and Jiangmiao Pang. G2vlm: Geometry grounded vision language model with unified 3d reconstruction and spatial reasoning. arXiv preprint arXiv:2511.21688, 2025. 2

  13. [13]

    Chat-scene: Bridging 3d scene and large language models with object identifiers.NeurIPS, 37: 113991–114017, 2024

    Haifeng Huang, Yilun Chen, Zehan Wang, Rongjie Huang, Runsen Xu, Tai Wang, Luping Liu, Xize Cheng, Yang Zhao, Jiangmiao Pang, et al. Chat-scene: Bridging 3d scene and large language models with object identifiers.NeurIPS, 37: 113991–114017, 2024. 6

  14. [14]

    3drs: Mllms need 3d-aware representation supervision for scene understanding

    Xiaohu Huang, Jingjing Wu, Qunyi Xie, and Kai Han. 3drs: Mllms need 3d-aware representation supervision for scene understanding. InNeurIPS, 2025. 2, 6

  15. [15]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 1

  16. [16]

    MapAnything: Universal Feed-Forward Metric 3D Reconstruction

    Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, et al. Mapanything: Universal feed-forward metric 3d recon- struction.arXiv preprint arXiv:2509.13414, 2025. 2

  17. [17]

    What uncertainties do we need in bayesian deep learning for computer vision?Advances in neural information processing systems, 30, 2017

    Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision?Advances in neural information processing systems, 30, 2017. 5

  18. [18]

    Ground- ing image matching in 3d with mast3r

    Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Ground- ing image matching in 3d with mast3r. InEuropean Confer- ence on Computer Vision, 2024. 2

  19. [19]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv:2408.03326, 2024. 1, 6

  20. [20]

    Spatial forcing: Implicit spatial representation alignment for vision-language-action model

    Fuhao Li, Wenxuan Song, Han Zhao, Jingbo Wang, Pengx- iang Ding, Donglin Wang, Long ZENG, and Haoang Li. Spatial forcing: Implicit spatial representation alignment for vision-language-action model. InThe Fourteenth Interna- tional Conference on Learning Representations, 2026. 2

  21. [21]

    Spatialladder: Progressive training for spatial reasoning in vision-language models, 2025

    Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, and Yueting Zhuang. Spatialladder: Progressive training for spatial reasoning in vision-language models, 2025. 6

  22. [22]

    Thinking with Geometry: Active Geometry Integration for Spatial Reasoning

    Haoyuan Li, Qihang Cao, Tao Tang, Kun Xiang, Zihan Guo, Jianhua Han, Hang Xu, and Xiaodan Liang. Thinking with geometry: Active geometry integration for spatial reasoning. arXiv preprint arXiv:2602.06037, 2026. 2

  23. [23]

    Mvbench: A comprehensive multi-modal video understand- ing benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. InCVPR, 2024. 1

  24. [24]

    Depth Anything 3: Recovering the Visual Space from Any Views

    Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025. 1, 2, 3, 4, 5, 7

  25. [25]

    GeoAlign: Geometric Feature Realignment for MLLM Spatial Reasoning

    Zhaochen Liu, Limeng Qiao, Guanglu Wan, and Tingting Jiang. Geoalign: Geometric feature realignment for mllm spatial reasoning.arXiv preprint arXiv:2604.12630, 2026. 6

  26. [26]

    Sqa3d: Sit- uated question answering in 3d scenes.arXiv preprint arXiv:2210.07474, 2022

    Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, and Siyuan Huang. Sqa3d: Sit- uated question answering in 3d scenes.arXiv preprint arXiv:2210.07474, 2022. 6

  27. [27]

    Spatiallm: Train- ing large language models for structured indoor modeling

    Yongsen Mao, Junhao Zhong, Chuan Fang, Jia Zheng, Rui Tang, Hao Zhu, Ping Tan, and Zihan Zhou. Spatiallm: Train- ing large language models for structured indoor modeling. In NeurIPS, 2025. 1, 2

  28. [28]

    Learning 3d object categories by looking around them

    David Novotny, Diane Larlus, and Andrea Vedaldi. Learning 3d object categories by looking around them. InProceedings of the IEEE international conference on computer vision, pages 5218–5227, 2017. 5

  29. [29]

    Vi- sion transformers for dense prediction

    René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021. 4, 8

  30. [30]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025. 6

  31. [31]

    Streambridge: Turning your offline video large language model into a proactive streaming assistant

    Haibo Wang, Bo Feng, Zhengfeng Lai, Mingze Xu, Shiyu Li, Weifeng Ge, Afshin Dehghan, Meng Cao, and Ping Huang. Streambridge: Turning your offline video large language model into a proactive streaming assistant. InNeurIPS, 2025. 1

  32. [32]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 1, 2, 3, 4, 5, 7

  33. [33]

    Jianyuan Wang, Minghao Chen, Shangzhan Zhang, Nikita Karaev, Johannes Schönberger, Patrick Labatut, Piotr Bo- janowski, David Novotny, Andrea Vedaldi, and Christian Rupprecht. Vggt- Ω. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2026. 2, 3, 5, 7

  34. [34]

    Dust3r: Geometric 3d vision made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 20697–20709, 2024. 1, 2, 5

  35. [35]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 1

  36. [36]

    $\pi^3$: Permutation-equivariant visual geometry learning

    Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chun- hua Shen, and Tong He. $\pi^3$: Permutation-equivariant visual geometry learning. InThe Fourteenth International Conference on Learning Representations, 2026. 2

  37. [37]

    Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

    Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747, 2025. 1, 2, 6

  38. [38]

    Longvideobench: A benchmark for long-context interleaved video-language understanding.NeurIPS, 2024

    Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding.NeurIPS, 2024. 1

  39. [39]

    Slowfast-llava-1.5: A family of token-efficient video large language models for long-form video understand- ing

    Mingze Xu, Mingfei Gao, Shiyu Li, Jiasen Lu, Zhe Gan, Zhengfeng Lai, Meng Cao, Kai Kang, Yinfei Yang, and Af- shin Dehghan. Slowfast-llava-1.5: A family of token-efficient video large language models for long-form video understand- ing. InCOLM, 2025. 1

  40. [40]

    Pointllm: Empowering large language models to understand point clouds

    Runsen Xu, Xiaolong Wang, Tai Wang, Yilun Chen, Jiang- miao Pang, and Dahua Lin. Pointllm: Empowering large language models to understand point clouds. InECCV, pages 131–147. Springer, 2024. 1, 2

  41. [41]

    Thinking in space: How multimodal large language models see, remember, and recall spaces

    Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025. 1, 6

  42. [42]

    Visual spatial tuning.arXiv preprint arXiv:2511.05491, 2025

    Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wen- qian Wang, et al. Visual spatial tuning.arXiv preprint arXiv:2511.05491, 2025. 6

  43. [43]

    Cambrian-s: Towards spatial super- sensing in video

    Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis L Brown II, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, et al. Cambrian-s: Towards spatial super- sensing in video. InThe Fourteenth International Conference on Learning Representations, 2025. 1, 5, 6

  44. [44]

    Spatial mental modeling from limited views

    Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chan- drasegaran, Han Liu, Ranjay Krishna, et al. Spatial mental modeling from limited views. InStructural Priors for Vision Workshop at ICCV’25, 2025. 1

  45. [45]

    SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning

    Jiang Zhang, Shijie Zhou, Bangya Liu, Achuta Kadambi, and Zhiwen Fan. Spatialstack: Layered geometry-language fusion for 3d vlm spatial reasoning.arXiv preprint arXiv:2603.27437, 2026. 6

  46. [46]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Llava-video: Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024. 1

  47. [47]

    arXiv preprint arXiv:2511.23075 (2025)

    Ruosen Zhao, Zhikang Zhang, Jialei Xu, Jiahao Chang, Dong Chen, Lingyun Li, Weijian Sun, and Zizhuang Wei. Space- mind: Camera-guided modality fusion for spatial reasoning in vision-language models.arXiv preprint arXiv:2511.23075,

  48. [48]

    arXiv preprint arXiv:2505.24625 (2025)

    Duo Zheng, Shijia Huang, Yanyang Li, and Liwei Wang. Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.arXiv preprint arXiv:2505.24625,

  49. [49]

    Video-3d llm: Learning position-aware video representation for 3d scene understanding

    Duo Zheng, Shijia Huang, and Liwei Wang. Video-3d llm: Learning position-aware video representation for 3d scene understanding. InCVPR, pages 8995–9006, 2025. 2, 6

  50. [50]

    MLVU: Benchmarking Multi-task Long Video Understanding

    Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. Mlvu: A comprehensive benchmark for multi-task long video understanding.arXiv:2406.04264, 2024. 1

  51. [51]

    Llava-3d: A simple yet effective pathway to empowering lmms with 3d capabilities

    Chenming Zhu, Tai Wang, Wenwei Zhang, Jiangmiao Pang, and Xihui Liu. Llava-3d: A simple yet effective pathway to empowering lmms with 3d capabilities. InICCV, pages 4295–4305, 2025. 2, 6, 7

  52. [52]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 6