pith. sign in

arxiv: 2605.22558 · v1 · pith:NUZUNLT4new · submitted 2026-05-21 · 💻 cs.CV

GeoWeaver: Grounding Visual Tokens with Geometric Evidence before Scene Reasoning

Pith reviewed 2026-05-22 07:20 UTC · model grok-4.3

classification 💻 cs.CV
keywords geometric groundingvisual tokensspatio-temporal reasoningvision-language modelstoken-adaptive allocationspatial intelligencemultimodal reasoning
0
0 comments X

The pith

Geometry should ground visual tokens as a prerequisite before language models perform scene reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that vision-language models need geometric information built into their visual representations from the start rather than added during reasoning. Different visual tokens play different spatial roles and therefore need tailored geometric evidence. GeoWeaver achieves this by pulling relevant abstractions from a multi-level geometry bank created by a frozen encoder and incorporating them into each token through residual grounding before language modeling begins. This early integration leads to better performance on spatial reasoning benchmarks while keeping general multimodal abilities intact. The approach treats geometry not as an optional extra but as the foundation for accurate spatio-temporal understanding.

Core claim

GeoWeaver constructs a multi-level geometry bank from a frozen geometry encoder and performs token-adaptive geometric evidence allocation, enabling each visual token to retrieve the most relevant geometric abstractions. The selected evidence is incorporated into visual tokens via a residual grounding operation prior to language modeling, yielding geometry-grounded representations for downstream reasoning.

What carries the argument

token-adaptive geometric evidence allocation from a multi-level geometry bank, which assigns specific geometric abstractions to individual visual tokens based on their spatial roles before reasoning occurs.

Load-bearing premise

Different visual tokens require distinct geometric evidence based on their spatial roles, and a frozen geometry encoder plus token-adaptive allocation can supply the most relevant abstractions without degrading semantic content or downstream performance.

What would settle it

A direct comparison showing that late-fusion of the same geometric information achieves equal or better results on spatial reasoning benchmarks than the pre-reasoning grounding would falsify the prerequisite claim.

Figures

Figures reproduced from arXiv: 2605.22558 by Deshui Miao, Haijun Zhang, Ming-Hsuan Yang, Xingsen Huang, Xin Li, Yameng Gu.

Figure 1
Figure 1. Figure 1: Motivation and paradigm comparison of GeoWeaver. Top: Existing geometry-enhanced VLMs mainly introduce geometry through pre-fusion or LLM-side fusion, while GeoWeaver grounds visual tokens before language reasoning. Bottom: VGGT feature maps from different layers exhibit heterogeneous spatial responses, indicating that multi-layer geometry does not provide a uniform signal. This motivates our design of tre… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of GeoWeaver. GeoWeaver treats geometry as a representational prerequisite rather than a late fusion signal. A frozen VGGT encoder provides a multi-layer geometry bank, from which each visual token adaptively retrieves sparse geometric evidence via query-conditioned compact geometric grounding before entering the Qwen LLM. This pre-reasoning grounding process converts semantic visual tokens into g… view at source ↗
read the original abstract

Spatio-temporal reasoning in vision-language models requires visual representations that preserve physical geometry rather than merely semantic appearance. Recent multimodal models incorporate geometric information through structural branches, 3D-aware supervision, reasoning-stage fusion, or long-horizon memory. While these approaches demonstrate the importance of geometry for spatial intelligence, they typically treat geometric cues as a shared signal across all visual tokens. We note that this overlooks a finer-grained challenge: different visual tokens require different geometric evidence depending on their spatial roles. To address this limitation, we introduce GeoWeaver, a pre-reasoning geometric grounding framework that treats geometry as a representational prerequisite for spatio-temporal reasoning. GeoWeaver constructs a multi-level geometry bank from a frozen geometry encoder and performs token-adaptive geometric evidence allocation, enabling each visual token to retrieve the most relevant geometric abstractions. The selected evidence is incorporated into visual tokens via a residual grounding operation prior to language modeling, yielding geometry-grounded representations for downstream reasoning. Extensive evaluations on spatial reasoning benchmarks demonstrate that GeoWeaver consistently enhances geometry-aware reasoning while retaining general multimodal capabilities. This indicates that geometric information yields the greatest benefit not as a late-fusion auxiliary signal but as a fundamental prerequisite that shapes the representational foundation on which large language models perform reasoning. All source code and models will be released at https://github.com/yahooo-m/GeoWeaver .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces GeoWeaver, a pre-reasoning geometric grounding framework for vision-language models. It builds a multi-level geometry bank from a frozen geometry encoder, performs token-adaptive allocation of geometric evidence to individual visual tokens based on their spatial roles, and applies residual grounding to incorporate this evidence into the visual representations before language modeling. The central claim is that treating geometry as a representational prerequisite (rather than a late-fusion auxiliary signal or shared cue) improves spatio-temporal reasoning on benchmarks while preserving general multimodal capabilities.

Significance. If the empirical gains are shown to stem specifically from the prerequisite-style grounding rather than auxiliary fusion, the work would offer a modular, parameter-efficient route to inject geometric structure into VLMs. The token-adaptive mechanism and code release are practical strengths that could influence downstream applications in robotics and scene understanding.

major comments (3)
  1. [Abstract and §3] Abstract and §3 (method overview): the claim that geometry must act as a 'fundamental prerequisite that shapes the representational foundation' is load-bearing for the paper's novelty, yet the manuscript provides no explicit comparison or ablation against late-fusion baselines that apply the same geometry bank after the LLM stage; without this contrast, the superiority of pre-reasoning residual grounding over auxiliary signals remains unverified.
  2. [§4.2] §4.2 (token-adaptive allocation): the frozen geometry encoder is used without a described projection or alignment module to map its output features into the VLM visual token embedding space; if the selected geometric abstractions are misaligned, the residual update cannot function as a true representational prerequisite and may instead add noise, directly threatening the central claim.
  3. [§5] §5 (experiments): the reported improvements on spatial reasoning benchmarks are presented without error bars, statistical significance tests, or per-token ablation showing that allocation selects spatially relevant evidence rather than generic features; this weakens the assertion that distinct geometric evidence per token is necessary.
minor comments (2)
  1. [§3] Notation for the multi-level geometry bank and residual grounding operation should be introduced with explicit equations in §3 to improve reproducibility.
  2. [Figure 2] Figure 2 (architecture diagram) would benefit from clearer labeling of the allocation and residual steps to distinguish them from standard cross-attention.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The comments highlight important aspects of our central claims and experimental rigor. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core contributions.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (method overview): the claim that geometry must act as a 'fundamental prerequisite that shapes the representational foundation' is load-bearing for the paper's novelty, yet the manuscript provides no explicit comparison or ablation against late-fusion baselines that apply the same geometry bank after the LLM stage; without this contrast, the superiority of pre-reasoning residual grounding over auxiliary signals remains unverified.

    Authors: We agree that a direct ablation against a late-fusion baseline using the identical geometry bank would provide stronger verification of the prerequisite-style advantage. Our current experiments compare against methods that incorporate geometry at various stages, but do not isolate a post-LLM fusion variant. In the revised manuscript we will add this baseline (applying the same multi-level bank via residual fusion after the LLM) and report the resulting performance gap on the spatial reasoning benchmarks. revision: yes

  2. Referee: [§4.2] §4.2 (token-adaptive allocation): the frozen geometry encoder is used without a described projection or alignment module to map its output features into the VLM visual token embedding space; if the selected geometric abstractions are misaligned, the residual update cannot function as a true representational prerequisite and may instead add noise, directly threatening the central claim.

    Authors: The manuscript describes the geometry bank construction but does not explicitly detail the feature alignment step. We will revise §4.2 to include a learned linear projection layer that maps the frozen encoder outputs into the VLM visual token space prior to token-adaptive allocation. This module is lightweight, frozen-encoder compatible, and ensures dimensional and distributional alignment so that the residual grounding operates on commensurate representations. revision: yes

  3. Referee: [§5] §5 (experiments): the reported improvements on spatial reasoning benchmarks are presented without error bars, statistical significance tests, or per-token ablation showing that allocation selects spatially relevant evidence rather than generic features; this weakens the assertion that distinct geometric evidence per token is necessary.

    Authors: We acknowledge the value of these statistical and ablation details. In the revision we will (i) report mean and standard deviation over three random seeds for all main results, (ii) include paired t-test p-values for the key comparisons, and (iii) add a per-token ablation that measures the spatial relevance of allocated evidence (e.g., via overlap with ground-truth object regions) versus random or generic feature selection. These additions will directly support the necessity of token-adaptive allocation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method uses external frozen encoder with empirical validation

full rationale

The paper introduces GeoWeaver as a pre-reasoning grounding framework that constructs a multi-level geometry bank from a frozen external geometry encoder, performs token-adaptive allocation, and applies residual grounding to visual tokens before language modeling. The central claim—that geometry functions as a representational prerequisite rather than late-fusion auxiliary—is presented as an empirical outcome from benchmark evaluations, not as a mathematical derivation or prediction that reduces to author-defined inputs by construction. No equations, self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations are evident in the described chain. The approach relies on an independent frozen encoder and design choices validated externally, rendering the derivation self-contained without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only view yields no explicit free parameters, axioms, or invented entities; the method assumes a pre-trained frozen geometry encoder supplies useful abstractions that can be selectively allocated.

pith-pipeline@v0.9.0 · 5788 in / 978 out tokens · 42403 ms · 2026-05-22T07:20:23.388363+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 18 internal anchors

  1. [1]

    Llava-onevision-1.5: Fully open framework for democratized multimodal training

    Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Chunsheng Wu, Huajie Tan, Chunyuan Li, Jing Yang, Jie Yu, Xiyao Wang, Bin Qin, Yumeng Wang, Zizhen Yan, Ziyong Feng, Ziwei Liu, Bo Li, and Jiankang Deng. Llava-onevision-1.5: Fully open framework for democratized multimodal training. InarXiv, 2025. 16

  2. [2]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 3

  3. [3]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...

  4. [4]

    Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025

    Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, et al. Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025. 3

  5. [5]

    Has gpt-5 achieved spatial intelligence? an empirical study.arXiv preprint arXiv:2508.13142, 2025

    Zhongang Cai, Yubo Wang, Qingping Sun, Ruisi Wang, Chenyang Gu, Wanqi Yin, Zhiqian Lin, Zhitao Yang, Chen Wei, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Jiaqi Li, Xiangyu Fan, Hanming Deng, Lewei Lu, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, and Lei Yang. Holistic evaluation of multimodal llms on spatial intelligence.arXiv preprint arXiv:2508.13142...

  6. [6]

    Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455–14465, 2024. 3

  7. [7]

    Qwen3-vl: Multimodal large language model series

    QwenLM Team (Alibaba Cloud). Qwen3-vl: Multimodal large language model series. https://github. com/QwenLM/Qwen3-VL, 2025. GitHub repository; accessed: 2025-11-14. 6, 7, 16

  8. [8]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 16

  9. [9]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 3, 6

  10. [10]

    Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models

    Mengfei Du, Binhao Wu, Zejun Li, Xuan-Jing Huang, and Zhongyu Wei. Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 2: Short Papers), pages 346–355,

  11. [11]

    VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

    Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, et al. Vlm-3r: Vision-language models augmented with instruction-aligned 3d reconstruction.arXiv preprint arXiv:2505.20279, 2025. 2, 3, 6, 7

  12. [12]

    Vlm-3r: Vision-language models augmented with instruction-aligned 3d reconstruction

    Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Shijie Zhou, Dilin Wang, et al. Vlm-3r: Vision-language models augmented with instruction-aligned 3d reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,

  13. [13]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24108–24118, 2025. 7

  14. [14]

    Blink: Multimodal large language models can see but not perceive

    Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. In European Conference on Computer Vision, pages 148–166. Springer, 2024. 3, 6

  15. [15]

    Gemini 3 Pro Model Card

    Gemini. Gemini 3 Pro Model Card. Technical report, Gemini, November 2025. Accessed: 2025-11-18. 3, 6

  16. [16]

    Thinkmorph: Emergent properties in multimodal interleaved chain-of-thought reasoning

    Jiawei Gu, Yunzhuo Hao, Huichen Will Wang, Linjie Li, Michael Qizhe Shieh, Yejin Choi, Ranjay Krishna, and Yu Cheng. Thinkmorph: Emergent properties in multimodal interleaved chain-of-thought reasoning. arXiv preprint arXiv:2510.27492, 2025. 3

  17. [17]

    Thinking in dynamics: How multimodal large language models perceive, track, and reason dynamics in physical 4d world, 2026

    Yuzhi Huang, Kairun Wen, Rongxin Gao, Dongxuan Liu, Yibin Lou, Jie Wu, Jing Xu, Jian Zhang, Zheng Yang, Yunlong Lin, Chenxin Li, Panwang Pan, Junbin Lu, Jingyan Jiang, Xinghao Ding, Yue Huang, and Zhi Wang. Thinking in dynamics: How multimodal large language models perceive, track, and reason dynamics in physical 4d world, 2026. 16

  18. [18]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, Adam J. Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 16

  19. [19]

    Unitoken: Harmonizing multimodal understanding and generation through unified visual encoding

    Yang Jiao, Haibo Qiu, Zequn Jie, Shaoxiang Chen, Jingjing Chen, Lin Ma, and Yu-Gang Jiang. Unitoken: Harmonizing multimodal understanding and generation through unified visual encoding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3600–3610, 2025. 3

  20. [20]

    Grounding image matching in 3d with mast3r

    Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r. In European Conference on Computer Vision, pages 71–91. Springer, 2024. 3

  21. [21]

    Zebra-cot: A dataset for interleaved vision language reasoning.arXiv preprint arXiv:2507.16746, 2025

    Ang Li, Charles Wang, Deqing Fu, Kaiyu Yue, Zikui Cai, Wang Bill Zhu, Ollie Liu, Peng Guo, Willie Neiswanger, Furong Huang, et al. Zebra-cot: A dataset for interleaved vision language reasoning.arXiv preprint arXiv:2507.16746, 2025. 3

  22. [22]

    Imagine while Reasoning in Space: Multimodal Visualization-of-Thought

    Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vuli´c, and Furu Wei. Imagine while reasoning in space: Multimodal visualization-of-thought.arXiv preprint arXiv:2501.07542,

  23. [23]

    ViewSpatial- Bench: Evaluating multi-perspective spatial under- standing of vision-language models.arXiv preprint arXiv:2505.21500, 2025

    Dingming Li, Hongxing Li, Zixuan Wang, Yuchen Yan, Hang Zhang, Siqi Chen, Guiyang Hou, Shengpei Jiang, Wenqi Zhang, Yongliang Shen, et al. Viewspatial-bench: Evaluating multi-perspective spatial localization in vision-language models.arXiv preprint arXiv:2505.21500, 2025. 6

  24. [24]

    Thinking with Geometry: Active Geometry Integration for Spatial Reasoning

    Haoyuan Li, Qihang Cao, Tao Tang, Kun Xiang, Zihan Guo, Jianhua Han, Hang Xu, and Xiaodan Liang. Thinking with geometry: Active geometry integration for spatial reasoning.arXiv preprint arXiv:2602.06037, 2026. 3

  25. [25]

    Spatialladder: Progressive training for spatial reasoning in vision-language models.arXiv preprint arXiv:2510.08531, 2025

    Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, and Yueting Zhuang. Spatialladder: Progressive training for spatial reasoning in vision-language models.arXiv preprint arXiv:2510.08531, 2025. 2, 3, 6, 7

  26. [26]

    Spatialladder: Progressive training for spatial reasoning in vision-language models, 2025

    Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, and Yueting Zhuang. Spatialladder: Progressive training for spatial reasoning in vision-language models, 2025. 16

  27. [27]

    Enhancing action and ingredient modeling for semantically grounded recipe generation.arXiv preprint arXiv:2602.15862,

    Guoshan Liu, Bin Zhu, Yian Li, Jingjing Chen, Chong-Wah Ngo, and Yu-Gang Jiang. Enhancing action and ingredient modeling for semantically grounded recipe generation.arXiv preprint arXiv:2602.15862,

  28. [28]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 3

  29. [29]

    Ssr: Enhancing depth perception in vision-language models via rationale-guided spatial reasoning

    Yang Liu, Ming Ma, Xiaomin Yu, Pengxiang Ding, Han Zhao, Mingyang Sun, Siteng Huang, and Donglin Wang. Ssr: Enhancing depth perception in vision-language models via rationale-guided spatial reasoning. arXiv preprint arXiv:2505.12448, 2025. 3 11

  30. [30]

    Unipixel: Unified object referring and segmentation for pixel-level visual reasoning.arXiv preprint arXiv:2509.18094,

    Ye Liu, Zongyang Ma, Junfu Pu, Zhongang Qi, Yang Wu, Ying Shan, and Chang Wen Chen. Unipixel: Unified object referring and segmentation for pixel-level visual reasoning.arXiv preprint arXiv:2509.18094,

  31. [31]

    Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–233

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–233. Springer, 2024. 7

  32. [32]

    Tempcompass: Do video llms really understand videos? InFindings of the Association for Computational Linguistics: ACL 2024, pages 8731–8772, 2024

    Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcompass: Do video llms really understand videos? InFindings of the Association for Computational Linguistics: ACL 2024, pages 8731–8772, 2024. 7

  33. [34]

    Spatial-ssrl: Enhancing spatial understanding via self-supervised reinforcement learning.arXiv preprint arXiv:2510.27606, 2025

    Yuhong Liu, Beichen Zhang, Yuhang Zang, Yuhang Cao, Long Xing, Xiaoyi Dong, Haodong Duan, Dahua Lin, and Jiaqi Wang. Spatial-ssrl: Enhancing spatial understanding via self-supervised reinforcement learning.arXiv preprint arXiv:2510.27606, 2025. 16

  34. [35]

    3dsrbench: A comprehensive 3d spatial reasoning benchmark

    Wufei Ma, Haoyu Chen, Guofeng Zhang, Yu-Cheng Chou, Jieneng Chen, Celso de Melo, and Alan Yuille. 3dsrbench: A comprehensive 3d spatial reasoning benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6924–6934, 2025. 6

  35. [36]

    Videoglamm: A large multimodal model for pixel-level visual grounding in videos.arXiv preprint arXiv:2411.04923, 2024

    Shehan Munasinghe, Hanan Gani, Wenqi Zhu, Jiale Cao, Eric Xing, Fahad Khan, and Salman Khan. Videoglamm: A large multimodal model for pixel-level visual grounding in videos.arXiv preprint arXiv:2411.04923, 2024. 16

  36. [37]

    Gpt-5.https://openai.com/gpt-5/, 2025

    OpenAI. Gpt-5.https://openai.com/gpt-5/, 2025. Accessed: 2025-11-11. 16

  37. [38]

    GPT-5 System Card

    OpenAI. GPT-5 System Card. Technical report, OpenAI, August 2025. Accessed: 2025-08-10. 3, 6, 15

  38. [40]

    SpaceR: Reinforcing MLLMs in Video Spatial Reasoning

    Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. Spacer: Reinforcing mllms in video spatial reasoning.arXiv preprint arXiv:2504.01805, 2025. 16

  39. [41]

    Seed1.5-VL Technical Report

    ByteDance Seed Team. Seed1.5-vl technical report.arXiv preprint arXiv:2505.07062, 2025. 6

  40. [42]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024. 3

  41. [43]

    VGGT: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,, pages 5294–5306, 2025. 3, 4

  42. [44]

    Continuous 3d perception model with persistent state

    Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10510–10522, 2025. 3

  43. [45]

    Dust3r: Geometric 3d vision made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697–20709, 2024. 3

  44. [48]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025. 16

  45. [49]

    Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

    Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747, 2025. 2, 3, 6, 7 12

  46. [50]

    Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing

    Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing.arXiv preprint arXiv:2506.09965, 2025. 6, 7

  47. [51]

    Mind’s eye of llms: visualization-of-thought elicits spatial reasoning in large language models.Advances in Neural Information Processing Systems, 37:90277–90317, 2024

    Wenshan Wu, Shaoguang Mao, Yadong Zhang, Yan Xia, Li Dong, Lei Cui, and Furu Wei. Mind’s eye of llms: visualization-of-thought elicits spatial reasoning in large language models.Advances in Neural Information Processing Systems, 37:90277–90317, 2024. 3

  48. [52]

    Grok 4, 7 2025

    xAI. Grok 4, 7 2025. Model announcement. 6

  49. [53]

    Thinking in space: How multimodal large language models see, remember, and recall spaces

    Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025. 3, 6

  50. [55]

    Visual spatial tuning.arXiv preprint arXiv:2511.05491, 2025

    Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, Yi Lin, and Hengshuang Zhao. Visual spatial tuning.arXiv preprint arXiv:2511.05491,

  51. [56]

    Cambrian-S: Towards Spatial Supersensing in Video

    Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, et al. Cambrian-s: Towards spatial supersensing in video.arXiv preprint arXiv:2511.04670, 2025. 2, 3, 6, 7

  52. [57]

    Spatial mental modeling from limited views

    Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, et al. Spatial mental modeling from limited views. InProceedings of the IEEE/CVF International Conference on Computer Vision Workshop, 2025. 3, 6, 7

  53. [58]

    Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

    Haobo Yuan, Xiangtai Li, Tao Zhang, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, and Ming-Hsuan Yang. Sa2va: Marrying sam2 with llava for dense grounded understanding of images and videos.arXiv preprint arXiv:2501.04001, 2025. 16

  54. [59]

    From flatland to space: Teaching vision-language models to perceive and reason in 3d.arXiv preprint arXiv:2503.22976, 2025

    Jiahui Zhang, Yurui Chen, Yanpeng Zhou, Yueming Xu, Ze Huang, Jilin Mei, Junhui Chen, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, et al. From flatland to space: Teaching vision-language models to perceive and reason in 3d.arXiv preprint arXiv:2503.22976, 2025. 3, 6, 15

  55. [60]

    Spatialstack: Layered geometry- language fusion for 3d vlm spatial reasoning

    Jiang Zhang, Shijie Zhou, Bangya Liu, Achuta Kadambi, and Zhiwen Fan. Spatialstack: Layered geometry- language fusion for 3d vlm spatial reasoning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026. 6, 8

  56. [61]

    ReVSI: Rebuilding Visual Spatial Intelligence Evaluation for Accurate Assessment of VLM 3D Reasoning

    Yiming Zhang, Jiacheng Chen, Jiaqi Tan, Yongsen Mao, Wenhu Chen, and Angel X Chang. Revsi: Rebuilding visual spatial intelligence evaluation for accurate assessment of vlm 3d reasoning.arXiv preprint arXiv:2604.24300, 2026. 6, 7

  57. [62]

    Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.arXiv preprint arXiv:2505.24625, 2025

    Duo Zheng, Shijia Huang, Yanyang Li, and Liwei Wang. Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.arXiv preprint arXiv:2505.24625, 2025. 2, 3

  58. [63]

    Video-3d llm: Learning position-aware video representation for 3d scene understanding

    Duo Zheng, Shijia Huang, and Liwei Wang. Video-3d llm: Learning position-aware video representation for 3d scene understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8995–9006, 2025. 3

  59. [64]

    Vlm4d: Towards spatiotemporal awareness in vision language models

    Shijie Zhou, Alexander Vilesov, Xuehai He, Ziyu Wan, Shuwang Zhang, Aditya Nagachandra, Di Chang, Dongdong Chen, Xin Eric Wang, and Achuta Kadambi. Vlm4d: Towards spatiotemporal awareness in vision language models. InProceedings of the IEEE/CVF international conference on computer vision, pages 8600–8612, 2025. 3

  60. [65]

    Llava-3d: A simple yet effective pathway to empowering lmms with 3d capabilities

    Chenming Zhu, Tai Wang, Wenwei Zhang, Jiangmiao Pang, and Xihui Liu. Llava-3d: A simple yet effective pathway to empowering lmms with 3d capabilities. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4295–4305, 2025. 3

  61. [66]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 6 13 We provide additional details about the training and inference, as well as more experiments on...