GeoWeaver: Grounding Visual Tokens with Geometric Evidence before Scene Reasoning
Pith reviewed 2026-05-22 07:20 UTC · model grok-4.3
The pith
Geometry should ground visual tokens as a prerequisite before language models perform scene reasoning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GeoWeaver constructs a multi-level geometry bank from a frozen geometry encoder and performs token-adaptive geometric evidence allocation, enabling each visual token to retrieve the most relevant geometric abstractions. The selected evidence is incorporated into visual tokens via a residual grounding operation prior to language modeling, yielding geometry-grounded representations for downstream reasoning.
What carries the argument
token-adaptive geometric evidence allocation from a multi-level geometry bank, which assigns specific geometric abstractions to individual visual tokens based on their spatial roles before reasoning occurs.
Load-bearing premise
Different visual tokens require distinct geometric evidence based on their spatial roles, and a frozen geometry encoder plus token-adaptive allocation can supply the most relevant abstractions without degrading semantic content or downstream performance.
What would settle it
A direct comparison showing that late-fusion of the same geometric information achieves equal or better results on spatial reasoning benchmarks than the pre-reasoning grounding would falsify the prerequisite claim.
Figures
read the original abstract
Spatio-temporal reasoning in vision-language models requires visual representations that preserve physical geometry rather than merely semantic appearance. Recent multimodal models incorporate geometric information through structural branches, 3D-aware supervision, reasoning-stage fusion, or long-horizon memory. While these approaches demonstrate the importance of geometry for spatial intelligence, they typically treat geometric cues as a shared signal across all visual tokens. We note that this overlooks a finer-grained challenge: different visual tokens require different geometric evidence depending on their spatial roles. To address this limitation, we introduce GeoWeaver, a pre-reasoning geometric grounding framework that treats geometry as a representational prerequisite for spatio-temporal reasoning. GeoWeaver constructs a multi-level geometry bank from a frozen geometry encoder and performs token-adaptive geometric evidence allocation, enabling each visual token to retrieve the most relevant geometric abstractions. The selected evidence is incorporated into visual tokens via a residual grounding operation prior to language modeling, yielding geometry-grounded representations for downstream reasoning. Extensive evaluations on spatial reasoning benchmarks demonstrate that GeoWeaver consistently enhances geometry-aware reasoning while retaining general multimodal capabilities. This indicates that geometric information yields the greatest benefit not as a late-fusion auxiliary signal but as a fundamental prerequisite that shapes the representational foundation on which large language models perform reasoning. All source code and models will be released at https://github.com/yahooo-m/GeoWeaver .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces GeoWeaver, a pre-reasoning geometric grounding framework for vision-language models. It builds a multi-level geometry bank from a frozen geometry encoder, performs token-adaptive allocation of geometric evidence to individual visual tokens based on their spatial roles, and applies residual grounding to incorporate this evidence into the visual representations before language modeling. The central claim is that treating geometry as a representational prerequisite (rather than a late-fusion auxiliary signal or shared cue) improves spatio-temporal reasoning on benchmarks while preserving general multimodal capabilities.
Significance. If the empirical gains are shown to stem specifically from the prerequisite-style grounding rather than auxiliary fusion, the work would offer a modular, parameter-efficient route to inject geometric structure into VLMs. The token-adaptive mechanism and code release are practical strengths that could influence downstream applications in robotics and scene understanding.
major comments (3)
- [Abstract and §3] Abstract and §3 (method overview): the claim that geometry must act as a 'fundamental prerequisite that shapes the representational foundation' is load-bearing for the paper's novelty, yet the manuscript provides no explicit comparison or ablation against late-fusion baselines that apply the same geometry bank after the LLM stage; without this contrast, the superiority of pre-reasoning residual grounding over auxiliary signals remains unverified.
- [§4.2] §4.2 (token-adaptive allocation): the frozen geometry encoder is used without a described projection or alignment module to map its output features into the VLM visual token embedding space; if the selected geometric abstractions are misaligned, the residual update cannot function as a true representational prerequisite and may instead add noise, directly threatening the central claim.
- [§5] §5 (experiments): the reported improvements on spatial reasoning benchmarks are presented without error bars, statistical significance tests, or per-token ablation showing that allocation selects spatially relevant evidence rather than generic features; this weakens the assertion that distinct geometric evidence per token is necessary.
minor comments (2)
- [§3] Notation for the multi-level geometry bank and residual grounding operation should be introduced with explicit equations in §3 to improve reproducibility.
- [Figure 2] Figure 2 (architecture diagram) would benefit from clearer labeling of the allocation and residual steps to distinguish them from standard cross-attention.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive report. The comments highlight important aspects of our central claims and experimental rigor. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core contributions.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (method overview): the claim that geometry must act as a 'fundamental prerequisite that shapes the representational foundation' is load-bearing for the paper's novelty, yet the manuscript provides no explicit comparison or ablation against late-fusion baselines that apply the same geometry bank after the LLM stage; without this contrast, the superiority of pre-reasoning residual grounding over auxiliary signals remains unverified.
Authors: We agree that a direct ablation against a late-fusion baseline using the identical geometry bank would provide stronger verification of the prerequisite-style advantage. Our current experiments compare against methods that incorporate geometry at various stages, but do not isolate a post-LLM fusion variant. In the revised manuscript we will add this baseline (applying the same multi-level bank via residual fusion after the LLM) and report the resulting performance gap on the spatial reasoning benchmarks. revision: yes
-
Referee: [§4.2] §4.2 (token-adaptive allocation): the frozen geometry encoder is used without a described projection or alignment module to map its output features into the VLM visual token embedding space; if the selected geometric abstractions are misaligned, the residual update cannot function as a true representational prerequisite and may instead add noise, directly threatening the central claim.
Authors: The manuscript describes the geometry bank construction but does not explicitly detail the feature alignment step. We will revise §4.2 to include a learned linear projection layer that maps the frozen encoder outputs into the VLM visual token space prior to token-adaptive allocation. This module is lightweight, frozen-encoder compatible, and ensures dimensional and distributional alignment so that the residual grounding operates on commensurate representations. revision: yes
-
Referee: [§5] §5 (experiments): the reported improvements on spatial reasoning benchmarks are presented without error bars, statistical significance tests, or per-token ablation showing that allocation selects spatially relevant evidence rather than generic features; this weakens the assertion that distinct geometric evidence per token is necessary.
Authors: We acknowledge the value of these statistical and ablation details. In the revision we will (i) report mean and standard deviation over three random seeds for all main results, (ii) include paired t-test p-values for the key comparisons, and (iii) add a per-token ablation that measures the spatial relevance of allocated evidence (e.g., via overlap with ground-truth object regions) versus random or generic feature selection. These additions will directly support the necessity of token-adaptive allocation. revision: yes
Circularity Check
No significant circularity; method uses external frozen encoder with empirical validation
full rationale
The paper introduces GeoWeaver as a pre-reasoning grounding framework that constructs a multi-level geometry bank from a frozen external geometry encoder, performs token-adaptive allocation, and applies residual grounding to visual tokens before language modeling. The central claim—that geometry functions as a representational prerequisite rather than late-fusion auxiliary—is presented as an empirical outcome from benchmark evaluations, not as a mathematical derivation or prediction that reduces to author-defined inputs by construction. No equations, self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations are evident in the described chain. The approach relies on an independent frozen encoder and design choices validated externally, rendering the derivation self-contained without circular reduction.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
GeoWeaver constructs a multi-level geometry bank from a frozen geometry encoder and performs token-adaptive geometric evidence allocation, enabling each visual token to retrieve the most relevant geometric abstractions. The selected evidence is incorporated into visual tokens via a residual grounding operation prior to language modeling
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
This indicates that geometric information yields the greatest benefit not as a late-fusion auxiliary signal but as a fundamental prerequisite that shapes the representational foundation
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Llava-onevision-1.5: Fully open framework for democratized multimodal training
Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Chunsheng Wu, Huajie Tan, Chunyuan Li, Jing Yang, Jie Yu, Xiyao Wang, Bin Qin, Yumeng Wang, Zizhen Yan, Ziyong Feng, Ziwei Liu, Bo Li, and Jiankang Deng. Llava-onevision-1.5: Fully open framework for democratized multimodal training. InarXiv, 2025. 16
work page 2025
-
[2]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025
Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, et al. Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025. 3
-
[5]
Has gpt-5 achieved spatial intelligence? an empirical study.arXiv preprint arXiv:2508.13142, 2025
Zhongang Cai, Yubo Wang, Qingping Sun, Ruisi Wang, Chenyang Gu, Wanqi Yin, Zhiqian Lin, Zhitao Yang, Chen Wei, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Jiaqi Li, Xiangyu Fan, Hanming Deng, Lewei Lu, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, and Lei Yang. Holistic evaluation of multimodal llms on spatial intelligence.arXiv preprint arXiv:2508.13142...
-
[6]
Spatialvlm: Endowing vision-language models with spatial reasoning capabilities
Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455–14465, 2024. 3
work page 2024
-
[7]
Qwen3-vl: Multimodal large language model series
QwenLM Team (Alibaba Cloud). Qwen3-vl: Multimodal large language model series. https://github. com/QwenLM/Qwen3-VL, 2025. GitHub repository; accessed: 2025-11-14. 6, 7, 16
work page 2025
-
[8]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 16
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 3, 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Mengfei Du, Binhao Wu, Zejun Li, Xuan-Jing Huang, and Zhongyu Wei. Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 2: Short Papers), pages 346–355,
-
[11]
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction
Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, et al. Vlm-3r: Vision-language models augmented with instruction-aligned 3d reconstruction.arXiv preprint arXiv:2505.20279, 2025. 2, 3, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
Vlm-3r: Vision-language models augmented with instruction-aligned 3d reconstruction
Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Shijie Zhou, Dilin Wang, et al. Vlm-3r: Vision-language models augmented with instruction-aligned 3d reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
-
[13]
Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24108–24118, 2025. 7
work page 2025
-
[14]
Blink: Multimodal large language models can see but not perceive
Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. In European Conference on Computer Vision, pages 148–166. Springer, 2024. 3, 6
work page 2024
-
[15]
Gemini. Gemini 3 Pro Model Card. Technical report, Gemini, November 2025. Accessed: 2025-11-18. 3, 6
work page 2025
-
[16]
Thinkmorph: Emergent properties in multimodal interleaved chain-of-thought reasoning
Jiawei Gu, Yunzhuo Hao, Huichen Will Wang, Linjie Li, Michael Qizhe Shieh, Yejin Choi, Ranjay Krishna, and Yu Cheng. Thinkmorph: Emergent properties in multimodal interleaved chain-of-thought reasoning. arXiv preprint arXiv:2510.27492, 2025. 3
-
[17]
Yuzhi Huang, Kairun Wen, Rongxin Gao, Dongxuan Liu, Yibin Lou, Jie Wu, Jing Xu, Jian Zhang, Zheng Yang, Yunlong Lin, Chenxin Li, Panwang Pan, Junbin Lu, Jingyan Jiang, Xinghao Ding, Yue Huang, and Zhi Wang. Thinking in dynamics: How multimodal large language models perceive, track, and reason dynamics in physical 4d world, 2026. 16
work page 2026
-
[18]
Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, Adam J. Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 16
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Unitoken: Harmonizing multimodal understanding and generation through unified visual encoding
Yang Jiao, Haibo Qiu, Zequn Jie, Shaoxiang Chen, Jingjing Chen, Lin Ma, and Yu-Gang Jiang. Unitoken: Harmonizing multimodal understanding and generation through unified visual encoding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3600–3610, 2025. 3
work page 2025
-
[20]
Grounding image matching in 3d with mast3r
Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r. In European Conference on Computer Vision, pages 71–91. Springer, 2024. 3
work page 2024
-
[21]
Zebra-cot: A dataset for interleaved vision language reasoning.arXiv preprint arXiv:2507.16746, 2025
Ang Li, Charles Wang, Deqing Fu, Kaiyu Yue, Zikui Cai, Wang Bill Zhu, Ollie Liu, Peng Guo, Willie Neiswanger, Furong Huang, et al. Zebra-cot: A dataset for interleaved vision language reasoning.arXiv preprint arXiv:2507.16746, 2025. 3
-
[22]
Imagine while Reasoning in Space: Multimodal Visualization-of-Thought
Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vuli´c, and Furu Wei. Imagine while reasoning in space: Multimodal visualization-of-thought.arXiv preprint arXiv:2501.07542,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Dingming Li, Hongxing Li, Zixuan Wang, Yuchen Yan, Hang Zhang, Siqi Chen, Guiyang Hou, Shengpei Jiang, Wenqi Zhang, Yongliang Shen, et al. Viewspatial-bench: Evaluating multi-perspective spatial localization in vision-language models.arXiv preprint arXiv:2505.21500, 2025. 6
-
[24]
Thinking with Geometry: Active Geometry Integration for Spatial Reasoning
Haoyuan Li, Qihang Cao, Tao Tang, Kun Xiang, Zihan Guo, Jianhua Han, Hang Xu, and Xiaodan Liang. Thinking with geometry: Active geometry integration for spatial reasoning.arXiv preprint arXiv:2602.06037, 2026. 3
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[25]
Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, and Yueting Zhuang. Spatialladder: Progressive training for spatial reasoning in vision-language models.arXiv preprint arXiv:2510.08531, 2025. 2, 3, 6, 7
-
[26]
Spatialladder: Progressive training for spatial reasoning in vision-language models, 2025
Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, and Yueting Zhuang. Spatialladder: Progressive training for spatial reasoning in vision-language models, 2025. 16
work page 2025
-
[27]
Guoshan Liu, Bin Zhu, Yian Li, Jingjing Chen, Chong-Wah Ngo, and Yu-Gang Jiang. Enhancing action and ingredient modeling for semantically grounded recipe generation.arXiv preprint arXiv:2602.15862,
-
[28]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 3
work page 2023
-
[29]
Ssr: Enhancing depth perception in vision-language models via rationale-guided spatial reasoning
Yang Liu, Ming Ma, Xiaomin Yu, Pengxiang Ding, Han Zhao, Mingyang Sun, Siteng Huang, and Donglin Wang. Ssr: Enhancing depth perception in vision-language models via rationale-guided spatial reasoning. arXiv preprint arXiv:2505.12448, 2025. 3 11
-
[30]
Ye Liu, Zongyang Ma, Junfu Pu, Zhongang Qi, Yang Wu, Ying Shan, and Chang Wen Chen. Unipixel: Unified object referring and segmentation for pixel-level visual reasoning.arXiv preprint arXiv:2509.18094,
-
[31]
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–233. Springer, 2024. 7
work page 2024
-
[32]
Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcompass: Do video llms really understand videos? InFindings of the Association for Computational Linguistics: ACL 2024, pages 8731–8772, 2024. 7
work page 2024
-
[34]
Yuhong Liu, Beichen Zhang, Yuhang Zang, Yuhang Cao, Long Xing, Xiaoyi Dong, Haodong Duan, Dahua Lin, and Jiaqi Wang. Spatial-ssrl: Enhancing spatial understanding via self-supervised reinforcement learning.arXiv preprint arXiv:2510.27606, 2025. 16
-
[35]
3dsrbench: A comprehensive 3d spatial reasoning benchmark
Wufei Ma, Haoyu Chen, Guofeng Zhang, Yu-Cheng Chou, Jieneng Chen, Celso de Melo, and Alan Yuille. 3dsrbench: A comprehensive 3d spatial reasoning benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6924–6934, 2025. 6
work page 2025
-
[36]
Shehan Munasinghe, Hanan Gani, Wenqi Zhu, Jiale Cao, Eric Xing, Fahad Khan, and Salman Khan. Videoglamm: A large multimodal model for pixel-level visual grounding in videos.arXiv preprint arXiv:2411.04923, 2024. 16
-
[37]
Gpt-5.https://openai.com/gpt-5/, 2025
OpenAI. Gpt-5.https://openai.com/gpt-5/, 2025. Accessed: 2025-11-11. 16
work page 2025
-
[38]
OpenAI. GPT-5 System Card. Technical report, OpenAI, August 2025. Accessed: 2025-08-10. 3, 6, 15
work page 2025
-
[40]
SpaceR: Reinforcing MLLMs in Video Spatial Reasoning
Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. Spacer: Reinforcing mllms in video spatial reasoning.arXiv preprint arXiv:2504.01805, 2025. 16
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
ByteDance Seed Team. Seed1.5-vl technical report.arXiv preprint arXiv:2505.07062, 2025. 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
VGGT: Visual geometry grounded transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,, pages 5294–5306, 2025. 3, 4
work page 2025
-
[44]
Continuous 3d perception model with persistent state
Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10510–10522, 2025. 3
work page 2025
-
[45]
Dust3r: Geometric 3d vision made easy
Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697–20709, 2024. 3
work page 2024
-
[48]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025. 16
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[49]
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747, 2025. 2, 3, 6, 7 12
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing
Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing.arXiv preprint arXiv:2506.09965, 2025. 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[51]
Wenshan Wu, Shaoguang Mao, Yadong Zhang, Yan Xia, Li Dong, Lei Cui, and Furu Wei. Mind’s eye of llms: visualization-of-thought elicits spatial reasoning in large language models.Advances in Neural Information Processing Systems, 37:90277–90317, 2024. 3
work page 2024
- [52]
-
[53]
Thinking in space: How multimodal large language models see, remember, and recall spaces
Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025. 3, 6
work page 2025
-
[55]
Visual spatial tuning.arXiv preprint arXiv:2511.05491, 2025
Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, Yi Lin, and Hengshuang Zhao. Visual spatial tuning.arXiv preprint arXiv:2511.05491,
-
[56]
Cambrian-S: Towards Spatial Supersensing in Video
Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, et al. Cambrian-s: Towards spatial supersensing in video.arXiv preprint arXiv:2511.04670, 2025. 2, 3, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[57]
Spatial mental modeling from limited views
Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, et al. Spatial mental modeling from limited views. InProceedings of the IEEE/CVF International Conference on Computer Vision Workshop, 2025. 3, 6, 7
work page 2025
-
[58]
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
Haobo Yuan, Xiangtai Li, Tao Zhang, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, and Ming-Hsuan Yang. Sa2va: Marrying sam2 with llava for dense grounded understanding of images and videos.arXiv preprint arXiv:2501.04001, 2025. 16
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[59]
Jiahui Zhang, Yurui Chen, Yanpeng Zhou, Yueming Xu, Ze Huang, Jilin Mei, Junhui Chen, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, et al. From flatland to space: Teaching vision-language models to perceive and reason in 3d.arXiv preprint arXiv:2503.22976, 2025. 3, 6, 15
-
[60]
Spatialstack: Layered geometry- language fusion for 3d vlm spatial reasoning
Jiang Zhang, Shijie Zhou, Bangya Liu, Achuta Kadambi, and Zhiwen Fan. Spatialstack: Layered geometry- language fusion for 3d vlm spatial reasoning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026. 6, 8
work page 2026
-
[61]
ReVSI: Rebuilding Visual Spatial Intelligence Evaluation for Accurate Assessment of VLM 3D Reasoning
Yiming Zhang, Jiacheng Chen, Jiaqi Tan, Yongsen Mao, Wenhu Chen, and Angel X Chang. Revsi: Rebuilding visual spatial intelligence evaluation for accurate assessment of vlm 3d reasoning.arXiv preprint arXiv:2604.24300, 2026. 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[62]
Duo Zheng, Shijia Huang, Yanyang Li, and Liwei Wang. Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.arXiv preprint arXiv:2505.24625, 2025. 2, 3
-
[63]
Video-3d llm: Learning position-aware video representation for 3d scene understanding
Duo Zheng, Shijia Huang, and Liwei Wang. Video-3d llm: Learning position-aware video representation for 3d scene understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8995–9006, 2025. 3
work page 2025
-
[64]
Vlm4d: Towards spatiotemporal awareness in vision language models
Shijie Zhou, Alexander Vilesov, Xuehai He, Ziyu Wan, Shuwang Zhang, Aditya Nagachandra, Di Chang, Dongdong Chen, Xin Eric Wang, and Achuta Kadambi. Vlm4d: Towards spatiotemporal awareness in vision language models. InProceedings of the IEEE/CVF international conference on computer vision, pages 8600–8612, 2025. 3
work page 2025
-
[65]
Llava-3d: A simple yet effective pathway to empowering lmms with 3d capabilities
Chenming Zhu, Tai Wang, Wenwei Zhang, Jiangmiao Pang, and Xihui Liu. Llava-3d: A simple yet effective pathway to empowering lmms with 3d capabilities. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4295–4305, 2025. 3
work page 2025
-
[66]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 6 13 We provide additional details about the training and inference, as well as more experiments on...
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.