pith. sign in

arxiv: 2605.19528 · v1 · pith:BYVI35DDnew · submitted 2026-05-19 · 💻 cs.CV

Towards Camera-Robust 3D Localization: Equation-Anchored Tool-Use for MLLMs

Pith reviewed 2026-05-20 06:27 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D localizationMLLMcamera intrinsicstool usechain-of-thoughtpinhole camera3D object detectionvisual grounding
0
0 comments X

The pith

MLLMs achieve camera-robust 3D localization by writing the pinhole back-projection equation in Chain-of-Thought and substituting tool outputs directly into it.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tackles camera intrinsic ambiguity in MLLM 3D localization, where identical images can arise from different 3D scenes depending on the camera. Prior methods either drop camera parameters or pull depth and intrinsics from tools but leave the model free to interpret those numbers loosely. The new framework retrieves intrinsics and multi-point metric depths, states the back-projection equation explicitly during reasoning, and inserts the tool values straight into the formula before producing 9-DoF boxes. Experiments on 3D object detection and visual grounding show clear gains under camera rescaling from 0.5x to 1.5x, largest where the scale differs most from training data. Readers care because real images arrive from cameras whose parameters are rarely known in advance, so deterministic use of that information matters for reliable 3D output.

Core claim

By re-purposing spatial tools to supply values for an explicitly written pinhole back-projection equation inside Chain-of-Thought, the framework substitutes retrieved camera intrinsics and metric depths directly into the formula before regressing 9-DoF bounding boxes, producing better results than RGB-only or loosely tool-augmented baselines on both 3D object detection and 3D visual grounding when camera intrinsics are rescaled from 0.5x to 1.5x.

What carries the argument

The pinhole back-projection equation written out in full during Chain-of-Thought, with tool outputs for focal lengths, principal points, and depths substituted as exact variables to compute 3D points from 2D image locations.

If this is right

  • Largest gains appear when camera scale deviates most from the training distribution.
  • Camera information moves from optional hint to required input in the final 9-DoF prediction.
  • The same framework applies to both 3D object detection and 3D visual grounding.
  • Tool outputs function as precise formula variables rather than loose numerical cues.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Explicit equation anchoring may transfer to other MLLM tasks that require geometric consistency, such as monocular depth or camera pose estimation.
  • Forcing the model to perform the substitution step could reduce errors that arise from implicit pattern matching alone.
  • The method suggests testing with full camera calibration models that include distortion parameters beyond simple focal-length rescaling.

Load-bearing premise

The MLLM will correctly read the written pinhole equation after tool substitution in Chain-of-Thought and use the substituted values to produce accurate 9-DoF bounding boxes.

What would settle it

A controlled test that removes the explicit equation and substitution step from the prompt while keeping tool outputs available as hints, then checks whether the performance advantage on rescaled-camera benchmarks disappears.

Figures

Figures reproduced from arXiv: 2605.19528 by Deli Zhao, Gongjie Zhang, Quanhao Qian, Ran Xu, Shijian Lu, Wenhao Li, Xueying Jiang.

Figure 1
Figure 1. Figure 1: Comparison of spatial reasoning paradigms in Multimodal Large Language Models [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our proposed equation-anchored spatial agent. Given a single-frame [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison on 3D visual grounding (top) and 3D object detection (bottom). [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
read the original abstract

3D localization in Multimodal Large Language Models (MLLMs), including 3D object detection and 3D visual grounding, is fundamentally limited by camera intrinsic ambiguity: the same image admits different 3D scenes under different cameras. Existing MLLMs either ignore camera parameters and overfit to a canonical training intrinsic, or retrieve depth and 3D cues from external tools but treat the returned values as reference cues (numerical hints that the model is free to interpret implicitly), both preventing camera information from being deterministically propagated into the prediction. We propose an equation-anchored tool-use framework that re-purposes spatial tools as formula variables. The proposed framework proactively retrieves camera intrinsics and samples multi-point metric depths, writes the pinhole back-projection equation $\hat{X} = (u_c - c_x)\bar{Z}/f_x$ explicitly in Chain-of-Thought (CoT), and substitutes tool outputs into the formula before regressing the final 9-DoF bounding box. On both 3D object detection and 3D visual grounding tasks under rescaled camera intrinsics from $0.5\times$ to $1.5\times$, our method outperforms RGB-only and tool-augmented baselines, with significant gains where the camera deviates most from the training scale. Code and data will be released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims that an equation-anchored tool-use framework enables camera-robust 3D localization in MLLMs for 3D object detection and 3D visual grounding. By retrieving camera intrinsics and multi-point metric depths from external tools, explicitly writing the pinhole back-projection equation ˆX = (u_c - c_x)¯Z / f_x in Chain-of-Thought, substituting the tool outputs into the formula, and then regressing 9-DoF bounding boxes, the approach deterministically propagates camera information. Experiments under rescaled intrinsics (0.5× to 1.5×) show outperformance over RGB-only and tool-augmented baselines, with larger gains at scales farthest from the training distribution.

Significance. If the results hold and the mechanism is shown to be causal, the work could meaningfully advance robustness in MLLM-based 3D perception by moving beyond implicit cue interpretation to explicit geometric anchoring. This addresses a core limitation of camera intrinsic ambiguity and overfitting to canonical scales, offering a generalizable template for integrating equations into LLM reasoning for spatial tasks.

major comments (1)
  1. [Proposed Framework and equation substitution in CoT] The central claim that robustness gains arise from deterministic camera propagation rests on the MLLM accurately performing the arithmetic substitution of tool outputs into ˆX = (u_c - c_x)¯Z / f_x within CoT and then using the resulting numerical 3D coordinates to constrain the final 9-DoF regression. The manuscript provides no verification, error analysis, or ablation demonstrating that the model executes these substitutions without calculation errors or that the substituted values causally influence the output boxes rather than incidental prompting effects. This is load-bearing for attributing the reported gains under 0.5×–1.5× rescaling specifically to the equation-anchored mechanism.
minor comments (1)
  1. [Abstract] The abstract states 'significant gains' without reporting quantitative deltas, error bars, or per-scale breakdowns; adding these would improve clarity of the experimental claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need to verify the causal mechanism behind our equation-anchored approach. We agree that direct evidence of accurate substitution and its influence on predictions would strengthen the attribution of robustness gains. We address the major comment below and commit to revisions that provide the requested verification and analysis.

read point-by-point responses
  1. Referee: The central claim that robustness gains arise from deterministic camera propagation rests on the MLLM accurately performing the arithmetic substitution of tool outputs into ˆX = (u_c - c_x)¯Z / f_x within CoT and then using the resulting numerical 3D coordinates to constrain the final 9-DoF regression. The manuscript provides no verification, error analysis, or ablation demonstrating that the model executes these substitutions without calculation errors or that the substituted values causally influence the output boxes rather than incidental prompting effects. This is load-bearing for attributing the reported gains under 0.5×–1.5× rescaling specifically to the equation-anchored mechanism.

    Authors: We acknowledge that the current manuscript lacks explicit verification of the substitution process, error rates in arithmetic, or ablations isolating causality from prompting effects. To address this, the revised version will include: (1) a quantitative analysis of CoT traces from a representative subset of examples, reporting the percentage of cases where tool outputs are correctly substituted into the pinhole equation without arithmetic errors; (2) an error analysis tabulating substitution mistakes by the MLLM across different intrinsic scales; and (3) a controlled ablation that either omits the equation step, provides incorrect numerical values, or replaces the explicit formula with implicit prompting while keeping all other components fixed. These additions will directly test whether the deterministic propagation causally drives the observed gains, particularly at scales farthest from the training distribution (0.5× and 1.5×). revision: yes

Circularity Check

0 steps flagged

No circularity: explicit geometric substitution is independent of outputs

full rationale

The paper's core proposal is to insert the standard pinhole back-projection equation explicitly into CoT, retrieve external tool values for intrinsics and depths, substitute them, and then regress the 9-DoF box. This construction does not redefine any quantity in terms of the final prediction, nor does it fit parameters on a subset and relabel the result as a prediction. No self-citation is invoked to establish uniqueness or to smuggle an ansatz; the equation is the conventional camera model. Empirical gains under rescaled intrinsics are presented as measured outcomes rather than derived by algebraic identity from the inputs. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, the central claim rests on the standard pinhole camera model and the assumption that tool outputs can be directly substituted into explicit equations within CoT; no free parameters, ad-hoc axioms, or new invented entities are described.

pith-pipeline@v0.9.0 · 5792 in / 1110 out tokens · 63937 ms · 2026-05-20T06:27:01.839784+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 12 internal anchors

  1. [1]

    Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems, 35:23716–23736, 2022

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems, 35:23716–23736, 2022

  2. [2]

    Claude Opus 4.7

    Anthropic. Claude Opus 4.7. https://www.anthropic.com/claude/opus, 2026. Accessed: 2026-05- 06

  3. [3]

    Claude Sonnet 4.6

    Anthropic. Claude Sonnet 4.6. https://www.anthropic.com/claude/sonnet, 2026. Accessed: 2026-05-06

  4. [4]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

  5. [5]

    Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025

    Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, et al. Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025

  6. [6]

    SAM 3: Segment Anything with Concepts

    Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

  7. [7]

    Scanrefer: 3d object localization in rgb-d scans using natural language

    Dave Zhenyu Chen, Angel X Chang, and Matthias Nießner. Scanrefer: 3d object localization in rgb-d scans using natural language. InProceedings of the IEEE/CVF European Conference on Computer Vision, pages 202–221. Springer, 2020

  8. [8]

    Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning

    Sijin Chen, Xin Chen, Chi Zhang, Mingsheng Li, Gang Yu, Hao Fei, Hongyuan Zhu, Jiayuan Fan, and Tao Chen. Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26428–26438, 2024

  9. [9]

    Spatialrgpt: Grounded spatial reasoning in vision-language models.Advances in Neural Information Processing Systems, 37:135062–135093, 2024

    An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision-language models.Advances in Neural Information Processing Systems, 37:135062–135093, 2024

  10. [10]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  11. [11]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes

    Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 5828–5839, 2017

  12. [12]

    DeepSeek-V4.https://www.deepseek.com/, 2026

    DeepSeek-AI. DeepSeek-V4.https://www.deepseek.com/, 2026. Accessed: 2026-05-06

  13. [13]

    VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

    Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Shijie Zhou, Dilin Wang, et al. Vlm-3r: Vision-language models augmented with instruction-aligned 3d reconstruction.arXiv preprint arXiv:2505.20279, 2025

  14. [14]

    A survey of large language model-powered spatial intelli- gence across scales: Advances in embodied agents, smart cities, and earth science.arXiv preprint arXiv:2504.09848, 2025

    Jie Feng, Jinwei Zeng, Qingyue Long, Hongyi Chen, Jie Zhao, Yanxin Xi, Zhilun Zhou, Yuan Yuan, Shengyuan Wang, Qingbin Zeng, et al. A survey of large language model-powered spatial intelli- gence across scales: Advances in embodied agents, smart cities, and earth science.arXiv preprint arXiv:2504.09848, 2025

  15. [15]

    Refocus: Visual editing as a chain of thought for structured image understanding

    Xingyu Fu, Minqian Liu, Zhengyuan Yang, John Corring, Yijuan Lu, Jianwei Yang, Dan Roth, Dinei Florencio, and Cha Zhang. Refocus: Visual editing as a chain of thought for structured image understanding. International Conference on Machine Learning, 2025

  16. [16]

    Pal: Program-aided language models

    Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. InInternational Conference on Machine Learning, pages 10764–10799. PMLR, 2023

  17. [17]

    Visual sketchpad: Sketching as a visual chain of thought for multimodal language models

    Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. Advances in Neural Information Processing Systems, 37:139348–139379, 2024. 10

  18. [18]

    Chat-scene: Bridging 3d scene and large language models with object identifiers.Advances in Neural Information Processing Systems, 37:113991–114017, 2024

    Haifeng Huang, Yilun Chen, Zehan Wang, Rongjie Huang, Runsen Xu, Tai Wang, Luping Liu, Xize Cheng, Yang Zhao, Jiangmiao Pang, et al. Chat-scene: Bridging 3d scene and large language models with object identifiers.Advances in Neural Information Processing Systems, 37:113991–114017, 2024

  19. [19]

    Vision-r1: Incentivizing reasoning capability in multimodal large language models

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Xu Tang, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models. International Conference on Learning Representations, 2026

  20. [20]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  21. [21]

    Neuro-symbolic data generation for math reasoning.Advances in Neural Information Processing Systems, 37:23488–23515, 2024

    Zenan Li, Zhi Zhou, Yuan Yao, Yu-Feng Li, Chun Cao, Fan Yang, Xian Zhang, and Xiaoxing Ma. Neuro-symbolic data generation for math reasoning.Advances in Neural Information Processing Systems, 37:23488–23515, 2024

  22. [22]

    Visual instruction tuning.Advances in Neural Information Processing Systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in Neural Information Processing Systems, 36:34892–34916, 2023

  23. [23]

    Ssr: Enhancing depth perception in vision-language models via rationale-guided spatial reasoning

    Yang Liu, Ming Ma, Xiaomin Yu, Pengxiang Ding, Han Zhao, Mingyang Sun, Siteng Huang, and Donglin Wang. Ssr: Enhancing depth perception in vision-language models via rationale-guided spatial reasoning. Advances in Neural Information Processing Systems, 2025

  24. [24]

    Kimi K2.6.https://www.kimi.com/blog/kimi-k2-6, 2026

    Moonshot AI. Kimi K2.6.https://www.kimi.com/blog/kimi-k2-6, 2026. Accessed: 2026-05-06

  25. [25]

    WebGPT: Browser-assisted question-answering with human feedback

    Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332, 2021

  26. [26]

    OpenAI. GPT-5.4. https://openai.com/index/introducing-gpt-5-4/ , 2026. Accessed: 2026- 05-06

  27. [27]

    Logic-lm: Empowering large language models with symbolic solvers for faithful logical reasoning

    Liangming Pan, Alon Albalak, Xinyi Wang, and William Wang. Logic-lm: Empowering large language models with symbolic solvers for faithful logical reasoning. InConference on Empirical Methods in Natural Language Processing, pages 3806–3824, 2023

  28. [28]

    Unidepthv2: Universal monocular metric depth estimation made simpler.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Luigi Piccinelli, Christos Sakaridis, Yung-Hsu Yang, Mattia Segu, Siyuan Li, Wim Abbeloos, and Luc Van Gool. Unidepthv2: Universal monocular metric depth estimation made simpler.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  29. [29]

    Qwen3.6-Flash

    Qwen Team. Qwen3.6-Flash. https://qwen.ai/blog?id=qwen3.6-35b-a3b, 2026. Accessed: 2026- 05-06

  30. [30]

    Grounded Reinforcement Learning for Visual Reasoning

    Gabriel Sarch, Snigdha Saha, Naitik Khandelwal, Ayush Jain, Michael J Tarr, Aviral Kumar, and Katerina Fragkiadaki. Grounded reinforcement learning for visual reasoning.arXiv preprint arXiv:2505.23678, 2025

  31. [31]

    Seed-2.0-Pro

    Seed Team. Seed-2.0-Pro. https://research.doubao.com/en/seed2, 2026. Accessed: 2026-05-06

  32. [32]

    Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36:38154–38180, 2023

    Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36:38154–38180, 2023

  33. [33]

    OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning

    Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Juntao Li, Xiaoye Qu, et al. Openthinkimg: Learning to think with images via visual tool reinforcement learning.arXiv preprint arXiv:2505.08617, 2025

  34. [34]

    Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

    Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, et al. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers.arXiv preprint arXiv:2506.23918, 2025

  35. [35]

    Vipergpt: Visual inference via python execution for reasoning

    Dídac Surís, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11888– 11898, 2023

  36. [36]

    LAST: Leveraging Tools as Hints to Enhance Spatial Reasoning for Multimodal Large Language Models

    Shi-Yu Tian, Zhi Zhou, Kun-Yang Yu, Ming Yang, Yang Chen, Ziqiao Shang, Lan-Zhe Guo, and Yu-Feng Li. Last: Leveraging tools as hints to enhance spatial reasoning for multimodal large language models. arXiv preprint arXiv:2604.09712, 2026. 11

  37. [37]

    Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning.Advances in Neural Information Processing Systems, 2025

    Haozhe Wang, Alex Su, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning.Advances in Neural Information Processing Systems, 2025

  38. [38]

    Last: Learning to think in space and time for generalist vision-language models.arXiv preprint arXiv:2511.19261, 2025

    Shuai Wang, Daoan Zhang, Tianyi Bai, Shitong Shao, Jiebo Luo, and Jiaheng Wei. Last: Learning to think in space and time for generalist vision-language models.arXiv preprint arXiv:2511.19261, 2025

  39. [39]

    Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes.arXiv preprint arXiv:2308.08769, 2023

    Zehan Wang, Haifeng Huang, Yang Zhao, Ziang Zhang, and Zhou Zhao. Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes.arXiv preprint arXiv:2308.08769, 2023

  40. [40]

    Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

    Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models.arXiv preprint arXiv:2303.04671, 2023

  41. [41]

    Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.Advances in Neural Information Processing Systems, 2025

    Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.Advances in Neural Information Processing Systems, 2025

  42. [42]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  43. [43]

    Thinking in space: How multimodal large language models see, remember, and recall spaces

    Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10632–10643, 2025

  44. [44]

    arXiv:2509.18905 (2025) 6, 9, 17

    Songsong Yu, Yuxin Chen, Hao Ju, Lianjie Jia, Fuxi Zhang, Shaofei Huang, Yuhan Wu, Rundi Cui, Binghao Ran, Zaibin Zhang, et al. How far are vlms from visual spatial intelligence? a benchmark-driven perspective.arXiv preprint arXiv:2509.18905, 2025

  45. [45]

    On the generalization capacities of mllms for spatial intelligence.International Conference on Learning Representations, 2026

    Gongjie Zhang, Wenhao Li, Quanhao Qian, Jiuniu Wang, Deli Zhao, Shijian Lu, and Ran Xu. On the generalization capacities of mllms for spatial intelligence.International Conference on Learning Representations, 2026

  46. [46]

    From flatland to space: Teaching vision-language models to perceive and reason in 3d.Advances in Neural Information Processing Systems, 2025

    Jiahui Zhang, Yurui Chen, Yanpeng Zhou, Yueming Xu, Ze Huang, Jilin Mei, Junhui Chen, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, et al. From flatland to space: Teaching vision-language models to perceive and reason in 3d.Advances in Neural Information Processing Systems, 2025

  47. [47]

    Thyme: Think beyond images.International Conference on Learning Representations, 2026

    Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, et al. Thyme: Think beyond images.International Conference on Learning Representations, 2026

  48. [48]

    Think3d: Thinking with space for spatial reasoning.arXiv preprint arXiv:2601.13029, 2026

    Zaibin Zhang, Yuhan Wu, Lianjie Jia, Yifan Wang, Zhongbo Zhang, Yijiang Li, Binghao Ran, Fuxi Zhang, Zhuohan Sun, Zhenfei Yin, et al. Think3d: Thinking with space for spatial reasoning.arXiv preprint arXiv:2601.13029, 2026

  49. [49]

    Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.Advances in Neural Information Processing Systems, 2025

    Duo Zheng, Shijia Huang, Yanyang Li, and Liwei Wang. Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.Advances in Neural Information Processing Systems, 2025

  50. [50]

    Video-3d llm: Learning position-aware video representation for 3d scene understanding

    Duo Zheng, Shijia Huang, and Liwei Wang. Video-3d llm: Learning position-aware video representation for 3d scene understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8995–9006, 2025

  51. [51]

    thinking with images

    Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing" thinking with images" via reinforcement learning.International Conference on Learning Representations, 2026

  52. [52]

    Roborefer: Towards spatial referring with reasoning in vision- language models for robotics.Advances in Neural Information Processing Systems, 2025

    Enshen Zhou, Jingkun An, Cheng Chi, Yi Han, Shanyu Rong, Chi Zhang, Pengwei Wang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, et al. Roborefer: Towards spatial referring with reasoning in vision- language models for robotics.Advances in Neural Information Processing Systems, 2025

  53. [53]

    Reinforced visual perception with tools.arXiv preprint arXiv:2509.01656, 2025

    Zetong Zhou, Dongping Chen, Zixian Ma, Zhihan Hu, Mingyang Fu, Sinan Wang, Yao Wan, Zhou Zhao, and Ranjay Krishna. Reinforced visual perception with tools.arXiv preprint arXiv:2509.01656, 2025. 12