Towards Camera-Robust 3D Localization: Equation-Anchored Tool-Use for MLLMs
Pith reviewed 2026-05-20 06:27 UTC · model grok-4.3
The pith
MLLMs achieve camera-robust 3D localization by writing the pinhole back-projection equation in Chain-of-Thought and substituting tool outputs directly into it.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By re-purposing spatial tools to supply values for an explicitly written pinhole back-projection equation inside Chain-of-Thought, the framework substitutes retrieved camera intrinsics and metric depths directly into the formula before regressing 9-DoF bounding boxes, producing better results than RGB-only or loosely tool-augmented baselines on both 3D object detection and 3D visual grounding when camera intrinsics are rescaled from 0.5x to 1.5x.
What carries the argument
The pinhole back-projection equation written out in full during Chain-of-Thought, with tool outputs for focal lengths, principal points, and depths substituted as exact variables to compute 3D points from 2D image locations.
If this is right
- Largest gains appear when camera scale deviates most from the training distribution.
- Camera information moves from optional hint to required input in the final 9-DoF prediction.
- The same framework applies to both 3D object detection and 3D visual grounding.
- Tool outputs function as precise formula variables rather than loose numerical cues.
Where Pith is reading between the lines
- Explicit equation anchoring may transfer to other MLLM tasks that require geometric consistency, such as monocular depth or camera pose estimation.
- Forcing the model to perform the substitution step could reduce errors that arise from implicit pattern matching alone.
- The method suggests testing with full camera calibration models that include distortion parameters beyond simple focal-length rescaling.
Load-bearing premise
The MLLM will correctly read the written pinhole equation after tool substitution in Chain-of-Thought and use the substituted values to produce accurate 9-DoF bounding boxes.
What would settle it
A controlled test that removes the explicit equation and substitution step from the prompt while keeping tool outputs available as hints, then checks whether the performance advantage on rescaled-camera benchmarks disappears.
Figures
read the original abstract
3D localization in Multimodal Large Language Models (MLLMs), including 3D object detection and 3D visual grounding, is fundamentally limited by camera intrinsic ambiguity: the same image admits different 3D scenes under different cameras. Existing MLLMs either ignore camera parameters and overfit to a canonical training intrinsic, or retrieve depth and 3D cues from external tools but treat the returned values as reference cues (numerical hints that the model is free to interpret implicitly), both preventing camera information from being deterministically propagated into the prediction. We propose an equation-anchored tool-use framework that re-purposes spatial tools as formula variables. The proposed framework proactively retrieves camera intrinsics and samples multi-point metric depths, writes the pinhole back-projection equation $\hat{X} = (u_c - c_x)\bar{Z}/f_x$ explicitly in Chain-of-Thought (CoT), and substitutes tool outputs into the formula before regressing the final 9-DoF bounding box. On both 3D object detection and 3D visual grounding tasks under rescaled camera intrinsics from $0.5\times$ to $1.5\times$, our method outperforms RGB-only and tool-augmented baselines, with significant gains where the camera deviates most from the training scale. Code and data will be released.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that an equation-anchored tool-use framework enables camera-robust 3D localization in MLLMs for 3D object detection and 3D visual grounding. By retrieving camera intrinsics and multi-point metric depths from external tools, explicitly writing the pinhole back-projection equation ˆX = (u_c - c_x)¯Z / f_x in Chain-of-Thought, substituting the tool outputs into the formula, and then regressing 9-DoF bounding boxes, the approach deterministically propagates camera information. Experiments under rescaled intrinsics (0.5× to 1.5×) show outperformance over RGB-only and tool-augmented baselines, with larger gains at scales farthest from the training distribution.
Significance. If the results hold and the mechanism is shown to be causal, the work could meaningfully advance robustness in MLLM-based 3D perception by moving beyond implicit cue interpretation to explicit geometric anchoring. This addresses a core limitation of camera intrinsic ambiguity and overfitting to canonical scales, offering a generalizable template for integrating equations into LLM reasoning for spatial tasks.
major comments (1)
- [Proposed Framework and equation substitution in CoT] The central claim that robustness gains arise from deterministic camera propagation rests on the MLLM accurately performing the arithmetic substitution of tool outputs into ˆX = (u_c - c_x)¯Z / f_x within CoT and then using the resulting numerical 3D coordinates to constrain the final 9-DoF regression. The manuscript provides no verification, error analysis, or ablation demonstrating that the model executes these substitutions without calculation errors or that the substituted values causally influence the output boxes rather than incidental prompting effects. This is load-bearing for attributing the reported gains under 0.5×–1.5× rescaling specifically to the equation-anchored mechanism.
minor comments (1)
- [Abstract] The abstract states 'significant gains' without reporting quantitative deltas, error bars, or per-scale breakdowns; adding these would improve clarity of the experimental claims.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting the need to verify the causal mechanism behind our equation-anchored approach. We agree that direct evidence of accurate substitution and its influence on predictions would strengthen the attribution of robustness gains. We address the major comment below and commit to revisions that provide the requested verification and analysis.
read point-by-point responses
-
Referee: The central claim that robustness gains arise from deterministic camera propagation rests on the MLLM accurately performing the arithmetic substitution of tool outputs into ˆX = (u_c - c_x)¯Z / f_x within CoT and then using the resulting numerical 3D coordinates to constrain the final 9-DoF regression. The manuscript provides no verification, error analysis, or ablation demonstrating that the model executes these substitutions without calculation errors or that the substituted values causally influence the output boxes rather than incidental prompting effects. This is load-bearing for attributing the reported gains under 0.5×–1.5× rescaling specifically to the equation-anchored mechanism.
Authors: We acknowledge that the current manuscript lacks explicit verification of the substitution process, error rates in arithmetic, or ablations isolating causality from prompting effects. To address this, the revised version will include: (1) a quantitative analysis of CoT traces from a representative subset of examples, reporting the percentage of cases where tool outputs are correctly substituted into the pinhole equation without arithmetic errors; (2) an error analysis tabulating substitution mistakes by the MLLM across different intrinsic scales; and (3) a controlled ablation that either omits the equation step, provides incorrect numerical values, or replaces the explicit formula with implicit prompting while keeping all other components fixed. These additions will directly test whether the deterministic propagation causally drives the observed gains, particularly at scales farthest from the training distribution (0.5× and 1.5×). revision: yes
Circularity Check
No circularity: explicit geometric substitution is independent of outputs
full rationale
The paper's core proposal is to insert the standard pinhole back-projection equation explicitly into CoT, retrieve external tool values for intrinsics and depths, substitute them, and then regress the 9-DoF box. This construction does not redefine any quantity in terms of the final prediction, nor does it fit parameters on a subset and relabel the result as a prediction. No self-citation is invoked to establish uniqueness or to smuggle an ansatz; the equation is the conventional camera model. Empirical gains under rescaled intrinsics are presented as measured outcomes rather than derived by algebraic identity from the inputs. The derivation chain therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
writes the pinhole back-projection equation ˆX = (u_c − c_x)¯Z / f_x explicitly in Chain-of-Thought (CoT), and substitutes tool outputs into the formula before regressing the final 9-DoF bounding box
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
equation-anchored tool-use framework that re-purposes spatial tools as formula variables
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems, 35:23716–23736, 2022
work page 2022
-
[2]
Anthropic. Claude Opus 4.7. https://www.anthropic.com/claude/opus, 2026. Accessed: 2026-05- 06
work page 2026
-
[3]
Anthropic. Claude Sonnet 4.6. https://www.anthropic.com/claude/sonnet, 2026. Accessed: 2026-05-06
work page 2026
-
[4]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025
Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, et al. Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025
-
[6]
SAM 3: Segment Anything with Concepts
Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Scanrefer: 3d object localization in rgb-d scans using natural language
Dave Zhenyu Chen, Angel X Chang, and Matthias Nießner. Scanrefer: 3d object localization in rgb-d scans using natural language. InProceedings of the IEEE/CVF European Conference on Computer Vision, pages 202–221. Springer, 2020
work page 2020
-
[8]
Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning
Sijin Chen, Xin Chen, Chi Zhang, Mingsheng Li, Gang Yu, Hao Fei, Hongyuan Zhu, Jiayuan Fan, and Tao Chen. Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26428–26438, 2024
work page 2024
-
[9]
An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision-language models.Advances in Neural Information Processing Systems, 37:135062–135093, 2024
work page 2024
-
[10]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Scannet: Richly-annotated 3d reconstructions of indoor scenes
Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 5828–5839, 2017
work page 2017
-
[12]
DeepSeek-V4.https://www.deepseek.com/, 2026
DeepSeek-AI. DeepSeek-V4.https://www.deepseek.com/, 2026. Accessed: 2026-05-06
work page 2026
-
[13]
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction
Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Shijie Zhou, Dilin Wang, et al. Vlm-3r: Vision-language models augmented with instruction-aligned 3d reconstruction.arXiv preprint arXiv:2505.20279, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Jie Feng, Jinwei Zeng, Qingyue Long, Hongyi Chen, Jie Zhao, Yanxin Xi, Zhilun Zhou, Yuan Yuan, Shengyuan Wang, Qingbin Zeng, et al. A survey of large language model-powered spatial intelli- gence across scales: Advances in embodied agents, smart cities, and earth science.arXiv preprint arXiv:2504.09848, 2025
-
[15]
Refocus: Visual editing as a chain of thought for structured image understanding
Xingyu Fu, Minqian Liu, Zhengyuan Yang, John Corring, Yijuan Lu, Jianwei Yang, Dan Roth, Dinei Florencio, and Cha Zhang. Refocus: Visual editing as a chain of thought for structured image understanding. International Conference on Machine Learning, 2025
work page 2025
-
[16]
Pal: Program-aided language models
Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. InInternational Conference on Machine Learning, pages 10764–10799. PMLR, 2023
work page 2023
-
[17]
Visual sketchpad: Sketching as a visual chain of thought for multimodal language models
Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. Advances in Neural Information Processing Systems, 37:139348–139379, 2024. 10
work page 2024
-
[18]
Haifeng Huang, Yilun Chen, Zehan Wang, Rongjie Huang, Runsen Xu, Tai Wang, Luping Liu, Xize Cheng, Yang Zhao, Jiangmiao Pang, et al. Chat-scene: Bridging 3d scene and large language models with object identifiers.Advances in Neural Information Processing Systems, 37:113991–114017, 2024
work page 2024
-
[19]
Vision-r1: Incentivizing reasoning capability in multimodal large language models
Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Xu Tang, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models. International Conference on Learning Representations, 2026
work page 2026
-
[20]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Zenan Li, Zhi Zhou, Yuan Yao, Yu-Feng Li, Chun Cao, Fan Yang, Xian Zhang, and Xiaoxing Ma. Neuro-symbolic data generation for math reasoning.Advances in Neural Information Processing Systems, 37:23488–23515, 2024
work page 2024
-
[22]
Visual instruction tuning.Advances in Neural Information Processing Systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in Neural Information Processing Systems, 36:34892–34916, 2023
work page 2023
-
[23]
Ssr: Enhancing depth perception in vision-language models via rationale-guided spatial reasoning
Yang Liu, Ming Ma, Xiaomin Yu, Pengxiang Ding, Han Zhao, Mingyang Sun, Siteng Huang, and Donglin Wang. Ssr: Enhancing depth perception in vision-language models via rationale-guided spatial reasoning. Advances in Neural Information Processing Systems, 2025
work page 2025
-
[24]
Kimi K2.6.https://www.kimi.com/blog/kimi-k2-6, 2026
Moonshot AI. Kimi K2.6.https://www.kimi.com/blog/kimi-k2-6, 2026. Accessed: 2026-05-06
work page 2026
-
[25]
WebGPT: Browser-assisted question-answering with human feedback
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[26]
OpenAI. GPT-5.4. https://openai.com/index/introducing-gpt-5-4/ , 2026. Accessed: 2026- 05-06
work page 2026
-
[27]
Logic-lm: Empowering large language models with symbolic solvers for faithful logical reasoning
Liangming Pan, Alon Albalak, Xinyi Wang, and William Wang. Logic-lm: Empowering large language models with symbolic solvers for faithful logical reasoning. InConference on Empirical Methods in Natural Language Processing, pages 3806–3824, 2023
work page 2023
-
[28]
Luigi Piccinelli, Christos Sakaridis, Yung-Hsu Yang, Mattia Segu, Siyuan Li, Wim Abbeloos, and Luc Van Gool. Unidepthv2: Universal monocular metric depth estimation made simpler.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
work page 2025
-
[29]
Qwen Team. Qwen3.6-Flash. https://qwen.ai/blog?id=qwen3.6-35b-a3b, 2026. Accessed: 2026- 05-06
work page 2026
-
[30]
Grounded Reinforcement Learning for Visual Reasoning
Gabriel Sarch, Snigdha Saha, Naitik Khandelwal, Ayush Jain, Michael J Tarr, Aviral Kumar, and Katerina Fragkiadaki. Grounded reinforcement learning for visual reasoning.arXiv preprint arXiv:2505.23678, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
Seed Team. Seed-2.0-Pro. https://research.doubao.com/en/seed2, 2026. Accessed: 2026-05-06
work page 2026
-
[32]
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36:38154–38180, 2023
work page 2023
-
[33]
OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning
Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Juntao Li, Xiaoye Qu, et al. Openthinkimg: Learning to think with images via visual tool reinforcement learning.arXiv preprint arXiv:2505.08617, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers
Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, et al. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers.arXiv preprint arXiv:2506.23918, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Vipergpt: Visual inference via python execution for reasoning
Dídac Surís, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11888– 11898, 2023
work page 2023
-
[36]
LAST: Leveraging Tools as Hints to Enhance Spatial Reasoning for Multimodal Large Language Models
Shi-Yu Tian, Zhi Zhou, Kun-Yang Yu, Ming Yang, Yang Chen, Ziqiao Shang, Lan-Zhe Guo, and Yu-Feng Li. Last: Leveraging tools as hints to enhance spatial reasoning for multimodal large language models. arXiv preprint arXiv:2604.09712, 2026. 11
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[37]
Haozhe Wang, Alex Su, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning.Advances in Neural Information Processing Systems, 2025
work page 2025
-
[38]
Shuai Wang, Daoan Zhang, Tianyi Bai, Shitong Shao, Jiebo Luo, and Jiaheng Wei. Last: Learning to think in space and time for generalist vision-language models.arXiv preprint arXiv:2511.19261, 2025
-
[39]
Zehan Wang, Haifeng Huang, Yang Zhao, Ziang Zhang, and Zhou Zhao. Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes.arXiv preprint arXiv:2308.08769, 2023
-
[40]
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models.arXiv preprint arXiv:2303.04671, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[41]
Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.Advances in Neural Information Processing Systems, 2025
work page 2025
-
[42]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Thinking in space: How multimodal large language models see, remember, and recall spaces
Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10632–10643, 2025
work page 2025
-
[44]
arXiv:2509.18905 (2025) 6, 9, 17
Songsong Yu, Yuxin Chen, Hao Ju, Lianjie Jia, Fuxi Zhang, Shaofei Huang, Yuhan Wu, Rundi Cui, Binghao Ran, Zaibin Zhang, et al. How far are vlms from visual spatial intelligence? a benchmark-driven perspective.arXiv preprint arXiv:2509.18905, 2025
-
[45]
Gongjie Zhang, Wenhao Li, Quanhao Qian, Jiuniu Wang, Deli Zhao, Shijian Lu, and Ran Xu. On the generalization capacities of mllms for spatial intelligence.International Conference on Learning Representations, 2026
work page 2026
-
[46]
Jiahui Zhang, Yurui Chen, Yanpeng Zhou, Yueming Xu, Ze Huang, Jilin Mei, Junhui Chen, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, et al. From flatland to space: Teaching vision-language models to perceive and reason in 3d.Advances in Neural Information Processing Systems, 2025
work page 2025
-
[47]
Thyme: Think beyond images.International Conference on Learning Representations, 2026
Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, et al. Thyme: Think beyond images.International Conference on Learning Representations, 2026
work page 2026
-
[48]
Think3d: Thinking with space for spatial reasoning.arXiv preprint arXiv:2601.13029, 2026
Zaibin Zhang, Yuhan Wu, Lianjie Jia, Yifan Wang, Zhongbo Zhang, Yijiang Li, Binghao Ran, Fuxi Zhang, Zhuohan Sun, Zhenfei Yin, et al. Think3d: Thinking with space for spatial reasoning.arXiv preprint arXiv:2601.13029, 2026
-
[49]
Duo Zheng, Shijia Huang, Yanyang Li, and Liwei Wang. Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.Advances in Neural Information Processing Systems, 2025
work page 2025
-
[50]
Video-3d llm: Learning position-aware video representation for 3d scene understanding
Duo Zheng, Shijia Huang, and Liwei Wang. Video-3d llm: Learning position-aware video representation for 3d scene understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8995–9006, 2025
work page 2025
-
[51]
Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing" thinking with images" via reinforcement learning.International Conference on Learning Representations, 2026
work page 2026
-
[52]
Enshen Zhou, Jingkun An, Cheng Chi, Yi Han, Shanyu Rong, Chi Zhang, Pengwei Wang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, et al. Roborefer: Towards spatial referring with reasoning in vision- language models for robotics.Advances in Neural Information Processing Systems, 2025
work page 2025
-
[53]
Reinforced visual perception with tools.arXiv preprint arXiv:2509.01656, 2025
Zetong Zhou, Dongping Chen, Zixian Ma, Zhihan Hu, Mingyang Fu, Sinan Wang, Yao Wan, Zhou Zhao, and Ranjay Krishna. Reinforced visual perception with tools.arXiv preprint arXiv:2509.01656, 2025. 12
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.