Towards Camera-Robust 3D Localization: Equation-Anchored Tool-Use for MLLMs

Deli Zhao; Gongjie Zhang; Quanhao Qian; Ran Xu; Shijian Lu; Wenhao Li; Xueying Jiang

arxiv: 2605.19528 · v1 · pith:BYVI35DDnew · submitted 2026-05-19 · 💻 cs.CV

Towards Camera-Robust 3D Localization: Equation-Anchored Tool-Use for MLLMs

Xueying Jiang , Wenhao Li , Quanhao Qian , Deli Zhao , Shijian Lu , Gongjie Zhang , Ran Xu This is my paper

Pith reviewed 2026-05-20 06:27 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D localizationMLLMcamera intrinsicstool usechain-of-thoughtpinhole camera3D object detectionvisual grounding

0 comments

The pith

MLLMs achieve camera-robust 3D localization by writing the pinhole back-projection equation in Chain-of-Thought and substituting tool outputs directly into it.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tackles camera intrinsic ambiguity in MLLM 3D localization, where identical images can arise from different 3D scenes depending on the camera. Prior methods either drop camera parameters or pull depth and intrinsics from tools but leave the model free to interpret those numbers loosely. The new framework retrieves intrinsics and multi-point metric depths, states the back-projection equation explicitly during reasoning, and inserts the tool values straight into the formula before producing 9-DoF boxes. Experiments on 3D object detection and visual grounding show clear gains under camera rescaling from 0.5x to 1.5x, largest where the scale differs most from training data. Readers care because real images arrive from cameras whose parameters are rarely known in advance, so deterministic use of that information matters for reliable 3D output.

Core claim

By re-purposing spatial tools to supply values for an explicitly written pinhole back-projection equation inside Chain-of-Thought, the framework substitutes retrieved camera intrinsics and metric depths directly into the formula before regressing 9-DoF bounding boxes, producing better results than RGB-only or loosely tool-augmented baselines on both 3D object detection and 3D visual grounding when camera intrinsics are rescaled from 0.5x to 1.5x.

What carries the argument

The pinhole back-projection equation written out in full during Chain-of-Thought, with tool outputs for focal lengths, principal points, and depths substituted as exact variables to compute 3D points from 2D image locations.

If this is right

Largest gains appear when camera scale deviates most from the training distribution.
Camera information moves from optional hint to required input in the final 9-DoF prediction.
The same framework applies to both 3D object detection and 3D visual grounding.
Tool outputs function as precise formula variables rather than loose numerical cues.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Explicit equation anchoring may transfer to other MLLM tasks that require geometric consistency, such as monocular depth or camera pose estimation.
Forcing the model to perform the substitution step could reduce errors that arise from implicit pattern matching alone.
The method suggests testing with full camera calibration models that include distortion parameters beyond simple focal-length rescaling.

Load-bearing premise

The MLLM will correctly read the written pinhole equation after tool substitution in Chain-of-Thought and use the substituted values to produce accurate 9-DoF bounding boxes.

What would settle it

A controlled test that removes the explicit equation and substitution step from the prompt while keeping tool outputs available as hints, then checks whether the performance advantage on rescaled-camera benchmarks disappears.

Figures

Figures reproduced from arXiv: 2605.19528 by Deli Zhao, Gongjie Zhang, Quanhao Qian, Ran Xu, Shijian Lu, Wenhao Li, Xueying Jiang.

**Figure 2.** Figure 2: Overview of our proposed equation-anchored spatial agent. Given a single-frame [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparison on 3D visual grounding (top) and 3D object detection (bottom). [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

3D localization in Multimodal Large Language Models (MLLMs), including 3D object detection and 3D visual grounding, is fundamentally limited by camera intrinsic ambiguity: the same image admits different 3D scenes under different cameras. Existing MLLMs either ignore camera parameters and overfit to a canonical training intrinsic, or retrieve depth and 3D cues from external tools but treat the returned values as reference cues (numerical hints that the model is free to interpret implicitly), both preventing camera information from being deterministically propagated into the prediction. We propose an equation-anchored tool-use framework that re-purposes spatial tools as formula variables. The proposed framework proactively retrieves camera intrinsics and samples multi-point metric depths, writes the pinhole back-projection equation $\hat{X} = (u_c - c_x)\bar{Z}/f_x$ explicitly in Chain-of-Thought (CoT), and substitutes tool outputs into the formula before regressing the final 9-DoF bounding box. On both 3D object detection and 3D visual grounding tasks under rescaled camera intrinsics from $0.5\times$ to $1.5\times$, our method outperforms RGB-only and tool-augmented baselines, with significant gains where the camera deviates most from the training scale. Code and data will be released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The core idea of writing the pinhole equation explicitly in CoT and substituting tool outputs is a direct attempt to fix camera ambiguity in MLLM 3D tasks, but it rests on shaky ground because current models often fail at the required arithmetic steps.

read the letter

The paper's main move is to treat retrieved intrinsics and depths as variables in an explicit back-projection formula written out in chain-of-thought, then substitute the numbers before the final 9-DoF regression. This is presented as a way to make predictions actually depend on the camera parameters rather than treating them as loose hints or ignoring them. From the abstract that seems to be the clearest difference from prior tool-augmented baselines, and the reported gains on rescaled intrinsics (especially at 0.5x and 1.5x) line up with the claim that the method helps most when the camera deviates from training conditions. That part is worth noting because camera robustness is a real practical issue for 3D localization in the wild. The approach also keeps the method low on circularity by relying on external tool outputs and a standard geometric equation instead of fitting extra parameters inside the model. The experiments focus on both detection and grounding, which is reasonable coverage for the claim. That said, the central mechanism is fragile. The stress-test concern is on point: MLLMs frequently make arithmetic errors even when a formula is shown, and nothing in the setup appears to force the model to actually use the substituted coordinate when it outputs the box. If the gains come mostly from the extra prompting structure rather than deterministic propagation of the camera values, the robustness story weakens. The abstract does not include error bars, exact baseline implementations, or checks that the final predictions are consistent with the substituted equation, so it is hard to tell how much of the improvement is real versus incidental. The assumption that the model will correctly follow the written equation after substitution is the weakest link and needs direct evidence. This work is aimed at researchers building MLLM systems for 3D vision who already use tool retrieval. A reader looking for concrete ways to inject geometry into language-model reasoning might pick up the equation-anchoring pattern. It is coherent enough on its own terms to deserve a serious referee who can check the full experimental details and any verification that the substituted values actually constrain the output. I would send it to review rather than desk reject.

Referee Report

1 major / 1 minor

Summary. The paper claims that an equation-anchored tool-use framework enables camera-robust 3D localization in MLLMs for 3D object detection and 3D visual grounding. By retrieving camera intrinsics and multi-point metric depths from external tools, explicitly writing the pinhole back-projection equation ˆX = (u_c - c_x)¯Z / f_x in Chain-of-Thought, substituting the tool outputs into the formula, and then regressing 9-DoF bounding boxes, the approach deterministically propagates camera information. Experiments under rescaled intrinsics (0.5× to 1.5×) show outperformance over RGB-only and tool-augmented baselines, with larger gains at scales farthest from the training distribution.

Significance. If the results hold and the mechanism is shown to be causal, the work could meaningfully advance robustness in MLLM-based 3D perception by moving beyond implicit cue interpretation to explicit geometric anchoring. This addresses a core limitation of camera intrinsic ambiguity and overfitting to canonical scales, offering a generalizable template for integrating equations into LLM reasoning for spatial tasks.

major comments (1)

[Proposed Framework and equation substitution in CoT] The central claim that robustness gains arise from deterministic camera propagation rests on the MLLM accurately performing the arithmetic substitution of tool outputs into ˆX = (u_c - c_x)¯Z / f_x within CoT and then using the resulting numerical 3D coordinates to constrain the final 9-DoF regression. The manuscript provides no verification, error analysis, or ablation demonstrating that the model executes these substitutions without calculation errors or that the substituted values causally influence the output boxes rather than incidental prompting effects. This is load-bearing for attributing the reported gains under 0.5×–1.5× rescaling specifically to the equation-anchored mechanism.

minor comments (1)

[Abstract] The abstract states 'significant gains' without reporting quantitative deltas, error bars, or per-scale breakdowns; adding these would improve clarity of the experimental claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need to verify the causal mechanism behind our equation-anchored approach. We agree that direct evidence of accurate substitution and its influence on predictions would strengthen the attribution of robustness gains. We address the major comment below and commit to revisions that provide the requested verification and analysis.

read point-by-point responses

Referee: The central claim that robustness gains arise from deterministic camera propagation rests on the MLLM accurately performing the arithmetic substitution of tool outputs into ˆX = (u_c - c_x)¯Z / f_x within CoT and then using the resulting numerical 3D coordinates to constrain the final 9-DoF regression. The manuscript provides no verification, error analysis, or ablation demonstrating that the model executes these substitutions without calculation errors or that the substituted values causally influence the output boxes rather than incidental prompting effects. This is load-bearing for attributing the reported gains under 0.5×–1.5× rescaling specifically to the equation-anchored mechanism.

Authors: We acknowledge that the current manuscript lacks explicit verification of the substitution process, error rates in arithmetic, or ablations isolating causality from prompting effects. To address this, the revised version will include: (1) a quantitative analysis of CoT traces from a representative subset of examples, reporting the percentage of cases where tool outputs are correctly substituted into the pinhole equation without arithmetic errors; (2) an error analysis tabulating substitution mistakes by the MLLM across different intrinsic scales; and (3) a controlled ablation that either omits the equation step, provides incorrect numerical values, or replaces the explicit formula with implicit prompting while keeping all other components fixed. These additions will directly test whether the deterministic propagation causally drives the observed gains, particularly at scales farthest from the training distribution (0.5× and 1.5×). revision: yes

Circularity Check

0 steps flagged

No circularity: explicit geometric substitution is independent of outputs

full rationale

The paper's core proposal is to insert the standard pinhole back-projection equation explicitly into CoT, retrieve external tool values for intrinsics and depths, substitute them, and then regress the 9-DoF box. This construction does not redefine any quantity in terms of the final prediction, nor does it fit parameters on a subset and relabel the result as a prediction. No self-citation is invoked to establish uniqueness or to smuggle an ansatz; the equation is the conventional camera model. Empirical gains under rescaled intrinsics are presented as measured outcomes rather than derived by algebraic identity from the inputs. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, the central claim rests on the standard pinhole camera model and the assumption that tool outputs can be directly substituted into explicit equations within CoT; no free parameters, ad-hoc axioms, or new invented entities are described.

pith-pipeline@v0.9.0 · 5792 in / 1110 out tokens · 63937 ms · 2026-05-20T06:27:01.839784+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

writes the pinhole back-projection equation ˆX = (u_c − c_x)¯Z / f_x explicitly in Chain-of-Thought (CoT), and substitutes tool outputs into the formula before regressing the final 9-DoF bounding box
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

equation-anchored tool-use framework that re-purposes spatial tools as formula variables

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 12 internal anchors

[1]

Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems, 35:23716–23736, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems, 35:23716–23736, 2022

work page 2022
[2]

Claude Opus 4.7

Anthropic. Claude Opus 4.7. https://www.anthropic.com/claude/opus, 2026. Accessed: 2026-05- 06

work page 2026
[3]

Claude Sonnet 4.6

Anthropic. Claude Sonnet 4.6. https://www.anthropic.com/claude/sonnet, 2026. Accessed: 2026-05-06

work page 2026
[4]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025

Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, et al. Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025

work page arXiv 2025
[6]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Scanrefer: 3d object localization in rgb-d scans using natural language

Dave Zhenyu Chen, Angel X Chang, and Matthias Nießner. Scanrefer: 3d object localization in rgb-d scans using natural language. InProceedings of the IEEE/CVF European Conference on Computer Vision, pages 202–221. Springer, 2020

work page 2020
[8]

Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning

Sijin Chen, Xin Chen, Chi Zhang, Mingsheng Li, Gang Yu, Hao Fei, Hongyuan Zhu, Jiayuan Fan, and Tao Chen. Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26428–26438, 2024

work page 2024
[9]

Spatialrgpt: Grounded spatial reasoning in vision-language models.Advances in Neural Information Processing Systems, 37:135062–135093, 2024

An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision-language models.Advances in Neural Information Processing Systems, 37:135062–135093, 2024

work page 2024
[10]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 5828–5839, 2017

work page 2017
[12]

DeepSeek-V4.https://www.deepseek.com/, 2026

DeepSeek-AI. DeepSeek-V4.https://www.deepseek.com/, 2026. Accessed: 2026-05-06

work page 2026
[13]

VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Shijie Zhou, Dilin Wang, et al. Vlm-3r: Vision-language models augmented with instruction-aligned 3d reconstruction.arXiv preprint arXiv:2505.20279, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

A survey of large language model-powered spatial intelli- gence across scales: Advances in embodied agents, smart cities, and earth science.arXiv preprint arXiv:2504.09848, 2025

Jie Feng, Jinwei Zeng, Qingyue Long, Hongyi Chen, Jie Zhao, Yanxin Xi, Zhilun Zhou, Yuan Yuan, Shengyuan Wang, Qingbin Zeng, et al. A survey of large language model-powered spatial intelli- gence across scales: Advances in embodied agents, smart cities, and earth science.arXiv preprint arXiv:2504.09848, 2025

work page arXiv 2025
[15]

Refocus: Visual editing as a chain of thought for structured image understanding

Xingyu Fu, Minqian Liu, Zhengyuan Yang, John Corring, Yijuan Lu, Jianwei Yang, Dan Roth, Dinei Florencio, and Cha Zhang. Refocus: Visual editing as a chain of thought for structured image understanding. International Conference on Machine Learning, 2025

work page 2025
[16]

Pal: Program-aided language models

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. InInternational Conference on Machine Learning, pages 10764–10799. PMLR, 2023

work page 2023
[17]

Visual sketchpad: Sketching as a visual chain of thought for multimodal language models

Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. Advances in Neural Information Processing Systems, 37:139348–139379, 2024. 10

work page 2024
[18]

Chat-scene: Bridging 3d scene and large language models with object identifiers.Advances in Neural Information Processing Systems, 37:113991–114017, 2024

Haifeng Huang, Yilun Chen, Zehan Wang, Rongjie Huang, Runsen Xu, Tai Wang, Luping Liu, Xize Cheng, Yang Zhao, Jiangmiao Pang, et al. Chat-scene: Bridging 3d scene and large language models with object identifiers.Advances in Neural Information Processing Systems, 37:113991–114017, 2024

work page 2024
[19]

Vision-r1: Incentivizing reasoning capability in multimodal large language models

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Xu Tang, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models. International Conference on Learning Representations, 2026

work page 2026
[20]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Neuro-symbolic data generation for math reasoning.Advances in Neural Information Processing Systems, 37:23488–23515, 2024

Zenan Li, Zhi Zhou, Yuan Yao, Yu-Feng Li, Chun Cao, Fan Yang, Xian Zhang, and Xiaoxing Ma. Neuro-symbolic data generation for math reasoning.Advances in Neural Information Processing Systems, 37:23488–23515, 2024

work page 2024
[22]

Visual instruction tuning.Advances in Neural Information Processing Systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in Neural Information Processing Systems, 36:34892–34916, 2023

work page 2023
[23]

Ssr: Enhancing depth perception in vision-language models via rationale-guided spatial reasoning

Yang Liu, Ming Ma, Xiaomin Yu, Pengxiang Ding, Han Zhao, Mingyang Sun, Siteng Huang, and Donglin Wang. Ssr: Enhancing depth perception in vision-language models via rationale-guided spatial reasoning. Advances in Neural Information Processing Systems, 2025

work page 2025
[24]

Kimi K2.6.https://www.kimi.com/blog/kimi-k2-6, 2026

Moonshot AI. Kimi K2.6.https://www.kimi.com/blog/kimi-k2-6, 2026. Accessed: 2026-05-06

work page 2026
[25]

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[26]

OpenAI. GPT-5.4. https://openai.com/index/introducing-gpt-5-4/ , 2026. Accessed: 2026- 05-06

work page 2026
[27]

Logic-lm: Empowering large language models with symbolic solvers for faithful logical reasoning

Liangming Pan, Alon Albalak, Xinyi Wang, and William Wang. Logic-lm: Empowering large language models with symbolic solvers for faithful logical reasoning. InConference on Empirical Methods in Natural Language Processing, pages 3806–3824, 2023

work page 2023
[28]

Unidepthv2: Universal monocular metric depth estimation made simpler.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Luigi Piccinelli, Christos Sakaridis, Yung-Hsu Yang, Mattia Segu, Siyuan Li, Wim Abbeloos, and Luc Van Gool. Unidepthv2: Universal monocular metric depth estimation made simpler.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025
[29]

Qwen3.6-Flash

Qwen Team. Qwen3.6-Flash. https://qwen.ai/blog?id=qwen3.6-35b-a3b, 2026. Accessed: 2026- 05-06

work page 2026
[30]

Grounded Reinforcement Learning for Visual Reasoning

Gabriel Sarch, Snigdha Saha, Naitik Khandelwal, Ayush Jain, Michael J Tarr, Aviral Kumar, and Katerina Fragkiadaki. Grounded reinforcement learning for visual reasoning.arXiv preprint arXiv:2505.23678, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Seed-2.0-Pro

Seed Team. Seed-2.0-Pro. https://research.doubao.com/en/seed2, 2026. Accessed: 2026-05-06

work page 2026
[32]

Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36:38154–38180, 2023

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36:38154–38180, 2023

work page 2023
[33]

OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning

Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Juntao Li, Xiaoye Qu, et al. Openthinkimg: Learning to think with images via visual tool reinforcement learning.arXiv preprint arXiv:2505.08617, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, et al. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers.arXiv preprint arXiv:2506.23918, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Vipergpt: Visual inference via python execution for reasoning

Dídac Surís, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11888– 11898, 2023

work page 2023
[36]

LAST: Leveraging Tools as Hints to Enhance Spatial Reasoning for Multimodal Large Language Models

Shi-Yu Tian, Zhi Zhou, Kun-Yang Yu, Ming Yang, Yang Chen, Ziqiao Shang, Lan-Zhe Guo, and Yu-Feng Li. Last: Leveraging tools as hints to enhance spatial reasoning for multimodal large language models. arXiv preprint arXiv:2604.09712, 2026. 11

work page internal anchor Pith review Pith/arXiv arXiv 2026
[37]

Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning.Advances in Neural Information Processing Systems, 2025

Haozhe Wang, Alex Su, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning.Advances in Neural Information Processing Systems, 2025

work page 2025
[38]

Last: Learning to think in space and time for generalist vision-language models.arXiv preprint arXiv:2511.19261, 2025

Shuai Wang, Daoan Zhang, Tianyi Bai, Shitong Shao, Jiebo Luo, and Jiaheng Wei. Last: Learning to think in space and time for generalist vision-language models.arXiv preprint arXiv:2511.19261, 2025

work page arXiv 2025
[39]

Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes.arXiv preprint arXiv:2308.08769, 2023

Zehan Wang, Haifeng Huang, Yang Zhao, Ziang Zhang, and Zhou Zhao. Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes.arXiv preprint arXiv:2308.08769, 2023

work page arXiv 2023
[40]

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models.arXiv preprint arXiv:2303.04671, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.Advances in Neural Information Processing Systems, 2025

Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.Advances in Neural Information Processing Systems, 2025

work page 2025
[42]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Thinking in space: How multimodal large language models see, remember, and recall spaces

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10632–10643, 2025

work page 2025
[44]

arXiv:2509.18905 (2025) 6, 9, 17

Songsong Yu, Yuxin Chen, Hao Ju, Lianjie Jia, Fuxi Zhang, Shaofei Huang, Yuhan Wu, Rundi Cui, Binghao Ran, Zaibin Zhang, et al. How far are vlms from visual spatial intelligence? a benchmark-driven perspective.arXiv preprint arXiv:2509.18905, 2025

work page arXiv 2025
[45]

On the generalization capacities of mllms for spatial intelligence.International Conference on Learning Representations, 2026

Gongjie Zhang, Wenhao Li, Quanhao Qian, Jiuniu Wang, Deli Zhao, Shijian Lu, and Ran Xu. On the generalization capacities of mllms for spatial intelligence.International Conference on Learning Representations, 2026

work page 2026
[46]

From flatland to space: Teaching vision-language models to perceive and reason in 3d.Advances in Neural Information Processing Systems, 2025

Jiahui Zhang, Yurui Chen, Yanpeng Zhou, Yueming Xu, Ze Huang, Jilin Mei, Junhui Chen, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, et al. From flatland to space: Teaching vision-language models to perceive and reason in 3d.Advances in Neural Information Processing Systems, 2025

work page 2025
[47]

Thyme: Think beyond images.International Conference on Learning Representations, 2026

Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, et al. Thyme: Think beyond images.International Conference on Learning Representations, 2026

work page 2026
[48]

Think3d: Thinking with space for spatial reasoning.arXiv preprint arXiv:2601.13029, 2026

Zaibin Zhang, Yuhan Wu, Lianjie Jia, Yifan Wang, Zhongbo Zhang, Yijiang Li, Binghao Ran, Fuxi Zhang, Zhuohan Sun, Zhenfei Yin, et al. Think3d: Thinking with space for spatial reasoning.arXiv preprint arXiv:2601.13029, 2026

work page arXiv 2026
[49]

Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.Advances in Neural Information Processing Systems, 2025

Duo Zheng, Shijia Huang, Yanyang Li, and Liwei Wang. Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.Advances in Neural Information Processing Systems, 2025

work page 2025
[50]

Video-3d llm: Learning position-aware video representation for 3d scene understanding

Duo Zheng, Shijia Huang, and Liwei Wang. Video-3d llm: Learning position-aware video representation for 3d scene understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8995–9006, 2025

work page 2025
[51]

thinking with images

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing" thinking with images" via reinforcement learning.International Conference on Learning Representations, 2026

work page 2026
[52]

Roborefer: Towards spatial referring with reasoning in vision- language models for robotics.Advances in Neural Information Processing Systems, 2025

Enshen Zhou, Jingkun An, Cheng Chi, Yi Han, Shanyu Rong, Chi Zhang, Pengwei Wang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, et al. Roborefer: Towards spatial referring with reasoning in vision- language models for robotics.Advances in Neural Information Processing Systems, 2025

work page 2025
[53]

Reinforced visual perception with tools.arXiv preprint arXiv:2509.01656, 2025

Zetong Zhou, Dongping Chen, Zixian Ma, Zhihan Hu, Mingyang Fu, Sinan Wang, Yao Wan, Zhou Zhao, and Ranjay Krishna. Reinforced visual perception with tools.arXiv preprint arXiv:2509.01656, 2025. 12

work page arXiv 2025

[1] [1]

Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems, 35:23716–23736, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems, 35:23716–23736, 2022

work page 2022

[2] [2]

Claude Opus 4.7

Anthropic. Claude Opus 4.7. https://www.anthropic.com/claude/opus, 2026. Accessed: 2026-05- 06

work page 2026

[3] [3]

Claude Sonnet 4.6

Anthropic. Claude Sonnet 4.6. https://www.anthropic.com/claude/sonnet, 2026. Accessed: 2026-05-06

work page 2026

[4] [4]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025

Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, et al. Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025

work page arXiv 2025

[6] [6]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Scanrefer: 3d object localization in rgb-d scans using natural language

Dave Zhenyu Chen, Angel X Chang, and Matthias Nießner. Scanrefer: 3d object localization in rgb-d scans using natural language. InProceedings of the IEEE/CVF European Conference on Computer Vision, pages 202–221. Springer, 2020

work page 2020

[8] [8]

Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning

Sijin Chen, Xin Chen, Chi Zhang, Mingsheng Li, Gang Yu, Hao Fei, Hongyuan Zhu, Jiayuan Fan, and Tao Chen. Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26428–26438, 2024

work page 2024

[9] [9]

Spatialrgpt: Grounded spatial reasoning in vision-language models.Advances in Neural Information Processing Systems, 37:135062–135093, 2024

An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision-language models.Advances in Neural Information Processing Systems, 37:135062–135093, 2024

work page 2024

[10] [10]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 5828–5839, 2017

work page 2017

[12] [12]

DeepSeek-V4.https://www.deepseek.com/, 2026

DeepSeek-AI. DeepSeek-V4.https://www.deepseek.com/, 2026. Accessed: 2026-05-06

work page 2026

[13] [13]

VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Shijie Zhou, Dilin Wang, et al. Vlm-3r: Vision-language models augmented with instruction-aligned 3d reconstruction.arXiv preprint arXiv:2505.20279, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

A survey of large language model-powered spatial intelli- gence across scales: Advances in embodied agents, smart cities, and earth science.arXiv preprint arXiv:2504.09848, 2025

Jie Feng, Jinwei Zeng, Qingyue Long, Hongyi Chen, Jie Zhao, Yanxin Xi, Zhilun Zhou, Yuan Yuan, Shengyuan Wang, Qingbin Zeng, et al. A survey of large language model-powered spatial intelli- gence across scales: Advances in embodied agents, smart cities, and earth science.arXiv preprint arXiv:2504.09848, 2025

work page arXiv 2025

[15] [15]

Refocus: Visual editing as a chain of thought for structured image understanding

Xingyu Fu, Minqian Liu, Zhengyuan Yang, John Corring, Yijuan Lu, Jianwei Yang, Dan Roth, Dinei Florencio, and Cha Zhang. Refocus: Visual editing as a chain of thought for structured image understanding. International Conference on Machine Learning, 2025

work page 2025

[16] [16]

Pal: Program-aided language models

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. InInternational Conference on Machine Learning, pages 10764–10799. PMLR, 2023

work page 2023

[17] [17]

Visual sketchpad: Sketching as a visual chain of thought for multimodal language models

Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. Advances in Neural Information Processing Systems, 37:139348–139379, 2024. 10

work page 2024

[18] [18]

Chat-scene: Bridging 3d scene and large language models with object identifiers.Advances in Neural Information Processing Systems, 37:113991–114017, 2024

Haifeng Huang, Yilun Chen, Zehan Wang, Rongjie Huang, Runsen Xu, Tai Wang, Luping Liu, Xize Cheng, Yang Zhao, Jiangmiao Pang, et al. Chat-scene: Bridging 3d scene and large language models with object identifiers.Advances in Neural Information Processing Systems, 37:113991–114017, 2024

work page 2024

[19] [19]

Vision-r1: Incentivizing reasoning capability in multimodal large language models

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Xu Tang, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models. International Conference on Learning Representations, 2026

work page 2026

[20] [20]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Neuro-symbolic data generation for math reasoning.Advances in Neural Information Processing Systems, 37:23488–23515, 2024

Zenan Li, Zhi Zhou, Yuan Yao, Yu-Feng Li, Chun Cao, Fan Yang, Xian Zhang, and Xiaoxing Ma. Neuro-symbolic data generation for math reasoning.Advances in Neural Information Processing Systems, 37:23488–23515, 2024

work page 2024

[22] [22]

Visual instruction tuning.Advances in Neural Information Processing Systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in Neural Information Processing Systems, 36:34892–34916, 2023

work page 2023

[23] [23]

Ssr: Enhancing depth perception in vision-language models via rationale-guided spatial reasoning

Yang Liu, Ming Ma, Xiaomin Yu, Pengxiang Ding, Han Zhao, Mingyang Sun, Siteng Huang, and Donglin Wang. Ssr: Enhancing depth perception in vision-language models via rationale-guided spatial reasoning. Advances in Neural Information Processing Systems, 2025

work page 2025

[24] [24]

Kimi K2.6.https://www.kimi.com/blog/kimi-k2-6, 2026

Moonshot AI. Kimi K2.6.https://www.kimi.com/blog/kimi-k2-6, 2026. Accessed: 2026-05-06

work page 2026

[25] [25]

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[26] [26]

OpenAI. GPT-5.4. https://openai.com/index/introducing-gpt-5-4/ , 2026. Accessed: 2026- 05-06

work page 2026

[27] [27]

Logic-lm: Empowering large language models with symbolic solvers for faithful logical reasoning

Liangming Pan, Alon Albalak, Xinyi Wang, and William Wang. Logic-lm: Empowering large language models with symbolic solvers for faithful logical reasoning. InConference on Empirical Methods in Natural Language Processing, pages 3806–3824, 2023

work page 2023

[28] [28]

Unidepthv2: Universal monocular metric depth estimation made simpler.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Luigi Piccinelli, Christos Sakaridis, Yung-Hsu Yang, Mattia Segu, Siyuan Li, Wim Abbeloos, and Luc Van Gool. Unidepthv2: Universal monocular metric depth estimation made simpler.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025

[29] [29]

Qwen3.6-Flash

Qwen Team. Qwen3.6-Flash. https://qwen.ai/blog?id=qwen3.6-35b-a3b, 2026. Accessed: 2026- 05-06

work page 2026

[30] [30]

Grounded Reinforcement Learning for Visual Reasoning

Gabriel Sarch, Snigdha Saha, Naitik Khandelwal, Ayush Jain, Michael J Tarr, Aviral Kumar, and Katerina Fragkiadaki. Grounded reinforcement learning for visual reasoning.arXiv preprint arXiv:2505.23678, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

Seed-2.0-Pro

Seed Team. Seed-2.0-Pro. https://research.doubao.com/en/seed2, 2026. Accessed: 2026-05-06

work page 2026

[32] [32]

Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36:38154–38180, 2023

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36:38154–38180, 2023

work page 2023

[33] [33]

OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning

Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Juntao Li, Xiaoye Qu, et al. Openthinkimg: Learning to think with images via visual tool reinforcement learning.arXiv preprint arXiv:2505.08617, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, et al. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers.arXiv preprint arXiv:2506.23918, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

Vipergpt: Visual inference via python execution for reasoning

Dídac Surís, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11888– 11898, 2023

work page 2023

[36] [36]

LAST: Leveraging Tools as Hints to Enhance Spatial Reasoning for Multimodal Large Language Models

Shi-Yu Tian, Zhi Zhou, Kun-Yang Yu, Ming Yang, Yang Chen, Ziqiao Shang, Lan-Zhe Guo, and Yu-Feng Li. Last: Leveraging tools as hints to enhance spatial reasoning for multimodal large language models. arXiv preprint arXiv:2604.09712, 2026. 11

work page internal anchor Pith review Pith/arXiv arXiv 2026

[37] [37]

Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning.Advances in Neural Information Processing Systems, 2025

Haozhe Wang, Alex Su, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning.Advances in Neural Information Processing Systems, 2025

work page 2025

[38] [38]

Last: Learning to think in space and time for generalist vision-language models.arXiv preprint arXiv:2511.19261, 2025

Shuai Wang, Daoan Zhang, Tianyi Bai, Shitong Shao, Jiebo Luo, and Jiaheng Wei. Last: Learning to think in space and time for generalist vision-language models.arXiv preprint arXiv:2511.19261, 2025

work page arXiv 2025

[39] [39]

Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes.arXiv preprint arXiv:2308.08769, 2023

Zehan Wang, Haifeng Huang, Yang Zhao, Ziang Zhang, and Zhou Zhao. Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes.arXiv preprint arXiv:2308.08769, 2023

work page arXiv 2023

[40] [40]

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models.arXiv preprint arXiv:2303.04671, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[41] [41]

Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.Advances in Neural Information Processing Systems, 2025

Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.Advances in Neural Information Processing Systems, 2025

work page 2025

[42] [42]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

Thinking in space: How multimodal large language models see, remember, and recall spaces

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10632–10643, 2025

work page 2025

[44] [44]

arXiv:2509.18905 (2025) 6, 9, 17

Songsong Yu, Yuxin Chen, Hao Ju, Lianjie Jia, Fuxi Zhang, Shaofei Huang, Yuhan Wu, Rundi Cui, Binghao Ran, Zaibin Zhang, et al. How far are vlms from visual spatial intelligence? a benchmark-driven perspective.arXiv preprint arXiv:2509.18905, 2025

work page arXiv 2025

[45] [45]

On the generalization capacities of mllms for spatial intelligence.International Conference on Learning Representations, 2026

Gongjie Zhang, Wenhao Li, Quanhao Qian, Jiuniu Wang, Deli Zhao, Shijian Lu, and Ran Xu. On the generalization capacities of mllms for spatial intelligence.International Conference on Learning Representations, 2026

work page 2026

[46] [46]

From flatland to space: Teaching vision-language models to perceive and reason in 3d.Advances in Neural Information Processing Systems, 2025

Jiahui Zhang, Yurui Chen, Yanpeng Zhou, Yueming Xu, Ze Huang, Jilin Mei, Junhui Chen, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, et al. From flatland to space: Teaching vision-language models to perceive and reason in 3d.Advances in Neural Information Processing Systems, 2025

work page 2025

[47] [47]

Thyme: Think beyond images.International Conference on Learning Representations, 2026

Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, et al. Thyme: Think beyond images.International Conference on Learning Representations, 2026

work page 2026

[48] [48]

Think3d: Thinking with space for spatial reasoning.arXiv preprint arXiv:2601.13029, 2026

Zaibin Zhang, Yuhan Wu, Lianjie Jia, Yifan Wang, Zhongbo Zhang, Yijiang Li, Binghao Ran, Fuxi Zhang, Zhuohan Sun, Zhenfei Yin, et al. Think3d: Thinking with space for spatial reasoning.arXiv preprint arXiv:2601.13029, 2026

work page arXiv 2026

[49] [49]

Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.Advances in Neural Information Processing Systems, 2025

Duo Zheng, Shijia Huang, Yanyang Li, and Liwei Wang. Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.Advances in Neural Information Processing Systems, 2025

work page 2025

[50] [50]

Video-3d llm: Learning position-aware video representation for 3d scene understanding

Duo Zheng, Shijia Huang, and Liwei Wang. Video-3d llm: Learning position-aware video representation for 3d scene understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8995–9006, 2025

work page 2025

[51] [51]

thinking with images

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing" thinking with images" via reinforcement learning.International Conference on Learning Representations, 2026

work page 2026

[52] [52]

Roborefer: Towards spatial referring with reasoning in vision- language models for robotics.Advances in Neural Information Processing Systems, 2025

Enshen Zhou, Jingkun An, Cheng Chi, Yi Han, Shanyu Rong, Chi Zhang, Pengwei Wang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, et al. Roborefer: Towards spatial referring with reasoning in vision- language models for robotics.Advances in Neural Information Processing Systems, 2025

work page 2025

[53] [53]

Reinforced visual perception with tools.arXiv preprint arXiv:2509.01656, 2025

Zetong Zhou, Dongping Chen, Zixian Ma, Zhihan Hu, Mingyang Fu, Sinan Wang, Yao Wan, Zhou Zhao, and Ranjay Krishna. Reinforced visual perception with tools.arXiv preprint arXiv:2509.01656, 2025. 12

work page arXiv 2025