Recognition: 2 theorem links
· Lean TheoremMAG-3D: Multi-Agent Grounded Reasoning for 3D Understanding
Pith reviewed 2026-05-10 17:20 UTC · model grok-4.3
The pith
A multi-agent framework lets off-the-shelf vision-language models perform grounded reasoning in 3D scenes without training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MAG-3D is a training-free multi-agent framework that coordinates a planning agent to decompose queries and orchestrate reasoning, a grounding agent to execute free-form 3D object identification and relevant frame retrieval from extensive scene observations, and a coding agent to conduct flexible geometric reasoning and explicit verification through executable programs, thereby enabling effective grounded reasoning across diverse 3D scenes with off-the-shelf VLMs.
What carries the argument
The multi-agent collaboration architecture with a planning agent, grounding agent, and coding agent that dynamically coordinates off-the-shelf VLMs for task decomposition, free-form scene grounding, and programmatic geometric verification.
If this is right
- The system applies to novel 3D scenes in a zero-shot manner without retraining.
- Open-ended queries about spatial and geometric relationships become feasible without fixed procedures.
- Explicit code-based verification reduces errors in geometric checks compared to text-only reasoning.
- Performance reaches state-of-the-art levels on challenging 3D reasoning benchmarks.
Where Pith is reading between the lines
- Similar agent divisions could apply to related domains such as video or embodied scene reasoning.
- Lower dependence on 3D-specific training data may ease deployment in resource-limited settings.
- The modular design could support integration with robotic perception pipelines for real-time spatial tasks.
Load-bearing premise
Off-the-shelf vision-language models can reliably carry out accurate 3D grounding and geometric verification when guided by the planning, grounding, and coding agents without any task-specific training.
What would settle it
A 3D scene and natural-language query on which the multi-agent system returns incorrect object grounding or an erroneous spatial verification that a human observer can directly confirm as wrong.
Figures
read the original abstract
Vision-language models (VLMs) have achieved strong performance in multimodal understanding and reasoning, yet grounded reasoning in 3D scenes remains underexplored. Effective 3D reasoning hinges on accurate grounding: to answer open-ended queries, a model must first identify query-relevant objects and regions in a complex scene, and then reason about their spatial and geometric relationships. Recent approaches have demonstrated strong potential for grounded 3D reasoning. However, they often rely on in-domain tuning or hand-crafted reasoning pipelines, which limit their flexibility and zero-shot generalization to novel environments. In this work, we present MAG-3D, a training-free multi-agent framework for grounded 3D reasoning with off-the-shelf VLMs. Instead of relying on task-specific training or fixed reasoning procedures, MAG-3D dynamically coordinates expert agents to address the key challenges of 3D reasoning. Specifically, we propose a planning agent that decomposes the task and orchestrates the overall reasoning process, a grounding agent that performs free-form 3D grounding and relevant frame retrieval from extensive 3D scene observations, and a coding agent that conducts flexible geometric reasoning and explicit verification through executable programs. This multi-agent collaborative design enables flexible training-free 3D grounded reasoning across diverse scenes and achieves state-of-the-art performance on challenging benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MAG-3D, a training-free multi-agent framework for grounded 3D reasoning that coordinates three off-the-shelf VLMs: a planning agent that decomposes queries and orchestrates the process, a grounding agent that performs free-form object and region identification plus frame retrieval from 3D scene observations, and a coding agent that translates geometric queries into executable programs for explicit spatial verification. The central claim is that this collaborative design enables flexible zero-shot 3D reasoning across diverse scenes and attains state-of-the-art results on challenging benchmarks without task-specific training or hand-crafted pipelines.
Significance. If the experimental results hold, the work would be significant for demonstrating that dynamic multi-agent coordination of existing VLMs can address core 3D grounding and verification challenges more flexibly than prior tuned or pipeline-based methods. The training-free nature and explicit code-based verification step represent a potentially generalizable direction for open-ended 3D understanding.
major comments (3)
- [§4.2] §4.2 (Coding Agent): The central claim that the coding agent produces reliable executable programs for geometric verification is load-bearing, yet the manuscript provides no quantitative metrics on code-generation success rate, execution error frequency, or failure modes (e.g., incorrect coordinate transforms or distance computations). Without these, it is impossible to verify whether the “explicit verification” step actually contributes to the reported SOTA gains or merely passes through VLM hallucinations.
- [Table 3, §5.1] Table 3 and §5.1: The SOTA performance claims are presented without ablation studies that isolate the contribution of the coding agent (e.g., replacing it with direct VLM reasoning) or the multi-agent orchestration. The performance delta over single-VLM baselines is therefore not attributable to the proposed design, weakening the causal link between the framework and the benchmark results.
- [§3.3] §3.3 (Grounding Agent): The description of free-form 3D grounding and relevant-frame retrieval lacks detail on how the agent handles large-scale point clouds or meshes; no pseudocode, prompt templates, or failure-case analysis is supplied, making reproducibility and robustness assessment difficult despite the training-free claim.
minor comments (3)
- [Abstract, §1] The abstract and §1 repeatedly use “state-of-the-art” without immediately citing the specific benchmarks and prior numbers; a parenthetical reference to the main results table would improve clarity.
- [Figure 2] Figure 2 (agent interaction diagram) uses overlapping arrows that obscure the exact data flow between the grounding and coding agents; redrawing with clearer sequencing would aid comprehension.
- [§3.1, §4.1] Notation for 3D coordinates and bounding boxes is introduced inconsistently between §3.1 and §4.1; a single unified definition table would eliminate ambiguity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which highlights important areas for strengthening the empirical support and reproducibility of MAG-3D. We address each major comment below and will incorporate the suggested additions in the revised manuscript.
read point-by-point responses
-
Referee: [§4.2] §4.2 (Coding Agent): The central claim that the coding agent produces reliable executable programs for geometric verification is load-bearing, yet the manuscript provides no quantitative metrics on code-generation success rate, execution error frequency, or failure modes (e.g., incorrect coordinate transforms or distance computations). Without these, it is impossible to verify whether the “explicit verification” step actually contributes to the reported SOTA gains or merely passes through VLM hallucinations.
Authors: We agree that direct quantitative evaluation of the coding agent would provide stronger evidence for its contribution. Although the end-to-end SOTA results on multiple benchmarks offer indirect support for the verification step's utility, we will add a dedicated analysis in the revision. This will include code-generation success rates, execution error frequencies, and categorized failure modes (such as coordinate transform errors) computed over the evaluation datasets, presented in a new table or subsection under §4.2. revision: yes
-
Referee: [Table 3, §5.1] Table 3 and §5.1: The SOTA performance claims are presented without ablation studies that isolate the contribution of the coding agent (e.g., replacing it with direct VLM reasoning) or the multi-agent orchestration. The performance delta over single-VLM baselines is therefore not attributable to the proposed design, weakening the causal link between the framework and the benchmark results.
Authors: We acknowledge that the current comparisons to single-VLM baselines in Table 3 do not fully isolate the individual components. To address this, we will perform and report additional ablation experiments in the revised §5.1 and Table 3 (or an extended table). These will include variants that replace the coding agent with direct VLM-based geometric reasoning and variants that use a single unified agent instead of the multi-agent orchestration, allowing clearer attribution of performance gains to the proposed design. revision: yes
-
Referee: [§3.3] §3.3 (Grounding Agent): The description of free-form 3D grounding and relevant-frame retrieval lacks detail on how the agent handles large-scale point clouds or meshes; no pseudocode, prompt templates, or failure-case analysis is supplied, making reproducibility and robustness assessment difficult despite the training-free claim.
Authors: We agree that expanded implementation details are needed for reproducibility. In the revised manuscript, we will augment §3.3 with pseudocode outlining the grounding and frame-retrieval procedure, example prompt templates provided to the off-the-shelf VLM, and a short discussion of observed failure cases when processing large-scale point clouds or meshes. These additions will clarify the training-free mechanism without altering the core claims. revision: yes
Circularity Check
No circularity; empirical framework using external VLMs with no self-referential derivations
full rationale
The paper describes a training-free multi-agent system (planning, grounding, and coding agents) built on off-the-shelf VLMs. No equations, fitted parameters, or predictions appear in the provided text. Claims of SOTA performance rest on benchmark evaluation rather than any reduction of outputs to self-defined inputs or self-citation chains. The central design is presented as a novel coordination method whose validity is externally testable, satisfying the criteria for a self-contained, non-circular contribution.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Off-the-shelf VLMs can perform free-form 3D grounding and executable geometric reasoning when coordinated by specialized agents
invented entities (3)
-
Planning agent
no independent evidence
-
Grounding agent
no independent evidence
-
Coding agent
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
MAG-3D dynamically coordinates expert agents... planning agent that decomposes the task... grounding agent that performs free-form 3D grounding... coding agent that conducts flexible geometric reasoning and explicit verification through executable programs.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
training-free multi-agent framework... off-the-shelf VLMs... no task-specific training or hand-crafted pipelines
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Anthropic: Claude 3.5 sonnet (Jun 2024),https://www.anthropic.com/news/ claude-3-5-sonnet
2024
-
[2]
arXiv (2025),https://ai.meta.com/ research/publications/locate- 3d- real- world- object- localization- via- self-supervised-learning-in-3d
Arnaud*, S., McVay*, P., Ada Martin*, A.M., Jatavallabhula, K.M., Thomas, P., Partsey, R., Dugas, D., Gejji, A., Sax, A., Berges, V.P., Henaff, M., Jain, A., Cao, A., Prasad, I., Kalakrishnan, M., Rabbat, M., Ballas, N., Assran, M., Maksymets, O., Rajeswaran, A., Meier, F.: Locate 3d: Real-world object local- ization via self-supervised learning in 3d. ar...
2025
-
[3]
ByteDance Seed Team: Seed1.6 (Jun 2025),https://seed.bytedance.com/en/ seed1_6
2025
-
[4]
Scaling spatial intelligence with multi- modal foundation models.arXiv preprint arXiv:2511.13719,
Cai, Z., Wang, R., Gu, C., Pu, F., Xu, J., Wang, Y., Yin, W., Yang, Z., Wei, C., Sun, Q., Zhou, T., Li, J., Pang, H.E., Qian, O., Wei, Y., Lin, Z., Shi, X., Deng, K., Han, X., Chen, Z., Fan, X., Deng, H., Lu, L., Pan, L., Li, B., Liu, Z., Wang, Q., Lin, D., Yang, L.: Scaling spatial intelligence with multimodal foundation models. arXiv preprint arXiv:2511...
-
[5]
SAM 3: Segment Anything with Concepts
Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala,K.V.,Khedr,H.,Huang,A.,etal.:Sam3:Segmentanythingwithconcepts. arXiv preprint arXiv:2511.16719 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
In: CVPR
Chen, B., Xu, Z., Kirmani, S., Ichter, B., Sadigh, D., Guibas, L., Xia, F.: Spa- tialvlm: Endowing vision-language models with spatial reasoning capabilities. In: CVPR. pp. 14455–14465 (2024)
2024
-
[7]
In: CVPR
Chen, S., Chen, X., Zhang, C., Li, M., Yu, G., Fei, H., Zhu, H., Fan, J., Chen, T.: Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning. In: CVPR. pp. 26428–26438 (2024)
2024
-
[8]
Geometrically-constrained agent for spatial reasoning.arXiv preprint arXiv:2511.22659, 2025
Chen, Z., Lu, X., Zheng, Z., Li, P., He, L., Zhou, Y., Shao, J., Zhuang, B., Sheng, L.: Geometrically-constrained agent for spatial reasoning. arXiv preprint arXiv:2511.22659 (2025)
-
[9]
Chen, Z., Zhang, M., Yu, X., Luo, X., Sun, M., Pan, Z., Feng, Y., Pei, P., Cai, X., Huang, R.: Think with 3d: Geometric imagination grounded spatial reasoning from limited views. arXiv preprint arXiv:2510.18632 (2025)
-
[10]
NeurIPS37, 135062–135093 (2024)
Cheng, A.C., Yin, H., Fu, Y., Guo, Q., Yang, R., Kautz, J., Wang, X., Liu, S.: Spatialrgpt: Grounded spatial reasoning in vision-language models. NeurIPS37, 135062–135093 (2024)
2024
-
[11]
DeepSeek-AI, Liu, A., Mei, A., Lin, B., Xue, B., Wang, B., Xu, B., Wu, B., Zhang, B., Lin, C., Dong, C., Lu, C., Zhao, C., Deng, C., Xu, C., Ruan, C., Dai, D., Guo, D., Yang, D., Chen, D., Li, E., Zhou, F., Lin, F., Dai, F., Hao, G., Chen, G., Li, G., Zhang, H., Xu, H., Li, H., Liang, H., Wei, H., Zhang, H., Luo, H., Ji, H., Ding, H., Tang, H., Cao, H., G...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
Deng, J., He, T., Jiang, L., Wang, T., Dayoub, F., Reid, I.: 3d-llava: Towards generalist 3d lmms with omni superpoint transformer (2025),https://arxiv. org/abs/2501.01163
- [13]
-
[14]
In: European Conference on Computer Vision
Fan, Y., Ma, X., Wu, R., Du, Y., Li, J., Gao, Z., Li, Q.: Videoagent: A memory- augmented multimodal agent for video understanding. In: European Conference on Computer Vision. pp. 75–92. Springer (2025)
2025
-
[15]
Fan, Z., Zhang, J., Li, R., Zhang, J., Chen, R., Hu, H., Wang, K., Qu, H., Wang, D., Yan, Z., Xu, H., Theiss, J., Chen, T., Li, J., Tu, Z., Wang, Z., Ranjan, R.: Vlm- 3r: Vision-language models augmented with instruction-aligned 3d reconstruction (2025),https://arxiv.org/abs/2505.20279
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Visual pro- gramming: Compositional visual reasoning without training
Gupta, T., Kembhavi, A.: Visual programming: Compositional visual reasoning without training. ArXivabs/2211.11559(2022)
-
[17]
arXiv preprint arXiv:2510.07181 (2025)
Han, Y., Chi, C., Zhou, E., Rong, S., An, J., Wang, P., Wang, Z., Sheng, L., Zhang, S.: Tiger: Tool-integrated geometric reasoning in vision-language models for robotics. arXiv preprint arXiv:2510.07181 (2025)
-
[18]
In: CVPR
Hong, W., Wang, W., Lv, Q., Xu, J., Yu, W., Ji, J., Wang, Y., Wang, Z., Dong, Y., Ding, M., et al.: Cogagent: A visual language model for gui agents. In: CVPR. pp. 14281–14290 (2024)
2024
- [19]
-
[20]
NeurIPS (2024)
Huang, H., Chen, Y., Wang, Z., Huang, R., Xu, R., Wang, T., Liu, L., Cheng, X., Zhao, Y., Pang, J., et al.: Chat-scene: Bridging 3d scene and large language models with object identifiers. NeurIPS (2024)
2024
-
[21]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025) 18 H
Huang, J., Jia, B., Wang, Y., Zhu, Z., Linghu, X., Li, Q., Zhu, S.C., Huang, S.: Un- veiling the mist over 3d vision-language understanding: Object-centric evaluation with chain-of-analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025) 18 H. Zheng et al
2025
-
[22]
In: ICML
Huang, J., Yong, S., Ma, X., Linghu, X., Li, P., Wang, Y., Li, Q., Zhu, S.C., Jia, B., Huang, S.: An embodied generalist agent in 3d world. In: ICML. pp. 20413–20451 (2024)
2024
- [23]
-
[24]
Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Os- trow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
In: ECCV
Jia, B., Chen, Y., Yu, H., Wang, Y., Niu, X., Liu, T., Li, Q., Huang, S.: Sceneverse: Scaling 3d vision-language learning for grounded scene understanding. In: ECCV. pp. 289–310. Springer (2024)
2024
- [26]
-
[27]
Advances in Neural Information Processing Systems (2024)
Linghu, X., Huang, J., Niu, X., Ma, X., Jia, B., Huang, S.: Multi-modal situated reasoning in 3d scenes. Advances in Neural Information Processing Systems (2024)
2024
-
[28]
In: Proceedings of the International Con- ference on Learning Representations (ICLR) (2026)
Linghu, X., Huang, J., Zhu, Z., Jia, B., Huang, S.: Scenecot: Eliciting grounded chain-of-thought reasoning in 3d scenes. In: Proceedings of the International Con- ference on Learning Representations (ICLR) (2026)
2026
-
[29]
NeurIPS36, 34892– 34916 (2023)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. NeurIPS36, 34892– 34916 (2023)
2023
-
[30]
Liu, Y., Zhang, B., Zang, Y., Cao, Y., Xing, L., Dong, X., Duan, H., Lin, D., Wang, J.: Spatial-ssrl: Enhancing spatial understanding via self-supervised reinforcement learning. arXiv preprint arXiv:2510.27606 (2025)
-
[31]
In: ICLR (2026),https://openreview.net/forum?id=yv15C8ql24
Luo, Z., Zhang, C., Yong, S., Dai, C., Wang, Q., Ran, H., Shi, G., Sycara, K., Xie, Y.: pyspatial: Generating 3d visual programs for zero-shot spatial reasoning. In: ICLR (2026),https://openreview.net/forum?id=yv15C8ql24
2026
-
[32]
OpenAI: Introducing chatgpt and whisper apis (Apr 2024),https://openai.com/ index/introducing-chatgpt-and-whisper-apis/
2024
-
[33]
OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., Bao, H., Bavarian, M., Belgum, J., Bello, I., Berdine, J., Bernadett-Shapiro, G., Berner, C., Bogdonoff, L., Boiko, O., Boyd, M., Brakman, A.L., Brockman, ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
SpaceR: Reinforcing MLLMs in Video Spatial Reasoning
Ouyang, K., Liu, Y., Wu, H., Liu, Y., Zhou, H., Zhou, J., Meng, F., Sun, X.: Spacer: Reinforcing mllms in video spatial reasoning. arXiv preprint arXiv:2504.01805 (2025)
work page internal anchor Pith review arXiv 2025
-
[35]
Qi, Z., Zhang, Z., Fang, Y., Wang, J., Zhao, H.: Gpt4scene: Understand 3d scenes from videos with vision-language models (2025),https://arxiv.org/abs/2501. 01428
2025
-
[36]
ai / blog ? from = research
QwenTeam: Qwen3-coder: Agentic coding in the world (Jul 2025), https : / / qwen . ai / blog ? from = research . research - list & id = d927d7d2e59d059045ce758ded34f98c0186d2d7
2025
-
[37]
Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., Scialom, T.: Toolformer: Language models can teach themselves to use tools (2023),https://arxiv.org/abs/2302.04761
work page internal anchor Pith review arXiv 2023
-
[38]
Schult, J., Engelmann, F., Hermans, A., Litany, O., Tang, S., Leibe, B.: Mask3d: Mask transformer for 3d semantic instance segmentation (2023),https://arxiv. org/abs/2210.03105
-
[39]
In: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Si, C., Zhang, Y., Li, R., Yang, Z., Liu, R., Yang, D.: Design2code: Benchmarking multimodal code generation for automated front-end engineering. In: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). pp. 3956–3974 (2025)
2025
-
[40]
Su, Z., Li, L., Song, M., Hao, Y., Yang, Z., Zhang, J., Chen, G., Gu, J., Li, J., Qu, X., et al.: Openthinkimg: Learning to think with images via visual tool reinforce- ment learning. arXiv preprint arXiv:2505.08617 (2025)
-
[41]
arXiv preprint arXiv:2504.01901 (2025) 20 H
Wang, H., Zhao, Y., Wang, T., Fan, H., Zhang, X., Zhang, Z.: Ross3d: Reconstruc- tive visual instruction tuning with 3d-awareness. arXiv preprint arXiv:2504.01901 (2025) 20 H. Zheng et al
-
[42]
In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (2025)
Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (2025)
2025
- [43]
-
[45]
Wu, P., Xie, S.: V*: Guided visual search as a core mechanism in multimodal llms. arXiv preprint arXiv:2312.14135 (2023)
-
[46]
Yang, J., Yang, S., Gupta, A.W., Han, R., Fei-Fei, L., Xie, S.: Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces. arXiv preprint arXiv:2412.14171 (2024)
-
[47]
Visual spatial tuning.arXiv preprint arXiv:2511.05491, 2025
Yang, R., Zhu, Z., Li, Y., Huang, J., Yan, S., Zhou, S., Liu, Z., Li, X., Li, S., Wang, W., et al.: Visual spatial tuning. arXiv preprint arXiv:2511.05491 (2025)
-
[48]
In: International Conference on Learning Representations (ICLR) (2023)
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing reasoning and acting in language models. In: International Conference on Learning Representations (ICLR) (2023)
2023
- [49]
- [50]
-
[51]
Zhang, H., Gu, X., Li, J., Ma, C., Bai, S., Zhang, C., Zhang, B., Zhou, Z., He, D., Tang, Y.: Thinking with videos: Multimodal tool-augmented reinforcement learning for long video reasoning. arXiv preprint arXiv:2508.04416 (2025)
-
[52]
arXiv preprint arXiv:2503.22976 (2025)
Zhang, J., Chen, Y., Zhou, Y., Xu, Y., Huang, Z., Mei, J., Chen, J., Yuan, Y., Cai, X., Huang, G., Quan, X., Xu, H., Zhang, L.: From flatland to space: Teaching vision-language models to perceive and reason in 3d. arXiv preprint arXiv:2503.22976 (2025)
-
[53]
Zhang, X., Jia, Z., Guo, Z., Li, J., Li, B., Li, H., Lu, Y.: Deep video discovery: Agentic search with tool use for long-form video understanding. arXiv preprint arXiv:2505.18079 (2025)
- [54]
-
[55]
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
Zheng, Z., Yang, M., Hong, J., Zhao, C., Xu, G., Yang, L., Shen, C., Yu, X.: Deepeyes: Incentivizing "thinking with images" via reinforcement learning (2025), https://arxiv.org/abs/2505.14362
work page internal anchor Pith review arXiv 2025
-
[56]
In: Proceedings IEEE Interna- tional Conference on Computer Vision (ICCV) (2025)
Zhu, C., Wang, T., Zhang, W., Pang, J., Liu, X.: Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness. In: Proceedings IEEE Interna- tional Conference on Computer Vision (ICCV) (2025)
2025
-
[57]
In: ICCV
Zhu, Z., Ma, X., Chen, Y., Deng, Z., Huang, S., Li, Q.: 3d-vista: Pre-trained trans- former for 3d vision and text alignment. In: ICCV. pp. 2911–2921 (2023)
2023
-
[58]
In: ECCV
Zhu, Z., Zhang, Z., Ma, X., Niu, X., Chen, Y., Jia, B., Deng, Z., Huang, S., Li, Q.: Unifying 3d vision-language understanding via promptable queries. In: ECCV. pp. 188–206. Springer (2024) MAG-3D: Multi-Agent Grounded Reasoning for 3D Understanding A1 Appendix Contents of this appendix A Grounding Agent Implementation Details ......................... A1...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.