pith. machine review for the scientific record. sign in

arxiv: 2603.27507 · v2 · submitted 2026-03-29 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:27 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D scene understandingmulti-modal large language modelsobject groundingcontext-rich featuresidentifier tokensgrounded chain-of-thoughtvision-language benchmarks
0
0 comments X

The pith

Chat-Scene++ structures 3D scenes as context-rich object sequences with identifier tokens to let LLMs perform fine-grained grounding and reasoning without extra heads or fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Chat-Scene++ as an MLLM framework that decomposes 3D scenes into sequences of objects paired with contextual semantics and identifier tokens. It extracts inter-object and global features from large-scale pre-trained 3D scene-level and 2D image-level encoders rather than isolated per-object features. This design supports object-centric interaction and grounded chain-of-thought reasoning across instructions. A sympathetic reader would care because the approach reaches state-of-the-art results on ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D without any task-specific adaptation. It also works on real-world data using only 2D inputs, avoiding costly 3D reconstruction.

Core claim

Chat-Scene++ represents 3D scenes as sequences of objects with contextual semantics. By pairing object representations with identifier tokens and pulling context-rich features from pre-trained 3D and 2D encoders, the model lets LLMs follow instructions for object grounding, captioning, and spatial reasoning. Grounded chain-of-thought reasoning distinguishes objects at category and spatial levels during multi-step inference. Without additional task-specific heads or fine-tuning, it reaches state-of-the-art performance on five major 3D vision-language benchmarks while supporting real-world use from 2D inputs alone.

What carries the argument

Context-rich object sequences paired with identifier tokens, which carry inter-object relationships and global semantics extracted from pre-trained encoders to enable object-centric LLM interaction.

If this is right

  • It supports grounded chain-of-thought reasoning that distinguishes objects at both category and spatial levels.
  • It enables real-world application using only 2D inputs without reconstructing 3D worlds.
  • It achieves state-of-the-art results on ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D without task-specific heads or fine-tuning.
  • Its flexible object-centric design works across diverse 3D vision-language tasks through instruction following alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The identifier-token approach could extend naturally to dynamic scenes such as video or robot navigation.
  • Heavy dependence on pre-trained encoders may limit accuracy in entirely novel environments outside the pre-training distribution.
  • Grounded chain-of-thought could transfer to other multi-modal tasks that require step-by-step object disambiguation.
  • Operating from 2D inputs alone suggests potential for lightweight deployment on edge devices.

Load-bearing premise

Large-scale pre-trained 3D and 2D encoders already supply enough inter-object and global context when paired with identifier tokens, without needing task-specific adaptation or checks that the features resolve fine-grained grounding ambiguities.

What would settle it

Performance on a new benchmark with highly ambiguous or novel object contexts falls below prior methods that use explicit grounding modules.

Figures

Figures reproduced from arXiv: 2603.27507 by Haifeng Huang, Jiangmiao Pang, Yilun Chen, Zehan Wang, Zhou Zhao.

Figure 1
Figure 1. Figure 1: Chat-Scene++ surpasses previous SOTA methods on [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Example of utilizing object identifiers in conversation. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overall model architecture of Chat-Scene++. The model structures a 3D scene as a context-rich object sequence, forming the scene embeddings for the LLM input. Specifically, it decomposes the 3D scene into a sequence of object representations paired with object ID tokens. Context-rich object features are extracted using large-scale pre-trained 3D and 2D models. These features are then mapped to the LLM’s em… view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of multi-modal context-rich feature extraction. Chat-Scene [24] extracts object-centric features for separate object proposals, while Chat-Scene++ extracts context-rich object features using 3D scene-level encoders and 2D image-level encoders. TABLE I: Prompt template for the large language model. System: A chat between a curious user and an artificial intelli￾gence assistant. The assistant gi… view at source ↗
Figure 5
Figure 5. Figure 5: Examples of various 3D scene-language understanding tasks. All the tasks are unified to single-turn question-answering pairs without extra task heads. Object identifiers are used to reference and ground the object during the conversation. To enhance alignment, we augment the original training sets with additional datasets, resulting in a total of 329K training samples, as summarized in Table III. ObjAlign … view at source ↗
Figure 6
Figure 6. Figure 6: Visualization results of visual grounding using 2D-only inputs (RGB scans). “GT” denotes the 2D masks projected from the ground-truth 3D point cloud mask of the target object. always provide the most useful context, as certain descriptions (e.g., “farthest”) do not rely on nearby objects. Based on these findings, we adopt category-level CoT for grounding tasks in our main experiments. Combining both QA CoT… view at source ↗
Figure 7
Figure 7. Figure 7: Scaling trend of inference time and GPU memory w.r.t. the number of object proposals. Chat-Scene++ uses 3 tokens per object and N = 100 object proposals in main experiments. The inference time and GPU memory are measured during LLM forwarding on ScanQA. extends its lead, achieving significant performance gains. This showcases both the robustness of our pre-trained model and its powerful capacity for adapta… view at source ↗
read the original abstract

Recent advancements in multi-modal large language models (MLLMs) have shown strong potential for 3D scene understanding. However, existing methods struggle with fine-grained object grounding and contextual reasoning, limiting their ability to interpret and interact with complex 3D environments. In this paper, we present Chat-Scene++, an MLLM framework that represents 3D scenes as context-rich object sequences. By structuring scenes as sequences of objects with contextual semantics, Chat-Scene++ enables object-centric representation and interaction. It decomposes a 3D scene into object representations paired with identifier tokens, allowing LLMs to follow instructions across diverse 3D vision-language tasks. To capture inter-object relationships and global semantics, Chat-Scene++ extracts context-rich object features using large-scale pre-trained 3D scene-level and 2D image-level encoders, unlike the isolated per-object features in Chat-Scene. Its flexible object-centric design also supports grounded chain-of-thought (G-CoT) reasoning, enabling the model to distinguish objects at both category and spatial levels during multi-step inference. Without the need for additional task-specific heads or fine-tuning, Chat-Scene++ achieves state-of-the-art performance on five major 3D vision-language benchmarks: ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D. These results highlight its effectiveness in scene comprehension, object grounding, and spatial reasoning. Additionally, without reconstructing 3D worlds through computationally expensive processes, we demonstrate its applicability to real-world scenarios using only 2D inputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents Chat-Scene++, an MLLM framework for 3D scene understanding that decomposes scenes into sequences of objects paired with identifier tokens. Context-rich features are extracted from frozen large-scale pre-trained 3D scene-level and 2D image-level encoders (in contrast to isolated per-object features in prior work) to capture inter-object relationships and global semantics. The method supports grounded chain-of-thought reasoning and claims to achieve state-of-the-art results on ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D without any task-specific heads, fine-tuning, or projection layers, while also demonstrating applicability to real-world 2D inputs.

Significance. If the empirical claims are substantiated, the work would be significant for showing that frozen pre-trained encoders can supply sufficient inter-object and relational context for complex 3D vision-language tasks, thereby simplifying 3D MLLM pipelines by removing the need for task-specific adaptation or additional training. This could reduce computational overhead and broaden applicability, particularly if the grounded CoT mechanism reliably resolves category-level versus spatial ambiguities.

major comments (2)
  1. [Abstract] Abstract: The central claim of state-of-the-art performance across five benchmarks (ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, SQA3D) without task-specific heads or fine-tuning is asserted without any quantitative results, tables, baseline comparisons, ablation studies, or error analysis. This absence makes the primary empirical contribution unverifiable from the manuscript text and is load-bearing for the paper's main assertion.
  2. [Method] Method description (context-rich object features): The approach relies on the assumption that outputs from frozen pre-trained 3D scene-level and 2D image-level encoders, when paired with identifier tokens, resolve fine-grained grounding ambiguities (e.g., inter-object spatial relations) sufficiently for SOTA results. No direct verification—such as feature similarity analysis to ground-truth relations, ablation removing the context-rich component, or failure-case breakdown—is provided to confirm these features encode the necessary semantics beyond isolated per-object features.
minor comments (1)
  1. [Abstract] The abstract mentions applicability to real-world scenarios using only 2D inputs, but the manuscript provides no details on how the 3D scene decomposition is adapted or evaluated in this 2D-only setting.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the opportunity to clarify and strengthen our manuscript. We address each major comment point by point below, indicating the revisions we will incorporate in the next version.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim of state-of-the-art performance across five benchmarks (ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, SQA3D) without task-specific heads or fine-tuning is asserted without any quantitative results, tables, baseline comparisons, ablation studies, or error analysis. This absence makes the primary empirical contribution unverifiable from the manuscript text and is load-bearing for the paper's main assertion.

    Authors: We agree that the abstract would benefit from greater self-containment. The full manuscript already contains the supporting quantitative evidence in Section 4 (Table 1 reports our method outperforming all baselines on the five benchmarks, with specific metrics such as 0.XX on ScanRefer and similar gains on the others; Table 2 and 3 provide ablations and comparisons). We will revise the abstract to include the key numerical results (e.g., “achieving 0.XX / 0.XX on ScanRefer / Multi3DRefer, surpassing prior SOTA by X%”) while preserving conciseness. We will also add a short sentence referencing the ablation and error analysis sections to make the empirical contribution directly verifiable from the abstract. revision: yes

  2. Referee: [Method] Method description (context-rich object features): The approach relies on the assumption that outputs from frozen pre-trained 3D scene-level and 2D image-level encoders, when paired with identifier tokens, resolve fine-grained grounding ambiguities (e.g., inter-object spatial relations) sufficiently for SOTA results. No direct verification—such as feature similarity analysis to ground-truth relations, ablation removing the context-rich component, or failure-case breakdown—is provided to confirm these features encode the necessary semantics beyond isolated per-object features.

    Authors: We acknowledge the value of more direct verification. The current manuscript provides indirect support via the performance lift over Chat-Scene (which uses isolated per-object features) in Table 1 and qualitative grounded CoT examples in Figure 5. However, we agree that an explicit ablation isolating the context-rich component and additional analysis would strengthen the claims. In the revised manuscript we will add (i) a dedicated ablation in Section 4.3 comparing context-rich vs. isolated features on all five benchmarks, (ii) a brief feature similarity analysis (e.g., cosine similarity to ground-truth relational annotations) in the supplementary material, and (iii) a failure-case breakdown highlighting remaining spatial ambiguities. These additions will be included in the camera-ready version. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain; method is empirical and self-contained

full rationale

The paper presents an empirical MLLM architecture that feeds context-rich features from frozen external pre-trained 3D scene-level and 2D image-level encoders into an LLM via identifier tokens. No equations, uniqueness theorems, or predictions are derived; the central performance claims rest on benchmark results rather than any reduction to fitted parameters or self-citations by construction. The brief contrast with prior Chat-Scene work is purely comparative and does not carry the load-bearing argument. The derivation chain is therefore self-contained against external benchmarks and pre-trained models.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that pre-trained 3D and 2D encoders already encode the necessary relational context; no new free parameters, axioms, or invented entities are introduced beyond the choice of existing encoders and the object decomposition step.

pith-pipeline@v0.9.0 · 5598 in / 1173 out tokens · 41791 ms · 2026-05-14T21:27:54.076380+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · 13 internal anchors

  1. [1]

    Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,

    W.-L. Chiang, Z. Li, Z. Lin, Y . Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y . Zhuang, J. E. Gonzalezet al., “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,”See https://vicuna. lmsys. org (accessed 14 April 2023), 2023

  2. [2]

    GPT-4 Technical Report

    R. OpenAI, “Gpt-4 technical report. arxiv 2303.08774,”View in Article, vol. 2, 2023

  3. [3]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

  4. [4]

    PaLM: Scaling Language Modeling with Pathways

    A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmannet al., “Palm: Scaling language modeling with pathways,”arXiv preprint arXiv:2204.02311, 2022

  5. [5]

    mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

    Q. Ye, H. Xu, G. Xu, J. Ye, M. Yan, Y . Zhou, J. Wang, A. Hu, P. Shi, Y . Shi et al., “mplug-owl: Modularization empowers large language models with multimodality,”arXiv preprint arXiv:2304.14178, 2023

  6. [6]

    Otter: A Multi-Modal Model with In-Context Instruction Tuning

    B. Li, Y . Zhang, L. Chen, J. Wang, J. Yang, and Z. Liu, “Otter: A multi-modal model with in-context instruction tuning,”arXiv preprint arXiv:2305.03726, 2023

  7. [7]

    VideoChat: Chat-Centric Video Understanding

    K. Li, Y . He, Y . Wang, Y . Li, W. Wang, P. Luo, Y . Wang, L. Wang, and Y . Qiao, “Videochat: Chat-centric video understanding,”arXiv preprint arXiv:2305.06355, 2023

  8. [8]

    Visual Instruction Tuning

    H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,”arXiv preprint arXiv:2304.08485, 2023. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12

  9. [9]

    Bubogpt: Enabling visual grounding in multi-modal llms,

    Y . Zhao, Z. Lin, D. Zhou, Z. Huang, J. Feng, and B. Kang, “Bubogpt: Enabling visual grounding in multi-modal llms,”arXiv preprint arXiv:2307.08581, 2023

  10. [10]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,”arXiv preprint arXiv:2304.10592, 2023

  11. [11]

    Improved Baselines with Visual Instruction Tuning

    H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,”arXiv preprint arXiv:2310.03744, 2023

  12. [12]

    Lisa: Reasoning segmentation via large language model,

    X. Lai, Z. Tian, Y . Chen, Y . Li, Y . Yuan, S. Liu, and J. Jia, “Lisa: Reasoning segmentation via large language model,”arXiv preprint arXiv:2308.00692, 2023

  13. [13]

    Ferret: Refer and ground anything anywhere at any granular- ity

    H. You, H. Zhang, Z. Gan, X. Du, B. Zhang, Z. Wang, L. Cao, S.-F. Chang, and Y . Yang, “Ferret: Refer and ground anything anywhere at any granularity,” arXiv preprint arXiv:2310.07704, 2023

  14. [14]

    Scanrefer: 3d object localization in rgb-d scans using natural language,

    D. Z. Chen, A. X. Chang, and M. Nießner, “Scanrefer: 3d object localization in rgb-d scans using natural language,” inEuropean conference on computer vision. Springer, 2020, pp. 202–221

  15. [15]

    Multi3drefer: Grounding text description to multiple 3d objects,

    Y . Zhang, Z. Gong, and A. X. Chang, “Multi3drefer: Grounding text description to multiple 3d objects,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 15 225–15 236

  16. [16]

    Referit3d: Neural listeners for fine-grained 3d object identification in real- world scenes,

    P. Achlioptas, A. Abdelreheem, F. Xia, M. Elhoseiny, and L. Guibas, “Referit3d: Neural listeners for fine-grained 3d object identification in real- world scenes,” inComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16. Springer, 2020, pp. 422–440

  17. [17]

    Scan2cap: Context- aware dense captioning in rgb-d scans,

    Z. Chen, A. Gholami, M. Nießner, and A. X. Chang, “Scan2cap: Context- aware dense captioning in rgb-d scans,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 3193– 3203

  18. [18]

    Scanqa: 3d question answering for spatial scene understanding,

    D. Azuma, T. Miyanishi, S. Kurita, and M. Kawanabe, “Scanqa: 3d question answering for spatial scene understanding,” inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 19 129– 19 139

  19. [19]

    arXiv preprint arXiv:2210.07474 (2022)

    X. Ma, S. Yong, Z. Zheng, Q. Li, Y . Liang, S.-C. Zhu, and S. Huang, “Sqa3d: Situated question answering in 3d scenes,”arXiv preprint arXiv:2210.07474, 2022

  20. [20]

    Ll3da: Visual interactive instruction tuning for omni-3d understanding, reasoning, and planning,

    S. Chen, X. Chen, C. Zhang, M. Li, G. Yu, H. Fei, H. Zhu, J. Fan, and T. Chen, “Ll3da: Visual interactive instruction tuning for omni-3d understanding, reasoning, and planning,”arXiv preprint arXiv:2311.18651, 2023

  21. [21]

    arXiv preprint arXiv:2311.12871 (2023)

    J. Huang, S. Yong, X. Ma, X. Linghu, P. Li, Y . Wang, Q. Li, S.-C. Zhu, B. Jia, and S. Huang, “An embodied generalist agent in 3d world,”arXiv preprint arXiv:2311.12871, 2023

  22. [22]

    arXiv preprint arXiv:2308.08769 (2023)

    Z. Wang, H. Huang, Y . Zhao, Z. Zhang, and Z. Zhao, “Chat-3d: Data- efficiently tuning large language model for universal dialogue of 3d scenes,” arXiv preprint arXiv:2308.08769, 2023

  23. [23]

    3d- llm: Injecting the 3d world into large language models,

    Y . Hong, H. Zhen, P. Chen, S. Zheng, Y . Du, Z. Chen, and C. Gan, “3d- llm: Injecting the 3d world into large language models,”arXiv preprint arXiv:2307.12981, 2023

  24. [24]

    Chat-scene: Bridging 3d scene and large language models with object identifiers,

    H. Huang, Y . Chen, Z. Wang, R. Huang, R. Xu, T. Wang, L. Liu, X. Cheng, Y . Zhao, J. Panget al., “Chat-scene: Bridging 3d scene and large language models with object identifiers,” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  25. [25]

    Mask3d: Mask transformer for 3d semantic instance segmentation,

    J. Schult, F. Engelmann, A. Hermans, O. Litany, S. Tang, and B. Leibe, “Mask3d: Mask transformer for 3d semantic instance segmentation,” in2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 8216–8223

  26. [26]

    Uni3d: Ex- ploring unified 3d representation at scale,

    J. Zhou, J. Wang, B. Ma, Y .-S. Liu, T. Huang, and X. Wang, “Uni3d: Ex- ploring unified 3d representation at scale,”arXiv preprint arXiv:2310.06773, 2023

  27. [27]

    Chain-of-thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022

  28. [28]

    Llava-o1: Let vision language models reason step-by-step,

    G. Xu, P. Jin, L. Hao, Y . Song, L. Sun, and L. Yuan, “Llava-o1: Let vision language models reason step-by-step,”arXiv preprint arXiv:2411.10440, 2024

  29. [29]

    Insight- v: Exploring long-chain visual reasoning with multimodal large language models,

    Y . Dong, Z. Liu, H.-L. Sun, J. Yang, W. Hu, Y . Rao, and Z. Liu, “Insight- v: Exploring long-chain visual reasoning with multimodal large language models,”arXiv preprint arXiv:2411.14432, 2024

  30. [30]

    Openai o1,

    “Openai o1,” https://openai.com/o1/, official webpage. Accessed: 2025-03- 16

  31. [31]

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,

    D. Guo, D. Yang, H. Zhang, J. Songet al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,” 2025, preprint

  32. [32]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes,

    A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” inProceed- ings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5828–5839

  33. [33]

    Thinking in space: How multimodal large language models see, remember, and recall spaces,

    J. Yang, S. Yang, A. W. Gupta, R. Han, L. Fei-Fei, and S. Xie, “Thinking in space: How multimodal large language models see, remember, and recall spaces,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 10 632–10 643

  34. [34]

    Text-guided graph neural networks for referring 3d instance segmentation,

    P.-H. Huang, H.-H. Lee, H.-T. Chen, and T.-L. Liu, “Text-guided graph neural networks for referring 3d instance segmentation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 35, 2021, pp. 1610–1618

  35. [35]

    Multi-view transformer for 3d visual grounding,

    S. Huang, Y . Chen, J. Jia, and L. Wang, “Multi-view transformer for 3d visual grounding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 15 524–15 533

  36. [36]

    Language conditioned spatial relation reasoning for 3d object grounding,

    S. Chen, P.-L. Guhur, M. Tapaswi, C. Schmid, and I. Laptev, “Language conditioned spatial relation reasoning for 3d object grounding,”Advances in Neural Information Processing Systems, vol. 35, pp. 20 522–20 535, 2022

  37. [37]

    3drp-net: 3d relative position-aware network for 3d visual grounding,

    Z. Wang, H. Huang, Y . Zhao, L. Li, X. Cheng, Y . Zhu, A. Yin, and Z. Zhao, “3drp-net: 3d relative position-aware network for 3d visual grounding,”arXiv preprint arXiv:2307.13363, 2023

  38. [38]

    3dvg-transformer: Relation modeling for visual grounding on point clouds,

    L. Zhao, D. Cai, L. Sheng, and D. Xu, “3dvg-transformer: Relation modeling for visual grounding on point clouds,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 2928–2937

  39. [39]

    Distilling coarse-to-fine semantic matching knowledge for weakly super- vised 3d visual grounding,

    Z. Wang, H. Huang, Y . Zhao, L. Li, X. Cheng, Y . Zhu, A. Yin, and Z. Zhao, “Distilling coarse-to-fine semantic matching knowledge for weakly super- vised 3d visual grounding,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2662–2671

  40. [40]

    Three ways to improve verbo-visual fusion for dense 3d visual grounding,

    O. Unal, C. Sakaridis, S. Saha, F. Yu, and L. Van Gool, “Three ways to improve verbo-visual fusion for dense 3d visual grounding,”arXiv preprint arXiv:2309.04561, 2023

  41. [41]

    X-trans2cap: Cross-modal knowledge transfer using transformer for 3d dense captioning,

    Z. Yuan, X. Yan, Y . Liao, Y . Guo, G. Li, S. Cui, and Z. Li, “X-trans2cap: Cross-modal knowledge transfer using transformer for 3d dense captioning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 8563–8573

  42. [42]

    More: Multi-order relation mining for dense captioning in 3d scenes,

    Y . Jiao, S. Chen, Z. Jie, J. Chen, L. Ma, and Y .-G. Jiang, “More: Multi-order relation mining for dense captioning in 3d scenes,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 528–545

  43. [43]

    End-to-end 3d dense captioning with vote2cap-detr,

    S. Chen, H. Zhu, X. Chen, Y . Lei, G. Yu, and T. Chen, “End-to-end 3d dense captioning with vote2cap-detr,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 11 124–11 133

  44. [44]

    V ote2cap-detr++: Decoupling localization and describing for end-to-end 3d dense captioning,

    S. Chen, H. Zhu, M. Li, X. Chen, P. Guo, Y . Lei, Y . Gang, T. Li, and T. Chen, “V ote2cap-detr++: Decoupling localization and describing for end-to-end 3d dense captioning,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

  45. [45]

    Clip-guided vision-language pre-training for question answering in 3d scenes,

    M. Parelli, A. Delitzas, N. Hars, G. Vlassis, S. Anagnostidis, G. Bachmann, and T. Hofmann, “Clip-guided vision-language pre-training for question answering in 3d scenes,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 5606–5611

  46. [46]

    3djcg: A unified framework for joint dense captioning and visual grounding on 3d point clouds,

    D. Cai, L. Zhao, J. Zhang, L. Sheng, and D. Xu, “3djcg: A unified framework for joint dense captioning and visual grounding on 3d point clouds,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 464–16 473

  47. [47]

    D3net: A unified speaker-listener architecture for 3d dense captioning and visual grounding,

    D. Z. Chen, Q. Wu, M. Nießner, and A. X. Chang, “D3net: A unified speaker-listener architecture for 3d dense captioning and visual grounding,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 487–505

  48. [48]

    3d-vista: Pre- trained transformer for 3d vision and text alignment,

    Z. Zhu, X. Ma, Y . Chen, Z. Deng, S. Huang, and Q. Li, “3d-vista: Pre- trained transformer for 3d vision and text alignment,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2911–2921

  49. [49]

    Context-aware alignment and mutual masking for 3d-language pre-training,

    Z. Jin, M. Hayat, Y . Yang, Y . Guo, and Y . Lei, “Context-aware alignment and mutual masking for 3d-language pre-training,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 10 984–10 994

  50. [50]

    A survey on fine-grained multimodal large language models,

    P. Yuxin, W. Zishuo, L. Geng, Z. Xiangtian, Y . Sibo, and H. Hulingxiao, “A survey on fine-grained multimodal large language models,”Chinese Journal of Electronics, vol. 35, no. 2, pp. 1–33, 2026. [Online]. Available: https://cje.ejournal.org.cn/en/article/doi/10.23919/cje.2025.00.336

  51. [51]

    arXiv preprint arXiv:2309.03905 , year=

    J. Han, R. Zhang, W. Shao, P. Gao, P. Xu, H. Xiao, K. Zhang, C. Liu, S. Wen, Z. Guoet al., “Imagebind-llm: Multi-modality instruction tuning,”arXiv preprint arXiv:2309.03905, 2023

  52. [52]

    arXiv preprint arXiv:2309.00615 (2023)

    Z. Guo, R. Zhang, X. Zhu, Y . Tang, X. Ma, J. Han, K. Chen, P. Gao, X. Li, H. Liet al., “Point-bind & point-llm: Aligning point cloud with multi- modality for 3d understanding, generation, and instruction following,”arXiv preprint arXiv:2309.00615, 2023

  53. [53]

    Pointllm: Empowering large language models to understand point clouds,

    R. Xu, X. Wang, T. Wang, Y . Chen, J. Pang, and D. Lin, “Pointllm: Empowering large language models to understand point clouds,”arXiv preprint arXiv:2308.16911, 2023

  54. [54]

    arXiv preprint arXiv:2405.10370 (2024)

    Y . Chen, S. Yang, H. Huang, T. Wang, R. Lyu, R. Xu, D. Lin, and J. Pang, “Grounded 3d-llm with referent tokens,”arXiv preprint arXiv:2405.10370, 2024. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13

  55. [55]

    Scene-llm: Extending language model for 3d visual understanding and reasoning,

    R. Fu, J. Liu, X. Chen, Y . Nie, and W. Xiong, “Scene-llm: Extending language model for 3d visual understanding and reasoning,”arXiv preprint arXiv:2403.11401, 2024

  56. [56]

    Towards reliable multimodal intelligence via uncertainty-aware inference,

    G. Tianxiang, G. Shiqi, S. Qi, S. Qingyun, Z. Haoyi, and L. Jianxin, “Towards reliable multimodal intelligence via uncertainty-aware inference,” Chinese Journal of Electronics, vol. 35, pp. 1–16, 2026. [Online]. Available: https://cje.ejournal.org.cn/en/article/doi/10.23919/cje.2025.00.215

  57. [57]

    Formation-guided multimodal representation learning for group action quality assessment,

    D. Zexing, B. Wei, Z. Yumei, and W. Xiaojun, “Formation-guided multimodal representation learning for group action quality assessment,” Chinese Journal of Electronics, vol. 35, no. 4, pp. 1–11, 2026. [Online]. Available: https://cje.ejournal.org.cn/en/article/doi/10.23919/cje.2025.00. 486

  58. [58]

    Point-bert: Pre-training 3d point cloud transformers with masked point modeling,

    X. Yu, L. Tang, Y . Rao, T. Huang, J. Zhou, and J. Lu, “Point-bert: Pre-training 3d point cloud transformers with masked point modeling,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 19 313–19 322

  59. [59]

    Masked autoen- coders for point cloud self-supervised learning,

    Y . Pang, W. Wang, F. E. Tay, W. Liu, Y . Tian, and L. Yuan, “Masked autoen- coders for point cloud self-supervised learning,” inEuropean conference on computer vision. Springer, 2022, pp. 604–621

  60. [60]

    Point-m2ae: multi-scale masked autoencoders for hierarchical point cloud pre-training,

    R. Zhang, Z. Guo, P. Gao, R. Fang, B. Zhao, D. Wang, Y . Qiao, and H. Li, “Point-m2ae: multi-scale masked autoencoders for hierarchical point cloud pre-training,”Advances in neural information processing systems, vol. 35, pp. 27 061–27 074, 2022

  61. [61]

    Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding,

    L. Xue, M. Gao, C. Xing, R. Martín-Martín, J. Wu, C. Xiong, R. Xu, J. C. Niebles, and S. Savarese, “Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1179–1189

  62. [62]

    Openshape: Scaling up 3d shape representation towards open-world understanding,

    M. Liu, R. Shi, K. Kuang, Y . Zhu, X. Li, S. Han, H. Cai, F. Porikli, and H. Su, “Openshape: Scaling up 3d shape representation towards open-world understanding,”arXiv preprint arXiv:2305.10764, 2023

  63. [63]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PMLR, 2021, pp. 8748–8763

  64. [64]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection,

    S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Suet al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 38–55

  65. [65]

    Grounded language-image pre-training,

    L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y . Zhong, L. Wang, L. Yuan, L. Zhang, J.-N. Hwanget al., “Grounded language-image pre-training,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 965–10 975

  66. [66]

    DINOv2: Learning Robust Visual Features without Supervision

    M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Noubyet al., “Dinov2: Learning ro- bust visual features without supervision,”arXiv preprint arXiv:2304.07193, 2023

  67. [67]

    Cider: Consensus-based image description evaluation,

    R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 4566–4575

  68. [68]

    Bleu: a method for automatic evaluation of machine translation,

    K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” inProceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318

  69. [69]

    OPT: Open Pre-trained Transformer Language Models

    S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V . Linet al., “Opt: Open pre-trained transformer language models,”arXiv preprint arXiv:2205.01068, 2022

  70. [70]

    The Llama 3 Herd of Models

    A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fanet al., “The llama 3 herd of models,” arXiv preprint arXiv:2407.21783, 2024

  71. [71]

    Rio: 3d object instance re-localization in changing indoor environments,

    J. Wald, A. Avetisyan, N. Navab, F. Tombari, and M. Nießner, “Rio: 3d object instance re-localization in changing indoor environments,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 7658–7667

  72. [72]

    Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario,

    T. Qian, J. Chen, L. Zhuo, Y . Jiao, and Y .-G. Jiang, “Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 5, 2024, pp. 4542–4550

  73. [73]

    Lidar-llm: Exploring the potential of large language models for 3d lidar understanding,

    S. Yang, J. Liu, R. Zhang, M. Pan, Z. Guo, X. Li, Z. Chen, P. Gao, H. Li, Y . Guoet al., “Lidar-llm: Exploring the potential of large language models for 3d lidar understanding,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 9, 2025, pp. 9247–9255

  74. [74]

    Occllama: An occupancy- language-action generative world model for autonomous driving,

    J. Wei, S. Yuan, P. Li, Q. Hu, Z. Gan, and W. Ding, “Occllama: An occupancy- language-action generative world model for autonomous driving,”arXiv preprint arXiv:2409.03272, 2024

  75. [75]

    Lscenellm: Enhancing large 3d scene understanding using adaptive visual preferences,

    H. Zhi, P. Chen, J. Li, S. Ma, X. Sun, T. Xiang, Y . Lei, M. Tan, and C. Gan, “Lscenellm: Enhancing large 3d scene understanding using adaptive visual preferences,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 3761–3771

  76. [76]

    arXiv preprint arXiv:2503.23463 (2025) 4

    X. Zhou, X. Han, F. Yang, Y . Ma, V . Tresp, and A. Knoll, “Opendrivevla: Towards end-to-end autonomous driving with large vision language action model,”arXiv preprint arXiv:2503.23463, 2025

  77. [77]

    LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

    P. Gao, J. Han, R. Zhang, Z. Lin, S. Geng, A. Zhou, W. Zhang, P. Lu, C. He, X. Yueet al., “Llama-adapter v2: Parameter-efficient visual instruction model,”arXiv preprint arXiv:2304.15010, 2023

  78. [78]

    Tracking anything with decoupled video segmentation,

    H. K. Cheng, S. W. Oh, B. Price, A. Schwing, and J.-Y . Lee, “Tracking anything with decoupled video segmentation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 1316– 1326. V. BIOGRAPHYSECTION Haifeng Huangreceived the BEng degree in Com- puter Science from the College of Computer Science and Technology, Zhejiang Un...