arxiv: 2603.27507 · v2 · submitted 2026-03-29 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM

Haifeng Huang , Yilun Chen , Zehan Wang , Jiangmiao Pang , Zhou Zhao

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:27 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D scene understandingmulti-modal large language modelsobject groundingcontext-rich featuresidentifier tokensgrounded chain-of-thoughtvision-language benchmarks

0 comments

The pith

Chat-Scene++ structures 3D scenes as context-rich object sequences with identifier tokens to let LLMs perform fine-grained grounding and reasoning without extra heads or fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Chat-Scene++ as an MLLM framework that decomposes 3D scenes into sequences of objects paired with contextual semantics and identifier tokens. It extracts inter-object and global features from large-scale pre-trained 3D scene-level and 2D image-level encoders rather than isolated per-object features. This design supports object-centric interaction and grounded chain-of-thought reasoning across instructions. A sympathetic reader would care because the approach reaches state-of-the-art results on ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D without any task-specific adaptation. It also works on real-world data using only 2D inputs, avoiding costly 3D reconstruction.

Core claim

Chat-Scene++ represents 3D scenes as sequences of objects with contextual semantics. By pairing object representations with identifier tokens and pulling context-rich features from pre-trained 3D and 2D encoders, the model lets LLMs follow instructions for object grounding, captioning, and spatial reasoning. Grounded chain-of-thought reasoning distinguishes objects at category and spatial levels during multi-step inference. Without additional task-specific heads or fine-tuning, it reaches state-of-the-art performance on five major 3D vision-language benchmarks while supporting real-world use from 2D inputs alone.

What carries the argument

Context-rich object sequences paired with identifier tokens, which carry inter-object relationships and global semantics extracted from pre-trained encoders to enable object-centric LLM interaction.

If this is right

It supports grounded chain-of-thought reasoning that distinguishes objects at both category and spatial levels.
It enables real-world application using only 2D inputs without reconstructing 3D worlds.
It achieves state-of-the-art results on ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D without task-specific heads or fine-tuning.
Its flexible object-centric design works across diverse 3D vision-language tasks through instruction following alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The identifier-token approach could extend naturally to dynamic scenes such as video or robot navigation.
Heavy dependence on pre-trained encoders may limit accuracy in entirely novel environments outside the pre-training distribution.
Grounded chain-of-thought could transfer to other multi-modal tasks that require step-by-step object disambiguation.
Operating from 2D inputs alone suggests potential for lightweight deployment on edge devices.

Load-bearing premise

Large-scale pre-trained 3D and 2D encoders already supply enough inter-object and global context when paired with identifier tokens, without needing task-specific adaptation or checks that the features resolve fine-grained grounding ambiguities.

What would settle it

Performance on a new benchmark with highly ambiguous or novel object contexts falls below prior methods that use explicit grounding modules.

Figures

Figures reproduced from arXiv: 2603.27507 by Haifeng Huang, Jiangmiao Pang, Yilun Chen, Zehan Wang, Zhou Zhao.

**Figure 2.** Figure 2: Example of utilizing object identifiers in conversation. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overall model architecture of Chat-Scene++. The model structures a 3D scene as a context-rich object sequence, forming the scene embeddings for the LLM input. Specifically, it decomposes the 3D scene into a sequence of object representations paired with object ID tokens. Context-rich object features are extracted using large-scale pre-trained 3D and 2D models. These features are then mapped to the LLM’s em… view at source ↗

**Figure 4.** Figure 4: Illustration of multi-modal context-rich feature extraction. Chat-Scene [24] extracts object-centric features for separate object proposals, while Chat-Scene++ extracts context-rich object features using 3D scene-level encoders and 2D image-level encoders. TABLE I: Prompt template for the large language model. System: A chat between a curious user and an artificial intelligence assistant. The assistant gi… view at source ↗

**Figure 5.** Figure 5: Examples of various 3D scene-language understanding tasks. All the tasks are unified to single-turn question-answering pairs without extra task heads. Object identifiers are used to reference and ground the object during the conversation. To enhance alignment, we augment the original training sets with additional datasets, resulting in a total of 329K training samples, as summarized in Table III. ObjAlign … view at source ↗

**Figure 6.** Figure 6: Visualization results of visual grounding using 2D-only inputs (RGB scans). “GT” denotes the 2D masks projected from the ground-truth 3D point cloud mask of the target object. always provide the most useful context, as certain descriptions (e.g., “farthest”) do not rely on nearby objects. Based on these findings, we adopt category-level CoT for grounding tasks in our main experiments. Combining both QA CoT… view at source ↗

**Figure 7.** Figure 7: Scaling trend of inference time and GPU memory w.r.t. the number of object proposals. Chat-Scene++ uses 3 tokens per object and N = 100 object proposals in main experiments. The inference time and GPU memory are measured during LLM forwarding on ScanQA. extends its lead, achieving significant performance gains. This showcases both the robustness of our pre-trained model and its powerful capacity for adapta… view at source ↗

read the original abstract

Recent advancements in multi-modal large language models (MLLMs) have shown strong potential for 3D scene understanding. However, existing methods struggle with fine-grained object grounding and contextual reasoning, limiting their ability to interpret and interact with complex 3D environments. In this paper, we present Chat-Scene++, an MLLM framework that represents 3D scenes as context-rich object sequences. By structuring scenes as sequences of objects with contextual semantics, Chat-Scene++ enables object-centric representation and interaction. It decomposes a 3D scene into object representations paired with identifier tokens, allowing LLMs to follow instructions across diverse 3D vision-language tasks. To capture inter-object relationships and global semantics, Chat-Scene++ extracts context-rich object features using large-scale pre-trained 3D scene-level and 2D image-level encoders, unlike the isolated per-object features in Chat-Scene. Its flexible object-centric design also supports grounded chain-of-thought (G-CoT) reasoning, enabling the model to distinguish objects at both category and spatial levels during multi-step inference. Without the need for additional task-specific heads or fine-tuning, Chat-Scene++ achieves state-of-the-art performance on five major 3D vision-language benchmarks: ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D. These results highlight its effectiveness in scene comprehension, object grounding, and spatial reasoning. Additionally, without reconstructing 3D worlds through computationally expensive processes, we demonstrate its applicability to real-world scenarios using only 2D inputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Chat-Scene++ adds context-rich object sequences and identifier tokens on top of frozen encoders to let one LLM handle multiple 3D tasks without heads, but the SOTA claims rest on unshown numbers.

read the letter

Chat-Scene++ takes the earlier Chat-Scene setup and replaces isolated per-object features with context-rich ones pulled from large pre-trained 3D scene-level and 2D image-level encoders, then pairs them with identifier tokens so the LLM can follow instructions across grounding, captioning, and QA tasks. The grounded chain-of-thought step is the clearest addition; it lets the model reason step-by-step while keeping track of both category and spatial distinctions. The fact that the same model works from 2D inputs alone, without 3D reconstruction, is a practical plus for real scenes. The design stays simple and avoids new task heads or fine-tuning, which keeps the parameter count low and the interface flexible. The paper does a clean job of spelling out how the object-sequence format supports this multi-task use. The soft spot is the missing evidence. The abstract asserts state-of-the-art numbers on ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D, yet supplies no scores, no ablations on the context features versus isolated ones, and no error analysis on where fine-grained ambiguities remain. Without those details it is hard to tell how much the encoder context actually resolves the relational issues the stress-test flagged. The assumption that frozen encoders already encode the needed inter-object semantics is plausible but unverified in the provided text. This paper is for groups building 3D MLLMs for robotics or AR who want a single model that takes varied instructions on the same scene representation. If the full manuscript includes the benchmark tables and ablation results, it deserves a serious referee to check whether the gains are real and reproducible.

Referee Report

2 major / 1 minor

Summary. The paper presents Chat-Scene++, an MLLM framework for 3D scene understanding that decomposes scenes into sequences of objects paired with identifier tokens. Context-rich features are extracted from frozen large-scale pre-trained 3D scene-level and 2D image-level encoders (in contrast to isolated per-object features in prior work) to capture inter-object relationships and global semantics. The method supports grounded chain-of-thought reasoning and claims to achieve state-of-the-art results on ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D without any task-specific heads, fine-tuning, or projection layers, while also demonstrating applicability to real-world 2D inputs.

Significance. If the empirical claims are substantiated, the work would be significant for showing that frozen pre-trained encoders can supply sufficient inter-object and relational context for complex 3D vision-language tasks, thereby simplifying 3D MLLM pipelines by removing the need for task-specific adaptation or additional training. This could reduce computational overhead and broaden applicability, particularly if the grounded CoT mechanism reliably resolves category-level versus spatial ambiguities.

major comments (2)

[Abstract] Abstract: The central claim of state-of-the-art performance across five benchmarks (ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, SQA3D) without task-specific heads or fine-tuning is asserted without any quantitative results, tables, baseline comparisons, ablation studies, or error analysis. This absence makes the primary empirical contribution unverifiable from the manuscript text and is load-bearing for the paper's main assertion.
[Method] Method description (context-rich object features): The approach relies on the assumption that outputs from frozen pre-trained 3D scene-level and 2D image-level encoders, when paired with identifier tokens, resolve fine-grained grounding ambiguities (e.g., inter-object spatial relations) sufficiently for SOTA results. No direct verification—such as feature similarity analysis to ground-truth relations, ablation removing the context-rich component, or failure-case breakdown—is provided to confirm these features encode the necessary semantics beyond isolated per-object features.

minor comments (1)

[Abstract] The abstract mentions applicability to real-world scenarios using only 2D inputs, but the manuscript provides no details on how the 3D scene decomposition is adapted or evaluated in this 2D-only setting.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the opportunity to clarify and strengthen our manuscript. We address each major comment point by point below, indicating the revisions we will incorporate in the next version.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of state-of-the-art performance across five benchmarks (ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, SQA3D) without task-specific heads or fine-tuning is asserted without any quantitative results, tables, baseline comparisons, ablation studies, or error analysis. This absence makes the primary empirical contribution unverifiable from the manuscript text and is load-bearing for the paper's main assertion.

Authors: We agree that the abstract would benefit from greater self-containment. The full manuscript already contains the supporting quantitative evidence in Section 4 (Table 1 reports our method outperforming all baselines on the five benchmarks, with specific metrics such as 0.XX on ScanRefer and similar gains on the others; Table 2 and 3 provide ablations and comparisons). We will revise the abstract to include the key numerical results (e.g., “achieving 0.XX / 0.XX on ScanRefer / Multi3DRefer, surpassing prior SOTA by X%”) while preserving conciseness. We will also add a short sentence referencing the ablation and error analysis sections to make the empirical contribution directly verifiable from the abstract. revision: yes
Referee: [Method] Method description (context-rich object features): The approach relies on the assumption that outputs from frozen pre-trained 3D scene-level and 2D image-level encoders, when paired with identifier tokens, resolve fine-grained grounding ambiguities (e.g., inter-object spatial relations) sufficiently for SOTA results. No direct verification—such as feature similarity analysis to ground-truth relations, ablation removing the context-rich component, or failure-case breakdown—is provided to confirm these features encode the necessary semantics beyond isolated per-object features.

Authors: We acknowledge the value of more direct verification. The current manuscript provides indirect support via the performance lift over Chat-Scene (which uses isolated per-object features) in Table 1 and qualitative grounded CoT examples in Figure 5. However, we agree that an explicit ablation isolating the context-rich component and additional analysis would strengthen the claims. In the revised manuscript we will add (i) a dedicated ablation in Section 4.3 comparing context-rich vs. isolated features on all five benchmarks, (ii) a brief feature similarity analysis (e.g., cosine similarity to ground-truth relational annotations) in the supplementary material, and (iii) a failure-case breakdown highlighting remaining spatial ambiguities. These additions will be included in the camera-ready version. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain; method is empirical and self-contained

full rationale

The paper presents an empirical MLLM architecture that feeds context-rich features from frozen external pre-trained 3D scene-level and 2D image-level encoders into an LLM via identifier tokens. No equations, uniqueness theorems, or predictions are derived; the central performance claims rest on benchmark results rather than any reduction to fitted parameters or self-citations by construction. The brief contrast with prior Chat-Scene work is purely comparative and does not carry the load-bearing argument. The derivation chain is therefore self-contained against external benchmarks and pre-trained models.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that pre-trained 3D and 2D encoders already encode the necessary relational context; no new free parameters, axioms, or invented entities are introduced beyond the choice of existing encoders and the object decomposition step.

pith-pipeline@v0.9.0 · 5598 in / 1173 out tokens · 41791 ms · 2026-05-14T21:27:54.076380+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Chat-Scene++ represents 3D scenes as sequences of objects with contextual semantics... extracts context-rich object features using large-scale pre-trained 3D scene-level and 2D image-level encoders
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

grounded chain-of-thought (G-CoT) reasoning, enabling the model to distinguish objects at both category and spatial levels

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · 13 internal anchors

[1]

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,

W.-L. Chiang, Z. Li, Z. Lin, Y . Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y . Zhuang, J. E. Gonzalezet al., “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,”See https://vicuna. lmsys. org (accessed 14 April 2023), 2023

work page 2023
[2]

GPT-4 Technical Report

R. OpenAI, “Gpt-4 technical report. arxiv 2303.08774,”View in Article, vol. 2, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

PaLM: Scaling Language Modeling with Pathways

A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmannet al., “Palm: Scaling language modeling with pathways,”arXiv preprint arXiv:2204.02311, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[5]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Q. Ye, H. Xu, G. Xu, J. Ye, M. Yan, Y . Zhou, J. Wang, A. Hu, P. Shi, Y . Shi et al., “mplug-owl: Modularization empowers large language models with multimodality,”arXiv preprint arXiv:2304.14178, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Otter: A Multi-Modal Model with In-Context Instruction Tuning

B. Li, Y . Zhang, L. Chen, J. Wang, J. Yang, and Z. Liu, “Otter: A multi-modal model with in-context instruction tuning,”arXiv preprint arXiv:2305.03726, 2023

work page internal anchor Pith review arXiv 2023
[7]

VideoChat: Chat-Centric Video Understanding

K. Li, Y . He, Y . Wang, Y . Li, W. Wang, P. Luo, Y . Wang, L. Wang, and Y . Qiao, “Videochat: Chat-centric video understanding,”arXiv preprint arXiv:2305.06355, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Visual Instruction Tuning

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,”arXiv preprint arXiv:2304.08485, 2023. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Bubogpt: Enabling visual grounding in multi-modal llms,

Y . Zhao, Z. Lin, D. Zhou, Z. Huang, J. Feng, and B. Kang, “Bubogpt: Enabling visual grounding in multi-modal llms,”arXiv preprint arXiv:2307.08581, 2023

work page arXiv 2023
[10]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,”arXiv preprint arXiv:2304.10592, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Improved Baselines with Visual Instruction Tuning

H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,”arXiv preprint arXiv:2310.03744, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Lisa: Reasoning segmentation via large language model,

X. Lai, Z. Tian, Y . Chen, Y . Li, Y . Yuan, S. Liu, and J. Jia, “Lisa: Reasoning segmentation via large language model,”arXiv preprint arXiv:2308.00692, 2023

work page arXiv 2023
[13]

Ferret: Refer and ground anything anywhere at any granular- ity

H. You, H. Zhang, Z. Gan, X. Du, B. Zhang, Z. Wang, L. Cao, S.-F. Chang, and Y . Yang, “Ferret: Refer and ground anything anywhere at any granularity,” arXiv preprint arXiv:2310.07704, 2023

work page arXiv 2023
[14]

Scanrefer: 3d object localization in rgb-d scans using natural language,

D. Z. Chen, A. X. Chang, and M. Nießner, “Scanrefer: 3d object localization in rgb-d scans using natural language,” inEuropean conference on computer vision. Springer, 2020, pp. 202–221

work page 2020
[15]

Multi3drefer: Grounding text description to multiple 3d objects,

Y . Zhang, Z. Gong, and A. X. Chang, “Multi3drefer: Grounding text description to multiple 3d objects,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 15 225–15 236

work page 2023
[16]

Referit3d: Neural listeners for fine-grained 3d object identification in real- world scenes,

P. Achlioptas, A. Abdelreheem, F. Xia, M. Elhoseiny, and L. Guibas, “Referit3d: Neural listeners for fine-grained 3d object identification in real- world scenes,” inComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16. Springer, 2020, pp. 422–440

work page 2020
[17]

Scan2cap: Context- aware dense captioning in rgb-d scans,

Z. Chen, A. Gholami, M. Nießner, and A. X. Chang, “Scan2cap: Context- aware dense captioning in rgb-d scans,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 3193– 3203

work page 2021
[18]

Scanqa: 3d question answering for spatial scene understanding,

D. Azuma, T. Miyanishi, S. Kurita, and M. Kawanabe, “Scanqa: 3d question answering for spatial scene understanding,” inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 19 129– 19 139

work page 2022
[19]

arXiv preprint arXiv:2210.07474 (2022)

X. Ma, S. Yong, Z. Zheng, Q. Li, Y . Liang, S.-C. Zhu, and S. Huang, “Sqa3d: Situated question answering in 3d scenes,”arXiv preprint arXiv:2210.07474, 2022

work page arXiv 2022
[20]

Ll3da: Visual interactive instruction tuning for omni-3d understanding, reasoning, and planning,

S. Chen, X. Chen, C. Zhang, M. Li, G. Yu, H. Fei, H. Zhu, J. Fan, and T. Chen, “Ll3da: Visual interactive instruction tuning for omni-3d understanding, reasoning, and planning,”arXiv preprint arXiv:2311.18651, 2023

work page arXiv 2023
[21]

arXiv preprint arXiv:2311.12871 (2023)

J. Huang, S. Yong, X. Ma, X. Linghu, P. Li, Y . Wang, Q. Li, S.-C. Zhu, B. Jia, and S. Huang, “An embodied generalist agent in 3d world,”arXiv preprint arXiv:2311.12871, 2023

work page arXiv 2023
[22]

arXiv preprint arXiv:2308.08769 (2023)

Z. Wang, H. Huang, Y . Zhao, Z. Zhang, and Z. Zhao, “Chat-3d: Data- efficiently tuning large language model for universal dialogue of 3d scenes,” arXiv preprint arXiv:2308.08769, 2023

work page arXiv 2023
[23]

3d- llm: Injecting the 3d world into large language models,

Y . Hong, H. Zhen, P. Chen, S. Zheng, Y . Du, Z. Chen, and C. Gan, “3d- llm: Injecting the 3d world into large language models,”arXiv preprint arXiv:2307.12981, 2023

work page arXiv 2023
[24]

Chat-scene: Bridging 3d scene and large language models with object identifiers,

H. Huang, Y . Chen, Z. Wang, R. Huang, R. Xu, T. Wang, L. Liu, X. Cheng, Y . Zhao, J. Panget al., “Chat-scene: Bridging 3d scene and large language models with object identifiers,” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[25]

Mask3d: Mask transformer for 3d semantic instance segmentation,

J. Schult, F. Engelmann, A. Hermans, O. Litany, S. Tang, and B. Leibe, “Mask3d: Mask transformer for 3d semantic instance segmentation,” in2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 8216–8223

work page 2023
[26]

Uni3d: Ex- ploring unified 3d representation at scale,

J. Zhou, J. Wang, B. Ma, Y .-S. Liu, T. Huang, and X. Wang, “Uni3d: Ex- ploring unified 3d representation at scale,”arXiv preprint arXiv:2310.06773, 2023

work page arXiv 2023
[27]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022

work page 2022
[28]

Llava-o1: Let vision language models reason step-by-step,

G. Xu, P. Jin, L. Hao, Y . Song, L. Sun, and L. Yuan, “Llava-o1: Let vision language models reason step-by-step,”arXiv preprint arXiv:2411.10440, 2024

work page arXiv 2024
[29]

Insight- v: Exploring long-chain visual reasoning with multimodal large language models,

Y . Dong, Z. Liu, H.-L. Sun, J. Yang, W. Hu, Y . Rao, and Z. Liu, “Insight- v: Exploring long-chain visual reasoning with multimodal large language models,”arXiv preprint arXiv:2411.14432, 2024

work page arXiv 2024
[30]

Openai o1,

“Openai o1,” https://openai.com/o1/, official webpage. Accessed: 2025-03- 16

work page 2025
[31]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,

D. Guo, D. Yang, H. Zhang, J. Songet al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,” 2025, preprint

work page 2025
[32]

Scannet: Richly-annotated 3d reconstructions of indoor scenes,

A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” inProceed- ings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5828–5839

work page 2017
[33]

Thinking in space: How multimodal large language models see, remember, and recall spaces,

J. Yang, S. Yang, A. W. Gupta, R. Han, L. Fei-Fei, and S. Xie, “Thinking in space: How multimodal large language models see, remember, and recall spaces,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 10 632–10 643

work page 2025
[34]

Text-guided graph neural networks for referring 3d instance segmentation,

P.-H. Huang, H.-H. Lee, H.-T. Chen, and T.-L. Liu, “Text-guided graph neural networks for referring 3d instance segmentation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 35, 2021, pp. 1610–1618

work page 2021
[35]

Multi-view transformer for 3d visual grounding,

S. Huang, Y . Chen, J. Jia, and L. Wang, “Multi-view transformer for 3d visual grounding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 15 524–15 533

work page 2022
[36]

Language conditioned spatial relation reasoning for 3d object grounding,

S. Chen, P.-L. Guhur, M. Tapaswi, C. Schmid, and I. Laptev, “Language conditioned spatial relation reasoning for 3d object grounding,”Advances in Neural Information Processing Systems, vol. 35, pp. 20 522–20 535, 2022

work page 2022
[37]

3drp-net: 3d relative position-aware network for 3d visual grounding,

Z. Wang, H. Huang, Y . Zhao, L. Li, X. Cheng, Y . Zhu, A. Yin, and Z. Zhao, “3drp-net: 3d relative position-aware network for 3d visual grounding,”arXiv preprint arXiv:2307.13363, 2023

work page arXiv 2023
[38]

3dvg-transformer: Relation modeling for visual grounding on point clouds,

L. Zhao, D. Cai, L. Sheng, and D. Xu, “3dvg-transformer: Relation modeling for visual grounding on point clouds,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 2928–2937

work page 2021
[39]

Distilling coarse-to-fine semantic matching knowledge for weakly super- vised 3d visual grounding,

Z. Wang, H. Huang, Y . Zhao, L. Li, X. Cheng, Y . Zhu, A. Yin, and Z. Zhao, “Distilling coarse-to-fine semantic matching knowledge for weakly super- vised 3d visual grounding,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2662–2671

work page 2023
[40]

Three ways to improve verbo-visual fusion for dense 3d visual grounding,

O. Unal, C. Sakaridis, S. Saha, F. Yu, and L. Van Gool, “Three ways to improve verbo-visual fusion for dense 3d visual grounding,”arXiv preprint arXiv:2309.04561, 2023

work page arXiv 2023
[41]

X-trans2cap: Cross-modal knowledge transfer using transformer for 3d dense captioning,

Z. Yuan, X. Yan, Y . Liao, Y . Guo, G. Li, S. Cui, and Z. Li, “X-trans2cap: Cross-modal knowledge transfer using transformer for 3d dense captioning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 8563–8573

work page 2022
[42]

More: Multi-order relation mining for dense captioning in 3d scenes,

Y . Jiao, S. Chen, Z. Jie, J. Chen, L. Ma, and Y .-G. Jiang, “More: Multi-order relation mining for dense captioning in 3d scenes,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 528–545

work page 2022
[43]

End-to-end 3d dense captioning with vote2cap-detr,

S. Chen, H. Zhu, X. Chen, Y . Lei, G. Yu, and T. Chen, “End-to-end 3d dense captioning with vote2cap-detr,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 11 124–11 133

work page 2023
[44]

V ote2cap-detr++: Decoupling localization and describing for end-to-end 3d dense captioning,

S. Chen, H. Zhu, M. Li, X. Chen, P. Guo, Y . Lei, Y . Gang, T. Li, and T. Chen, “V ote2cap-detr++: Decoupling localization and describing for end-to-end 3d dense captioning,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

work page 2024
[45]

Clip-guided vision-language pre-training for question answering in 3d scenes,

M. Parelli, A. Delitzas, N. Hars, G. Vlassis, S. Anagnostidis, G. Bachmann, and T. Hofmann, “Clip-guided vision-language pre-training for question answering in 3d scenes,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 5606–5611

work page 2023
[46]

3djcg: A unified framework for joint dense captioning and visual grounding on 3d point clouds,

D. Cai, L. Zhao, J. Zhang, L. Sheng, and D. Xu, “3djcg: A unified framework for joint dense captioning and visual grounding on 3d point clouds,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 464–16 473

work page 2022
[47]

D3net: A unified speaker-listener architecture for 3d dense captioning and visual grounding,

D. Z. Chen, Q. Wu, M. Nießner, and A. X. Chang, “D3net: A unified speaker-listener architecture for 3d dense captioning and visual grounding,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 487–505

work page 2022
[48]

3d-vista: Pre- trained transformer for 3d vision and text alignment,

Z. Zhu, X. Ma, Y . Chen, Z. Deng, S. Huang, and Q. Li, “3d-vista: Pre- trained transformer for 3d vision and text alignment,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2911–2921

work page 2023
[49]

Context-aware alignment and mutual masking for 3d-language pre-training,

Z. Jin, M. Hayat, Y . Yang, Y . Guo, and Y . Lei, “Context-aware alignment and mutual masking for 3d-language pre-training,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 10 984–10 994

work page 2023
[50]

A survey on fine-grained multimodal large language models,

P. Yuxin, W. Zishuo, L. Geng, Z. Xiangtian, Y . Sibo, and H. Hulingxiao, “A survey on fine-grained multimodal large language models,”Chinese Journal of Electronics, vol. 35, no. 2, pp. 1–33, 2026. [Online]. Available: https://cje.ejournal.org.cn/en/article/doi/10.23919/cje.2025.00.336

work page doi:10.23919/cje.2025.00.336 2026
[51]

arXiv preprint arXiv:2309.03905 , year=

J. Han, R. Zhang, W. Shao, P. Gao, P. Xu, H. Xiao, K. Zhang, C. Liu, S. Wen, Z. Guoet al., “Imagebind-llm: Multi-modality instruction tuning,”arXiv preprint arXiv:2309.03905, 2023

work page arXiv 2023
[52]

arXiv preprint arXiv:2309.00615 (2023)

Z. Guo, R. Zhang, X. Zhu, Y . Tang, X. Ma, J. Han, K. Chen, P. Gao, X. Li, H. Liet al., “Point-bind & point-llm: Aligning point cloud with multi- modality for 3d understanding, generation, and instruction following,”arXiv preprint arXiv:2309.00615, 2023

work page arXiv 2023
[53]

Pointllm: Empowering large language models to understand point clouds,

R. Xu, X. Wang, T. Wang, Y . Chen, J. Pang, and D. Lin, “Pointllm: Empowering large language models to understand point clouds,”arXiv preprint arXiv:2308.16911, 2023

work page arXiv 2023
[54]

arXiv preprint arXiv:2405.10370 (2024)

Y . Chen, S. Yang, H. Huang, T. Wang, R. Lyu, R. Xu, D. Lin, and J. Pang, “Grounded 3d-llm with referent tokens,”arXiv preprint arXiv:2405.10370, 2024. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13

work page arXiv 2024
[55]

Scene-llm: Extending language model for 3d visual understanding and reasoning,

R. Fu, J. Liu, X. Chen, Y . Nie, and W. Xiong, “Scene-llm: Extending language model for 3d visual understanding and reasoning,”arXiv preprint arXiv:2403.11401, 2024

work page arXiv 2024
[56]

Towards reliable multimodal intelligence via uncertainty-aware inference,

G. Tianxiang, G. Shiqi, S. Qi, S. Qingyun, Z. Haoyi, and L. Jianxin, “Towards reliable multimodal intelligence via uncertainty-aware inference,” Chinese Journal of Electronics, vol. 35, pp. 1–16, 2026. [Online]. Available: https://cje.ejournal.org.cn/en/article/doi/10.23919/cje.2025.00.215

work page doi:10.23919/cje.2025.00.215 2026
[57]

Formation-guided multimodal representation learning for group action quality assessment,

D. Zexing, B. Wei, Z. Yumei, and W. Xiaojun, “Formation-guided multimodal representation learning for group action quality assessment,” Chinese Journal of Electronics, vol. 35, no. 4, pp. 1–11, 2026. [Online]. Available: https://cje.ejournal.org.cn/en/article/doi/10.23919/cje.2025.00. 486

work page doi:10.23919/cje.2025.00 2026
[58]

Point-bert: Pre-training 3d point cloud transformers with masked point modeling,

X. Yu, L. Tang, Y . Rao, T. Huang, J. Zhou, and J. Lu, “Point-bert: Pre-training 3d point cloud transformers with masked point modeling,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 19 313–19 322

work page 2022
[59]

Masked autoen- coders for point cloud self-supervised learning,

Y . Pang, W. Wang, F. E. Tay, W. Liu, Y . Tian, and L. Yuan, “Masked autoen- coders for point cloud self-supervised learning,” inEuropean conference on computer vision. Springer, 2022, pp. 604–621

work page 2022
[60]

Point-m2ae: multi-scale masked autoencoders for hierarchical point cloud pre-training,

R. Zhang, Z. Guo, P. Gao, R. Fang, B. Zhao, D. Wang, Y . Qiao, and H. Li, “Point-m2ae: multi-scale masked autoencoders for hierarchical point cloud pre-training,”Advances in neural information processing systems, vol. 35, pp. 27 061–27 074, 2022

work page 2022
[61]

Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding,

L. Xue, M. Gao, C. Xing, R. Martín-Martín, J. Wu, C. Xiong, R. Xu, J. C. Niebles, and S. Savarese, “Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1179–1189

work page 2023
[62]

Openshape: Scaling up 3d shape representation towards open-world understanding,

M. Liu, R. Shi, K. Kuang, Y . Zhu, X. Li, S. Han, H. Cai, F. Porikli, and H. Su, “Openshape: Scaling up 3d shape representation towards open-world understanding,”arXiv preprint arXiv:2305.10764, 2023

work page arXiv 2023
[63]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PMLR, 2021, pp. 8748–8763

work page 2021
[64]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection,

S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Suet al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 38–55

work page 2024
[65]

Grounded language-image pre-training,

L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y . Zhong, L. Wang, L. Yuan, L. Zhang, J.-N. Hwanget al., “Grounded language-image pre-training,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 965–10 975

work page 2022
[66]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Noubyet al., “Dinov2: Learning ro- bust visual features without supervision,”arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[67]

Cider: Consensus-based image description evaluation,

R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 4566–4575

work page 2015
[68]

Bleu: a method for automatic evaluation of machine translation,

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” inProceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318

work page 2002
[69]

OPT: Open Pre-trained Transformer Language Models

S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V . Linet al., “Opt: Open pre-trained transformer language models,”arXiv preprint arXiv:2205.01068, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[70]

The Llama 3 Herd of Models

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fanet al., “The llama 3 herd of models,” arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[71]

Rio: 3d object instance re-localization in changing indoor environments,

J. Wald, A. Avetisyan, N. Navab, F. Tombari, and M. Nießner, “Rio: 3d object instance re-localization in changing indoor environments,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 7658–7667

work page 2019
[72]

Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario,

T. Qian, J. Chen, L. Zhuo, Y . Jiao, and Y .-G. Jiang, “Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 5, 2024, pp. 4542–4550

work page 2024
[73]

Lidar-llm: Exploring the potential of large language models for 3d lidar understanding,

S. Yang, J. Liu, R. Zhang, M. Pan, Z. Guo, X. Li, Z. Chen, P. Gao, H. Li, Y . Guoet al., “Lidar-llm: Exploring the potential of large language models for 3d lidar understanding,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 9, 2025, pp. 9247–9255

work page 2025
[74]

Occllama: An occupancy- language-action generative world model for autonomous driving,

J. Wei, S. Yuan, P. Li, Q. Hu, Z. Gan, and W. Ding, “Occllama: An occupancy- language-action generative world model for autonomous driving,”arXiv preprint arXiv:2409.03272, 2024

work page arXiv 2024
[75]

Lscenellm: Enhancing large 3d scene understanding using adaptive visual preferences,

H. Zhi, P. Chen, J. Li, S. Ma, X. Sun, T. Xiang, Y . Lei, M. Tan, and C. Gan, “Lscenellm: Enhancing large 3d scene understanding using adaptive visual preferences,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 3761–3771

work page 2025
[76]

arXiv preprint arXiv:2503.23463 (2025) 4

X. Zhou, X. Han, F. Yang, Y . Ma, V . Tresp, and A. Knoll, “Opendrivevla: Towards end-to-end autonomous driving with large vision language action model,”arXiv preprint arXiv:2503.23463, 2025

work page arXiv 2025
[77]

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

P. Gao, J. Han, R. Zhang, Z. Lin, S. Geng, A. Zhou, W. Zhang, P. Lu, C. He, X. Yueet al., “Llama-adapter v2: Parameter-efficient visual instruction model,”arXiv preprint arXiv:2304.15010, 2023

work page internal anchor Pith review arXiv 2023
[78]

Tracking anything with decoupled video segmentation,

H. K. Cheng, S. W. Oh, B. Price, A. Schwing, and J.-Y . Lee, “Tracking anything with decoupled video segmentation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 1316– 1326. V. BIOGRAPHYSECTION Haifeng Huangreceived the BEng degree in Com- puter Science from the College of Computer Science and Technology, Zhejiang Un...

work page 2023