PAR3D: A Unified 3D-MLLM with Part-Aware Representation for Scene Understanding

Liujuan Cao; Shaohui Dai; Shengchuan Zhang; Yansong Qu; You Shen

arxiv: 2606.06485 · v1 · pith:SFHG6S2Wnew · submitted 2026-06-04 · 💻 cs.CV

PAR3D: A Unified 3D-MLLM with Part-Aware Representation for Scene Understanding

Shaohui Dai , Yansong Qu , You Shen , Shengchuan Zhang , Liujuan Cao This is my paper

Pith reviewed 2026-06-28 02:15 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D scene understandingpart-aware representation3D-MLLMreferring segmentationvisual question answeringsynthetic datasethierarchical queriesmultimodal large language model

0 comments

The pith

PAR3D adds part-level awareness to 3D-MLLMs so models can ground both objects and their parts in scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PAR3D as a unified 3D multimodal large language model that extends current object-centric systems to also model fine-grained part structures essential for embodied interaction. It supports this with a new synthetic dataset ScenePart containing part-level annotations and language instructions, plus two technical additions: Part-Aware 3D Representation Learning to enrich visual features with part semantics and Hierarchical Segmentation Query Generation to produce hierarchical object-part queries for grounding. Experiments indicate clear gains on part-level question answering and referring segmentation while object-level vision-language performance stays strong. A sympathetic reader would care because many real-world 3D tasks, from robotic grasping to scene navigation, require distinguishing and referring to specific parts rather than whole objects alone.

Core claim

PAR3D is a unified part-aware 3D-MLLM framework that enables models to understand, reason about, and ground both objects and their parts in 3D scenes. Training and evaluation are supported by the introduced ScenePart synthetic dataset with part-level annotations and language instructions. Part-Aware 3D Representation Learning enriches 3D visual representations with fine-grained part-level semantics, while Hierarchical Segmentation Query Generation grounds part targets via hierarchical object-part queries. The result is substantial improvement on part-level question answering and referring segmentation alongside strong object-level performance.

What carries the argument

Part-Aware 3D Representation Learning that enriches 3D visual representations with fine-grained part-level semantics, paired with Hierarchical Segmentation Query Generation that produces hierarchical object-part queries for grounding.

If this is right

Models gain the ability to answer part-level questions and perform part-level referring segmentation in 3D scenes.
Object-level vision-language tasks such as captioning and VQA continue to perform strongly.
A single framework handles both object-centric and part-centric 3D scene understanding tasks.
Hierarchical queries allow grounding of parts relative to their parent objects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Robotic systems could use the part-aware outputs for more precise manipulation of object components.
The synthetic-to-real transfer approach may extend to other 3D perception tasks where part annotations are scarce.
Hierarchical query mechanisms could be adapted for multi-scale reasoning in other multimodal settings.

Load-bearing premise

Training on the synthetic ScenePart dataset will transfer to enable part-aware understanding in real 3D scenes.

What would settle it

Running the trained PAR3D model on real-world 3D scene benchmarks and finding no measurable gain in part-level question answering or referring segmentation accuracy compared with prior object-centric 3D-MLLMs would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.06485 by Liujuan Cao, Shaohui Dai, Shengchuan Zhang, Yansong Qu, You Shen.

**Figure 2.** Figure 2: ScenePart Data Construction Pipeline. ScenePart composes part-annotated 3D objects into synthesized indoor layouts, producing object- and part-level mask annotations in 3D scenes and multi-task language instructions for training and evaluating part-aware 3D-MLLMs. where Pool(·) denotes superpoint pooling, and Fe consists of M encoder features {f e i }M i=1 over superpoints. A query decoder D further refine… view at source ↗

**Figure 3.** Figure 3: Overall Framework of PAR3D. PAR3D is trained with a two-stage scheme. Stage 1 adapts the 3D visual backbone with object- and part-level supervision through instance segmentation, part-aware contrastive learning, and representation-preserving regularization. Stage 2 performs instruction tuning on the MLLM using 3D vision-language instruction data. PAR3D generates textual responses and object or part masks t… view at source ↗

**Figure 4.** Figure 4: Qualitative Examples of PAR3D. We present examples on (a) referring segmentation and (b) visual question answering for part-aware 3D scene understanding. 4.4 Ablation Studies We conduct ablation studies to analyze the contribution of each component in our framework. Since our main designs focus on visual representation learning and hierarchical segmentation query generation, we report mIoU on ScenePart-Seg… view at source ↗

**Figure 5.** Figure 5: shows additional question answering examples from ScanQA and ScenePart-QA datasets. Each example includes the input scene, the ground-truth answer, the answer predicted by PAR3D, and the answer predicted by 3D-LLaVA. These cases illustrate that PAR3D can answer questions that require understanding object attributes, object parts, spatial relationships, and scene-level context. Compared to 3D-LLaVA, PAR3D p… view at source ↗

**Figure 6.** Figure 6: Additional Qualitative Comparisons on Referring Segmentation. Each example includes the input scene, the ground-truth mask, the prediction of PAR3D, and the prediction of 3D-LLaVA. The target mask is highlighted in blue, regardless of whether the target corresponds to an object or a part. PAR3D achieves more accurate segmentation across representative examples from multiple datasets. 18 [PITH_FULL_IMAGE:f… view at source ↗

read the original abstract

Recent advances in 3D multimodal large language models (3D-MLLMs) have enabled unified solutions for 3D scene understanding tasks, including visual question answering, captioning, and referring segmentation. However, existing 3D-MLLMs remain largely object-centric, limiting their ability to model fine-grained part structures that are essential for embodied interaction with 3D environments. In this work, we present PAR3D, a unified part-aware 3D-MLLM framework that enables models to understand, reason about, and ground both objects and their parts in 3D scenes. To enable training and evaluation of part-aware 3D scene understanding, we introduce ScenePart, a synthetic 3D scene dataset with part-level annotations and language instructions. We further develop Part-Aware 3D Representation Learning to enrich 3D visual representations with fine-grained part-level semantics, and propose Hierarchical Segmentation Query Generation to ground part targets via hierarchical object-part queries. Extensive experiments show that our method substantially improves part-level question answering and referring segmentation, while also achieving strong performance across object-level vision-language tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PAR3D adds part-level handling to 3D MLLMs with a new synthetic dataset and two techniques, but the transfer to real scenes is unshown and remains the key untested step.

read the letter

The main takeaway is that this work targets the object-centric limit in current 3D-MLLMs by introducing part-aware representations and queries, backed by a new synthetic dataset. That is the actual addition.

What stands out as new is the PAR3D framework itself, the ScenePart dataset with part-level annotations and instructions, the Part-Aware 3D Representation Learning approach, and the Hierarchical Segmentation Query Generation method. These are presented as direct responses to gaps in prior models for fine-grained scene understanding.

The paper does a clear job stating the motivation: embodied tasks need part-level reasoning, not just whole objects. The two proposed techniques look like practical extensions of existing segmentation and query mechanisms.

The soft spot is exactly the one flagged in the stress test. The abstract and claims rest on ScenePart for both training and evaluation, with no mention of real 3D datasets such as ScanNet or 3RScan, no domain gap measurements, and no sim-to-real results. If the part annotations and language instructions do not carry over, the reported gains on part-level QA and referring segmentation stay confined to the synthetic case and do not support the embodied interaction story.

This paper is aimed at groups already working on 3D vision-language models. A reader who wants to see concrete ablations and real-scene numbers would get value from the full experiments. It is coherent enough on its own terms to deserve a serious referee rather than a desk reject, mainly to check whether the experiments close the transfer gap or leave it open.

Referee Report

1 major / 0 minor

Summary. The paper introduces PAR3D, a unified part-aware 3D-MLLM, along with the synthetic ScenePart dataset containing part-level annotations and language instructions. It proposes Part-Aware 3D Representation Learning to enrich 3D visual features with fine-grained part semantics and Hierarchical Segmentation Query Generation to produce hierarchical object-part queries for grounding. The central claim is that these components enable models to understand, reason about, and ground both objects and parts, yielding substantial gains on part-level VQA and referring segmentation while preserving strong object-level performance.

Significance. If the synthetic-to-real transfer holds, the work would meaningfully extend 3D-MLLMs beyond object-centric limitations and support finer-grained embodied interaction. The creation of ScenePart is a constructive step toward part-level supervision, but the overall significance hinges on whether the reported gains generalize outside the synthetic distribution.

major comments (1)

[Abstract] Abstract: The claim that PAR3D enables part-aware understanding and grounding 'in 3D scenes' for embodied interaction is load-bearing on sim-to-real transfer, yet the abstract introduces ScenePart solely for training/evaluation and reports improvements without any reference to real-world 3D datasets (ScanNet, 3RScan, etc.) or domain-randomization metrics.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for precision in the abstract regarding the scope of our claims. We agree that the current wording could overstate generalization and will revise accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that PAR3D enables part-aware understanding and grounding 'in 3D scenes' for embodied interaction is load-bearing on sim-to-real transfer, yet the abstract introduces ScenePart solely for training/evaluation and reports improvements without any reference to real-world 3D datasets (ScanNet, 3RScan, etc.) or domain-randomization metrics.

Authors: We agree the abstract phrasing is imprecise. The manuscript evaluates exclusively on the synthetic ScenePart dataset; no experiments on ScanNet, 3RScan or other real-world 3D datasets are reported, and no domain-randomization or sim-to-real metrics are provided. The framework is motivated by embodied interaction needs, but we do not claim or demonstrate transfer. We will revise the abstract to state that gains are shown on synthetic part-level data and note the absence of real-world validation as a limitation, with sim-to-real transfer left for future work. revision: yes

Circularity Check

0 steps flagged

No circularity; standard empirical proposal with independent dataset and modules

full rationale

The paper introduces a new synthetic dataset (ScenePart) and two new modules (Part-Aware 3D Representation Learning and Hierarchical Segmentation Query Generation) as contributions, then reports empirical gains on part-level and object-level tasks. No equations, derivations, or claims reduce a 'prediction' or result to a fitted input by construction, nor does any load-bearing premise rest on a self-citation chain or imported uniqueness theorem. The central claims rest on external evaluation metrics rather than self-referential definitions, so the derivation chain is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Abstract-only review; no specific free parameters, axioms, or invented entities beyond the high-level contributions can be identified or verified.

invented entities (2)

PAR3D no independent evidence
purpose: Unified part-aware 3D-MLLM framework
New model name and architecture introduced in abstract.
ScenePart no independent evidence
purpose: Synthetic 3D scene dataset with part-level annotations
New dataset introduced in abstract.

pith-pipeline@v0.9.1-grok · 5745 in / 1148 out tokens · 28159 ms · 2026-06-28T02:15:29.159157+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

73 extracted references · 20 canonical work pages · 4 internal anchors

[1]

Achlioptas, A

P. Achlioptas, A. Abdelreheem, F. Xia, M. Elhoseiny, and L. Guibas. Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. InEuropean conference on computer vision, pages 422–440. Springer, 2020

2020
[2]

Azuma, T

D. Azuma, T. Miyanishi, S. Kurita, and M. Kawanabe. Scanqa: 3d question answering for spatial scene understanding. Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19129–19139, 2022

2022
[3]

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Banerjee and A

S. Banerjee and A. Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005

2005
[5]

ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data

G. Baruch, Z. Chen, A. Dehghan, T. Dimry, Y . Feigin, P. Fu, T. Gebauer, B. Joffe, D. Kurz, A. Schwartz, et al. Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data. arXiv preprint arXiv:2111.08897, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. Shapenet: An information-rich 3d model repository.arXiv preprint arXiv:1512.03012, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[7]

D. Z. Chen, A. X. Chang, and M. Nießner. Scanrefer: 3d object localization in rgb-d scans using natural language. InEuropean conference on computer vision, pages 202–221. Springer, 2020

2020
[8]

S. Chen, H. Zhu, X. Chen, Y . Lei, G. Yu, and T. Chen. End-to-end 3d dense captioning with vote2cap- detr. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11124–11133, 2023. 10

2023
[9]

S. Chen, X. Chen, C. Zhang, M. Li, G. Yu, H. Fei, H. Zhu, J. Fan, and T. Chen. Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26428–26438, 2024

2024
[10]

Y . Chen, S. Yang, H. Huang, T. Wang, R. Xu, R. Lyu, D. Lin, and J. Pang. Grounded 3d-llm with referent tokens.arXiv preprint arXiv:2405.10370, 2024

work page arXiv 2024
[11]

Z. Chen, A. Gholami, M. Nießner, and A. X. Chang. Scan2cap: Context-aware dense captioning in rgb-d scans. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3193–3203, 2021

2021
[12]

Z. Chen, R. Hu, X. Chen, M. Nießner, and A. X. Chang. Unit3d: A unified transformer for 3d dense captioning and visual grounding. InProceedings of the IEEE/CVF international conference on computer vision, pages 18109–18119, 2023

2023
[13]

A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017

2017
[14]

S. Dai, Y . Qu, Z. Li, X. Li, S. Zhang, and L. Cao. Training-free hierarchical scene understanding for gaussian splatting with superpoint graphs. InProceedings of the 33rd ACM International Conference on Multimedia, pages 3673–3682, 2025

2025
[15]

J. Deng, T. He, L. Jiang, T. Wang, F. Dayoub, and I. Reid. 3d-llava: Towards generalist 3d lmms with omni superpoint transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3772–3782, 2025

2025
[16]

H. Fu, B. Cai, L. Gao, L.-X. Zhang, J. Wang, C. Li, Q. Zeng, C. Sun, R. Jia, B. Zhao, et al. 3d-front: 3d furnished rooms with layouts and semantics. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10933–10942, 2021

2021
[17]

H. Fu, R. Jia, L. Gao, M. Gong, B. Zhao, S. Maybank, and D. Tao. 3d-future: 3d furniture shape with texture.International Journal of Computer Vision, 129(12):3313–3337, 2021

2021
[18]

R. Fu, J. Liu, X. Chen, Y . Nie, and W. Xiong. Scene-llm: Extending language model for 3d visual understanding and reasoning.arXiv preprint arXiv:2403.11401, 2024

work page arXiv 2024
[19]

H. Geng, H. Xu, C. Zhao, C. Xu, L. Yi, S. Huang, and H. Wang. Gapartnet: Cross-category domain- generalizable object perception and manipulation via generalizable and actionable parts. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7081–7091, 2023

2023
[20]

S. He, H. Ding, X. Jiang, and B. Wen. Segpoint: Segment any point cloud via large language model. In European Conference on Computer Vision, pages 349–367. Springer, 2024

2024
[21]

Y . Hong, H. Zhen, P. Chen, S. Zheng, Y . Du, Z. Chen, and C. Gan. 3d-llm: Injecting the 3d world into large language models.Advances in Neural Information Processing Systems, 36:20482–20494, 2023

2023
[22]

S. Hu, D. M. Arroyo, S. Debats, F. Manhardt, L. Carlone, and F. Tombari. Mixed diffusion for 3d indoor scene synthesis. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1262–1272, 2026

2026
[23]

Huang, Y

H. Huang, Y . Chen, Z. Wang, R. Huang, R. Xu, T. Wang, L. Liu, X. Cheng, Y . Zhao, J. Pang, et al. Chat-scene: Bridging 3d scene and large language models with object identifiers.Advances in Neural Information Processing Systems, 37:113991–114017, 2024

2024
[24]

An Embodied Generalist Agent in 3D World

J. Huang, S. Yong, X. Ma, X. Linghu, P. Li, Y . Wang, Q. Li, S.-C. Zhu, B. Jia, and S. Huang. An embodied generalist agent in 3d world.arXiv preprint arXiv:2311.12871, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Huang, X

J. Huang, X. Ma, X. Linghu, Y . Fan, J. He, W. Tan, Q. Li, S.-C. Zhu, Y . Chen, B. Jia, et al. Leo-vl: Efficient scene representation for scalable 3d vision-language learning.arXiv preprint arXiv:2506.09935, 2025

work page arXiv 2025
[26]

Huang, X

K.-C. Huang, X. Li, L. Qi, S. Yan, and M.-H. Yang. Reason3d: Searching and reasoning 3d segmentation via large language model. In2025 International Conference on 3D Vision (3DV), pages 1177–1186. IEEE, 2025

2025
[27]

J. Kerr, C. M. Kim, K. Goldberg, A. Kanazawa, and M. Tancik. LERF: language embedded radiance fields. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 19672–19682. IEEE, 2023. doi: 10.1109/ICCV51070.2023.01807. URL https: //doi.org/10.1109/ICCV51070.2023.01807. 11

work page doi:10.1109/iccv51070.2023.01807 2023
[28]

Y . Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen. Pointcnn: Convolution on x-transformed points. Advances in neural information processing systems, 31, 2018

2018
[29]

Y . Li, U. Upadhyay, H. Slim, A. Abdelreheem, A. Prajapati, S. Pothigara, P. Wonka, and M. Elhoseiny. 3d compat: Composition of materials on parts of 3d things. InEuropean conference on computer vision, pages 110–127. Springer, 2022

2022
[30]

C.-Y . Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004

2004
[31]

H. Liu, C. Li, Y . Li, and Y . J. Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024

2024
[32]

M. Liu, Y . Zhu, H. Cai, S. Han, Z. Ling, F. Porikli, and H. Su. Partslip: Low-shot part segmentation for 3d point clouds via pretrained image-language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21736–21746, 2023

2023
[33]

C. Ma, Y . Li, X. Yan, J. Xu, Y . Yang, C. Wang, Z. Zhao, Y . Guo, Z. Chen, and C. Guo. P3-sam: Native 3d part segmentation.arXiv preprint arXiv:2509.06784, 2025

work page arXiv 2025
[34]

X. Ma, S. Yong, Z. Zheng, Q. Li, Y . Liang, S.-C. Zhu, and S. Huang. Sqa3d: Situated question answering in 3d scenes.arXiv preprint arXiv:2210.07474, 2022

work page arXiv 2022
[35]

X. Ma, B. Smart, Y . Bhalgat, S. Chen, X. Li, J. Ding, J. Gu, D. Z. Chen, S. Peng, J.-W. Bian, et al. When llms step into the 3d world: A survey and meta-analysis of 3d tasks via multi-modal large language models. arXiv preprint arXiv:2405.10255, 2024

work page arXiv 2024
[36]

Z. Ma, Y . Yue, and G. Gkioxari. Find any part in 3d. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7818–7827, 2025

2025
[37]

K. Mo, S. Zhu, A. X. Chang, L. Yi, S. Tripathi, L. J. Guibas, and H. Su. Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 909–918, 2019

2019
[38]

K. Mo, L. J. Guibas, M. Mukadam, A. Gupta, and S. Tulsiani. Where2act: From pixels to actions for articulated 3d objects. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6813–6823, 2021

2021
[39]

Papineni, S

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

2002
[40]

A. V . Phan, M. Le Nguyen, Y . L. H. Nguyen, and L. T. Bui. Dgcnn: A convolutional neural network over large-scale labeled graphs.Neural Networks, 108:533–543, 2018

2018
[41]

C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017

2017
[42]

Z. Qi, Y . Fang, Z. Sun, X. Wu, T. Wu, J. Wang, D. Lin, and H. Zhao. Gpt4point: A unified framework for point-language understanding and generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26417–26427, 2024

2024
[43]

M. Qin, W. Li, J. Zhou, H. Wang, and H. Pfister. Langsplat: 3d language gaussian splatting. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 20051–20060. IEEE, 2024. doi: 10.1109/CVPR52733.2024.01895. URL https://doi.org/10.1109/CVPR52733.2024.01895

work page doi:10.1109/cvpr52733.2024.01895 2024
[44]

Y . Qu, Y . Wang, and Y . Qi. Sg-nerf: Semantic-guided point-based neural radiance fields. In2023 IEEE International Conference on Multimedia and Expo (ICME), pages 570–575. IEEE, 2023

2023
[45]

Y . Qu, S. Dai, X. Li, J. Lin, L. Cao, S. Zhang, and R. Ji. Goi: Find 3d gaussians of interest with an optimizable open-vocabulary semantic-space hyperplane. InProceedings of the 32nd ACM International Conference on Multimedia, pages 5328–5337, 2024

2024
[46]

Y . Qu, D. Chen, X. Li, X. Li, S. Zhang, L. Cao, and R. Ji. Drag your gaussian: Effective drag-based editing with score distillation for 3d gaussian splatting.ArXiv preprint, abs/2501.18672, 2025. URL https://arxiv.org/abs/2501.18672. 12

work page arXiv 2025
[47]

H. Slim, X. Li, Y . Li, M. Ahmed, M. Ayman, U. Upadhyay, A. Abdelreheem, A. Prajapati, S. Pothigara, P. Wonka, et al. 3dcompat++: An improved large-scale 3d vision dataset for compositional recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

2025
[48]

Y . Tang, X. Han, X. Li, Q. Yu, Y . Hao, L. Hu, and M. Chen. Minigpt-3d: Efficiently aligning 3d point clouds with large language models using 2d priors. InProceedings of the 32nd ACM International Conference on Multimedia, pages 6617–6626, 2024

2024
[49]

Umam, C.-K

A. Umam, C.-K. Yang, M.-H. Chen, J.-H. Chuang, and Y .-Y . Lin. Partdistill: 3d shape part segmentation by vision-language model distillation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3470–3479, 2024

2024
[50]

Vedantam, C

R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider: Consensus-based image description evaluation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015

2015
[51]

J. Wald, A. Avetisyan, N. Navab, F. Tombari, and M. Nießner. Rio: 3d object instance re-localization in changing indoor environments. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7658–7667, 2019

2019
[52]

C. Wang, J. Ye, Y . Yang, Y . Li, Z. Lin, J. Zhu, Z. Chen, Y . Luo, and C. Guo. Part-x-mllm: Part-aware 3d multimodal large language model.arXiv preprint arXiv:2511.13647, 2025

work page arXiv 2025
[53]

J. Wang, D. Wang, J. Hu, Q. Zhang, J. Yu, and L. Xu. Kinematify: Open-vocabulary synthesis of high-dof articulated objects.arXiv preprint arXiv:2511.01294, 2025

work page arXiv 2025
[54]

Y . Wang, J. Wang, Y . Qu, and Y . Qi. Rip-nerf: learning rotation-invariant point-based neural radiance field for fine-grained editing and compositing. InProceedings of the 2023 ACM International Conference on Multimedia Retrieval, pages 125–134, 2023

2023
[55]

Y . Wang, J. Wang, R. Gao, Y . Qu, W. Duan, S. Yang, and Y . Qi. Look at the sky: Sky-aware efficient 3d gaussian splatting in the wild.IEEE Transactions on Visualization and Computer Graphics, 2025

2025
[56]

C. Wu, Y . Ma, Q. Chen, H. Wang, G. Luo, J. Ji, and X. Sun. 3d-stmn: Dependency-driven superpoint-text matching network for end-to-end 3d referring expression segmentation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 5940–5948, 2024

2024
[57]

X. Wu, L. Jiang, P.-S. Wang, Z. Liu, X. Liu, Y . Qiao, W. Ouyang, T. He, and H. Zhao. Point transformer v3: Simpler faster stronger. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4840–4851, 2024

2024
[58]

X. Wu, D. DeTone, D. Frost, T. Shen, C. Xie, N. Yang, J. Engel, R. Newcombe, H. Zhao, and J. Straub. Sonata: Self-supervised learning of reliable point representations. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22193–22204, 2025

2025
[59]

Xiang, Y

F. Xiang, Y . Qin, K. Mo, Y . Xia, H. Zhu, F. Liu, M. Liu, H. Jiang, Y . Yuan, H. Wang, et al. Sapien: A simulated part-based interactive environment. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11097–11107, 2020

2020
[60]

R. Xu, X. Wang, T. Wang, Y . Chen, J. Pang, and D. Lin. Pointllm: Empowering large language models to understand point clouds. InEuropean Conference on Computer Vision, pages 131–147. Springer, 2024

2024
[61]

Y . Yang, Y . Huang, Y .-C. Guo, L. Lu, X. Wu, E. Y . Lam, Y .-P. Cao, and X. Liu. Sampart3d: Segment any part in 3d objects.arXiv preprint arXiv:2411.07184, 2024

work page arXiv 2024
[62]

L. Yi, V . G. Kim, D. Ceylan, I.-C. Shen, M. Yan, H. Su, C. Lu, Q. Huang, A. Sheffer, and L. Guibas. A scalable active framework for region annotation in 3d shape collections.ACM Transactions on Graphics (ToG), 35(6):1–12, 2016

2016
[63]

Zemskova and D

T. Zemskova and D. Yudin. 3dgraphllm: Combining semantic graphs and large language models for 3d scene understanding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8885–8895, 2025

2025
[64]

J. Zha, Y . Fan, X. Yang, C. Gao, and X. Chen. How to enable llm with 3d capacity? a survey of spatial reasoning in llm.arXiv preprint arXiv:2504.05786, 2025

work page arXiv 2025
[65]

Zhang, Z

Y . Zhang, Z. Gong, and A. X. Chang. Multi3drefer: Grounding text description to multiple 3d objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15225–15236, 2023. 13

2023
[66]

arXiv preprint arXiv:2510.23607 (2025)

Y . Zhang, X. Wu, Y . Lao, C. Wang, Z. Tian, N. Wang, and H. Zhao. Concerto: Joint 2d-3d self-supervised learning emerges spatial representations.arXiv preprint arXiv:2510.23607, 2025

work page arXiv 2025
[67]

Zhang, X

Y . Zhang, X. Wu, Y . Yang, X. Fan, H. Li, Y . Zhang, Z. Huang, N. Wang, and H. Zhao. Utonia: Toward one encoder for all point clouds.arXiv preprint arXiv:2603.03283, 2026

work page arXiv 2026
[68]

H. Zhao, L. Jiang, J. Jia, P. H. Torr, and V . Koltun. Point transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 16259–16268, 2021

2021
[69]

Zheng, S

D. Zheng, S. Huang, and L. Wang. Video-3d llm: Learning position-aware video representation for 3d scene understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8995–9006, 2025

2025
[70]

Y . Zhou, J. Gu, T. Y . Chiang, F. Xiang, and H. Su. Point-sam: Promptable 3d segmentation model for point clouds.arXiv preprint arXiv:2406.17741, 2024

work page arXiv 2024
[71]

C. Zhu, T. Wang, W. Zhang, K. Chen, and X. Liu. Scanreason: Empowering 3d visual grounding with reasoning capabilities. InEuropean Conference on Computer Vision, pages 151–168. Springer, 2024

2024
[72]

C. Zhu, T. Wang, W. Zhang, J. Pang, and X. Liu. Llava-3d: A simple yet effective pathway to empowering lmms with 3d capabilities. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4295–4305, 2025

2025
[73]

Z. Zhu, X. Ma, Y . Chen, Z. Deng, S. Huang, and Q. Li. 3d-vista: Pre-trained transformer for 3d vision and text alignment. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2911–2921, 2023. 14 A Evaluation Metrics We evaluate PAR3D with existing approaches on referring segmentation, visual question answering, and dense capti...

2023

[1] [1]

Achlioptas, A

P. Achlioptas, A. Abdelreheem, F. Xia, M. Elhoseiny, and L. Guibas. Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. InEuropean conference on computer vision, pages 422–440. Springer, 2020

2020

[2] [2]

Azuma, T

D. Azuma, T. Miyanishi, S. Kurita, and M. Kawanabe. Scanqa: 3d question answering for spatial scene understanding. Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19129–19139, 2022

2022

[3] [3]

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Banerjee and A

S. Banerjee and A. Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005

2005

[5] [5]

ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data

G. Baruch, Z. Chen, A. Dehghan, T. Dimry, Y . Feigin, P. Fu, T. Gebauer, B. Joffe, D. Kurz, A. Schwartz, et al. Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data. arXiv preprint arXiv:2111.08897, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[6] [6]

A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. Shapenet: An information-rich 3d model repository.arXiv preprint arXiv:1512.03012, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[7] [7]

D. Z. Chen, A. X. Chang, and M. Nießner. Scanrefer: 3d object localization in rgb-d scans using natural language. InEuropean conference on computer vision, pages 202–221. Springer, 2020

2020

[8] [8]

S. Chen, H. Zhu, X. Chen, Y . Lei, G. Yu, and T. Chen. End-to-end 3d dense captioning with vote2cap- detr. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11124–11133, 2023. 10

2023

[9] [9]

S. Chen, X. Chen, C. Zhang, M. Li, G. Yu, H. Fei, H. Zhu, J. Fan, and T. Chen. Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26428–26438, 2024

2024

[10] [10]

Y . Chen, S. Yang, H. Huang, T. Wang, R. Xu, R. Lyu, D. Lin, and J. Pang. Grounded 3d-llm with referent tokens.arXiv preprint arXiv:2405.10370, 2024

work page arXiv 2024

[11] [11]

Z. Chen, A. Gholami, M. Nießner, and A. X. Chang. Scan2cap: Context-aware dense captioning in rgb-d scans. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3193–3203, 2021

2021

[12] [12]

Z. Chen, R. Hu, X. Chen, M. Nießner, and A. X. Chang. Unit3d: A unified transformer for 3d dense captioning and visual grounding. InProceedings of the IEEE/CVF international conference on computer vision, pages 18109–18119, 2023

2023

[13] [13]

A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017

2017

[14] [14]

S. Dai, Y . Qu, Z. Li, X. Li, S. Zhang, and L. Cao. Training-free hierarchical scene understanding for gaussian splatting with superpoint graphs. InProceedings of the 33rd ACM International Conference on Multimedia, pages 3673–3682, 2025

2025

[15] [15]

J. Deng, T. He, L. Jiang, T. Wang, F. Dayoub, and I. Reid. 3d-llava: Towards generalist 3d lmms with omni superpoint transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3772–3782, 2025

2025

[16] [16]

H. Fu, B. Cai, L. Gao, L.-X. Zhang, J. Wang, C. Li, Q. Zeng, C. Sun, R. Jia, B. Zhao, et al. 3d-front: 3d furnished rooms with layouts and semantics. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10933–10942, 2021

2021

[17] [17]

H. Fu, R. Jia, L. Gao, M. Gong, B. Zhao, S. Maybank, and D. Tao. 3d-future: 3d furniture shape with texture.International Journal of Computer Vision, 129(12):3313–3337, 2021

2021

[18] [18]

R. Fu, J. Liu, X. Chen, Y . Nie, and W. Xiong. Scene-llm: Extending language model for 3d visual understanding and reasoning.arXiv preprint arXiv:2403.11401, 2024

work page arXiv 2024

[19] [19]

H. Geng, H. Xu, C. Zhao, C. Xu, L. Yi, S. Huang, and H. Wang. Gapartnet: Cross-category domain- generalizable object perception and manipulation via generalizable and actionable parts. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7081–7091, 2023

2023

[20] [20]

S. He, H. Ding, X. Jiang, and B. Wen. Segpoint: Segment any point cloud via large language model. In European Conference on Computer Vision, pages 349–367. Springer, 2024

2024

[21] [21]

Y . Hong, H. Zhen, P. Chen, S. Zheng, Y . Du, Z. Chen, and C. Gan. 3d-llm: Injecting the 3d world into large language models.Advances in Neural Information Processing Systems, 36:20482–20494, 2023

2023

[22] [22]

S. Hu, D. M. Arroyo, S. Debats, F. Manhardt, L. Carlone, and F. Tombari. Mixed diffusion for 3d indoor scene synthesis. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1262–1272, 2026

2026

[23] [23]

Huang, Y

H. Huang, Y . Chen, Z. Wang, R. Huang, R. Xu, T. Wang, L. Liu, X. Cheng, Y . Zhao, J. Pang, et al. Chat-scene: Bridging 3d scene and large language models with object identifiers.Advances in Neural Information Processing Systems, 37:113991–114017, 2024

2024

[24] [24]

An Embodied Generalist Agent in 3D World

J. Huang, S. Yong, X. Ma, X. Linghu, P. Li, Y . Wang, Q. Li, S.-C. Zhu, B. Jia, and S. Huang. An embodied generalist agent in 3d world.arXiv preprint arXiv:2311.12871, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [25]

Huang, X

J. Huang, X. Ma, X. Linghu, Y . Fan, J. He, W. Tan, Q. Li, S.-C. Zhu, Y . Chen, B. Jia, et al. Leo-vl: Efficient scene representation for scalable 3d vision-language learning.arXiv preprint arXiv:2506.09935, 2025

work page arXiv 2025

[26] [26]

Huang, X

K.-C. Huang, X. Li, L. Qi, S. Yan, and M.-H. Yang. Reason3d: Searching and reasoning 3d segmentation via large language model. In2025 International Conference on 3D Vision (3DV), pages 1177–1186. IEEE, 2025

2025

[27] [27]

J. Kerr, C. M. Kim, K. Goldberg, A. Kanazawa, and M. Tancik. LERF: language embedded radiance fields. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 19672–19682. IEEE, 2023. doi: 10.1109/ICCV51070.2023.01807. URL https: //doi.org/10.1109/ICCV51070.2023.01807. 11

work page doi:10.1109/iccv51070.2023.01807 2023

[28] [28]

Y . Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen. Pointcnn: Convolution on x-transformed points. Advances in neural information processing systems, 31, 2018

2018

[29] [29]

Y . Li, U. Upadhyay, H. Slim, A. Abdelreheem, A. Prajapati, S. Pothigara, P. Wonka, and M. Elhoseiny. 3d compat: Composition of materials on parts of 3d things. InEuropean conference on computer vision, pages 110–127. Springer, 2022

2022

[30] [30]

C.-Y . Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004

2004

[31] [31]

H. Liu, C. Li, Y . Li, and Y . J. Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024

2024

[32] [32]

M. Liu, Y . Zhu, H. Cai, S. Han, Z. Ling, F. Porikli, and H. Su. Partslip: Low-shot part segmentation for 3d point clouds via pretrained image-language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21736–21746, 2023

2023

[33] [33]

C. Ma, Y . Li, X. Yan, J. Xu, Y . Yang, C. Wang, Z. Zhao, Y . Guo, Z. Chen, and C. Guo. P3-sam: Native 3d part segmentation.arXiv preprint arXiv:2509.06784, 2025

work page arXiv 2025

[34] [34]

X. Ma, S. Yong, Z. Zheng, Q. Li, Y . Liang, S.-C. Zhu, and S. Huang. Sqa3d: Situated question answering in 3d scenes.arXiv preprint arXiv:2210.07474, 2022

work page arXiv 2022

[35] [35]

X. Ma, B. Smart, Y . Bhalgat, S. Chen, X. Li, J. Ding, J. Gu, D. Z. Chen, S. Peng, J.-W. Bian, et al. When llms step into the 3d world: A survey and meta-analysis of 3d tasks via multi-modal large language models. arXiv preprint arXiv:2405.10255, 2024

work page arXiv 2024

[36] [36]

Z. Ma, Y . Yue, and G. Gkioxari. Find any part in 3d. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7818–7827, 2025

2025

[37] [37]

K. Mo, S. Zhu, A. X. Chang, L. Yi, S. Tripathi, L. J. Guibas, and H. Su. Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 909–918, 2019

2019

[38] [38]

K. Mo, L. J. Guibas, M. Mukadam, A. Gupta, and S. Tulsiani. Where2act: From pixels to actions for articulated 3d objects. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6813–6823, 2021

2021

[39] [39]

Papineni, S

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

2002

[40] [40]

A. V . Phan, M. Le Nguyen, Y . L. H. Nguyen, and L. T. Bui. Dgcnn: A convolutional neural network over large-scale labeled graphs.Neural Networks, 108:533–543, 2018

2018

[41] [41]

C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017

2017

[42] [42]

Z. Qi, Y . Fang, Z. Sun, X. Wu, T. Wu, J. Wang, D. Lin, and H. Zhao. Gpt4point: A unified framework for point-language understanding and generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26417–26427, 2024

2024

[43] [43]

M. Qin, W. Li, J. Zhou, H. Wang, and H. Pfister. Langsplat: 3d language gaussian splatting. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 20051–20060. IEEE, 2024. doi: 10.1109/CVPR52733.2024.01895. URL https://doi.org/10.1109/CVPR52733.2024.01895

work page doi:10.1109/cvpr52733.2024.01895 2024

[44] [44]

Y . Qu, Y . Wang, and Y . Qi. Sg-nerf: Semantic-guided point-based neural radiance fields. In2023 IEEE International Conference on Multimedia and Expo (ICME), pages 570–575. IEEE, 2023

2023

[45] [45]

Y . Qu, S. Dai, X. Li, J. Lin, L. Cao, S. Zhang, and R. Ji. Goi: Find 3d gaussians of interest with an optimizable open-vocabulary semantic-space hyperplane. InProceedings of the 32nd ACM International Conference on Multimedia, pages 5328–5337, 2024

2024

[46] [46]

Y . Qu, D. Chen, X. Li, X. Li, S. Zhang, L. Cao, and R. Ji. Drag your gaussian: Effective drag-based editing with score distillation for 3d gaussian splatting.ArXiv preprint, abs/2501.18672, 2025. URL https://arxiv.org/abs/2501.18672. 12

work page arXiv 2025

[47] [47]

H. Slim, X. Li, Y . Li, M. Ahmed, M. Ayman, U. Upadhyay, A. Abdelreheem, A. Prajapati, S. Pothigara, P. Wonka, et al. 3dcompat++: An improved large-scale 3d vision dataset for compositional recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

2025

[48] [48]

Y . Tang, X. Han, X. Li, Q. Yu, Y . Hao, L. Hu, and M. Chen. Minigpt-3d: Efficiently aligning 3d point clouds with large language models using 2d priors. InProceedings of the 32nd ACM International Conference on Multimedia, pages 6617–6626, 2024

2024

[49] [49]

Umam, C.-K

A. Umam, C.-K. Yang, M.-H. Chen, J.-H. Chuang, and Y .-Y . Lin. Partdistill: 3d shape part segmentation by vision-language model distillation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3470–3479, 2024

2024

[50] [50]

Vedantam, C

R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider: Consensus-based image description evaluation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015

2015

[51] [51]

J. Wald, A. Avetisyan, N. Navab, F. Tombari, and M. Nießner. Rio: 3d object instance re-localization in changing indoor environments. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7658–7667, 2019

2019

[52] [52]

C. Wang, J. Ye, Y . Yang, Y . Li, Z. Lin, J. Zhu, Z. Chen, Y . Luo, and C. Guo. Part-x-mllm: Part-aware 3d multimodal large language model.arXiv preprint arXiv:2511.13647, 2025

work page arXiv 2025

[53] [53]

J. Wang, D. Wang, J. Hu, Q. Zhang, J. Yu, and L. Xu. Kinematify: Open-vocabulary synthesis of high-dof articulated objects.arXiv preprint arXiv:2511.01294, 2025

work page arXiv 2025

[54] [54]

Y . Wang, J. Wang, Y . Qu, and Y . Qi. Rip-nerf: learning rotation-invariant point-based neural radiance field for fine-grained editing and compositing. InProceedings of the 2023 ACM International Conference on Multimedia Retrieval, pages 125–134, 2023

2023

[55] [55]

Y . Wang, J. Wang, R. Gao, Y . Qu, W. Duan, S. Yang, and Y . Qi. Look at the sky: Sky-aware efficient 3d gaussian splatting in the wild.IEEE Transactions on Visualization and Computer Graphics, 2025

2025

[56] [56]

C. Wu, Y . Ma, Q. Chen, H. Wang, G. Luo, J. Ji, and X. Sun. 3d-stmn: Dependency-driven superpoint-text matching network for end-to-end 3d referring expression segmentation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 5940–5948, 2024

2024

[57] [57]

X. Wu, L. Jiang, P.-S. Wang, Z. Liu, X. Liu, Y . Qiao, W. Ouyang, T. He, and H. Zhao. Point transformer v3: Simpler faster stronger. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4840–4851, 2024

2024

[58] [58]

X. Wu, D. DeTone, D. Frost, T. Shen, C. Xie, N. Yang, J. Engel, R. Newcombe, H. Zhao, and J. Straub. Sonata: Self-supervised learning of reliable point representations. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22193–22204, 2025

2025

[59] [59]

Xiang, Y

F. Xiang, Y . Qin, K. Mo, Y . Xia, H. Zhu, F. Liu, M. Liu, H. Jiang, Y . Yuan, H. Wang, et al. Sapien: A simulated part-based interactive environment. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11097–11107, 2020

2020

[60] [60]

R. Xu, X. Wang, T. Wang, Y . Chen, J. Pang, and D. Lin. Pointllm: Empowering large language models to understand point clouds. InEuropean Conference on Computer Vision, pages 131–147. Springer, 2024

2024

[61] [61]

Y . Yang, Y . Huang, Y .-C. Guo, L. Lu, X. Wu, E. Y . Lam, Y .-P. Cao, and X. Liu. Sampart3d: Segment any part in 3d objects.arXiv preprint arXiv:2411.07184, 2024

work page arXiv 2024

[62] [62]

L. Yi, V . G. Kim, D. Ceylan, I.-C. Shen, M. Yan, H. Su, C. Lu, Q. Huang, A. Sheffer, and L. Guibas. A scalable active framework for region annotation in 3d shape collections.ACM Transactions on Graphics (ToG), 35(6):1–12, 2016

2016

[63] [63]

Zemskova and D

T. Zemskova and D. Yudin. 3dgraphllm: Combining semantic graphs and large language models for 3d scene understanding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8885–8895, 2025

2025

[64] [64]

J. Zha, Y . Fan, X. Yang, C. Gao, and X. Chen. How to enable llm with 3d capacity? a survey of spatial reasoning in llm.arXiv preprint arXiv:2504.05786, 2025

work page arXiv 2025

[65] [65]

Zhang, Z

Y . Zhang, Z. Gong, and A. X. Chang. Multi3drefer: Grounding text description to multiple 3d objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15225–15236, 2023. 13

2023

[66] [66]

arXiv preprint arXiv:2510.23607 (2025)

Y . Zhang, X. Wu, Y . Lao, C. Wang, Z. Tian, N. Wang, and H. Zhao. Concerto: Joint 2d-3d self-supervised learning emerges spatial representations.arXiv preprint arXiv:2510.23607, 2025

work page arXiv 2025

[67] [67]

Zhang, X

Y . Zhang, X. Wu, Y . Yang, X. Fan, H. Li, Y . Zhang, Z. Huang, N. Wang, and H. Zhao. Utonia: Toward one encoder for all point clouds.arXiv preprint arXiv:2603.03283, 2026

work page arXiv 2026

[68] [68]

H. Zhao, L. Jiang, J. Jia, P. H. Torr, and V . Koltun. Point transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 16259–16268, 2021

2021

[69] [69]

Zheng, S

D. Zheng, S. Huang, and L. Wang. Video-3d llm: Learning position-aware video representation for 3d scene understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8995–9006, 2025

2025

[70] [70]

Y . Zhou, J. Gu, T. Y . Chiang, F. Xiang, and H. Su. Point-sam: Promptable 3d segmentation model for point clouds.arXiv preprint arXiv:2406.17741, 2024

work page arXiv 2024

[71] [71]

C. Zhu, T. Wang, W. Zhang, K. Chen, and X. Liu. Scanreason: Empowering 3d visual grounding with reasoning capabilities. InEuropean Conference on Computer Vision, pages 151–168. Springer, 2024

2024

[72] [72]

C. Zhu, T. Wang, W. Zhang, J. Pang, and X. Liu. Llava-3d: A simple yet effective pathway to empowering lmms with 3d capabilities. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4295–4305, 2025

2025

[73] [73]

Z. Zhu, X. Ma, Y . Chen, Z. Deng, S. Huang, and Q. Li. 3d-vista: Pre-trained transformer for 3d vision and text alignment. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2911–2921, 2023. 14 A Evaluation Metrics We evaluate PAR3D with existing approaches on referring segmentation, visual question answering, and dense capti...

2023