pith. sign in

arxiv: 2606.06485 · v1 · pith:SFHG6S2Wnew · submitted 2026-06-04 · 💻 cs.CV

PAR3D: A Unified 3D-MLLM with Part-Aware Representation for Scene Understanding

Pith reviewed 2026-06-28 02:15 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D scene understandingpart-aware representation3D-MLLMreferring segmentationvisual question answeringsynthetic datasethierarchical queriesmultimodal large language model
0
0 comments X

The pith

PAR3D adds part-level awareness to 3D-MLLMs so models can ground both objects and their parts in scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PAR3D as a unified 3D multimodal large language model that extends current object-centric systems to also model fine-grained part structures essential for embodied interaction. It supports this with a new synthetic dataset ScenePart containing part-level annotations and language instructions, plus two technical additions: Part-Aware 3D Representation Learning to enrich visual features with part semantics and Hierarchical Segmentation Query Generation to produce hierarchical object-part queries for grounding. Experiments indicate clear gains on part-level question answering and referring segmentation while object-level vision-language performance stays strong. A sympathetic reader would care because many real-world 3D tasks, from robotic grasping to scene navigation, require distinguishing and referring to specific parts rather than whole objects alone.

Core claim

PAR3D is a unified part-aware 3D-MLLM framework that enables models to understand, reason about, and ground both objects and their parts in 3D scenes. Training and evaluation are supported by the introduced ScenePart synthetic dataset with part-level annotations and language instructions. Part-Aware 3D Representation Learning enriches 3D visual representations with fine-grained part-level semantics, while Hierarchical Segmentation Query Generation grounds part targets via hierarchical object-part queries. The result is substantial improvement on part-level question answering and referring segmentation alongside strong object-level performance.

What carries the argument

Part-Aware 3D Representation Learning that enriches 3D visual representations with fine-grained part-level semantics, paired with Hierarchical Segmentation Query Generation that produces hierarchical object-part queries for grounding.

If this is right

  • Models gain the ability to answer part-level questions and perform part-level referring segmentation in 3D scenes.
  • Object-level vision-language tasks such as captioning and VQA continue to perform strongly.
  • A single framework handles both object-centric and part-centric 3D scene understanding tasks.
  • Hierarchical queries allow grounding of parts relative to their parent objects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Robotic systems could use the part-aware outputs for more precise manipulation of object components.
  • The synthetic-to-real transfer approach may extend to other 3D perception tasks where part annotations are scarce.
  • Hierarchical query mechanisms could be adapted for multi-scale reasoning in other multimodal settings.

Load-bearing premise

Training on the synthetic ScenePart dataset will transfer to enable part-aware understanding in real 3D scenes.

What would settle it

Running the trained PAR3D model on real-world 3D scene benchmarks and finding no measurable gain in part-level question answering or referring segmentation accuracy compared with prior object-centric 3D-MLLMs would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.06485 by Liujuan Cao, Shaohui Dai, Shengchuan Zhang, Yansong Qu, You Shen.

Figure 1
Figure 1. Figure 1: We propose PAR3D, a unified 3D-MLLM with part-aware representation, together with [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: ScenePart Data Construction Pipeline. ScenePart composes part-annotated 3D objects into synthesized indoor layouts, producing object- and part-level mask annotations in 3D scenes and multi-task language instructions for training and evaluating part-aware 3D-MLLMs. where Pool(·) denotes superpoint pooling, and Fe consists of M encoder features {f e i }M i=1 over superpoints. A query decoder D further refine… view at source ↗
Figure 3
Figure 3. Figure 3: Overall Framework of PAR3D. PAR3D is trained with a two-stage scheme. Stage 1 adapts the 3D visual backbone with object- and part-level supervision through instance segmentation, part-aware contrastive learning, and representation-preserving regularization. Stage 2 performs instruction tuning on the MLLM using 3D vision-language instruction data. PAR3D generates textual responses and object or part masks t… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative Examples of PAR3D. We present examples on (a) referring segmentation and (b) visual question answering for part-aware 3D scene understanding. 4.4 Ablation Studies We conduct ablation studies to analyze the contribution of each component in our framework. Since our main designs focus on visual representation learning and hierarchical segmentation query generation, we report mIoU on ScenePart-Seg… view at source ↗
Figure 5
Figure 5. Figure 5: shows additional question answering examples from ScanQA and ScenePart-QA datasets. Each example includes the input scene, the ground-truth answer, the answer predicted by PAR3D, and the answer predicted by 3D-LLaVA. These cases illustrate that PAR3D can answer questions that require understanding object attributes, object parts, spatial relationships, and scene-level context. Compared to 3D-LLaVA, PAR3D p… view at source ↗
Figure 6
Figure 6. Figure 6: Additional Qualitative Comparisons on Referring Segmentation. Each example includes the input scene, the ground-truth mask, the prediction of PAR3D, and the prediction of 3D-LLaVA. The target mask is highlighted in blue, regardless of whether the target corresponds to an object or a part. PAR3D achieves more accurate segmentation across representative examples from multiple datasets. 18 [PITH_FULL_IMAGE:f… view at source ↗
read the original abstract

Recent advances in 3D multimodal large language models (3D-MLLMs) have enabled unified solutions for 3D scene understanding tasks, including visual question answering, captioning, and referring segmentation. However, existing 3D-MLLMs remain largely object-centric, limiting their ability to model fine-grained part structures that are essential for embodied interaction with 3D environments. In this work, we present PAR3D, a unified part-aware 3D-MLLM framework that enables models to understand, reason about, and ground both objects and their parts in 3D scenes. To enable training and evaluation of part-aware 3D scene understanding, we introduce ScenePart, a synthetic 3D scene dataset with part-level annotations and language instructions. We further develop Part-Aware 3D Representation Learning to enrich 3D visual representations with fine-grained part-level semantics, and propose Hierarchical Segmentation Query Generation to ground part targets via hierarchical object-part queries. Extensive experiments show that our method substantially improves part-level question answering and referring segmentation, while also achieving strong performance across object-level vision-language tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces PAR3D, a unified part-aware 3D-MLLM, along with the synthetic ScenePart dataset containing part-level annotations and language instructions. It proposes Part-Aware 3D Representation Learning to enrich 3D visual features with fine-grained part semantics and Hierarchical Segmentation Query Generation to produce hierarchical object-part queries for grounding. The central claim is that these components enable models to understand, reason about, and ground both objects and parts, yielding substantial gains on part-level VQA and referring segmentation while preserving strong object-level performance.

Significance. If the synthetic-to-real transfer holds, the work would meaningfully extend 3D-MLLMs beyond object-centric limitations and support finer-grained embodied interaction. The creation of ScenePart is a constructive step toward part-level supervision, but the overall significance hinges on whether the reported gains generalize outside the synthetic distribution.

major comments (1)
  1. [Abstract] Abstract: The claim that PAR3D enables part-aware understanding and grounding 'in 3D scenes' for embodied interaction is load-bearing on sim-to-real transfer, yet the abstract introduces ScenePart solely for training/evaluation and reports improvements without any reference to real-world 3D datasets (ScanNet, 3RScan, etc.) or domain-randomization metrics.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for precision in the abstract regarding the scope of our claims. We agree that the current wording could overstate generalization and will revise accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that PAR3D enables part-aware understanding and grounding 'in 3D scenes' for embodied interaction is load-bearing on sim-to-real transfer, yet the abstract introduces ScenePart solely for training/evaluation and reports improvements without any reference to real-world 3D datasets (ScanNet, 3RScan, etc.) or domain-randomization metrics.

    Authors: We agree the abstract phrasing is imprecise. The manuscript evaluates exclusively on the synthetic ScenePart dataset; no experiments on ScanNet, 3RScan or other real-world 3D datasets are reported, and no domain-randomization or sim-to-real metrics are provided. The framework is motivated by embodied interaction needs, but we do not claim or demonstrate transfer. We will revise the abstract to state that gains are shown on synthetic part-level data and note the absence of real-world validation as a limitation, with sim-to-real transfer left for future work. revision: yes

Circularity Check

0 steps flagged

No circularity; standard empirical proposal with independent dataset and modules

full rationale

The paper introduces a new synthetic dataset (ScenePart) and two new modules (Part-Aware 3D Representation Learning and Hierarchical Segmentation Query Generation) as contributions, then reports empirical gains on part-level and object-level tasks. No equations, derivations, or claims reduce a 'prediction' or result to a fitted input by construction, nor does any load-bearing premise rest on a self-citation chain or imported uniqueness theorem. The central claims rest on external evaluation metrics rather than self-referential definitions, so the derivation chain is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Abstract-only review; no specific free parameters, axioms, or invented entities beyond the high-level contributions can be identified or verified.

invented entities (2)
  • PAR3D no independent evidence
    purpose: Unified part-aware 3D-MLLM framework
    New model name and architecture introduced in abstract.
  • ScenePart no independent evidence
    purpose: Synthetic 3D scene dataset with part-level annotations
    New dataset introduced in abstract.

pith-pipeline@v0.9.1-grok · 5745 in / 1148 out tokens · 28159 ms · 2026-06-28T02:15:29.159157+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

73 extracted references · 20 canonical work pages · 4 internal anchors

  1. [1]

    Achlioptas, A

    P. Achlioptas, A. Abdelreheem, F. Xia, M. Elhoseiny, and L. Guibas. Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. InEuropean conference on computer vision, pages 422–440. Springer, 2020

  2. [2]

    Azuma, T

    D. Azuma, T. Miyanishi, S. Kurita, and M. Kawanabe. Scanqa: 3d question answering for spatial scene understanding. Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19129–19139, 2022

  3. [3]

    S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  4. [4]

    Banerjee and A

    S. Banerjee and A. Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005

  5. [5]

    ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data

    G. Baruch, Z. Chen, A. Dehghan, T. Dimry, Y . Feigin, P. Fu, T. Gebauer, B. Joffe, D. Kurz, A. Schwartz, et al. Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data. arXiv preprint arXiv:2111.08897, 2021

  6. [6]

    A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. Shapenet: An information-rich 3d model repository.arXiv preprint arXiv:1512.03012, 2015

  7. [7]

    D. Z. Chen, A. X. Chang, and M. Nießner. Scanrefer: 3d object localization in rgb-d scans using natural language. InEuropean conference on computer vision, pages 202–221. Springer, 2020

  8. [8]

    S. Chen, H. Zhu, X. Chen, Y . Lei, G. Yu, and T. Chen. End-to-end 3d dense captioning with vote2cap- detr. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11124–11133, 2023. 10

  9. [9]

    S. Chen, X. Chen, C. Zhang, M. Li, G. Yu, H. Fei, H. Zhu, J. Fan, and T. Chen. Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26428–26438, 2024

  10. [10]

    Y . Chen, S. Yang, H. Huang, T. Wang, R. Xu, R. Lyu, D. Lin, and J. Pang. Grounded 3d-llm with referent tokens.arXiv preprint arXiv:2405.10370, 2024

  11. [11]

    Z. Chen, A. Gholami, M. Nießner, and A. X. Chang. Scan2cap: Context-aware dense captioning in rgb-d scans. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3193–3203, 2021

  12. [12]

    Z. Chen, R. Hu, X. Chen, M. Nießner, and A. X. Chang. Unit3d: A unified transformer for 3d dense captioning and visual grounding. InProceedings of the IEEE/CVF international conference on computer vision, pages 18109–18119, 2023

  13. [13]

    A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017

  14. [14]

    S. Dai, Y . Qu, Z. Li, X. Li, S. Zhang, and L. Cao. Training-free hierarchical scene understanding for gaussian splatting with superpoint graphs. InProceedings of the 33rd ACM International Conference on Multimedia, pages 3673–3682, 2025

  15. [15]

    J. Deng, T. He, L. Jiang, T. Wang, F. Dayoub, and I. Reid. 3d-llava: Towards generalist 3d lmms with omni superpoint transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3772–3782, 2025

  16. [16]

    H. Fu, B. Cai, L. Gao, L.-X. Zhang, J. Wang, C. Li, Q. Zeng, C. Sun, R. Jia, B. Zhao, et al. 3d-front: 3d furnished rooms with layouts and semantics. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10933–10942, 2021

  17. [17]

    H. Fu, R. Jia, L. Gao, M. Gong, B. Zhao, S. Maybank, and D. Tao. 3d-future: 3d furniture shape with texture.International Journal of Computer Vision, 129(12):3313–3337, 2021

  18. [18]

    R. Fu, J. Liu, X. Chen, Y . Nie, and W. Xiong. Scene-llm: Extending language model for 3d visual understanding and reasoning.arXiv preprint arXiv:2403.11401, 2024

  19. [19]

    H. Geng, H. Xu, C. Zhao, C. Xu, L. Yi, S. Huang, and H. Wang. Gapartnet: Cross-category domain- generalizable object perception and manipulation via generalizable and actionable parts. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7081–7091, 2023

  20. [20]

    S. He, H. Ding, X. Jiang, and B. Wen. Segpoint: Segment any point cloud via large language model. In European Conference on Computer Vision, pages 349–367. Springer, 2024

  21. [21]

    Y . Hong, H. Zhen, P. Chen, S. Zheng, Y . Du, Z. Chen, and C. Gan. 3d-llm: Injecting the 3d world into large language models.Advances in Neural Information Processing Systems, 36:20482–20494, 2023

  22. [22]

    S. Hu, D. M. Arroyo, S. Debats, F. Manhardt, L. Carlone, and F. Tombari. Mixed diffusion for 3d indoor scene synthesis. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1262–1272, 2026

  23. [23]

    Huang, Y

    H. Huang, Y . Chen, Z. Wang, R. Huang, R. Xu, T. Wang, L. Liu, X. Cheng, Y . Zhao, J. Pang, et al. Chat-scene: Bridging 3d scene and large language models with object identifiers.Advances in Neural Information Processing Systems, 37:113991–114017, 2024

  24. [24]

    An Embodied Generalist Agent in 3D World

    J. Huang, S. Yong, X. Ma, X. Linghu, P. Li, Y . Wang, Q. Li, S.-C. Zhu, B. Jia, and S. Huang. An embodied generalist agent in 3d world.arXiv preprint arXiv:2311.12871, 2023

  25. [25]

    Huang, X

    J. Huang, X. Ma, X. Linghu, Y . Fan, J. He, W. Tan, Q. Li, S.-C. Zhu, Y . Chen, B. Jia, et al. Leo-vl: Efficient scene representation for scalable 3d vision-language learning.arXiv preprint arXiv:2506.09935, 2025

  26. [26]

    Huang, X

    K.-C. Huang, X. Li, L. Qi, S. Yan, and M.-H. Yang. Reason3d: Searching and reasoning 3d segmentation via large language model. In2025 International Conference on 3D Vision (3DV), pages 1177–1186. IEEE, 2025

  27. [27]

    J. Kerr, C. M. Kim, K. Goldberg, A. Kanazawa, and M. Tancik. LERF: language embedded radiance fields. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 19672–19682. IEEE, 2023. doi: 10.1109/ICCV51070.2023.01807. URL https: //doi.org/10.1109/ICCV51070.2023.01807. 11

  28. [28]

    Y . Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen. Pointcnn: Convolution on x-transformed points. Advances in neural information processing systems, 31, 2018

  29. [29]

    Y . Li, U. Upadhyay, H. Slim, A. Abdelreheem, A. Prajapati, S. Pothigara, P. Wonka, and M. Elhoseiny. 3d compat: Composition of materials on parts of 3d things. InEuropean conference on computer vision, pages 110–127. Springer, 2022

  30. [30]

    C.-Y . Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004

  31. [31]

    H. Liu, C. Li, Y . Li, and Y . J. Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024

  32. [32]

    M. Liu, Y . Zhu, H. Cai, S. Han, Z. Ling, F. Porikli, and H. Su. Partslip: Low-shot part segmentation for 3d point clouds via pretrained image-language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21736–21746, 2023

  33. [33]

    C. Ma, Y . Li, X. Yan, J. Xu, Y . Yang, C. Wang, Z. Zhao, Y . Guo, Z. Chen, and C. Guo. P3-sam: Native 3d part segmentation.arXiv preprint arXiv:2509.06784, 2025

  34. [34]

    X. Ma, S. Yong, Z. Zheng, Q. Li, Y . Liang, S.-C. Zhu, and S. Huang. Sqa3d: Situated question answering in 3d scenes.arXiv preprint arXiv:2210.07474, 2022

  35. [35]

    X. Ma, B. Smart, Y . Bhalgat, S. Chen, X. Li, J. Ding, J. Gu, D. Z. Chen, S. Peng, J.-W. Bian, et al. When llms step into the 3d world: A survey and meta-analysis of 3d tasks via multi-modal large language models. arXiv preprint arXiv:2405.10255, 2024

  36. [36]

    Z. Ma, Y . Yue, and G. Gkioxari. Find any part in 3d. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7818–7827, 2025

  37. [37]

    K. Mo, S. Zhu, A. X. Chang, L. Yi, S. Tripathi, L. J. Guibas, and H. Su. Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 909–918, 2019

  38. [38]

    K. Mo, L. J. Guibas, M. Mukadam, A. Gupta, and S. Tulsiani. Where2act: From pixels to actions for articulated 3d objects. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6813–6823, 2021

  39. [39]

    Papineni, S

    K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

  40. [40]

    A. V . Phan, M. Le Nguyen, Y . L. H. Nguyen, and L. T. Bui. Dgcnn: A convolutional neural network over large-scale labeled graphs.Neural Networks, 108:533–543, 2018

  41. [41]

    C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017

  42. [42]

    Z. Qi, Y . Fang, Z. Sun, X. Wu, T. Wu, J. Wang, D. Lin, and H. Zhao. Gpt4point: A unified framework for point-language understanding and generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26417–26427, 2024

  43. [43]

    M. Qin, W. Li, J. Zhou, H. Wang, and H. Pfister. Langsplat: 3d language gaussian splatting. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 20051–20060. IEEE, 2024. doi: 10.1109/CVPR52733.2024.01895. URL https://doi.org/10.1109/CVPR52733.2024.01895

  44. [44]

    Y . Qu, Y . Wang, and Y . Qi. Sg-nerf: Semantic-guided point-based neural radiance fields. In2023 IEEE International Conference on Multimedia and Expo (ICME), pages 570–575. IEEE, 2023

  45. [45]

    Y . Qu, S. Dai, X. Li, J. Lin, L. Cao, S. Zhang, and R. Ji. Goi: Find 3d gaussians of interest with an optimizable open-vocabulary semantic-space hyperplane. InProceedings of the 32nd ACM International Conference on Multimedia, pages 5328–5337, 2024

  46. [46]

    Y . Qu, D. Chen, X. Li, X. Li, S. Zhang, L. Cao, and R. Ji. Drag your gaussian: Effective drag-based editing with score distillation for 3d gaussian splatting.ArXiv preprint, abs/2501.18672, 2025. URL https://arxiv.org/abs/2501.18672. 12

  47. [47]

    H. Slim, X. Li, Y . Li, M. Ahmed, M. Ayman, U. Upadhyay, A. Abdelreheem, A. Prajapati, S. Pothigara, P. Wonka, et al. 3dcompat++: An improved large-scale 3d vision dataset for compositional recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  48. [48]

    Y . Tang, X. Han, X. Li, Q. Yu, Y . Hao, L. Hu, and M. Chen. Minigpt-3d: Efficiently aligning 3d point clouds with large language models using 2d priors. InProceedings of the 32nd ACM International Conference on Multimedia, pages 6617–6626, 2024

  49. [49]

    Umam, C.-K

    A. Umam, C.-K. Yang, M.-H. Chen, J.-H. Chuang, and Y .-Y . Lin. Partdistill: 3d shape part segmentation by vision-language model distillation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3470–3479, 2024

  50. [50]

    Vedantam, C

    R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider: Consensus-based image description evaluation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015

  51. [51]

    J. Wald, A. Avetisyan, N. Navab, F. Tombari, and M. Nießner. Rio: 3d object instance re-localization in changing indoor environments. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7658–7667, 2019

  52. [52]

    C. Wang, J. Ye, Y . Yang, Y . Li, Z. Lin, J. Zhu, Z. Chen, Y . Luo, and C. Guo. Part-x-mllm: Part-aware 3d multimodal large language model.arXiv preprint arXiv:2511.13647, 2025

  53. [53]

    J. Wang, D. Wang, J. Hu, Q. Zhang, J. Yu, and L. Xu. Kinematify: Open-vocabulary synthesis of high-dof articulated objects.arXiv preprint arXiv:2511.01294, 2025

  54. [54]

    Y . Wang, J. Wang, Y . Qu, and Y . Qi. Rip-nerf: learning rotation-invariant point-based neural radiance field for fine-grained editing and compositing. InProceedings of the 2023 ACM International Conference on Multimedia Retrieval, pages 125–134, 2023

  55. [55]

    Y . Wang, J. Wang, R. Gao, Y . Qu, W. Duan, S. Yang, and Y . Qi. Look at the sky: Sky-aware efficient 3d gaussian splatting in the wild.IEEE Transactions on Visualization and Computer Graphics, 2025

  56. [56]

    C. Wu, Y . Ma, Q. Chen, H. Wang, G. Luo, J. Ji, and X. Sun. 3d-stmn: Dependency-driven superpoint-text matching network for end-to-end 3d referring expression segmentation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 5940–5948, 2024

  57. [57]

    X. Wu, L. Jiang, P.-S. Wang, Z. Liu, X. Liu, Y . Qiao, W. Ouyang, T. He, and H. Zhao. Point transformer v3: Simpler faster stronger. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4840–4851, 2024

  58. [58]

    X. Wu, D. DeTone, D. Frost, T. Shen, C. Xie, N. Yang, J. Engel, R. Newcombe, H. Zhao, and J. Straub. Sonata: Self-supervised learning of reliable point representations. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22193–22204, 2025

  59. [59]

    Xiang, Y

    F. Xiang, Y . Qin, K. Mo, Y . Xia, H. Zhu, F. Liu, M. Liu, H. Jiang, Y . Yuan, H. Wang, et al. Sapien: A simulated part-based interactive environment. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11097–11107, 2020

  60. [60]

    R. Xu, X. Wang, T. Wang, Y . Chen, J. Pang, and D. Lin. Pointllm: Empowering large language models to understand point clouds. InEuropean Conference on Computer Vision, pages 131–147. Springer, 2024

  61. [61]

    Y . Yang, Y . Huang, Y .-C. Guo, L. Lu, X. Wu, E. Y . Lam, Y .-P. Cao, and X. Liu. Sampart3d: Segment any part in 3d objects.arXiv preprint arXiv:2411.07184, 2024

  62. [62]

    L. Yi, V . G. Kim, D. Ceylan, I.-C. Shen, M. Yan, H. Su, C. Lu, Q. Huang, A. Sheffer, and L. Guibas. A scalable active framework for region annotation in 3d shape collections.ACM Transactions on Graphics (ToG), 35(6):1–12, 2016

  63. [63]

    Zemskova and D

    T. Zemskova and D. Yudin. 3dgraphllm: Combining semantic graphs and large language models for 3d scene understanding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8885–8895, 2025

  64. [64]

    J. Zha, Y . Fan, X. Yang, C. Gao, and X. Chen. How to enable llm with 3d capacity? a survey of spatial reasoning in llm.arXiv preprint arXiv:2504.05786, 2025

  65. [65]

    Zhang, Z

    Y . Zhang, Z. Gong, and A. X. Chang. Multi3drefer: Grounding text description to multiple 3d objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15225–15236, 2023. 13

  66. [66]

    arXiv preprint arXiv:2510.23607 (2025)

    Y . Zhang, X. Wu, Y . Lao, C. Wang, Z. Tian, N. Wang, and H. Zhao. Concerto: Joint 2d-3d self-supervised learning emerges spatial representations.arXiv preprint arXiv:2510.23607, 2025

  67. [67]

    Zhang, X

    Y . Zhang, X. Wu, Y . Yang, X. Fan, H. Li, Y . Zhang, Z. Huang, N. Wang, and H. Zhao. Utonia: Toward one encoder for all point clouds.arXiv preprint arXiv:2603.03283, 2026

  68. [68]

    H. Zhao, L. Jiang, J. Jia, P. H. Torr, and V . Koltun. Point transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 16259–16268, 2021

  69. [69]

    Zheng, S

    D. Zheng, S. Huang, and L. Wang. Video-3d llm: Learning position-aware video representation for 3d scene understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8995–9006, 2025

  70. [70]

    Y . Zhou, J. Gu, T. Y . Chiang, F. Xiang, and H. Su. Point-sam: Promptable 3d segmentation model for point clouds.arXiv preprint arXiv:2406.17741, 2024

  71. [71]

    C. Zhu, T. Wang, W. Zhang, K. Chen, and X. Liu. Scanreason: Empowering 3d visual grounding with reasoning capabilities. InEuropean Conference on Computer Vision, pages 151–168. Springer, 2024

  72. [72]

    C. Zhu, T. Wang, W. Zhang, J. Pang, and X. Liu. Llava-3d: A simple yet effective pathway to empowering lmms with 3d capabilities. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4295–4305, 2025

  73. [73]

    Z. Zhu, X. Ma, Y . Chen, Z. Deng, S. Huang, and Q. Li. 3d-vista: Pre-trained transformer for 3d vision and text alignment. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2911–2921, 2023. 14 A Evaluation Metrics We evaluate PAR3D with existing approaches on referring segmentation, visual question answering, and dense capti...