pith. machine review for the scientific record. sign in

arxiv: 2605.08925 · v1 · submitted 2026-05-09 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Few-Click-Driven Interactive 3D Segmentation with Semantic Embedding

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:46 UTC · model grok-4.3

classification 💻 cs.CV
keywords interactive 3D segmentationpoint cloudsemantic embeddingsinstance segmentationhierarchical decoderfew-click annotation
0
0 comments X

The pith

A 3D interactive segmentation framework processes multiple user clicks together in one forward pass to label objects accurately.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a method for interactive 3D instance segmentation on sparse point clouds that accepts clicks for several objects at once. It builds on a point Transformer encoder and a hierarchical mask decoder that uses learnable semantic embeddings to handle all queries jointly. This setup models relationships between objects and refines both masks and semantics without running the model repeatedly after each new click. The result is higher accuracy with fewer interactions, as shown in large gains on standard metrics and cross-dataset tests.

Core claim

The central claim is that a point Transformer encoder paired with a hierarchical mask decoder conditioned on learnable semantic embeddings can jointly reason over multiple click queries on downsampled 3D points in a single forward pass, producing refined spatial masks and semantic predictions while capturing inter-instance relationships.

What carries the argument

Hierarchical mask decoder with learnable semantic embeddings that performs multi-level crop-and-merge operations conditioned on all click queries at once.

If this is right

  • Multiple objects are segmented with often only a single click per object.
  • The approach yields over 20 percent higher mIoU than strong baselines on standard benchmarks.
  • Cross-dataset tests show 8-10 percent gains in the one-click-per-instance setting.
  • The method supports real-time uses such as robotic manipulation and rapid 3D annotation without per-click model retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Annotation effort for large 3D scenes could drop substantially if one click per object becomes routine.
  • Avoiding reliance on 2D foundation models may improve robustness on raw point clouds from new sensors.
  • The single-pass design could extend to dynamic scenes if temporal embeddings are added in follow-up work.
  • Real-time deployment on mobile robots would benefit from the reduced compute of one forward pass.

Load-bearing premise

The hierarchical mask decoder with learnable semantic embeddings can jointly reason over all click queries, model inter-instance relationships, and refine masks and semantics without needing repeated model updates after each corrective click.

What would settle it

A test on scenes with many closely spaced or overlapping objects where accuracy falls below sequential single-object baselines or fails to improve mIoU by the reported margins would falsify the joint-reasoning claim.

Figures

Figures reproduced from arXiv: 2605.08925 by Kourosh Khoshelham, Liangliang Nan, Xueyang Kang, Zijian Yu.

Figure 1
Figure 1. Figure 1: Overview of our click-based instance segmentation framework. Given a scene S with user-provided clicks C, the scene encoder extracts multi-scale scene features {F0, ..., FL}, while the query encoder produces query features Q. The transformer block refines these features into Qt, which the Conditioned Query Adaptor further refines into Qs using the semantic prototype Ps and semantic embedding Es. The mask d… view at source ↗
Figure 2
Figure 2. Figure 2: Baseline comparison at 1 click per instance with identical click positions: above the dashed line on ScanNet40 [5], and below on KITTI360 [32]. Each instance class is shown using a consistent color, with the red box showing a zoomed-in region for closer inspection of the segmentation mask details. with single clicks, significantly outperforming baselines. SAM2Point performs competitively on ScanNet40 (65.2… view at source ↗
Figure 3
Figure 3. Figure 3: Ablation study visualization of instance segmentation on a selected indoor scene with different modules removed; the leftmost shows the Ground Truth, with stars indicating click point positions as input [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Plot of mIoU results for all methods as a function of the number of clicks. and saturating around 7–10 clicks, indicating that both data diversity and user feedback enhance segmentation accuracy [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Plot of mIoU test performance on ScanNet40 as a function of the number of click query points during inference (The query numbers ranging from 50 to 200 during training are explored) [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Plot of (a) mIoU as a function of embedding dimension, with spatial embeddings in blue and semantic embeddings in orange; (b) mIoU as a function of the number of semantic class prototype embedding. 5 Conclusion We presented a single-forward-pass interactive 3D segmentation framework that unifies click-guided query learning with semantic prototyped-conditioned refine￾ment. By eliminating iterative re-infere… view at source ↗
read the original abstract

Interactive segmentation allows efficient label generation by leveraging user-provided clicks to progressively refine predictions, which is critical when fully supervised labels are costly or generalization to unseen classes is needed. Existing 3D interactive methods are limited: most operate sequentially, predicting only one object per iteration with binary masks, while several recent approaches depend on 2D foundation models and camera alignment to bridge the 2D-3D gap. To address these limitations, we propose a novel interactive segmentation framework that operates directly on sparse, randomly downsampled 3D points and processes multiple object clicks in a single forward pass. Our framework consists of a point Transformer-based encoder and a hierarchical mask decoder, which integrates multi-level crop-and-merge operations conditioned on learnable semantic embeddings. Unlike prior interactive approaches that require repeated model updates after each manually corrective click, our method jointly reasons over all click queries, modeling inter-instance relationships and refining both spatial masks and semantic predictions through spatial and semantic embeddings. Extensive experiments demonstrate that our model improves the mIoU metric by over 20 percent compared to strong baselines and achieves 8-10 percent gains under cross-dataset evaluation for a one-click per instance setting, often requiring only a single click per object. Our approach provides a generalizable and efficient solution for interactive 3D instance segmentation, particularly suitable for real-time applications such as robotic manipulation, navigation, and rapid 3D semantic annotation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes a novel interactive 3D instance segmentation framework that operates directly on sparse 3D points. It uses a point Transformer encoder and a hierarchical mask decoder with learnable semantic embeddings and multi-level crop-and-merge operations to process multiple object clicks in a single forward pass, jointly modeling inter-instance relationships while refining spatial masks and semantic predictions. The work claims over 20% mIoU gains versus strong baselines and 8-10% improvements in cross-dataset one-click-per-instance settings, often needing only a single click per object.

Significance. If the performance claims and single-pass multi-object reasoning hold under rigorous evaluation, the method could meaningfully advance efficient interactive 3D segmentation for real-time applications such as robotic manipulation and rapid annotation, by eliminating the need for repeated model updates after each corrective click.

major comments (2)
  1. [Abstract] Abstract: The central performance claims ('improves the mIoU metric by over 20 percent' and 'achieves 8-10 percent gains under cross-dataset evaluation') are stated without any experimental details, including dataset identities and sizes, baseline specifications, number of trials, error bars, ablation results, or statistical significance tests. These omissions make the quantitative assertions impossible to evaluate and are load-bearing for the paper's primary contribution.
  2. [Abstract] Abstract: The hierarchical mask decoder is asserted to 'jointly reason over all click queries, modeling inter-instance relationships' via learnable semantic embeddings and multi-level crop-and-merge, yet no formulation is supplied for click encoding, cross-query interaction (attention or otherwise), embedding conditioning, or scaling behavior with click count or instance density. This architectural mechanism is load-bearing for the advertised single-forward-pass advantage over sequential baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments on our manuscript. We have carefully considered each point and provide detailed responses below. Where appropriate, we have revised the manuscript to address the concerns raised.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central performance claims ('improves the mIoU metric by over 20 percent' and 'achieves 8-10 percent gains under cross-dataset evaluation') are stated without any experimental details, including dataset identities and sizes, baseline specifications, number of trials, error bars, ablation results, or statistical significance tests. These omissions make the quantitative assertions impossible to evaluate and are load-bearing for the paper's primary contribution.

    Authors: We agree that the abstract would benefit from more specificity to allow readers to better contextualize the claims. In the revised manuscript, we will update the abstract to specify the primary datasets (ScanNet and S3DIS), note the baselines used (including recent interactive 3D segmentation methods), and indicate that results are reported as averages over multiple random seeds with standard deviations, with full details, ablations, and statistical analysis provided in the Experiments section. This revision maintains the abstract's conciseness while addressing the evaluation concerns. revision: yes

  2. Referee: [Abstract] Abstract: The hierarchical mask decoder is asserted to 'jointly reason over all click queries, modeling inter-instance relationships' via learnable semantic embeddings and multi-level crop-and-merge, yet no formulation is supplied for click encoding, cross-query interaction (attention or otherwise), embedding conditioning, or scaling behavior with click count or instance density. This architectural mechanism is load-bearing for the advertised single-forward-pass advantage over sequential baselines.

    Authors: The abstract provides a high-level overview of the proposed framework. The detailed formulations for click encoding (using positional and semantic embeddings), cross-query interactions through the point Transformer's self-attention layers, embedding conditioning in the hierarchical decoder, and analysis of scaling with click count and instance density are presented in Sections 3.1-3.3, including the relevant equations and architectural diagrams. To strengthen the abstract, we will incorporate a brief mention of the joint reasoning mechanism via attention-based query interactions. We believe this clarifies the single-pass advantage without requiring major expansion. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical architecture with experimental validation only

full rationale

The paper describes a point Transformer encoder and hierarchical mask decoder architecture for multi-click 3D instance segmentation, supported solely by experimental mIoU gains on datasets. No equations, derivations, fitted parameters, or self-citation chains appear in the provided text or abstract. Claims of single-pass joint reasoning and inter-instance modeling are presented as design choices validated empirically, not as results derived from prior self-referential inputs. The work is self-contained against external benchmarks with no reduction of predictions to author-defined fits.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the unverified effectiveness of the proposed encoder-decoder architecture and the learnable semantic embeddings; no free parameters, standard axioms, or independently evidenced invented entities are identifiable from the abstract alone.

invented entities (1)
  • learnable semantic embeddings no independent evidence
    purpose: Condition the hierarchical mask decoder to separate instances and refine masks and semantics jointly
    Introduced as a core component of the framework but no independent evidence or falsifiable prediction is provided in the abstract.

pith-pipeline@v0.9.0 · 5557 in / 1165 out tokens · 43752 ms · 2026-05-12T01:46:05.620616+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 4 internal anchors

  1. [1]

    ArXiv e-prints (Feb 2017)

    Armeni, I., Sax, A., Zamir, A.R., Savarese, S.: Joint 2D-3D-Semantic Data for Indoor Scene Understanding. ArXiv e-prints (Feb 2017)

  2. [2]

    Ad- vances in Neural Information Processing Systems36(2024)

    Boudjoghra, M.E.A., Al Khatib, S., Lahoud, J., Cholakkal, H., Anwer, R., Khan, S.H., Shahbaz Khan, F.: 3d indoor instance segmentation in an open-world. Ad- vances in Neural Information Processing Systems36(2024)

  3. [3]

    Choi, D., Cho, W., Kim, K., Choo, J.: iDet3D: Towards Efficient Interactive Object Detection for LiDAR Point Clouds (2023)

  4. [4]

    In: European Conference on Computer Vision

    Choi, S., Song, H., Kim, J., Kim, T., Do, H.: Click-gaussian: Interactive segmen- tation to any 3d gaussians. In: European Conference on Computer Vision. pp. 289–305. Springer (2025)

  5. [5]

    In: Proc

    Dai,A.,Chang,A.X.,Savva,M.,Halber,M.,Funkhouser,T.,Nießner,M.:Scannet: Richly-annotated 3d reconstructions of indoor scenes. In: Proc. Computer Vision and Pattern Recognition (CVPR), IEEE (2017) 16 X. Kang et al

  6. [6]

    IEEE transactions on Visualiza- tion and Computer Graphics14(6), 1539–1148 (2008)

    Elmqvist, N., Dragicevic, P., Fekete, J.D.: Rolling the dice: Multidimensional visual exploration using scatterplot matrix navigation. IEEE transactions on Visualiza- tion and Computer Graphics14(6), 1539–1148 (2008)

  7. [7]

    Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks

    Engelmann, F., Manhardt, F., Niemeyer, M., Tateno, K., Pollefeys, M., Tombari, F.: OpenNeRF: Open Set 3D Neural Scene Segmentation with Pixel-Wise Fea- tures and Rendered Novel Views (2024).https://doi.org/10.48550/arXiv. 2404.03650

  8. [8]

    arXiv preprint arXiv:2404.03650 (2024)

    Engelmann, F., Manhardt, F., Niemeyer, M., Tateno, K., Pollefeys, M., Tombari, F.: Opennerf: Open set 3d neural scene segmentation with pixel-wise features and rendered novel views. arXiv preprint arXiv:2404.03650 (2024)

  9. [9]

    Ghiasi, G., Gu, X., Cui, Y., Lin, T.Y.: Scaling Open-Vocabulary Image Segmen- tation with Image-Level Labels (2022).https://doi.org/10.48550/arXiv.2112. 12143

  10. [10]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Goel, R., Sirikonda, D., Saini, S., Narayanan, P.: Interactive segmentation of ra- diance fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4201–4211 (2023)

  11. [11]

    arXiv preprint arXiv:2312.08372 (2023)

    Guo, H., Zhu, H., Peng, S., Wang, Y., Shen, Y., Hu, R., Zhou, X.: Sam-guided graph cut for 3d instance segmentation. arXiv preprint arXiv:2312.08372 (2023)

  12. [12]

    arXiv preprint arXiv:2408.16768 (2024)

    Guo, Z., Zhang, R., Zhu, X., Tong, C., Gao, P., Li, C., Heng, P.A.: Sam2point: Segment any 3d as videos in zero-shot and promptable manners. arXiv preprint arXiv:2408.16768 (2024)

  13. [13]

    Hou, J., Dai, A., Nießner, M.: 3D-SIS: 3D Semantic Instance Segmentation of RGB-D Scans (2019).https://doi.org/10.48550/arXiv.1812.07003

  14. [14]

    European Conference on Computer Vision (ECCV) (2024)

    Huang, R., Peng, S., Takmaz, A., Tombari, F., Pollefeys, M., Song, S., Huang, G., Engelmann, F.: Segment3d: Learning fine-grained class-agnostic 3d segmentation without manual labels. European Conference on Computer Vision (ECCV) (2024)

  15. [15]

    Huang, Y., Zheng, W., Zhang, Y., Zhou, J., Lu, J.: GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction (2024).https: //doi.org/10.48550/arXiv.2405.17429

  16. [16]

    arXiv preprint arXiv:2411.07555 (2024)

    Jain, U., Mirzaei, A., Gilitschenski, I.: Gaussiancut: Interactive segmentation via graph cut for 3d gaussian splatting. arXiv preprint arXiv:2411.07555 (2024)

  17. [17]

    Jatavallabhula, K.M., Kuwajerwala, A., Gu, Q., Omama, M., Chen, T., Li, S., Iyer, G., Saryazdi, S., Keetha, N., Tewari, A., Tenenbaum, J.B., de Melo, C.M., Krishna, M., Paull, L., Shkurti, F., Torralba, A.: ConceptFusion: Open-set Multimodal 3D Mapping (2023).https://doi.org/10.48550/arXiv.2302.07241

  18. [18]

    Hierarchical Point-Patch Fusion with Adaptive Patch Codebook for 3D Shape Anomaly Detection

    Kang, X., Li, Z., Lan, T., Gong, D., Khoshelham, K., Nan, L.: Hierarchical point- patch fusion with adaptive patch codebook for 3d shape anomaly detection. arXiv preprint arXiv:2604.03972 (2026)

  19. [19]

    In: Proceedings of the 33rd ACM International Conference on Multimedia

    Kang, X., Xiang, Z., Zhang, Z., Khoshelham, K.: Look beyond: Two-stage scene view generation via panorama and video diffusion. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 9375–9384 (2025)

  20. [20]

    In: 2025 International Joint Conference on Neural Networks (IJCNN)

    Kang, X., Xiang, Z., Zhang, Z., Khoshelham, K.: Multi-view geometry-aware dif- fusion transformer for novel view synthesis of indoor scenes. In: 2025 International Joint Conference on Neural Networks (IJCNN). pp. 1–10. IEEE (2025)

  21. [21]

    In: 2018 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM)

    Kang, X., Yin, S., Fen, Y.: 3d reconstruction & assessment framework based on affordable 2d lidar. In: 2018 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM). pp. 292–297. IEEE (2018)

  22. [22]

    arXiv preprint arXiv:1909.13493 (2019) One Click Model 17

    Kang, X., Yuan, S.: Robust data association for object-level semantic slam. arXiv preprint arXiv:1909.13493 (2019) One Click Model 17

  23. [23]

    In: The conference proceedings and published in IEEE Xplore of 2025 IEEE International Geoscience and Remote Sensing Symposium (2025)

    Kang, X., Zhao, H., Khoshelham, K., Patrick, V.: 2d surfel-based 3d point cloud registration with robust equivariant se (3) features. In: The conference proceedings and published in IEEE Xplore of 2025 IEEE International Geoscience and Remote Sensing Symposium (2025)

  24. [24]

    In: International Conference on Computer Vision (ICCV) (2023)

    Kerr, J., Kim, C.M., Goldberg, K., Kanazawa, A., Tancik, M.: Lerf: Language em- bedded radiance fields. In: International Conference on Computer Vision (ICCV) (2023)

  25. [25]

    Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollár, P., Girshick, R.: Segment Anything (2023).https://doi.org/10.48550/arXiv.2304.02643

  26. [26]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4015–4026 (2023)

  27. [27]

    In: Advances in Neural Information Processing Systems

    Kobayashi, S., Matsumoto, E., Sitzmann, V.: Decomposing nerf for editing via feature field distillation. In: Advances in Neural Information Processing Systems. vol. 35 (2022),https://arxiv.org/pdf/2205.15585.pdf

  28. [28]

    ICRA (2023)

    Kontogianni, T., Celikkan, E., Tang, S., Schindler, K.: Interactive Object Segmen- tation in 3D Point Clouds. ICRA (2023)

  29. [29]

    Lan, K., Li, H., Shi, H., Wu, W., Liao, Y., Wang, L., Zhou, P.: 2D-Guided 3D Gaussian Segmentation (2023).https://doi.org/10.48550/arXiv.2312.16047

  30. [30]

    In: SIGGRAPH Asia 2024 Conference Papers

    Lang, I., Xu, F., Decatur, D., Babu, S., Hanocka, R.: iseg: Interactive 3d segmen- tation via interactive attention. In: SIGGRAPH Asia 2024 Conference Papers. pp. 1–11 (2024)

  31. [31]

    Advances in Neural Information Process- ing Systems34, 15816–15829 (2021)

    Li, Y., Si, S., Li, G., Hsieh, C.J., Bengio, S.: Learnable fourier features for multi- dimensional spatial positional encoding. Advances in Neural Information Process- ing Systems34, 15816–15829 (2021)

  32. [32]

    Liao, Y., Xie, J., Geiger, A.: KITTI-360: A novel dataset and benchmarks for urbansceneunderstandingin2dand3d.PatternAnalysisandMachineIntelligence (PAMI) (2022)

  33. [33]

    arXiv preprint arXiv:2307.09732 (2023)

    Liu, L., Kong, T., Zhu, M., Fan, J., Fang, L.: Clickseg: 3d instance segmentation with click-level weak annotations. arXiv preprint arXiv:2307.09732 (2023)

  34. [34]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Liu, Y., Hu, B., Tang, C.K., Tai, Y.W.: Sanerf-hq: Segment anything for nerf in high quality. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3216–3226 (2024)

  35. [35]

    Lu, S., Chang, H., Jing, E.P., Boularias, A., Bekris, K.: OVIR-3D: Open- Vocabulary 3D Instance Retrieval Without Training on 3D Data (2023).https: //doi.org/10.48550/arXiv.2311.02873

  36. [36]

    Nguyen, P.D.A., Ngo, T.D., Kalogerakis, E., Gan, C., Tran, A., Pham, C., Nguyen, K.: Open3DIS: Open-Vocabulary 3D Instance Segmentation with 2D Mask Guid- ance (2024).https://doi.org/10.48550/arXiv.2312.10671

  37. [37]

    org/10.48550/arXiv.2403.13129

    Ošep, A., Meinhardt, T., Ferroni, F., Peri, N., Ramanan, D., Leal-Taixé, L.: Better Call SAL: Towards Learning to Segment Anything in Lidar (2024).https://doi. org/10.48550/arXiv.2403.13129

  38. [38]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Peng, S., Genova, K., Jiang, C., Tagliasacchi, A., Pollefeys, M., Funkhouser, T., et al.: Openscene: 3d scene understanding with open vocabularies. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 815– 824 (2023)

  39. [39]

    arXiv preprint arXiv:2310.08820 (2023) 18 X

    Peng, X., Chen, R., Qiao, F., Kong, L., Liu, Y., Wang, T., Zhu, X., Ma, Y.: Sam-guided unsupervised domain adaptation for 3d segmentation. arXiv preprint arXiv:2310.08820 (2023) 18 X. Kang et al

  40. [40]

    Qin, M., Li, W., Zhou, J., Wang, H., Pfister, H.: LangSplat: 3D Language Gaussian Splatting (2024).https://doi.org/10.48550/arXiv.2312.16084

  41. [41]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

  42. [42]

    org/10.48550/arXiv.2210.03105

    Schult, J., Engelmann, F., Hermans, A., Litany, O., Tang, S., Leibe, B.: Mask3D: Mask Transformer for 3D Semantic Instance Segmentation (2023).https://doi. org/10.48550/arXiv.2210.03105

  43. [43]

    arXiv preprint arXiv:2311.18482 (2023)

    Shi, J.C., Wang, M., Duan, H.B., Guan, S.H.: Language embedded 3d gaussians for open-vocabulary scene understanding. arXiv preprint arXiv:2311.18482 (2023)

  44. [44]

    In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition

    Shi, J.C., Wang, M., Duan, H.B., Guan, S.H.: Language embedded 3d gaussians for open-vocabulary scene understanding. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition. pp. 5333–5343 (2024)

  45. [45]

    org/abs/2306.13631

    Takmaz, A., Fedele, E., Sumner, R.W., Pollefeys, M., Tombari, F., Engelmann, F.: OpenMask3D: Open-Vocabulary 3D Instance Segmentation (2023).https://doi. org/10.48550/arXiv.2306.13631

  46. [46]

    Team, S.D., Chen, X., Chu, F.J., Gleize, P., Liang, K.J., Sax, A., Tang, H., Wang, W., Guo, M., Hardin, T., Li, X., Lin, A., Liu, J., Ma, Z., Sagar, A., Song, B., Wang, X., Yang, J., Zhang, B., Dollár, P., Gkioxari, G., Feiszli, M., Malik, J.: Sam 3d: 3dfy anything in images (2025),https://arxiv.org/abs/2511.16624

  47. [47]

    arXiv preprint arXiv:2505.01458 , year=

    Wong, L.H.K., Kang, X., Bai, K., Zhang, J.: A survey of robotic navigation and manipulation with physics simulators in the era of embodied ai. arXiv preprint arXiv:2505.01458 (2025)

  48. [48]

    In: CVPR (2024)

    Wu, X., Jiang, L., Wang, P.S., Liu, Z., Liu, X., Qiao, Y., Ouyang, W., He, T., Zhao, H.: Point transformer v3: Simpler, faster, stronger. In: CVPR (2024)

  49. [49]

    arXiv preprint arXiv:2406.02058 (2024)

    Wu, Y., Meng, J., Li, H., Wu, C., Shi, Y., Cheng, X., Zhao, C., Feng, H., Ding, E., Wang, J., et al.: Opengaussian: Towards point-level 3d gaussian-based open vocabulary understanding. arXiv preprint arXiv:2406.02058 (2024)

  50. [50]

    arXiv preprint arXiv:2311.17707 (2023)

    Xu, M., Yin, X., Qiu, L., Liu, Y., Tong, X., Han, X.: Sampro3d: Locating sam prompts in 3d for zero-shot scene segmentation. arXiv preprint arXiv:2311.17707 (2023)

  51. [51]

    In: International Conference on 3D Vision (3DV) (2025)

    Xu, M., Yin, X., Qiu, L., Liu, Y., Tong, X., Han, X.: Sampro3d: Locating sam prompts in 3d for zero-shot instance segmentation. In: International Conference on 3D Vision (3DV) (2025)

  52. [52]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Yan,M.,Zhang, J.,Zhu, Y.,Wang,H.: Maskclustering: View consensusbased mask graph clustering for open-vocabulary 3d instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 28274–28284 (2024)

  53. [53]

    Yang, Y., Wu, X., He, T., Zhao, H., Liu, X.: Sam3d: Segment anything in 3d scenes (2023)

  54. [54]

    Yin,Y.,Liu,Y.,Xiao,Y.,Cohen-Or,D.,Huang,J.,Chen,B.:SAI3D:SegmentAny Instance in 3D Scenes (2024).https://doi.org/10.48550/arXiv.2312.11557

  55. [55]

    In: International Conference on Learning Representations (ICLR) (2024)

    Yue, Y., Mahadevan, S., Schult, J., Engelmann, F., Leibe, B., Schindler, K., Konto- gianni,T.:AGILE3D:AttentionGuidedInteractiveMulti-object3DSegmentation. In: International Conference on Learning Representations (ICLR) (2024)

  56. [56]

    Zhang, P., Wu, T., Sun, J., Li, W., Su, Z.: Refining Segmentation On-the-Fly: An Interactive Framework for Point Cloud Semantic Segmentation (2024).https: //doi.org/10.48550/arXiv.2403.06401 One Click Model 19

  57. [57]

    Gaussiangrasper: 3d language gaussian splatting for open-vocabulary robotic grasping,

    Zheng, Y., Chen, X., Zheng, Y., Gu, S., Yang, R., Jin, B., Li, P., Zhong, C., Wang, Z., Liu, L., et al.: Gaussiangrasper: 3d language gaussian splatting for open- vocabulary robotic grasping. arXiv preprint arXiv:2403.09637 (2024)

  58. [58]

    arXiv preprint arXiv:2406.17741 (2024)

    Zhou, Y., Gu, J., Chiang, T.Y., Xiang, F., Su, H.: Point-sam: Promptable 3d segmentation model for point clouds. arXiv preprint arXiv:2406.17741 (2024)

  59. [59]

    Kang et al

    Zhu, X., Zhou, H., Xing, P., Zhao, L., Xu, H., Liang, J., Hauptmann, A., Liu, T., Gallagher, A.: Open-vocabulary 3d semantic segmentation with text-to-image dif- fusionmodels.In:EuropeanConferenceonComputerVision.pp.357–375.Springer (2024) 20 X. Kang et al. Supplementary Material: Few-Click-Driven Interactive 3D Segmentation with Semantic Embedding S1 Exp...