pith. machine review for the scientific record. sign in

arxiv: 2604.03309 · v1 · submitted 2026-03-31 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

TreeGaussian: Tree-Guided Cascaded Contrastive Learning for Hierarchical Consistent 3D Gaussian Scene Segmentation and Understanding

Authors on Pith no claims yet

Pith reviewed 2026-05-14 00:13 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords 3D Gaussian Splattinghierarchical segmentationcontrastive learningscene understandingobject-part hierarchiesview consistency3D segmentation
0
0 comments X

The pith

TreeGaussian builds a multi-level object tree to guide cascaded contrastive learning for hierarchical consistent segmentation in 3D Gaussian scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TreeGaussian to overcome limitations in 3D Gaussian Splatting methods that fail to capture whole-part relationships and hierarchical semantics in complex scenes. Dense pairwise comparisons and inconsistent labels from 2D priors create redundancy and instability in feature learning. The framework constructs an object tree to structure supervision across levels and applies a two-stage cascaded contrastive strategy that refines representations progressively from global to local. A Consistent Segmentation Detection mechanism and graph-based denoising align outputs across views and suppress unstable points. This matters for applications needing reliable part-level scene understanding in real-time 3D representations.

Core claim

TreeGaussian constructs a multi-level object tree from 2D priors to explicitly model hierarchical semantic relationships, then applies a two-stage cascaded contrastive learning strategy that progressively refines features from global to local while using a Consistent Segmentation Detection mechanism and graph-based denoising to align segmentation modes across views and suppress unstable Gaussians, resulting in improved hierarchical consistency and segmentation quality.

What carries the argument

The multi-level object tree that structures contrastive supervision across object-part hierarchies together with the two-stage cascaded contrastive learning strategy that reduces redundancy and mitigates feature saturation.

If this is right

  • Structured learning across object-part hierarchies becomes feasible in real-time 3D Gaussian representations.
  • Redundancy in contrastive supervision is reduced through progressive global-to-local refinement.
  • Segmentation modes align across different views via the CSD mechanism.
  • Unstable Gaussian points are suppressed by the graph-based denoising module.
  • Performance improves on open-vocabulary 3D object selection and 3D point cloud understanding tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The tree construction step could be extended to incorporate temporal consistency if applied to dynamic scenes.
  • Part-level features learned this way may support downstream robotic tasks that require grasping or manipulation at the object-component level.
  • The cascaded contrastive pattern might transfer to other hierarchical 3D representations such as neural radiance fields with added spatial partitioning.

Load-bearing premise

Inconsistent hierarchical labels from 2D priors can be turned into a stable multi-level object tree that guides learning without propagating errors into the final 3D segmentations.

What would settle it

On scenes where 2D priors yield highly conflicting part labels, compare the cross-view consistency of the resulting 3D Gaussian segmentations with and without the tree-guided cascaded training; large drops in consistency would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.03309 by Congcong Zheng, Feng Dai, Hao Jiang, Honglong Zhao, Jingbin You, Shuqin Gao, Tianlu Mao, Xinzhu Ma, Yucheng Zhang, Zehao Li, Zhaoqi Wang.

Figure 1
Figure 1. Figure 1: Motivation illustration. (a) Flat contrastive learning isolates feature spaces for object wholes and parts, limiting their hierarchical interaction. (b) Fused contrastive learning merges feature spaces but suffers from oversaturation and instability due to dense pairwise comparisons. (c) Cascaded contrastive learning (our method) preserves semantic hierarchy while minimizing contrastive redundancy, enablin… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our method. (a) Constructing object-tree from multi-view images using SAM to capture structured relationships between object parts and wholes. (b) Two-stage cascaded contrastive learning strategy to progressively optimize the instance feature of each Gaussian point. (c) Graph-based denoising is applied to each language￾mapped instance cluster for improving the multi-view rendering quality. – Ma… view at source ↗
Figure 3
Figure 3. Figure 3: Consistent Segmentation Detection (CSD) for local contrastive learning. The blue curve shows the raw Split Number and the red curve shows the smoothed reference across views. Views where the blue curve lies above the red reference are treated as over segmentation (apply only L 2 pull), while views where it lies below are treated as under segmentation (apply only L 2 push). When the blue curve is close to t… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of the rendered instance feature maps. Our method achieves better global feature consistency across objects (cup and spoon) at the whole scale and exhibits clearer feature separation at the part scale. Lerf_ovs dataset [33], which covers four scenes (figurines, teatime, ramen, and waldo_kitchen) with annotated pixel-level semantic labels. We computed the mIoU and mAcc@0.25 scores by … view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of rendered local objects. Our method produces cleaner and more accurate segmentation results compared to baselines, effectively reducing noise and preserving fine-grained structures. OpenGaussian (whole) TreeGaussian (whole) OpenGaussian (part) TreeGaussian (part) OpenGaussian (whole) TreeGaussian (whole) OpenGaussian (part) TreeGaussian (part) [PITH_FULL_IMAGE:figures/full_fig_p01… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison of click-based 3D object selection. Compared to Open￾Gaussian, our method produces more accurate and hierarchically consistent results at both whole and part scales. archical relationships, flat contrastive strategies are unable to produce accurate and coherent instance segmentation. While OmniSeg3D-GS unifies whole and part feature spaces, the large number of redundant contrastive p… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative ablation results with Consistent Segmentation Detection (CSD) mechanism and Graph-based Gaussian Points Denoising. The results demonstrate that CSD reduces oversegmentation and enhances consistency, while the denoising module effectively suppresses distant clutter and improves clarity. model to adapt to under-segmentation and over-segmentation modes by selec￾tively applying L 2 pull or L 2 push… view at source ↗
read the original abstract

3D Gaussian Splatting (3DGS) has emerged as a real-time, differentiable representation for neural scene understanding. However, existing 3DGS-based methods struggle to represent hierarchical 3D semantic structures and capture whole-part relationships in complex scenes. Moreover, dense pairwise comparisons and inconsistent hierarchical labels from 2D priors hinder feature learning, resulting in suboptimal segmentation. To address these limitations, we introduce TreeGaussian, a tree-guided cascaded contrastive learning framework that explicitly models hierarchical semantic relationships and reduces redundancy in contrastive supervision. By constructing a multi-level object tree, TreeGaussian enables structured learning across object-part hierarchies. In addition, we propose a two-stage cascaded contrastive learning strategy that progressively refines feature representations from global to local, mitigating saturation and stabilizing training. A Consistent Segmentation Detection (CSD) mechanism and a graph-based denoising module are further introduced to align segmentation modes across views while suppressing unstable Gaussian points, enhancing segmentation consistency and quality. Extensive experiments, including open-vocabulary 3D object selection, 3D point cloud understanding, and ablation studies, demonstrate the effectiveness and robustness of our approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces TreeGaussian, a tree-guided cascaded contrastive learning framework for 3D Gaussian Splatting that constructs a multi-level object tree from 2D hierarchical labels, applies a two-stage global-to-local contrastive strategy, and incorporates a Consistent Segmentation Detection (CSD) mechanism plus graph-based denoising to improve hierarchical consistency and reduce redundancy in segmentation supervision.

Significance. If the central claims hold, the approach could advance structured 3D scene understanding by explicitly modeling object-part hierarchies in real-time Gaussian representations, with potential benefits for open-vocabulary selection and point-cloud tasks. The cascaded contrastive design and CSD module represent targeted innovations over standard contrastive baselines in 3DGS, but the absence of quantitative metrics, ablation tables, or error-propagation analysis in the provided description makes it difficult to gauge the magnitude of improvement or robustness.

major comments (2)
  1. [Method (tree construction and cascaded contrastive strategy)] The central claim depends on reliable construction of a multi-level object tree from inconsistent 2D priors, yet the manuscript provides no quantitative sensitivity analysis, ablation on label noise levels, or bounds on error propagation across views. If tree edges misalign, the global-to-local contrastive losses risk reinforcing rather than correcting inconsistencies, directly undermining the hierarchical consistency benefit.
  2. [Experiments and results] No numerical results, ablation tables, or error analysis are referenced to support the claims of enhanced segmentation consistency and quality. The abstract and description assert effectiveness from experiments on open-vocabulary selection and point-cloud understanding, but without reported metrics (e.g., mIoU, consistency scores) or baseline comparisons, the support for the central claims cannot be verified.
minor comments (2)
  1. [Method] Clarify the precise definition and implementation details of the Consistent Segmentation Detection (CSD) mechanism and graph-based denoising module, including how they interact with the contrastive losses.
  2. [Figures and tables] Ensure all figures and tables include clear captions, axis labels, and statistical significance indicators to aid interpretation of any ablation or comparison results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We sincerely thank the referee for the constructive feedback on TreeGaussian. We address each major comment below and will revise the manuscript to incorporate additional analysis and clearer experimental reporting as suggested.

read point-by-point responses
  1. Referee: [Method (tree construction and cascaded contrastive strategy)] The central claim depends on reliable construction of a multi-level object tree from inconsistent 2D priors, yet the manuscript provides no quantitative sensitivity analysis, ablation on label noise levels, or bounds on error propagation across views. If tree edges misalign, the global-to-local contrastive losses risk reinforcing rather than correcting inconsistencies, directly undermining the hierarchical consistency benefit.

    Authors: We appreciate this concern regarding robustness to inconsistent 2D priors. The multi-level object tree is built by aggregating hierarchical labels from multiple views via a graph structure that identifies consistent nodes, with the CSD mechanism explicitly detecting and enforcing segmentation consistency across views while the graph-based denoising removes unstable Gaussians. The cascaded global-to-local contrastive losses are intended to progressively correct rather than propagate errors. We acknowledge the absence of explicit sensitivity analysis in the current version and will add a new ablation subsection quantifying performance under varying label noise levels (simulated by random label flips) and providing empirical bounds on error propagation (measured via consistency scores before/after each stage). This revision will directly demonstrate that the framework mitigates misalignment. revision: yes

  2. Referee: [Experiments and results] No numerical results, ablation tables, or error analysis are referenced to support the claims of enhanced segmentation consistency and quality. The abstract and description assert effectiveness from experiments on open-vocabulary selection and point-cloud understanding, but without reported metrics (e.g., mIoU, consistency scores) or baseline comparisons, the support for the central claims cannot be verified.

    Authors: We apologize that the quantitative results were not sufficiently highlighted or cross-referenced in the version reviewed. The full manuscript contains ablation tables comparing the cascaded strategy and CSD module, along with numerical metrics including mIoU on 3D segmentation, view-consistency scores, and baseline comparisons for open-vocabulary object selection and point-cloud understanding tasks. In the revision we will add explicit in-text references to these tables/figures, include error bars and propagation analysis, and expand the results section to make all supporting numbers immediately verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity detected; method is algorithmic extension of existing 3DGS and contrastive learning

full rationale

The paper presents TreeGaussian as a new tree-guided cascaded contrastive framework built on 3D Gaussian Splatting and standard contrastive learning. The abstract and description outline construction of a multi-level object tree from 2D priors, a two-stage cascaded contrastive strategy, CSD mechanism, and graph denoising without any equations, fitted parameters renamed as predictions, or load-bearing self-citations that reduce the central claims to inputs by construction. No self-definitional loops, uniqueness theorems imported from authors, or ansatzes smuggled via citation appear in the provided text. The derivation chain consists of independent structural additions whose effectiveness is claimed to be shown via experiments, making the paper self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on the domain assumption that 2D image priors can supply usable hierarchical labels and that contrastive saturation can be avoided by staging the loss; no free parameters or invented physical entities are explicitly introduced in the abstract.

axioms (2)
  • domain assumption Hierarchical semantic structures in 3D scenes can be represented by a multi-level object tree derived from 2D priors
    Invoked when the paper states that constructing the tree enables structured learning across object-part hierarchies.
  • domain assumption Cascaded global-to-local contrastive learning mitigates saturation and stabilizes training
    Stated as the rationale for the two-stage strategy.
invented entities (2)
  • Consistent Segmentation Detection (CSD) mechanism no independent evidence
    purpose: Align segmentation modes across views
    New module introduced to enforce cross-view consistency; no independent evidence outside the method is provided.
  • Graph-based denoising module no independent evidence
    purpose: Suppress unstable Gaussian points
    New component for cleaning the representation; no external validation mentioned.

pith-pipeline@v0.9.0 · 5540 in / 1536 out tokens · 61651 ms · 2026-05-14T00:13:57.848329+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 2 internal anchors

  1. [1]

    In: European conference on computer vision

    Aliev, K.A., Sevastopolsky, A., Kolos, M., Ulyanov, D., Lempitsky, V.: Neu- ral point-based graphics. In: European conference on computer vision. pp. 696–712. Springer (2020)

  2. [2]

    In: Eu- ropean Conference on Computer Vision

    Bhalgat, Y., Laina, I., Henriques, J.F., Zisserman, A., Vedaldi, A.: N2f2: Hierarchical scene understanding with nested neural feature fields. In: Eu- ropean Conference on Computer Vision. pp. 197–214. Springer (2024)

  3. [4]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9650–9660 (2021)

  4. [5]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Cen, J., Fang, J., Yang, C., Xie, L., Zhang, X., Shen, W., Tian, Q.: Segment any 3d gaussians. In: Proceedings of the AAAI Conference on Artificial Intelligence. pp. 1971–1979 (2025)

  5. [6]

    Advances in Neural Information Processing Systems36, 25971–25990 (2023)

    Cen, J., Zhou, Z., Fang, J., Shen, W., Xie, L., Jiang, D., Zhang, X., Tian, Q., et al.: Segment anything in 3d with nerfs. Advances in Neural Information Processing Systems36, 25971–25990 (2023)

  6. [7]

    In: European Conference on Computer Vision

    Choi, S., Song, H., Kim, J., Kim, T., Do, H.: Click-gaussian: Interactive segmentation to any 3d gaussians. In: European Conference on Computer Vision. pp. 289–305. Springer (2024)

  7. [8]

    In: Proceed- ings of the IEEE conference on computer vision and pattern recognition

    Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: Scannet: Richly-annotated 3d reconstructions of indoor scenes. In: Proceed- ings of the IEEE conference on computer vision and pattern recognition. pp. 5828–5839 (2017) 16 You et al

  8. [9]

    Foley,J.D.:Computergraphics:principlesandpractice,vol.12110.Addison- Wesley Professional (1996)

  9. [11]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Fridovich-Keil, S., Yu, A., Tancik, M., Chen, Q., Recht, B., Kanazawa, A.: Plenoxels: Radiance fields without neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5501–5510 (2022)

  10. [12]

    The Uni- versity of North Carolina at Chapel Hill (2000)

    Gottschalk, S.A.: Collision queries using oriented bounding boxes. The Uni- versity of North Carolina at Chapel Hill (2000)

  11. [13]

    In: Robotics: science and systems

    Guadarrama, S., Rodner, E., Saenko, K., Zhang, N., Farrell, R., Donahue, J., Darrell, T.: Open-vocabulary object retrieval. In: Robotics: science and systems. vol. 2, p. 6 (2014)

  12. [14]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Han, L., Zheng, T., Xu, L., Fang, L.: Occuseg: Occupancy-aware 3d instance segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 2940–2949 (2020)

  13. [15]

    IEEE Robotics and Automation Letters7(2), 2913–2920 (2022)

    Humblot-Renaux, G., Marchegiani, L., Moeslund, T.B., Gade, R.: Navigation-oriented scene understanding for robotic autonomy: Learning to segment driveability in egocentric images. IEEE Robotics and Automation Letters7(2), 2913–2920 (2022)

  14. [16]

    ACM Trans

    Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph.42(4), 139–1 (2023)

  15. [17]

    In: Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision

    Kerr, J., Kim, C.M., Goldberg, K., Kanazawa, A., Tancik, M.: Lerf: Lan- guage embedded radiance fields. In: Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision. pp. 19729–19739 (2023)

  16. [18]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Kim, C.M., Wu, M., Kerr, J., Goldberg, K., Tancik, M., Kanazawa, A.: Garfield: Group anything with radiance fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21530–21539 (2024)

  17. [19]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4015–4026 (2023)

  18. [20]

    Advances in Neural Information Processing Sys- tems35, 30233–30249 (2022)

    Kusupati, A., Bhatt, G., Rege, A., Wallingford, M., Sinha, A., Ramanujan, V., Howard-Snyder, W., Chen, K., Kakade, S., Jain, P., et al.: Matryoshka representation learning. Advances in Neural Information Processing Sys- tems35, 30233–30249 (2022)

  19. [21]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Lassner, C., Zollhofer, M.: Pulsar: Efficient sphere-based neural rendering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1440–1449 (2021)

  20. [22]

    Language-driven Semantic Segmentation

    Li, B., Weinberger, K.Q., Belongie, S., Koltun, V., Ranftl, R.: Language- driven semantic segmentation. arXiv preprint arXiv:2201.03546 (2022) TreeGaussian 17

  21. [23]

    In: Proceedings of the Computer Vision and Pat- tern Recognition Conference

    Li, H., Wu, Y., Meng, J., Gao, Q., Zhang, Z., Wang, R., Zhang, J.: In- stancegaussian: Appearance-semantic joint gaussian representation for 3d instance-level perception. In: Proceedings of the Computer Vision and Pat- tern Recognition Conference. pp. 14078–14088 (2025)

  22. [24]

    In: In- ternational conference on machine learning

    Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In: In- ternational conference on machine learning. pp. 19730–19742. PMLR (2023)

  23. [25]

    In: ICRA 2024 Workshop on 3D Visual Representations for Robot Manipulation (2024)

    Li, Y., Pathak, D.: Object-aware gaussian splatting for robotic manipula- tion. In: ICRA 2024 Workshop on 3D Visual Representations for Robot Manipulation (2024)

  24. [26]

    Advances in neural information processing systems32(2019)

    Liu, Z., Tang, H., Lin, Y., Han, S.: Point-voxel cnn for efficient 3d deep learning. Advances in neural information processing systems32(2019)

  25. [27]

    In: Proceedings of the Fifth Berkeley Symposium on Mathe- matical Statistics and Probability, Volume 1: Statistics

    MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathe- matical Statistics and Probability, Volume 1: Statistics. vol. 5, pp. 281–298. University of California press (1967)

  26. [28]

    In: 2017 IEEE International Conference on Robotics and automation (ICRA)

    McCormac, J., Handa, A., Davison, A., Leutenegger, S.: Semanticfusion: Dense 3d semantic mapping with convolutional neural networks. In: 2017 IEEE International Conference on Robotics and automation (ICRA). pp. 4628–4635. IEEE (2017)

  27. [29]

    Communications of the ACM65(1), 99–106 (2021)

    Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM65(1), 99–106 (2021)

  28. [30]

    IEEE transactions on robotics33(5), 1255–1262 (2017)

    Mur-Artal, R., Tardós, J.D.: Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras. IEEE transactions on robotics33(5), 1255–1262 (2017)

  29. [31]

    DINOv2: Learning Robust Visual Features without Supervision

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khali- dov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Di- nov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

  30. [32]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Peng, S., Genova, K., Jiang, C., Tagliasacchi, A., Pollefeys, M., Funkhouser, T., et al.: Openscene: 3d scene understanding with open vocabularies. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 815–824 (2023)

  31. [33]

    In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition

    Qin, M., Li, W., Zhou, J., Wang, H., Pfister, H.: Langsplat: 3d language gaussian splatting. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition. pp. 20051–20060 (2024)

  32. [34]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sas- try, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

  33. [35]

    The International Journal of Robotics Research27(2), 157–173 (2008)

    Saxena, A., Driemeyer, J., Ng, A.Y.: Robotic grasping of novel objects us- ing vision. The International Journal of Robotics Research27(2), 157–173 (2008)

  34. [36]

    In: Proceedings of 18 You et al

    Schult, J., Engelmann, F., Kontogianni, T., Leibe, B.: Dualconvmesh-net: Joint geodesic and euclidean convolutions on 3d meshes. In: Proceedings of 18 You et al. the IEEE/CVF conference on computer vision and pattern recognition. pp. 8612–8622 (2020)

  35. [37]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Shi, J.C., Wang, M., Duan, H.B., Guan, S.H.: Language embedded 3d gaussians for open-vocabulary scene understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5333–5343 (2024)

  36. [38]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    Valentin, J.P., Sengupta, S., Warrell, J., Shahrokni, A., Torr, P.H.: Mesh based semantic modelling for indoor and outdoor scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2067–2074 (2013)

  37. [39]

    In: International conference on machine learning

    Wang, T., Isola, P.: Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In: International conference on machine learning. pp. 9929–9939. PMLR (2020)

  38. [40]

    In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C

    Wu, Y., Meng, J., Li, H., Wu, C., Shi, Y., Cheng, X., Zhao, C., Feng, H., Ding, E., Wang, J., Zhang, J.: Opengaussian: Towards point-level 3d gaussian-based open vocabulary understanding. In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C. (eds.) Ad- vances in Neural Information Processing Systems. vol. 37, pp. 19114–1...

  39. [41]

    Advances in neural information processing systems32(2019)

    Yang, B., Wang, J., Clark, R., Hu, Q., Wang, S., Markham, A., Trigoni, N.: Learning object bounding boxes for 3d instance segmentation on point clouds. Advances in neural information processing systems32(2019)

  40. [42]

    arXiv preprint arXiv:2405.00676 (2024)

    Yang, R., Zhu, Z., Jiang, Z., Ye, B., Chen, X., Zhang, Y., Chen, Y., Zhao, J., Zhao, H.: Spectrally pruned gaussian fields with neural compensation. arXiv preprint arXiv:2405.00676 (2024)

  41. [43]

    In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recogni- tion

    Yi, L., Zhao, W., Wang, H., Sung, M., Guibas, L.J.: Gspn: Generative shape proposal network for 3d instance segmentation in point cloud. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recogni- tion. pp. 3947–3956 (2019)

  42. [44]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Ying, H., Yin, Y., Zhang, J., Wang, F., Yu, T., Huang, R., Fang, L.: Om- niseg3d: Omniversal 3d segmentation via hierarchical contrastive learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 20612–20622 (2024)

  43. [45]

    IEEE Robotics and Automation Letters (2024)

    Zheng, Y., Chen, X., Zheng, Y., Gu, S., Yang, R., Jin, B., Li, P., Zhong, C., Wang,Z.,Liu,L.,etal.:Gaussiangrasper:3dlanguagegaussiansplattingfor open-vocabulary robotic grasping. IEEE Robotics and Automation Letters (2024)

  44. [46]

    three cookies

    Zhou, S., Chang, H., Jiang, S., Fan, Z., Zhu, Z., Xu, D., Chari, P., You, S., Wang, Z., Kadambi, A.: Feature 3dgs: Supercharging 3d gaussian splat- ting to enable distilled feature fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21676–21685 (2024) TreeGaussian 1 Supplementary Material A Implementation Det...