pith. sign in

arxiv: 2606.29786 · v1 · pith:ICPWQVACnew · submitted 2026-06-29 · 💻 cs.CV · cs.RO

OP3DSG: Open-Vocabulary Part-Aware 3D Scene Graph Generation for Real-World Environments

Pith reviewed 2026-06-30 06:20 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords 3D scene graphsopen-vocabularypart-awarescene understandingrobotics perceptiongraph generationaffordancesUniGraph3D
0
0 comments X

The pith

OP3DSG builds unified open-vocabulary 3D scene graphs that model objects, interactive parts, relations, and affordances together.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OP3DSG as a framework that moves beyond object-centric 3D scene graphs to include part-level details and multiple relation types. This matters for real-world robotics because tasks often require understanding how parts interact and what actions they afford. The method combines knowledge-guided detection and 3D fusion to keep small components intact, then uses a geometry prior graph refined by language models to cut down on incorrect relations while keeping construction fast. A new benchmark called UniGraph3D supports testing of part-aware and multi-level reasoning. Results show the approach reaches state-of-the-art accuracy and works as a backbone for robot perception tasks.

Core claim

OP3DSG is an open-vocabulary part-aware 3D scene graph generation framework that constructs unified graphs jointly modeling objects, interactive parts, spatial relations, functional relations, and affordances. It integrates object-part knowledge-guided detection with part-aware 3D fusion to preserve small and interaction-relevant components, and employs a geometry-initialized prior graph with LLM-based refinement to reduce spurious relational predictions while enabling efficient graph construction.

What carries the argument

Geometry-initialized prior graph with LLM-based refinement, which reduces spurious relational predictions while enabling efficient construction of part-aware graphs.

If this is right

  • Achieves state-of-the-art performance on unified 3D scene graph construction.
  • Supports systematic evaluation of part-aware perception and multi-level relational reasoning via the UniGraph3D benchmark.
  • Functions effectively as a perception backbone across diverse real-world robotics tasks.
  • Enables fine-grained understanding of environments that require part-level and affordance information.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Part-level modeling could support more precise robot manipulation by identifying which components of an object to grasp or act on.
  • The open-vocabulary design may let the same graphs handle novel objects without requiring new labeled training data.
  • Including functional relations and affordances directly in the graph could simplify higher-level task planning for robots.

Load-bearing premise

The geometry-initialized prior graph combined with LLM-based refinement reduces spurious relational predictions while preserving accuracy and enabling efficient construction.

What would settle it

A controlled test in which graphs built without the prior graph and LLM refinement achieve equal or higher relational accuracy and downstream robotics task success rates than OP3DSG graphs.

Figures

Figures reproduced from arXiv: 2606.29786 by Ue-Hwan Kim, Yirum Kim.

Figure 1
Figure 1. Figure 1: Conceptual Comparison of 3D Scene Graphs (3DSGs). [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Preliminary experiments reveal limitations in part detection. (a) With CLIP [33], part-centric phrases (e.g., “a chair leg”, “a leg of chair”) yield activa￾tion maps nearly identical to the object￾level prompt “a chair”, as attention concen￾trates on the token chair. (b) With Ground￾ingDINO [27], part-centric phrases still yield object-level bounding boxes; when a part is detected, its confidence score is … view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the proposed framework. Knowledge-guided 2D detection generates object and part instances that are incrementally fused into a global 3D map. The resulting nodes define a geometry-initialized prior graph, which is refined by LLM￾based reasoning into the Unified 3DSG. 3 Problem Formulation Given a posed RGB-D sequence {It, Dt} T t=0, where It denotes the RGB frame and Dt the corresponding depth, … view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results. We visualize the unified 3DSG in partial regions of two scenes. We visualize portions of the unified 3DSG in [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Robot Applications using OP3DSG. The generated 3DSGs can be used for various downstream tasks, including question answering, task planning, language￾conditioned querying, and navigation. See the supplement for implementation details. 6 Real-world Robot Applications We demonstrate that OP3DSG directly supports diverse real-world embodied tasks ( [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
read the original abstract

3D scene graphs (3DSGs) provide a compact and structured abstraction of 3D environments. Although advances in foundation models have enabled open-vocabulary 3DSG generation, existing approaches remain object-centric and encode limited relational information -- restricting their applicability in real-world scenarios that require fine-grained understanding. We propose OP3DSG, an open-vocabulary part-aware 3DSG generation framework that constructs unified graphs that jointly model objects, interactive parts, spatial relations, functional relations, and affordances. OP3DSG integrates object-part knowledge-guided detection with part-aware 3D fusion to preserve small and interaction-relevant components, and employs a geometry-initialized prior graph with LLM-based refinement to reduce spurious relational predictions while enabling efficient graph construction. To systematically evaluate unified 3D scene graph construction, we introduce UniGraph3D, a benchmark designed for part-aware perception and multi-level relational reasoning. Experimental results show that OP3DSG achieves state-of-the-art performance and demonstrates its effectiveness as a perception backbone in diverse real-world robotics tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes OP3DSG, an open-vocabulary part-aware 3D scene graph generation framework for real-world environments. It constructs unified graphs modeling objects, interactive parts, spatial relations, functional relations, and affordances by integrating object-part knowledge-guided detection, part-aware 3D fusion, and a geometry-initialized prior graph with LLM-based refinement. The paper introduces the UniGraph3D benchmark for part-aware perception and multi-level relational reasoning, and claims that OP3DSG achieves state-of-the-art performance while being effective as a perception backbone in diverse real-world robotics tasks.

Significance. If the experimental claims hold, this work would be significant as it extends 3D scene graphs beyond object-centric approaches to include part-level details and affordances, potentially improving structured perception for robotics applications in complex environments.

major comments (1)
  1. Abstract: The central claims that OP3DSG achieves state-of-the-art performance on UniGraph3D and demonstrates effectiveness as a perception backbone in robotics tasks are unsupported by any quantitative metrics, ablation studies, benchmark construction details, LLM prompt examples, or evaluation results in the provided manuscript text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and recommendation. Below we address the major comment point by point.

read point-by-point responses
  1. Referee: [—] Abstract: The central claims that OP3DSG achieves state-of-the-art performance on UniGraph3D and demonstrates effectiveness as a perception backbone in robotics tasks are unsupported by any quantitative metrics, ablation studies, benchmark construction details, LLM prompt examples, or evaluation results in the provided manuscript text.

    Authors: The text supplied for review consists solely of the abstract. The complete manuscript contains dedicated sections providing exactly these elements: Section 4 details UniGraph3D benchmark construction and statistics; Section 5 reports quantitative metrics, SOTA comparisons on part-aware detection and relational reasoning, and ablation studies; Section 6 presents robotics-task evaluations with quantitative results; and the appendix supplies LLM prompt templates together with additional evaluation tables. The abstract therefore summarizes results that are fully supported in the body of the paper. If the full manuscript was not accessible during review, we can supply the complete PDF or specific excerpts. revision: no

Circularity Check

0 steps flagged

No circularity; pipeline description contains no self-referential derivations or fitted predictions

full rationale

The supplied abstract and method summary describe a modular pipeline (object-part knowledge-guided detection, part-aware 3D fusion, geometry-initialized prior graph + LLM refinement) without any equations, parameter-fitting steps, or predictions that reduce to the inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems, no ansatzes are smuggled, and no known empirical patterns are renamed as novel derivations. The SOTA and robotics-backbone claims rest on experimental results rather than on any closed derivation loop, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, background axioms, or newly postulated entities.

pith-pipeline@v0.9.1-grok · 5721 in / 1080 out tokens · 30714 ms · 2026-06-30T06:20:29.191338+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 8 canonical work pages · 5 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. ArXiv abs/2303.08774(2023)

  2. [2]

    In: Conference on Robot Learning (2022)

    Agia, C., Jatavallabhula, K.M., Khodeir, M.N.M., Mikvs’ik, O., Vineet, V., Mukadam, M., Paull, L., Shkurti, F.: Taskography: Evaluating robot task plan- ning over large 3d scene graphs. In: Conference on Robot Learning (2022)

  3. [3]

    Armeni, I., He, Z.Y., Gwak, J., Zamir, A.R., Fischer, M., Malik, J., Savarese, S.: 3d scene graph: A structure for unified semantics, 3d space, and camera. In: Int. Conf. Comput. Vis. pp. 5664–5673 (2019)

  4. [4]

    In: Conference on Robot Learning

    Chang, H., Boyalakuntla, K., Lu, S., Cai, S., Jing, E., Keskar, S., Geng, S., Ab- bas, A., Zhou, L., Bekris, K., et al.: Context-aware entity grounding with open- vocabulary 3d scene graphs. In: Conference on Robot Learning. pp. 1950–1974 (2023)

  5. [5]

    In: IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS)

    Chang, Y., Hughes, N., Ray, A., Carlone, L.: Hydra-multi: Collaborative online construction of 3d scene graphs with multi-robot teams. In: IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS). pp. 10995–11002. IEEE (2023) 16 Y. Kim and UH. Kim

  6. [6]

    Chen, H., Yang, J., Vondrick, C., Mao, C.: Invite: Interpret and control vision- language models with text explanations. In: Int. Conf. Learn. Represent. (2024)

  7. [7]

    In: Scandinavian conference on image analysis

    Danelljan, M., Häger, G., Khan, F.S., Felsberg, M.: Coloring channel represen- tations for visual tracking. In: Scandinavian conference on image analysis. pp. 117–129. Springer (2015)

  8. [8]

    In: IEEE Conf

    Delitzas, A., Takmaz, A., Tombari, F., Sumner, R., Pollefeys, M., Engelmann, F.: Scenefun3d: Fine-grained functionality and affordance understanding in 3d scenes. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 14531–14542 (2024)

  9. [9]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi- rectional transformers for language understanding. ArXivabs/1810.04805(2018)

  10. [10]

    In: IEEE Conf

    Everts, I., Van Gemert, J.C., Gevers, T.: Evaluation of color stips for human action recognition. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 2850–2857 (2013)

  11. [11]

    In: IEEE Conf

    Feng, M., Hou, H., Zhang, L., Wu, Z., Guo, Y., Mian, A.: 3d spatial multimodal knowledge accumulation for scene graph prediction in point cloud. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 9182–9191 (2023)

  12. [12]

    In: IEEE Conf

    Gadre, S.Y., Ehsani, K., Song, S., Mottaghi, R.: Continuous scene representations for embodied ai. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 14849–14859 (2022)

  13. [13]

    In: IEEE Conf

    Geng, H., Xu, H., Zhao, C., Xu, C., Yi, L., Huang, S., Wang, H.: Gapartnet: Cross-category domain-generalizable object perception and manipulation via gen- eralizable and actionable parts. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 7081–7091 (2023)

  14. [14]

    In: IEEE International Conference on Robotics and Automation (ICRA)

    Gu, Q., Kuwajerwala, A., Morin, S., Jatavallabhula, K.M., Sen, B., Agarwal, A., Rivera, C., Paul, W., Ellis, K., Chellappa, R., et al.: Conceptgraphs: Open- vocabulary 3d scene graphs for perception and planning. In: IEEE International Conference on Robotics and Automation (ICRA). pp. 5021–5028 (2024)

  15. [15]

    In: IEEE Conf

    Gupta, A., Dollar, P., Girshick, R.: Lvis: A dataset for large vocabulary instance segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 5356–5364 (2019)

  16. [16]

    He, J., Yang, S., Yang, S., Kortylewski, A., Yuan, X., Chen, J.N., Liu, S., Yang, C., Yu, Q., Yuille, A.: Partimagenet: A large, high-quality dataset of parts. In: Eur. Conf. Comput. Vis. pp. 128–145. Springer (2022)

  17. [17]

    Hou, H.Y., Lee, C.Y., Sonogashira, M., Kawanishi, Y.: Fross: Faster-than-real- time online 3d semantic scene graph generation from rgb-d images. In: Int. Conf. Comput. Vis. pp. 28818–28827 (2025)

  18. [18]

    In: Robotics: Science and Systems XVIII (2022)

    Hughes, N., Chang, Y., Carlone, L.: Hydra: A real-time spatial perception sys- tem for 3d scene graph construction and optimization. In: Robotics: Science and Systems XVIII (2022)

  19. [19]

    Kim, U.H.,Park, J.M.,Song,T.J.,Kim, J.H.:3-dscenegraph: Asparseandseman- ticrepresentationofphysicalenvironmentsforintelligentagents.IEEEtransactions on cybernetics50(12), 4921–4933 (2019)

  20. [20]

    In: IEEE Conf

    Koch, S., Vaskevicius, N., Colosi, M., Hermosilla, P., Ropinski, T.: Open3dsg: Open-vocabulary 3d scene graphs from point clouds with queryable objects and open-set relationships. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 14183– 14193 (2024)

  21. [21]

    In: IEEE Conf

    Li, L.H., Zhang, P., Zhang, H., Yang, J., Li, C., Zhong, Y., Wang, L., Yuan, L., Zhang, L., Hwang, J.N., et al.: Grounded language-image pre-training. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 10965–10975 (2022)

  22. [22]

    Li, X., Xu, S., Yang, Y., Cheng, G., Tong, Y., Tao, D.: Panoptic-partformer: Learn- ing a unified model for panoptic part segmentation. In: Eur. Conf. Comput. Vis. pp. 729–747. Springer (2022) OP3DSG 17

  23. [23]

    Intelligent Service Robotics15(4), 459–473 (2022)

    Li, Y., Ma, Y., Huo, X., Wu, X.: Remote object navigation for service robots using hierarchical knowledge graph in human-centered environments. Intelligent Service Robotics15(4), 459–473 (2022)

  24. [24]

    IEEE Trans

    Liang, P., Blasch, E., Ling, H.: Encoding color information for visual tracking: Algorithms and benchmark. IEEE Trans. Image Process.24(12), 5630–5644 (2015)

  25. [25]

    Lin, C., Sun, P., Jiang, Y., Luo, P., Qu, L., Haffari, G., Yuan, Z., Cai, J.: Learning object-language alignments for open-vocabulary object detection. In: Int. Conf. Learn. Represent. (2023)

  26. [26]

    In: IEEE Conf

    Liu, M., Zhu, Y., Cai, H., Han, S., Ling, Z., Porikli, F.M., Su, H.: Partslip: Low- shot part segmentation for 3d point clouds via pretrained image-language models. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 21736–21746 (2023)

  27. [27]

    Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., et al.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In: Eur. Conf. Comput. Vis. pp. 38–55. Springer (2024)

  28. [28]

    Liu, Y., Zhang, X., Zhang, S., He, X.: Part-aware prototype network for few-shot semantic segmentation. In: Eur. Conf. Comput. Vis. (2020)

  29. [29]

    Ma, Z., Yue, Y., Gkioxari, G.: Find any part in 3d. In: Int. Conf. Comput. Vis. pp. 7818–7827 (2025)

  30. [30]

    ArXiv abs/2004.07944(2020)

    Meletis, P., Wen, X., Lu, C., De Geus, D., Dubbelman, G.: Cityscapes- panoptic-parts and pascal-panoptic-parts datasets for scene understanding. ArXiv abs/2004.07944(2020)

  31. [31]

    Minderer, M., Gritsenko, A., Stone, A., Neumann, M., Weissenborn, D., Dosovit- skiy, A., Mahendran, A., Arnab, A., Dehghani, M., Shen, Z., et al.: Simple open- vocabulary object detection. In: Eur. Conf. Comput. Vis. pp. 728–755. Springer (2022)

  32. [32]

    In: IEEE Conf

    Mo, K., Zhu, S., Chang, A.X., Yi, L., Tripathi, S., Guibas, L.J., Su, H.: Part- net: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 909–918 (2019)

  33. [33]

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: Int. Conf. Mach. Learn. pp. 8748–8763 (2021)

  34. [34]

    In: IEEE Conf

    Ramanathan, V., Kalia, A., Petrovic, V., Wen, Y., Zheng, B., Guo, B., Wang, R., Marquez, A., Kovvuri, R., Kadian, A., et al.: Paco: Parts and attributes of common objects. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 7141–7151 (2023)

  35. [35]

    In: Conference on Robot Learning (2023)

    Rana, K., Haviland, J., Garg, S., Abou-Chakra, J., Reid, I.D., Sünderhauf, N.: Sayplan: Grounding large language models using 3d scene graphs for scalable task planning. In: Conference on Robot Learning (2023)

  36. [36]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert- networks. ArXivabs/1908.10084(2019)

  37. [37]

    Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

    Ren, T., Liu, S., Zeng, A., Lin, J., Li, K., Cao, H., Chen, J., Huang, X., Chen, Y., Yan, F., et al.: Grounded sam: Assembling open-world models for diverse visual tasks. ArXivabs/2401.14159(2024)

  38. [38]

    The International Journal of Robotics Research40, 1510 – 1546 (2021)

    Rosinol,A.,Violette,A.,Abate,M.,Hughes,N.,Chang,Y.,Shi,J.,Gupta,A.,Car- lone, L.: Kimera: From slam to spatial perception with 3d dynamic scene graphs. The International Journal of Robotics Research40, 1510 – 1546 (2021)

  39. [39]

    In: IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS)

    Rotondi, D., Scaparro, F., Blum, H., Arras, K.O.: Fungraph: Functionality aware 3d scene graphs for language-prompted scene interaction. In: IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS). pp. 4083–4090. IEEE (2025) 18 Y. Kim and UH. Kim

  40. [40]

    OpenAI GPT-5 System Card

    Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., et al.: Openai gpt-5 system card. ArXiv abs/2601.03267(2025)

  41. [41]

    Sun, P., Chen, S., Zhu, C., Xiao, F., Luo, P., Xie, S., Yan, Z.: Going denser with open-vocabulary part segmentation. In: Int. Conf. Comput. Vis. pp. 15453–15465 (2023)

  42. [42]

    International journal of computer vi- sion7(1), 11–32 (1991)

    Swain, M.J., Ballard, D.H.: Color indexing. International journal of computer vi- sion7(1), 11–32 (1991)

  43. [43]

    Szymańska, E., Dusmanu, M., Buurlage, J.W., Rad, M., Pollefeys, M.: Space3d- bench: Spatial 3d question answering benchmark. In: Eur. Conf. Comput. Vis. pp. 68–85. Springer (2024)

  44. [44]

    IEEE Trans

    Van De Sande, K., Gevers, T., Snoek, C.: Evaluating color descriptors for object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell.32(9), 1582–1596 (2009)

  45. [45]

    IEEE Trans

    Van De Weijer, J., Schmid, C., Verbeek, J., Larlus, D.: Learning color names for real-world applications. IEEE Trans. Image Process.18(7), 1512–1523 (2009)

  46. [46]

    In: IEEE Conf

    Wald, J., Dhamo, H., Navab, N., Tombari, F.: Learning 3d semantic scene graphs from 3d indoor reconstructions. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 3961–3970 (2020)

  47. [47]

    IEEE Trans

    Wang, J., Chakraborty, R., Yu, S.X.: Transformer for 3d point clouds. IEEE Trans. Pattern Anal. Mach. Intell.44, 4419–4431 (2019)

  48. [48]

    Wang, L., Li, X., Fang, Y.: Few-shot learning of part-specific probability space for 3dshapesegmentation.In:IEEEConf.Comput.Vis.PatternRecog.pp.4504–4513 (2020)

  49. [49]

    Wang, P., Shen, X., Lin, Z.L., Cohen, S.D., Price, B.L., Yuille, A.L.: Joint object and part segmentation using deep learned potentials. In: Int. Conf. Comput. Vis. pp. 1573–1581 (2015)

  50. [50]

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. In: Adv. Neural Inform. Process. Syst. vol. 35, pp. 24824–24837 (2022)

  51. [51]

    Wei, M., Yue, X., Zhang, W., Kong, S., Liu, X., Pang, J.: Ov-parts: Towards open- vocabulary part segmentation. In: Adv. Neural Inform. Process. Syst. vol. 36, pp. 70094–70114 (2023)

  52. [52]

    In: First Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024 (2024)

    Werby, A., Huang, C., Büchner, M., Valada, A., Burgard, W.: Hierarchical open- vocabulary 3d scene graphs for language-grounded robot navigation. In: First Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024 (2024)

  53. [53]

    In: IEEE International Conference on Robotics and Au- tomation (ICRA) (2026)

    Werby, A., Rotondi, D., Scaparro, F., Arras, K.O.: Keysg: Hierarchical keyframe- based 3d scene graphs. In: IEEE International Conference on Robotics and Au- tomation (ICRA) (2026)

  54. [54]

    In: IEEE Conf

    Wu, S.C., Tateno, K., Navab, N., Tombari, F.: Incremental 3d semantic scene graph prediction from rgb sequences. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 5064–5074 (2023)

  55. [55]

    In: IEEE Conf

    Wu, S.C., Wald, J., Tateno, K., Navab, N., Tombari, F.: Scenegraphfusion: Incre- mental 3d scene graph prediction from rgb-d sequences. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 7515–7525 (2021)

  56. [56]

    In: IEEE Conf

    Wu, S., Zhang, W., Jin, S., Liu, W., Loy, C.C.: Aligning bag of regions for open- vocabulary object detection. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 15254–15264 (2023)

  57. [57]

    Sampart3d: Segment any part in 3d objects.arXiv preprint arXiv:2411.07184,

    Yang, Y.N., Huang, Y., Guo, Y.C., Lu, L., Wu, X., Lam, E.Y., Cao, Y.P., Liu, X.: Sampart3d: Segment any part in 3d objects. ArXivabs/2411.07184(2024) OP3DSG 19

  58. [58]

    In: IEEE Conf

    Yu, F., Liu, K., Zhang, Y., Zhu, C., Xu, K.: Partnet: A recursive part decomposi- tion network for fine-grained and hierarchical shape segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 9483–9492 (2019)

  59. [59]

    In: IEEE Conf

    Zhang, C., Delitzas, A., Wang, F., Zhang, R., Ji, X., Pollefeys, M., Engelmann, F.: Open-vocabulary functional 3d scene graphs for real-world indoor spaces. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 19401–19413 (2025)

  60. [60]

    In: IEEE Conf

    Zhang, Y., Huang, X., Ma, J., Li, Z., Luo, Z., Xie, Y., Qin, Y., Luo, T., Li, Y., Liu, S., et al.: Recognize anything: A strong image tagging model. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 1724–1732 (2024)

  61. [61]

    In: IEEE Conf

    Zhong, Y., Yang, J., Zhang, P., Li, C., Codella, N., Li, L.H., Zhou, L., Dai, X., Yuan, L., Li, Y., et al.: Regionclip: Region-based language-image pretraining. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 16793–16803 (2022)

  62. [62]

    International journal of computer vision127(3), 302–321 (2019)

    Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso, A., Torralba, A.: Se- mantic understanding of scenes through the ade20k dataset. International journal of computer vision127(3), 302–321 (2019)

  63. [63]

    Zhou, X., Girdhar, R., Joulin, A., Krähenbühl, P., Misra, I.: Detecting twenty- thousand classes using image-level supervision. In: Eur. Conf. Comput. Vis. pp. 350–368. Springer (2022)

  64. [64]

    ArXivabs/2312.03015(2023)

    Zhou, Y., Gu, J., Li, X., Liu, M., Fang, Y., Su, H.: Partslip++: Enhancing low- shot 3d part segmentation via multi-view instance segmentation and maximum likelihood estimation. ArXivabs/2312.03015(2023)