OP3DSG: Open-Vocabulary Part-Aware 3D Scene Graph Generation for Real-World Environments

Ue-Hwan Kim; Yirum Kim

arxiv: 2606.29786 · v1 · pith:ICPWQVACnew · submitted 2026-06-29 · 💻 cs.CV · cs.RO

OP3DSG: Open-Vocabulary Part-Aware 3D Scene Graph Generation for Real-World Environments

Yirum Kim , Ue-Hwan Kim This is my paper

Pith reviewed 2026-06-30 06:20 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords 3D scene graphsopen-vocabularypart-awarescene understandingrobotics perceptiongraph generationaffordancesUniGraph3D

0 comments

The pith

OP3DSG builds unified open-vocabulary 3D scene graphs that model objects, interactive parts, relations, and affordances together.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OP3DSG as a framework that moves beyond object-centric 3D scene graphs to include part-level details and multiple relation types. This matters for real-world robotics because tasks often require understanding how parts interact and what actions they afford. The method combines knowledge-guided detection and 3D fusion to keep small components intact, then uses a geometry prior graph refined by language models to cut down on incorrect relations while keeping construction fast. A new benchmark called UniGraph3D supports testing of part-aware and multi-level reasoning. Results show the approach reaches state-of-the-art accuracy and works as a backbone for robot perception tasks.

Core claim

OP3DSG is an open-vocabulary part-aware 3D scene graph generation framework that constructs unified graphs jointly modeling objects, interactive parts, spatial relations, functional relations, and affordances. It integrates object-part knowledge-guided detection with part-aware 3D fusion to preserve small and interaction-relevant components, and employs a geometry-initialized prior graph with LLM-based refinement to reduce spurious relational predictions while enabling efficient graph construction.

What carries the argument

Geometry-initialized prior graph with LLM-based refinement, which reduces spurious relational predictions while enabling efficient construction of part-aware graphs.

If this is right

Achieves state-of-the-art performance on unified 3D scene graph construction.
Supports systematic evaluation of part-aware perception and multi-level relational reasoning via the UniGraph3D benchmark.
Functions effectively as a perception backbone across diverse real-world robotics tasks.
Enables fine-grained understanding of environments that require part-level and affordance information.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Part-level modeling could support more precise robot manipulation by identifying which components of an object to grasp or act on.
The open-vocabulary design may let the same graphs handle novel objects without requiring new labeled training data.
Including functional relations and affordances directly in the graph could simplify higher-level task planning for robots.

Load-bearing premise

The geometry-initialized prior graph combined with LLM-based refinement reduces spurious relational predictions while preserving accuracy and enabling efficient construction.

What would settle it

A controlled test in which graphs built without the prior graph and LLM refinement achieve equal or higher relational accuracy and downstream robotics task success rates than OP3DSG graphs.

Figures

Figures reproduced from arXiv: 2606.29786 by Ue-Hwan Kim, Yirum Kim.

**Figure 2.** Figure 2: Preliminary experiments reveal limitations in part detection. (a) With CLIP [33], part-centric phrases (e.g., “a chair leg”, “a leg of chair”) yield activation maps nearly identical to the objectlevel prompt “a chair”, as attention concentrates on the token chair. (b) With GroundingDINO [27], part-centric phrases still yield object-level bounding boxes; when a part is detected, its confidence score is … view at source ↗

**Figure 3.** Figure 3: Overview of the proposed framework. Knowledge-guided 2D detection generates object and part instances that are incrementally fused into a global 3D map. The resulting nodes define a geometry-initialized prior graph, which is refined by LLMbased reasoning into the Unified 3DSG. 3 Problem Formulation Given a posed RGB-D sequence {It, Dt} T t=0, where It denotes the RGB frame and Dt the corresponding depth, … view at source ↗

**Figure 4.** Figure 4: Qualitative results. We visualize the unified 3DSG in partial regions of two scenes. We visualize portions of the unified 3DSG in [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Robot Applications using OP3DSG. The generated 3DSGs can be used for various downstream tasks, including question answering, task planning, languageconditioned querying, and navigation. See the supplement for implementation details. 6 Real-world Robot Applications We demonstrate that OP3DSG directly supports diverse real-world embodied tasks ( [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

read the original abstract

3D scene graphs (3DSGs) provide a compact and structured abstraction of 3D environments. Although advances in foundation models have enabled open-vocabulary 3DSG generation, existing approaches remain object-centric and encode limited relational information -- restricting their applicability in real-world scenarios that require fine-grained understanding. We propose OP3DSG, an open-vocabulary part-aware 3DSG generation framework that constructs unified graphs that jointly model objects, interactive parts, spatial relations, functional relations, and affordances. OP3DSG integrates object-part knowledge-guided detection with part-aware 3D fusion to preserve small and interaction-relevant components, and employs a geometry-initialized prior graph with LLM-based refinement to reduce spurious relational predictions while enabling efficient graph construction. To systematically evaluate unified 3D scene graph construction, we introduce UniGraph3D, a benchmark designed for part-aware perception and multi-level relational reasoning. Experimental results show that OP3DSG achieves state-of-the-art performance and demonstrates its effectiveness as a perception backbone in diverse real-world robotics tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches a part-aware 3D scene graph pipeline with a new benchmark, but its SOTA and robotics-effectiveness claims sit on experimental details that are not visible.

read the letter

The one thing to know is that OP3DSG targets object-centric limits in 3D scene graphs by adding part detection, affordances, and functional relations through object-part guided detection, part-aware fusion, and a geometry prior cleaned by LLM. The UniGraph3D benchmark is presented as the evaluation vehicle for this multi-level setup.

What is actually new is the specific combination of those components plus the benchmark itself. Prior work stayed at whole objects, so extending to interactive parts and using geometry to seed relations before LLM refinement is a reasonable step for robotics scenes where small components matter.

The soft spot is the evidence. The abstract states SOTA performance and usefulness as a robotics backbone, yet supplies no tables, ablations, quantitative robotics-task results, or even basic metrics. Without those, it is not possible to check whether the geometry prior plus LLM step actually cuts spurious edges while keeping accuracy, or whether the fusion step preserves small parts as claimed. The stress-test note is accurate on this point: the central assertions cannot be tested from the supplied text.

This is for people building perception stacks for real-world robots who already work with scene graphs and want finer granularity. A reader could pull the benchmark for their own tests even if the method details need more scrutiny.

If the full manuscript contains the missing experiments and they are reproducible, it is worth sending to referees. Right now the writeup does not give enough to judge whether the approach delivers on its own terms.

Referee Report

1 major / 0 minor

Summary. The paper proposes OP3DSG, an open-vocabulary part-aware 3D scene graph generation framework for real-world environments. It constructs unified graphs modeling objects, interactive parts, spatial relations, functional relations, and affordances by integrating object-part knowledge-guided detection, part-aware 3D fusion, and a geometry-initialized prior graph with LLM-based refinement. The paper introduces the UniGraph3D benchmark for part-aware perception and multi-level relational reasoning, and claims that OP3DSG achieves state-of-the-art performance while being effective as a perception backbone in diverse real-world robotics tasks.

Significance. If the experimental claims hold, this work would be significant as it extends 3D scene graphs beyond object-centric approaches to include part-level details and affordances, potentially improving structured perception for robotics applications in complex environments.

major comments (1)

Abstract: The central claims that OP3DSG achieves state-of-the-art performance on UniGraph3D and demonstrates effectiveness as a perception backbone in robotics tasks are unsupported by any quantitative metrics, ablation studies, benchmark construction details, LLM prompt examples, or evaluation results in the provided manuscript text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and recommendation. Below we address the major comment point by point.

read point-by-point responses

Referee: [—] Abstract: The central claims that OP3DSG achieves state-of-the-art performance on UniGraph3D and demonstrates effectiveness as a perception backbone in robotics tasks are unsupported by any quantitative metrics, ablation studies, benchmark construction details, LLM prompt examples, or evaluation results in the provided manuscript text.

Authors: The text supplied for review consists solely of the abstract. The complete manuscript contains dedicated sections providing exactly these elements: Section 4 details UniGraph3D benchmark construction and statistics; Section 5 reports quantitative metrics, SOTA comparisons on part-aware detection and relational reasoning, and ablation studies; Section 6 presents robotics-task evaluations with quantitative results; and the appendix supplies LLM prompt templates together with additional evaluation tables. The abstract therefore summarizes results that are fully supported in the body of the paper. If the full manuscript was not accessible during review, we can supply the complete PDF or specific excerpts. revision: no

Circularity Check

0 steps flagged

No circularity; pipeline description contains no self-referential derivations or fitted predictions

full rationale

The supplied abstract and method summary describe a modular pipeline (object-part knowledge-guided detection, part-aware 3D fusion, geometry-initialized prior graph + LLM refinement) without any equations, parameter-fitting steps, or predictions that reduce to the inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems, no ansatzes are smuggled, and no known empirical patterns are renamed as novel derivations. The SOTA and robotics-backbone claims rest on experimental results rather than on any closed derivation loop, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, background axioms, or newly postulated entities.

pith-pipeline@v0.9.1-grok · 5721 in / 1080 out tokens · 30714 ms · 2026-06-30T06:20:29.191338+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

64 extracted references · 8 canonical work pages · 5 internal anchors

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. ArXiv abs/2303.08774(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

In: Conference on Robot Learning (2022)

Agia, C., Jatavallabhula, K.M., Khodeir, M.N.M., Mikvs’ik, O., Vineet, V., Mukadam, M., Paull, L., Shkurti, F.: Taskography: Evaluating robot task plan- ning over large 3d scene graphs. In: Conference on Robot Learning (2022)

2022
[3]

Armeni, I., He, Z.Y., Gwak, J., Zamir, A.R., Fischer, M., Malik, J., Savarese, S.: 3d scene graph: A structure for unified semantics, 3d space, and camera. In: Int. Conf. Comput. Vis. pp. 5664–5673 (2019)

2019
[4]

In: Conference on Robot Learning

Chang, H., Boyalakuntla, K., Lu, S., Cai, S., Jing, E., Keskar, S., Geng, S., Ab- bas, A., Zhou, L., Bekris, K., et al.: Context-aware entity grounding with open- vocabulary 3d scene graphs. In: Conference on Robot Learning. pp. 1950–1974 (2023)

1950
[5]

In: IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS)

Chang, Y., Hughes, N., Ray, A., Carlone, L.: Hydra-multi: Collaborative online construction of 3d scene graphs with multi-robot teams. In: IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS). pp. 10995–11002. IEEE (2023) 16 Y. Kim and UH. Kim

2023
[6]

Chen, H., Yang, J., Vondrick, C., Mao, C.: Invite: Interpret and control vision- language models with text explanations. In: Int. Conf. Learn. Represent. (2024)

2024
[7]

In: Scandinavian conference on image analysis

Danelljan, M., Häger, G., Khan, F.S., Felsberg, M.: Coloring channel represen- tations for visual tracking. In: Scandinavian conference on image analysis. pp. 117–129. Springer (2015)

2015
[8]

In: IEEE Conf

Delitzas, A., Takmaz, A., Tombari, F., Sumner, R., Pollefeys, M., Engelmann, F.: Scenefun3d: Fine-grained functionality and affordance understanding in 3d scenes. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 14531–14542 (2024)

2024
[9]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi- rectional transformers for language understanding. ArXivabs/1810.04805(2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[10]

In: IEEE Conf

Everts, I., Van Gemert, J.C., Gevers, T.: Evaluation of color stips for human action recognition. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 2850–2857 (2013)

2013
[11]

In: IEEE Conf

Feng, M., Hou, H., Zhang, L., Wu, Z., Guo, Y., Mian, A.: 3d spatial multimodal knowledge accumulation for scene graph prediction in point cloud. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 9182–9191 (2023)

2023
[12]

In: IEEE Conf

Gadre, S.Y., Ehsani, K., Song, S., Mottaghi, R.: Continuous scene representations for embodied ai. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 14849–14859 (2022)

2022
[13]

In: IEEE Conf

Geng, H., Xu, H., Zhao, C., Xu, C., Yi, L., Huang, S., Wang, H.: Gapartnet: Cross-category domain-generalizable object perception and manipulation via gen- eralizable and actionable parts. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 7081–7091 (2023)

2023
[14]

In: IEEE International Conference on Robotics and Automation (ICRA)

Gu, Q., Kuwajerwala, A., Morin, S., Jatavallabhula, K.M., Sen, B., Agarwal, A., Rivera, C., Paul, W., Ellis, K., Chellappa, R., et al.: Conceptgraphs: Open- vocabulary 3d scene graphs for perception and planning. In: IEEE International Conference on Robotics and Automation (ICRA). pp. 5021–5028 (2024)

2024
[15]

In: IEEE Conf

Gupta, A., Dollar, P., Girshick, R.: Lvis: A dataset for large vocabulary instance segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 5356–5364 (2019)

2019
[16]

He, J., Yang, S., Yang, S., Kortylewski, A., Yuan, X., Chen, J.N., Liu, S., Yang, C., Yu, Q., Yuille, A.: Partimagenet: A large, high-quality dataset of parts. In: Eur. Conf. Comput. Vis. pp. 128–145. Springer (2022)

2022
[17]

Hou, H.Y., Lee, C.Y., Sonogashira, M., Kawanishi, Y.: Fross: Faster-than-real- time online 3d semantic scene graph generation from rgb-d images. In: Int. Conf. Comput. Vis. pp. 28818–28827 (2025)

2025
[18]

In: Robotics: Science and Systems XVIII (2022)

Hughes, N., Chang, Y., Carlone, L.: Hydra: A real-time spatial perception sys- tem for 3d scene graph construction and optimization. In: Robotics: Science and Systems XVIII (2022)

2022
[19]

Kim, U.H.,Park, J.M.,Song,T.J.,Kim, J.H.:3-dscenegraph: Asparseandseman- ticrepresentationofphysicalenvironmentsforintelligentagents.IEEEtransactions on cybernetics50(12), 4921–4933 (2019)

2019
[20]

In: IEEE Conf

Koch, S., Vaskevicius, N., Colosi, M., Hermosilla, P., Ropinski, T.: Open3dsg: Open-vocabulary 3d scene graphs from point clouds with queryable objects and open-set relationships. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 14183– 14193 (2024)

2024
[21]

In: IEEE Conf

Li, L.H., Zhang, P., Zhang, H., Yang, J., Li, C., Zhong, Y., Wang, L., Yuan, L., Zhang, L., Hwang, J.N., et al.: Grounded language-image pre-training. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 10965–10975 (2022)

2022
[22]

Li, X., Xu, S., Yang, Y., Cheng, G., Tong, Y., Tao, D.: Panoptic-partformer: Learn- ing a unified model for panoptic part segmentation. In: Eur. Conf. Comput. Vis. pp. 729–747. Springer (2022) OP3DSG 17

2022
[23]

Intelligent Service Robotics15(4), 459–473 (2022)

Li, Y., Ma, Y., Huo, X., Wu, X.: Remote object navigation for service robots using hierarchical knowledge graph in human-centered environments. Intelligent Service Robotics15(4), 459–473 (2022)

2022
[24]

IEEE Trans

Liang, P., Blasch, E., Ling, H.: Encoding color information for visual tracking: Algorithms and benchmark. IEEE Trans. Image Process.24(12), 5630–5644 (2015)

2015
[25]

Lin, C., Sun, P., Jiang, Y., Luo, P., Qu, L., Haffari, G., Yuan, Z., Cai, J.: Learning object-language alignments for open-vocabulary object detection. In: Int. Conf. Learn. Represent. (2023)

2023
[26]

In: IEEE Conf

Liu, M., Zhu, Y., Cai, H., Han, S., Ling, Z., Porikli, F.M., Su, H.: Partslip: Low- shot part segmentation for 3d point clouds via pretrained image-language models. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 21736–21746 (2023)

2023
[27]

Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., et al.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In: Eur. Conf. Comput. Vis. pp. 38–55. Springer (2024)

2024
[28]

Liu, Y., Zhang, X., Zhang, S., He, X.: Part-aware prototype network for few-shot semantic segmentation. In: Eur. Conf. Comput. Vis. (2020)

2020
[29]

Ma, Z., Yue, Y., Gkioxari, G.: Find any part in 3d. In: Int. Conf. Comput. Vis. pp. 7818–7827 (2025)

2025
[30]

ArXiv abs/2004.07944(2020)

Meletis, P., Wen, X., Lu, C., De Geus, D., Dubbelman, G.: Cityscapes- panoptic-parts and pascal-panoptic-parts datasets for scene understanding. ArXiv abs/2004.07944(2020)

work page arXiv 2004
[31]

Minderer, M., Gritsenko, A., Stone, A., Neumann, M., Weissenborn, D., Dosovit- skiy, A., Mahendran, A., Arnab, A., Dehghani, M., Shen, Z., et al.: Simple open- vocabulary object detection. In: Eur. Conf. Comput. Vis. pp. 728–755. Springer (2022)

2022
[32]

In: IEEE Conf

Mo, K., Zhu, S., Chang, A.X., Yi, L., Tripathi, S., Guibas, L.J., Su, H.: Part- net: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 909–918 (2019)

2019
[33]

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: Int. Conf. Mach. Learn. pp. 8748–8763 (2021)

2021
[34]

In: IEEE Conf

Ramanathan, V., Kalia, A., Petrovic, V., Wen, Y., Zheng, B., Guo, B., Wang, R., Marquez, A., Kovvuri, R., Kadian, A., et al.: Paco: Parts and attributes of common objects. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 7141–7151 (2023)

2023
[35]

In: Conference on Robot Learning (2023)

Rana, K., Haviland, J., Garg, S., Abou-Chakra, J., Reid, I.D., Sünderhauf, N.: Sayplan: Grounding large language models using 3d scene graphs for scalable task planning. In: Conference on Robot Learning (2023)

2023
[36]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert- networks. ArXivabs/1908.10084(2019)

work page internal anchor Pith review Pith/arXiv arXiv 1908
[37]

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

Ren, T., Liu, S., Zeng, A., Lin, J., Li, K., Cao, H., Chen, J., Huang, X., Chen, Y., Yan, F., et al.: Grounded sam: Assembling open-world models for diverse visual tasks. ArXivabs/2401.14159(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

The International Journal of Robotics Research40, 1510 – 1546 (2021)

Rosinol,A.,Violette,A.,Abate,M.,Hughes,N.,Chang,Y.,Shi,J.,Gupta,A.,Car- lone, L.: Kimera: From slam to spatial perception with 3d dynamic scene graphs. The International Journal of Robotics Research40, 1510 – 1546 (2021)

2021
[39]

In: IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS)

Rotondi, D., Scaparro, F., Blum, H., Arras, K.O.: Fungraph: Functionality aware 3d scene graphs for language-prompted scene interaction. In: IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS). pp. 4083–4090. IEEE (2025) 18 Y. Kim and UH. Kim

2025
[40]

OpenAI GPT-5 System Card

Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., et al.: Openai gpt-5 system card. ArXiv abs/2601.03267(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Sun, P., Chen, S., Zhu, C., Xiao, F., Luo, P., Xie, S., Yan, Z.: Going denser with open-vocabulary part segmentation. In: Int. Conf. Comput. Vis. pp. 15453–15465 (2023)

2023
[42]

International journal of computer vi- sion7(1), 11–32 (1991)

Swain, M.J., Ballard, D.H.: Color indexing. International journal of computer vi- sion7(1), 11–32 (1991)

1991
[43]

Szymańska, E., Dusmanu, M., Buurlage, J.W., Rad, M., Pollefeys, M.: Space3d- bench: Spatial 3d question answering benchmark. In: Eur. Conf. Comput. Vis. pp. 68–85. Springer (2024)

2024
[44]

IEEE Trans

Van De Sande, K., Gevers, T., Snoek, C.: Evaluating color descriptors for object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell.32(9), 1582–1596 (2009)

2009
[45]

IEEE Trans

Van De Weijer, J., Schmid, C., Verbeek, J., Larlus, D.: Learning color names for real-world applications. IEEE Trans. Image Process.18(7), 1512–1523 (2009)

2009
[46]

In: IEEE Conf

Wald, J., Dhamo, H., Navab, N., Tombari, F.: Learning 3d semantic scene graphs from 3d indoor reconstructions. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 3961–3970 (2020)

2020
[47]

IEEE Trans

Wang, J., Chakraborty, R., Yu, S.X.: Transformer for 3d point clouds. IEEE Trans. Pattern Anal. Mach. Intell.44, 4419–4431 (2019)

2019
[48]

Wang, L., Li, X., Fang, Y.: Few-shot learning of part-specific probability space for 3dshapesegmentation.In:IEEEConf.Comput.Vis.PatternRecog.pp.4504–4513 (2020)

2020
[49]

Wang, P., Shen, X., Lin, Z.L., Cohen, S.D., Price, B.L., Yuille, A.L.: Joint object and part segmentation using deep learned potentials. In: Int. Conf. Comput. Vis. pp. 1573–1581 (2015)

2015
[50]

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. In: Adv. Neural Inform. Process. Syst. vol. 35, pp. 24824–24837 (2022)

2022
[51]

Wei, M., Yue, X., Zhang, W., Kong, S., Liu, X., Pang, J.: Ov-parts: Towards open- vocabulary part segmentation. In: Adv. Neural Inform. Process. Syst. vol. 36, pp. 70094–70114 (2023)

2023
[52]

In: First Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024 (2024)

Werby, A., Huang, C., Büchner, M., Valada, A., Burgard, W.: Hierarchical open- vocabulary 3d scene graphs for language-grounded robot navigation. In: First Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024 (2024)

2024
[53]

In: IEEE International Conference on Robotics and Au- tomation (ICRA) (2026)

Werby, A., Rotondi, D., Scaparro, F., Arras, K.O.: Keysg: Hierarchical keyframe- based 3d scene graphs. In: IEEE International Conference on Robotics and Au- tomation (ICRA) (2026)

2026
[54]

In: IEEE Conf

Wu, S.C., Tateno, K., Navab, N., Tombari, F.: Incremental 3d semantic scene graph prediction from rgb sequences. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 5064–5074 (2023)

2023
[55]

In: IEEE Conf

Wu, S.C., Wald, J., Tateno, K., Navab, N., Tombari, F.: Scenegraphfusion: Incre- mental 3d scene graph prediction from rgb-d sequences. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 7515–7525 (2021)

2021
[56]

In: IEEE Conf

Wu, S., Zhang, W., Jin, S., Liu, W., Loy, C.C.: Aligning bag of regions for open- vocabulary object detection. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 15254–15264 (2023)

2023
[57]

Sampart3d: Segment any part in 3d objects.arXiv preprint arXiv:2411.07184,

Yang, Y.N., Huang, Y., Guo, Y.C., Lu, L., Wu, X., Lam, E.Y., Cao, Y.P., Liu, X.: Sampart3d: Segment any part in 3d objects. ArXivabs/2411.07184(2024) OP3DSG 19

work page arXiv 2024
[58]

In: IEEE Conf

Yu, F., Liu, K., Zhang, Y., Zhu, C., Xu, K.: Partnet: A recursive part decomposi- tion network for fine-grained and hierarchical shape segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 9483–9492 (2019)

2019
[59]

In: IEEE Conf

Zhang, C., Delitzas, A., Wang, F., Zhang, R., Ji, X., Pollefeys, M., Engelmann, F.: Open-vocabulary functional 3d scene graphs for real-world indoor spaces. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 19401–19413 (2025)

2025
[60]

In: IEEE Conf

Zhang, Y., Huang, X., Ma, J., Li, Z., Luo, Z., Xie, Y., Qin, Y., Luo, T., Li, Y., Liu, S., et al.: Recognize anything: A strong image tagging model. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 1724–1732 (2024)

2024
[61]

In: IEEE Conf

Zhong, Y., Yang, J., Zhang, P., Li, C., Codella, N., Li, L.H., Zhou, L., Dai, X., Yuan, L., Li, Y., et al.: Regionclip: Region-based language-image pretraining. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 16793–16803 (2022)

2022
[62]

International journal of computer vision127(3), 302–321 (2019)

Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso, A., Torralba, A.: Se- mantic understanding of scenes through the ade20k dataset. International journal of computer vision127(3), 302–321 (2019)

2019
[63]

Zhou, X., Girdhar, R., Joulin, A., Krähenbühl, P., Misra, I.: Detecting twenty- thousand classes using image-level supervision. In: Eur. Conf. Comput. Vis. pp. 350–368. Springer (2022)

2022
[64]

ArXivabs/2312.03015(2023)

Zhou, Y., Gu, J., Li, X., Liu, M., Fang, Y., Su, H.: Partslip++: Enhancing low- shot 3d part segmentation via multi-view instance segmentation and maximum likelihood estimation. ArXivabs/2312.03015(2023)

work page arXiv 2023

[1] [1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. ArXiv abs/2303.08774(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

In: Conference on Robot Learning (2022)

Agia, C., Jatavallabhula, K.M., Khodeir, M.N.M., Mikvs’ik, O., Vineet, V., Mukadam, M., Paull, L., Shkurti, F.: Taskography: Evaluating robot task plan- ning over large 3d scene graphs. In: Conference on Robot Learning (2022)

2022

[3] [3]

Armeni, I., He, Z.Y., Gwak, J., Zamir, A.R., Fischer, M., Malik, J., Savarese, S.: 3d scene graph: A structure for unified semantics, 3d space, and camera. In: Int. Conf. Comput. Vis. pp. 5664–5673 (2019)

2019

[4] [4]

In: Conference on Robot Learning

Chang, H., Boyalakuntla, K., Lu, S., Cai, S., Jing, E., Keskar, S., Geng, S., Ab- bas, A., Zhou, L., Bekris, K., et al.: Context-aware entity grounding with open- vocabulary 3d scene graphs. In: Conference on Robot Learning. pp. 1950–1974 (2023)

1950

[5] [5]

In: IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS)

Chang, Y., Hughes, N., Ray, A., Carlone, L.: Hydra-multi: Collaborative online construction of 3d scene graphs with multi-robot teams. In: IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS). pp. 10995–11002. IEEE (2023) 16 Y. Kim and UH. Kim

2023

[6] [6]

Chen, H., Yang, J., Vondrick, C., Mao, C.: Invite: Interpret and control vision- language models with text explanations. In: Int. Conf. Learn. Represent. (2024)

2024

[7] [7]

In: Scandinavian conference on image analysis

Danelljan, M., Häger, G., Khan, F.S., Felsberg, M.: Coloring channel represen- tations for visual tracking. In: Scandinavian conference on image analysis. pp. 117–129. Springer (2015)

2015

[8] [8]

In: IEEE Conf

Delitzas, A., Takmaz, A., Tombari, F., Sumner, R., Pollefeys, M., Engelmann, F.: Scenefun3d: Fine-grained functionality and affordance understanding in 3d scenes. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 14531–14542 (2024)

2024

[9] [9]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi- rectional transformers for language understanding. ArXivabs/1810.04805(2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[10] [10]

In: IEEE Conf

Everts, I., Van Gemert, J.C., Gevers, T.: Evaluation of color stips for human action recognition. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 2850–2857 (2013)

2013

[11] [11]

In: IEEE Conf

Feng, M., Hou, H., Zhang, L., Wu, Z., Guo, Y., Mian, A.: 3d spatial multimodal knowledge accumulation for scene graph prediction in point cloud. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 9182–9191 (2023)

2023

[12] [12]

In: IEEE Conf

Gadre, S.Y., Ehsani, K., Song, S., Mottaghi, R.: Continuous scene representations for embodied ai. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 14849–14859 (2022)

2022

[13] [13]

In: IEEE Conf

Geng, H., Xu, H., Zhao, C., Xu, C., Yi, L., Huang, S., Wang, H.: Gapartnet: Cross-category domain-generalizable object perception and manipulation via gen- eralizable and actionable parts. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 7081–7091 (2023)

2023

[14] [14]

In: IEEE International Conference on Robotics and Automation (ICRA)

Gu, Q., Kuwajerwala, A., Morin, S., Jatavallabhula, K.M., Sen, B., Agarwal, A., Rivera, C., Paul, W., Ellis, K., Chellappa, R., et al.: Conceptgraphs: Open- vocabulary 3d scene graphs for perception and planning. In: IEEE International Conference on Robotics and Automation (ICRA). pp. 5021–5028 (2024)

2024

[15] [15]

In: IEEE Conf

Gupta, A., Dollar, P., Girshick, R.: Lvis: A dataset for large vocabulary instance segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 5356–5364 (2019)

2019

[16] [16]

He, J., Yang, S., Yang, S., Kortylewski, A., Yuan, X., Chen, J.N., Liu, S., Yang, C., Yu, Q., Yuille, A.: Partimagenet: A large, high-quality dataset of parts. In: Eur. Conf. Comput. Vis. pp. 128–145. Springer (2022)

2022

[17] [17]

Hou, H.Y., Lee, C.Y., Sonogashira, M., Kawanishi, Y.: Fross: Faster-than-real- time online 3d semantic scene graph generation from rgb-d images. In: Int. Conf. Comput. Vis. pp. 28818–28827 (2025)

2025

[18] [18]

In: Robotics: Science and Systems XVIII (2022)

Hughes, N., Chang, Y., Carlone, L.: Hydra: A real-time spatial perception sys- tem for 3d scene graph construction and optimization. In: Robotics: Science and Systems XVIII (2022)

2022

[19] [19]

Kim, U.H.,Park, J.M.,Song,T.J.,Kim, J.H.:3-dscenegraph: Asparseandseman- ticrepresentationofphysicalenvironmentsforintelligentagents.IEEEtransactions on cybernetics50(12), 4921–4933 (2019)

2019

[20] [20]

In: IEEE Conf

Koch, S., Vaskevicius, N., Colosi, M., Hermosilla, P., Ropinski, T.: Open3dsg: Open-vocabulary 3d scene graphs from point clouds with queryable objects and open-set relationships. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 14183– 14193 (2024)

2024

[21] [21]

In: IEEE Conf

Li, L.H., Zhang, P., Zhang, H., Yang, J., Li, C., Zhong, Y., Wang, L., Yuan, L., Zhang, L., Hwang, J.N., et al.: Grounded language-image pre-training. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 10965–10975 (2022)

2022

[22] [22]

Li, X., Xu, S., Yang, Y., Cheng, G., Tong, Y., Tao, D.: Panoptic-partformer: Learn- ing a unified model for panoptic part segmentation. In: Eur. Conf. Comput. Vis. pp. 729–747. Springer (2022) OP3DSG 17

2022

[23] [23]

Intelligent Service Robotics15(4), 459–473 (2022)

Li, Y., Ma, Y., Huo, X., Wu, X.: Remote object navigation for service robots using hierarchical knowledge graph in human-centered environments. Intelligent Service Robotics15(4), 459–473 (2022)

2022

[24] [24]

IEEE Trans

Liang, P., Blasch, E., Ling, H.: Encoding color information for visual tracking: Algorithms and benchmark. IEEE Trans. Image Process.24(12), 5630–5644 (2015)

2015

[25] [25]

Lin, C., Sun, P., Jiang, Y., Luo, P., Qu, L., Haffari, G., Yuan, Z., Cai, J.: Learning object-language alignments for open-vocabulary object detection. In: Int. Conf. Learn. Represent. (2023)

2023

[26] [26]

In: IEEE Conf

Liu, M., Zhu, Y., Cai, H., Han, S., Ling, Z., Porikli, F.M., Su, H.: Partslip: Low- shot part segmentation for 3d point clouds via pretrained image-language models. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 21736–21746 (2023)

2023

[27] [27]

Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., et al.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In: Eur. Conf. Comput. Vis. pp. 38–55. Springer (2024)

2024

[28] [28]

Liu, Y., Zhang, X., Zhang, S., He, X.: Part-aware prototype network for few-shot semantic segmentation. In: Eur. Conf. Comput. Vis. (2020)

2020

[29] [29]

Ma, Z., Yue, Y., Gkioxari, G.: Find any part in 3d. In: Int. Conf. Comput. Vis. pp. 7818–7827 (2025)

2025

[30] [30]

ArXiv abs/2004.07944(2020)

Meletis, P., Wen, X., Lu, C., De Geus, D., Dubbelman, G.: Cityscapes- panoptic-parts and pascal-panoptic-parts datasets for scene understanding. ArXiv abs/2004.07944(2020)

work page arXiv 2004

[31] [31]

Minderer, M., Gritsenko, A., Stone, A., Neumann, M., Weissenborn, D., Dosovit- skiy, A., Mahendran, A., Arnab, A., Dehghani, M., Shen, Z., et al.: Simple open- vocabulary object detection. In: Eur. Conf. Comput. Vis. pp. 728–755. Springer (2022)

2022

[32] [32]

In: IEEE Conf

Mo, K., Zhu, S., Chang, A.X., Yi, L., Tripathi, S., Guibas, L.J., Su, H.: Part- net: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 909–918 (2019)

2019

[33] [33]

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: Int. Conf. Mach. Learn. pp. 8748–8763 (2021)

2021

[34] [34]

In: IEEE Conf

Ramanathan, V., Kalia, A., Petrovic, V., Wen, Y., Zheng, B., Guo, B., Wang, R., Marquez, A., Kovvuri, R., Kadian, A., et al.: Paco: Parts and attributes of common objects. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 7141–7151 (2023)

2023

[35] [35]

In: Conference on Robot Learning (2023)

Rana, K., Haviland, J., Garg, S., Abou-Chakra, J., Reid, I.D., Sünderhauf, N.: Sayplan: Grounding large language models using 3d scene graphs for scalable task planning. In: Conference on Robot Learning (2023)

2023

[36] [36]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert- networks. ArXivabs/1908.10084(2019)

work page internal anchor Pith review Pith/arXiv arXiv 1908

[37] [37]

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

Ren, T., Liu, S., Zeng, A., Lin, J., Li, K., Cao, H., Chen, J., Huang, X., Chen, Y., Yan, F., et al.: Grounded sam: Assembling open-world models for diverse visual tasks. ArXivabs/2401.14159(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[38] [38]

The International Journal of Robotics Research40, 1510 – 1546 (2021)

Rosinol,A.,Violette,A.,Abate,M.,Hughes,N.,Chang,Y.,Shi,J.,Gupta,A.,Car- lone, L.: Kimera: From slam to spatial perception with 3d dynamic scene graphs. The International Journal of Robotics Research40, 1510 – 1546 (2021)

2021

[39] [39]

In: IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS)

Rotondi, D., Scaparro, F., Blum, H., Arras, K.O.: Fungraph: Functionality aware 3d scene graphs for language-prompted scene interaction. In: IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS). pp. 4083–4090. IEEE (2025) 18 Y. Kim and UH. Kim

2025

[40] [40]

OpenAI GPT-5 System Card

Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., et al.: Openai gpt-5 system card. ArXiv abs/2601.03267(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

Sun, P., Chen, S., Zhu, C., Xiao, F., Luo, P., Xie, S., Yan, Z.: Going denser with open-vocabulary part segmentation. In: Int. Conf. Comput. Vis. pp. 15453–15465 (2023)

2023

[42] [42]

International journal of computer vi- sion7(1), 11–32 (1991)

Swain, M.J., Ballard, D.H.: Color indexing. International journal of computer vi- sion7(1), 11–32 (1991)

1991

[43] [43]

Szymańska, E., Dusmanu, M., Buurlage, J.W., Rad, M., Pollefeys, M.: Space3d- bench: Spatial 3d question answering benchmark. In: Eur. Conf. Comput. Vis. pp. 68–85. Springer (2024)

2024

[44] [44]

IEEE Trans

Van De Sande, K., Gevers, T., Snoek, C.: Evaluating color descriptors for object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell.32(9), 1582–1596 (2009)

2009

[45] [45]

IEEE Trans

Van De Weijer, J., Schmid, C., Verbeek, J., Larlus, D.: Learning color names for real-world applications. IEEE Trans. Image Process.18(7), 1512–1523 (2009)

2009

[46] [46]

In: IEEE Conf

Wald, J., Dhamo, H., Navab, N., Tombari, F.: Learning 3d semantic scene graphs from 3d indoor reconstructions. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 3961–3970 (2020)

2020

[47] [47]

IEEE Trans

Wang, J., Chakraborty, R., Yu, S.X.: Transformer for 3d point clouds. IEEE Trans. Pattern Anal. Mach. Intell.44, 4419–4431 (2019)

2019

[48] [48]

Wang, L., Li, X., Fang, Y.: Few-shot learning of part-specific probability space for 3dshapesegmentation.In:IEEEConf.Comput.Vis.PatternRecog.pp.4504–4513 (2020)

2020

[49] [49]

Wang, P., Shen, X., Lin, Z.L., Cohen, S.D., Price, B.L., Yuille, A.L.: Joint object and part segmentation using deep learned potentials. In: Int. Conf. Comput. Vis. pp. 1573–1581 (2015)

2015

[50] [50]

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. In: Adv. Neural Inform. Process. Syst. vol. 35, pp. 24824–24837 (2022)

2022

[51] [51]

Wei, M., Yue, X., Zhang, W., Kong, S., Liu, X., Pang, J.: Ov-parts: Towards open- vocabulary part segmentation. In: Adv. Neural Inform. Process. Syst. vol. 36, pp. 70094–70114 (2023)

2023

[52] [52]

In: First Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024 (2024)

Werby, A., Huang, C., Büchner, M., Valada, A., Burgard, W.: Hierarchical open- vocabulary 3d scene graphs for language-grounded robot navigation. In: First Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024 (2024)

2024

[53] [53]

In: IEEE International Conference on Robotics and Au- tomation (ICRA) (2026)

Werby, A., Rotondi, D., Scaparro, F., Arras, K.O.: Keysg: Hierarchical keyframe- based 3d scene graphs. In: IEEE International Conference on Robotics and Au- tomation (ICRA) (2026)

2026

[54] [54]

In: IEEE Conf

Wu, S.C., Tateno, K., Navab, N., Tombari, F.: Incremental 3d semantic scene graph prediction from rgb sequences. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 5064–5074 (2023)

2023

[55] [55]

In: IEEE Conf

Wu, S.C., Wald, J., Tateno, K., Navab, N., Tombari, F.: Scenegraphfusion: Incre- mental 3d scene graph prediction from rgb-d sequences. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 7515–7525 (2021)

2021

[56] [56]

In: IEEE Conf

Wu, S., Zhang, W., Jin, S., Liu, W., Loy, C.C.: Aligning bag of regions for open- vocabulary object detection. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 15254–15264 (2023)

2023

[57] [57]

Sampart3d: Segment any part in 3d objects.arXiv preprint arXiv:2411.07184,

Yang, Y.N., Huang, Y., Guo, Y.C., Lu, L., Wu, X., Lam, E.Y., Cao, Y.P., Liu, X.: Sampart3d: Segment any part in 3d objects. ArXivabs/2411.07184(2024) OP3DSG 19

work page arXiv 2024

[58] [58]

In: IEEE Conf

Yu, F., Liu, K., Zhang, Y., Zhu, C., Xu, K.: Partnet: A recursive part decomposi- tion network for fine-grained and hierarchical shape segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 9483–9492 (2019)

2019

[59] [59]

In: IEEE Conf

Zhang, C., Delitzas, A., Wang, F., Zhang, R., Ji, X., Pollefeys, M., Engelmann, F.: Open-vocabulary functional 3d scene graphs for real-world indoor spaces. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 19401–19413 (2025)

2025

[60] [60]

In: IEEE Conf

Zhang, Y., Huang, X., Ma, J., Li, Z., Luo, Z., Xie, Y., Qin, Y., Luo, T., Li, Y., Liu, S., et al.: Recognize anything: A strong image tagging model. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 1724–1732 (2024)

2024

[61] [61]

In: IEEE Conf

Zhong, Y., Yang, J., Zhang, P., Li, C., Codella, N., Li, L.H., Zhou, L., Dai, X., Yuan, L., Li, Y., et al.: Regionclip: Region-based language-image pretraining. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 16793–16803 (2022)

2022

[62] [62]

International journal of computer vision127(3), 302–321 (2019)

Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso, A., Torralba, A.: Se- mantic understanding of scenes through the ade20k dataset. International journal of computer vision127(3), 302–321 (2019)

2019

[63] [63]

Zhou, X., Girdhar, R., Joulin, A., Krähenbühl, P., Misra, I.: Detecting twenty- thousand classes using image-level supervision. In: Eur. Conf. Comput. Vis. pp. 350–368. Springer (2022)

2022

[64] [64]

ArXivabs/2312.03015(2023)

Zhou, Y., Gu, J., Li, X., Liu, M., Fang, Y., Su, H.: Partslip++: Enhancing low- shot 3d part segmentation via multi-view instance segmentation and maximum likelihood estimation. ArXivabs/2312.03015(2023)

work page arXiv 2023