pith. sign in

arxiv: 2605.15753 · v1 · pith:ZTLW4NQSnew · submitted 2026-05-15 · 💻 cs.RO · cs.CV

Hierarchical and Holistic Open-Vocabulary Functional 3D Scene Graphs for Indoor Spaces

Pith reviewed 2026-05-20 18:53 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords functional scene graphs3D scene understandingopen-vocabularyindoor environmentshierarchical graphsrobotic manipulationtemporal graph optimizationvisual grounding
0
0 comments X

The pith

An open-vocabulary pipeline using 2D grounding and 3D temporal optimization can construct hierarchical functional 3D scene graphs for dense indoor scenes with small objects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tries to establish that functional 3D scene graphs can be extended to include dense tabletop objects and multi-level functional relationships in indoor spaces. A sympathetic reader would care because such representations could help robots better understand and interact with cluttered real-world environments beyond just large furniture. The work introduces a new benchmark coverage and addresses challenges like instance confusion and attribution uncertainty by anchoring edges in 2D visuals and optimizing the graph over time in 3D. If the method works, it would allow more complete and hierarchical scene understanding for practical robotic applications.

Core claim

The paper claims that by anchoring fine-grained functional edges from 2D visual evidence, associating nodes across frames with multiple cues, formulating edge association as temporal graph optimization that integrates evidence accumulation, entropy regularization, and temporal smoothing, and performing global hierarchy shaping, it is possible to reliably infer functional 3D scene graphs in challenging real-world scenes with small-scale dense objects.

What carries the argument

Temporal graph optimization that combines evidence accumulation, entropy regularization, and temporal smoothing to robustly determine functional connections of each node.

If this is right

  • The approach handles small-scale, dense, and similar instances that lack visual anchoring.
  • Multiple cues and temporal optimization resolve instance confusion and attribution uncertainty across frames.
  • Global hierarchy shaping recovers the multi-level graph structure.
  • This unlocks potential for practical applications in robotic manipulation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar optimization techniques might improve other graph-based scene representations in dynamic environments.
  • Combining this with language models could enhance open-vocabulary capabilities further for zero-shot scenarios.
  • Future work could test scalability to larger or outdoor scenes.

Load-bearing premise

That 2D visual grounding supplies accurate and unambiguous evidence for fine-grained functional edges and that multiple cues with temporal optimization can resolve confusions without additional adjustments.

What would settle it

A scene with many similar small tabletop objects filmed from moving viewpoints where the predicted functional edges or hierarchies do not match detailed human ground truth annotations.

Figures

Figures reproduced from arXiv: 2605.15753 by Alexandros Delitzas, Chenyangguang Zhang, Francis Engelmann, Marc Pollefeys, Xiangkui Zhang, Xiangyang Ji, Xinggang Hu.

Figure 1
Figure 1. Figure 1: Hierarchical and holistic functional 3D scene graphs. In contrast to prior ap￾proaches [72], we model tabletop manipulable objects and explicit hierarchical ob￾ject–part structures in functional 3D scene graphs. Recently, the pioneering work OpenFunGraph [72] introduces the concept of functional 3D scene graphs by extending traditional scene graphs to include objects, interactive elements, and functional r… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the hierarchical functional 3D scene graph construction pipeline. tiny parts suffer from unstable localization under reconstruction noise, and dense hierarchical nodes (e.g., adjacent drawers) exhibit severe bounding box aliasing. Consequently, 3D spatial proximity becomes non-discriminative. To circumvent this issue, we abandon the paradigm of relying on distorted 3D coordinates for inference … view at source ↗
Figure 3
Figure 3. Figure 3: Examples of the improved benchmark. We introduce tabletop manipulable objects and hierarchical relationships. enable the above extensions, we conducted a systematic hierarchical functional annotation and statistical analysis on FunGraph3D [72] and SceneFun3D [11]. FunGraph3D contains 722 nodes (O = 224, C = 94, U = 404), with 592 func￾tional edges in total; among them, 118 are hierarchical relations, and 1… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results. Our constructed functional 3D scene graph features a hier￾archical structure and covers a more comprehensive range of manipulable objects. hierarchical structures, and successfully construct most nodes and functional edges in tabletop scenarios. See the supplement for more qualitative results. Real-World Manipulation Tasks. Furthermore, to validate the utility of the constructed hierar… view at source ↗
read the original abstract

Functional 3D scene graphs offer a versatile and flexible representation for 3D scene understanding and robotic manipulation, defined by object nodes, interactive elements, and functional relationship edges. However, their potential remains underexplored due to the limited coverage of existing benchmarks and the overly straightforward design of previous pipelines, which primarily focus on large-scale furniture but lack of hierarchical structures. Therefore, in this work, we extend the benchmark coverage by introducing dense tabletop objects and explicit multi-level functional relationships. This expansion introduces critical challenges involving small-scale, dense, and similar instances, with lack of visual anchoring in relational reasoning, instance confusion during cross-frame fusion, and attribution uncertainty under dynamic viewpoints. To address these issues, we propose an open-vocabulary pipeline based on 2D visual grounding and 3D graph optimization. Specifically, we anchor fine-grained functional edges from 2D visual evidence, and associate nodes across frames in 3D using multiple cues. Furthermore, edge association is formulated as temporal graph optimization, integrating evidence accumulation, entropy regularization, and temporal smoothing to robustly determine the functional connections of each node. Finally, global hierarchy shaping is performed to recover the hierarchical graph structure. Extensive experiments demonstrate that the proposed method can reliably infer functional 3D scene graphs in challenging real-world scenes, thereby further unlocking their potential for practical applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces an open-vocabulary pipeline for generating hierarchical functional 3D scene graphs in indoor spaces, extending benchmarks to dense tabletop objects and multi-level relationships. It addresses challenges of small-scale instances, instance confusion, and attribution uncertainty by anchoring functional edges via 2D visual grounding, using multi-cue 3D association across frames, formulating edge association as temporal graph optimization with evidence accumulation, entropy regularization, and temporal smoothing, and applying global hierarchy shaping. The authors assert that this enables reliable inference of functional graphs in real-world scenes for applications like robotic manipulation.

Significance. If the claims hold, the work would significantly advance functional scene graph construction by handling fine-grained details in cluttered indoor environments, which prior work neglected. The combination of 2D grounding with 3D temporal optimization offers a practical approach to open-vocabulary functional reasoning. Strengths include the explicit handling of new challenges like dense objects and the use of regularization techniques for robustness. This could unlock better performance in downstream robotic tasks.

major comments (3)
  1. [Abstract] The abstract claims 'extensive experiments demonstrate that the proposed method can reliably infer functional 3D scene graphs in challenging real-world scenes' but provides no quantitative metrics, error bars, ablation studies, or specific validation results. This absence leaves the central reliability claim unsupported and requires detailed experimental evidence in the main text.
  2. [§3.2] The pipeline anchors fine-grained functional edges directly from 2D visual evidence before temporal graph optimization (§3.2). For the dense tabletop benchmark, this relies on 2D grounding producing unambiguous instance-to-function mappings despite occlusion and visual similarity. No error analysis or sensitivity study on grounding accuracy is mentioned, which is load-bearing since optimization steps may not correct systematic attribution errors.
  3. [Experiments] The temporal graph optimization integrates evidence accumulation, entropy regularization, and smoothing. Without ablations isolating the contribution of each component or comparisons to simpler baselines, it is difficult to evaluate if this formulation is necessary or superior for resolving cross-frame uncertainties.
minor comments (2)
  1. [Abstract] Consider adding a sentence on the scale of the new benchmark or number of scenes evaluated to give readers a sense of the experimental scope.
  2. [Notation] Ensure that terms like 'functional relationship edges' and 'multi-level functional relationships' are defined with clear notation upon first use to improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below. Where appropriate, we have revised the manuscript to incorporate additional evidence, analyses, and clarifications that strengthen the presentation of our contributions.

read point-by-point responses
  1. Referee: [Abstract] The abstract claims 'extensive experiments demonstrate that the proposed method can reliably infer functional 3D scene graphs in challenging real-world scenes' but provides no quantitative metrics, error bars, ablation studies, or specific validation results. This absence leaves the central reliability claim unsupported and requires detailed experimental evidence in the main text.

    Authors: We agree that the abstract would benefit from explicit reference to key quantitative results to better ground the reliability claim. In the revised version, we have updated the abstract to include concise mentions of our main quantitative findings (e.g., functional edge precision/recall on the dense tabletop benchmark and improvements over baselines), while preserving brevity. The main text already contains the full experimental results, error bars, and ablations in Section 4; the abstract revision now explicitly points readers to these sections. revision: yes

  2. Referee: [§3.2] The pipeline anchors fine-grained functional edges directly from 2D visual evidence before temporal graph optimization (§3.2). For the dense tabletop benchmark, this relies on 2D grounding producing unambiguous instance-to-function mappings despite occlusion and visual similarity. No error analysis or sensitivity study on grounding accuracy is mentioned, which is load-bearing since optimization steps may not correct systematic attribution errors.

    Authors: This observation correctly identifies a load-bearing assumption. While the subsequent temporal optimization and multi-cue association are intended to mitigate isolated grounding failures, we acknowledge that a dedicated analysis of grounding accuracy is valuable. In the revision, we have added a short error analysis paragraph in §3.2 that discusses observed grounding failure modes on the dense tabletop data and a sensitivity study in the Experiments section that quantifies how controlled degradation in 2D grounding accuracy propagates to final graph metrics. These additions make the dependence on the 2D stage explicit and demonstrate the robustness margin provided by the 3D optimization. revision: yes

  3. Referee: [Experiments] The temporal graph optimization integrates evidence accumulation, entropy regularization, and smoothing. Without ablations isolating the contribution of each component or comparisons to simpler baselines, it is difficult to evaluate if this formulation is necessary or superior for resolving cross-frame uncertainties.

    Authors: We concur that component-wise ablations and baseline comparisons are necessary to justify the optimization design. The revised manuscript now includes a dedicated ablation table in the Experiments section that isolates evidence accumulation, entropy regularization, and temporal smoothing. We also report results against simpler baselines (per-frame majority voting and non-regularized temporal averaging). The new results show that each term contributes measurably to cross-frame consistency and that the full formulation outperforms the ablated variants on the dense scenes, thereby substantiating the necessity of the integrated approach. revision: yes

Circularity Check

0 steps flagged

No significant circularity; pipeline uses external grounding and standard optimization

full rationale

The paper presents an engineering pipeline that first anchors fine-grained functional edges directly from 2D visual evidence, then associates nodes across frames using multiple 3D cues, formulates edge association as temporal graph optimization incorporating evidence accumulation, entropy regularization and smoothing, and finally applies global hierarchy shaping. None of these steps reduce by construction to fitted parameters, self-definitions, or load-bearing self-citations; the method description relies on external 2D grounding tools and conventional graph techniques whose correctness is independent of the target functional-graph outputs. The central claim of reliable inference on the new dense tabletop benchmark is evaluated via experiments rather than being presupposed by the derivation itself, rendering the chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim depends on the effectiveness of 2D visual evidence for functional edges and the ability of multi-cue 3D association plus optimization to handle small-object confusion; no explicit free parameters or invented entities are named in the abstract.

axioms (1)
  • domain assumption 2D visual grounding supplies reliable evidence for anchoring fine-grained functional relationships in 3D
    Pipeline explicitly anchors edges from 2D visual evidence as the starting point for relational reasoning.

pith-pipeline@v0.9.0 · 5795 in / 1331 out tokens · 65754 ms · 2026-05-20T18:53:13.776764+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · 1 internal anchor

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  2. [2]

    In: ICCV (2019)

    Armeni, I., He, Z.Y., Gwak, J., Zamir, A.R., Fischer, M., Malik, J., Savarese, S.: 3d scene graph: A structure for unified semantics, 3d space, and camera. In: ICCV (2019)

  3. [3]

    ACM TOG (2018)

    Atzmon, M., Maron, H., Lipman, Y.: Point convolutional neural networks by ex- tension operators. ACM TOG (2018)

  4. [4]

    Banerjee, S

    Banerjee, P., Shkodrani, S., Moulon, P., Hampali, S., Zhang, F., Fountain, J., Miller, E., Basol, S., Newcombe, R., Wang, R., et al.: Introducing hot3d: An ego- centric dataset for 3d hand and object tracking. arXiv preprint arXiv:2406.09598 (2024)

  5. [5]

    Bieri, V., Zamboni, M., Blumer, N.S., Chen, Q., Engelmann, F.: OpenCity3D: 3D Urban Scene Understanding with Vision-Language Models (2025)

  6. [6]

    Chen, L., Wang, X., Lu, J., Lin, S., Wang, C., He, G.: Clip-driven open-vocabulary 3dscenegraphgenerationviacross-modalitycontrastivelearning.In:CVPR(2024)

  7. [7]

    In: ECCV (2022)

    Chen, Z., Hasson, Y., Schmid, C., Laptev, I.: Alignsdf: Pose-aligned signed distance fields for hand-object reconstruction. In: ECCV (2022)

  8. [8]

    ECCV (2024)

    Cho, W., Lee, J., Yi, M., Kim, M., Woo, T., Kim, D., Ha, T., Lee, H., Ryu, J.H., Woo, W., et al.: Dense hand-object (ho) graspnet with full grasping taxonomy and dynamics. ECCV (2024)

  9. [9]

    In: CVPR (2019)

    Choy, C., Gwak, J., Savarese, S.: 4d spatio-temporal convnets: Minkowski convo- lutional neural networks. In: CVPR (2019)

  10. [10]

    In: BMVC (2023)

    Delitzas, A., Parelli, M., Hars, N., Vlassis, G., Anagnostidis, S.K., Bachmann, G., Hofmann, T.: Multi-clip: Contrastive vision-language pre-training for question answering tasks in 3d scenes. In: BMVC (2023)

  11. [11]

    In: CVPR (2024)

    Delitzas, A., Takmaz, A., Tombari, F., Sumner, R., Pollefeys, M., Engelmann, F.: Scenefun3d: Fine-grained functionality and affordance understanding in 3d scenes. In: CVPR (2024)

  12. [12]

    Hu et al

    Do, T.T., Nguyen, A., Reid, I.: Affordancenet: An end-to-end deep learning ap- proach for object affordance detection (2018) 16 X. Hu et al

  13. [13]

    In: CVPR (2020)

    Engelmann, F., Bokeloh, M., Fathi, A., Leibe, B., Nießner, M.: 3d-mpa: Multi- proposal aggregation for 3d semantic instance segmentation. In: CVPR (2020)

  14. [14]

    ICLR (2024)

    Engelmann, F., Manhardt, F., Niemeyer, M., Tateno, K., Pollefeys, M., Tombari, F.: Opennerf: Open set 3d neural scene segmentation with pixel-wise features and rendered novel views. ICLR (2024)

  15. [15]

    In: CVPR (2024)

    Fan, Z., Parelli, M., Kadoglou, M.E., Chen, X., Kocabas, M., Black, M.J., Hilliges, O.: Hold: Category-agnostic 3d reconstruction of interacting hands and objects from video. In: CVPR (2024)

  16. [16]

    In: CVPR (2018)

    Fang, K., Wu, T.L., Yang, D., Savarese, S., Lim, J.J.: Demo2vec: Reasoning object affordances from online videos. In: CVPR (2018)

  17. [17]

    In: CVPR (2025)

    Fu, S., Yang, Q., Mo, Q., Yan, J., Wei, X., Meng, J., Xie, X., Zheng, W.S.: Llmdet: Learning strong open-vocabulary object detectors under the supervision of large language models. In: CVPR (2025)

  18. [18]

    Gu, Q., Kuwajerwala, A., Morin, S., Jatavallabhula, K.M., Sen, B., Agarwal, A., Rivera, C., Paul, W., Ellis, K., Chellappa, R., et al.: ConceptGraphs: Open- vocabulary 3d scene graphs for perception and planning (2024)

  19. [19]

    In: CVPR (2020)

    Han, L., Zheng, T., Xu, L., Fang, L.: Occuseg: Occupancy-aware 3d instance seg- mentation. In: CVPR (2020)

  20. [20]

    In: CVPR (2019)

    Hou, J., Dai, A., Nießner, M.: 3d-sis: 3d semantic instance segmentation of rgb-d scans. In: CVPR (2019)

  21. [21]

    In: CVPR (2023)

    Hsu, J., Mao, J., Wu, J.: Ns3d: Neuro-symbolic grounding of 3d objects and rela- tions. In: CVPR (2023)

  22. [22]

    Hu, X., Wu, Y., Zhao, M., Cao, Z., Zhang, X., Ji, X.: Dyo-slam: Visual localization andobjectmappingindynamicscenes.IEEETransactionsonCircuitsandSystems for Video Technology (2025)

  23. [23]

    In: ICCV (2021)

    Hu, Z., Bai, X., Shang, J., Zhang, R., Dong, J., Wang, X., Sun, G., Fu, H., Tai, C.L.: Vmnet: Voxel-mesh network for geodesic-aware 3d semantic segmentation. In: ICCV (2021)

  24. [24]

    In: CVPR (2018)

    Hua, B.S., Tran, M.K., Yeung, S.K.: Pointwise convolutional neural networks. In: CVPR (2018)

  25. [25]

    ECCV (2024)

    Huang, R., Peng, S., Takmaz, A., Tombari, F., Pollefeys, M., Song, S., Huang, G., Engelmann, F.: Segment3d: Learning fine-grained class-agnostic 3d segmentation without manual labels. ECCV (2024)

  26. [26]

    In: CVPR (2022)

    Huang,S.,Chen,Y.,Jia,J.,Wang,L.:Multi-viewtransformerfor3dvisualground- ing. In: CVPR (2022)

  27. [27]

    In: ACM MM (2025)

    Huang, X., Huang, Y.J., Zhang, Y., Tian, W., Feng, R., Zhang, Y., Xie, Y., Li, Y., Zhang, L.: Open-set image tagging with multi-grained text supervision. In: ACM MM (2025)

  28. [28]

    ICRA2023 Workshop on Pretraining for Robotics (PT4R) (2023)

    Jatavallabhula, K.M., Kuwajerwala, A., Gu, Q., Omama, M., Chen, T., Maalouf, A., Li, S., Iyer, G., Saryazdi, S., Keetha, N., et al.: Conceptfusion: Open-set mul- timodal 3d mapping. ICRA2023 Workshop on Pretraining for Robotics (PT4R) (2023)

  29. [29]

    In: CVPR (2020)

    Jiang, L., Zhao, H., Shi, S., Liu, S., Fu, C.W., Jia, J.: Pointgroup: Dual-set point grouping for 3d instance segmentation. In: CVPR (2020)

  30. [30]

    In: ICCV (2023)

    Kerr, J., Kim, C.M., Goldberg, K., Kanazawa, A., Tancik, M.: Lerf: Language embedded radiance fields. In: ICCV (2023)

  31. [31]

    Koch, S., Hermosilla, P., Vaskevicius, N., Colosi, M., Ropinski, T.: Lang3dsg: Language-based contrastive pre-training for 3d scene graph prediction (2024)

  32. [32]

    In: CVPR (2024) Abbreviated paper title 17

    Koch, S., Vaskevicius, N., Colosi, M., Hermosilla, P., Ropinski, T.: Open3dsg: Open-vocabulary 3d scene graphs from point clouds with queryable objects and open-set relationships. In: CVPR (2024) Abbreviated paper title 17

  33. [33]

    In: CVPR (2018)

    Landrieu, L., Simonovsky, M.: Large-scale point cloud semantic segmentation with superpoint graphs. In: CVPR (2018)

  34. [34]

    ICLR (2022)

    Li, Q., Mo, K., Yang, Y., Zhao, H., Guibas, L.: IFR-Explore: Learning inter-object functional relationships in 3d indoor scenes. ICLR (2022)

  35. [35]

    NeurIPS (2018)

    Li, Y., Bu, R., Sun, M., Wu, W., Di, X., Chen, B.: Pointcnn: Convolution on x-transformed points. NeurIPS (2018)

  36. [36]

    In: CVPR (2024)

    Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. In: CVPR (2024)

  37. [37]

    Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., Lee, Y.J.: Llava-next: Improved reasoning, ocr, and world knowledge (2024)

  38. [38]

    In: NeurIPS (2023)

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023)

  39. [39]

    In: ICCV (2019)

    Nagarajan, T., Feichtenhofer, C., Grauman, K.: Grounded human-object interac- tion hotspots from video. In: ICCV (2019)

  40. [40]

    In: 2019 International Conference on Robotics and Automation (ICRA)

    Ok, K., Liu, K., Frey, K., How, J.P., Roy, N.: Robust object-based slam for high- speed autonomous navigation. In: 2019 International Conference on Robotics and Automation (ICRA). pp. 669–675. IEEE (2019)

  41. [41]

    Technical report, OpenAI (Aug 2025),https://cdn

    OpenAI: Gpt-5 system card. Technical report, OpenAI (Aug 2025),https://cdn. openai.com/gpt-5-system-card.pdf, accessed: 2025-11-14

  42. [42]

    In: CVPRW (2023)

    Parelli, M., Delitzas, A., Hars, N., Vlassis, G., Anagnostidis, S., Bachmann, G., Hofmann, T.: CLIP-Guided Vision-Language Pre-Training for Question Answering in 3D Scenes. In: CVPRW (2023)

  43. [43]

    Peng, S., Genova, K., Jiang, C., Tagliasacchi, A., Pollefeys, M., Funkhouser, T., etal.:Openscene:3dsceneunderstandingwithopenvocabularies.In:CVPR(2023)

  44. [44]

    In: CVPR (2017)

    Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: Deep learning on point sets for 3d classification and segmentation. In: CVPR (2017)

  45. [45]

    NeurIPS (2017)

    Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: Deep hierarchical feature learn- ing on point sets in a metric space. NeurIPS (2017)

  46. [46]

    In: CVPR (2024)

    Qin, M., Li, W., Zhou, J., Wang, H., Pfister, H.: Langsplat: 3d language gaussian splatting. In: CVPR (2024)

  47. [47]

    Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K.V., Carion, N., Wu, C.Y., Girshick, R., Dollár, P., Feichtenhofer, C.: Sam 2: Segment anything in images and videos (2024)

  48. [48]

    Roh, J., Desingh, K., Farhadi, A., Fox, D.: Languagerefer: Spatial-language model for 3d visual grounding (2022)

  49. [49]

    Robotics, Science and Systems (2020)

    Rosinol, A., Gupta, A., Abate, M., Shi, J., Carlone, L.: 3d dynamic scene graphs: Actionable spatial perception with places, objects, and humans. Robotics, Science and Systems (2020)

  50. [50]

    Rosinol, A., Violette, A., Abate, M., Hughes, N., Chang, Y., Shi, J., Gupta, A., Carlone,L.:Kimera:Fromslamtospatialperceptionwith3ddynamicscenegraphs (2021)

  51. [51]

    Rotondi, D., Scaparro, F., Blum, H., Arras, K.O.: Fungraph: Functionality aware 3d scene graphs for language-prompted scene interaction (2025)

  52. [52]

    Schult, J., Engelmann, F., Hermans, A., Litany, O., Tang, S., Leibe, B.: Mask3d: Mask transformer for 3d semantic instance segmentation (2023)

  53. [53]

    Takmaz, A., Delitzas, A., Sumner, R.W., Engelmann, F., Wald, J., Tombari, F.: Search3D: Hierarchical Open-Vocabulary 3D Segmentation (2025)

  54. [54]

    NeurIPS (2023) 18 X

    Takmaz, A., Fedele, E., Sumner, R.W., Pollefeys, M., Tombari, F., Engelmann, F.: Openmask3d: Open-vocabulary 3d instance segmentation. NeurIPS (2023) 18 X. Hu et al

  55. [55]

    In: ICCV (2023)

    Takmaz, A., Schult, J., Kaftan, I., Akçay, M., Leibe, B., Sumner, R., Engelmann, F., Tang, S.: 3D Segmentation of Humans in Point Clouds with Synthetic Data. In: ICCV (2023)

  56. [56]

    In: ICCV (2019)

    Thomas, H., Qi, C.R., Deschaud, J.E., Marcotegui, B., Goulette, F., Guibas, L.J.: Kpconv: Flexible and deformable convolution for point clouds. In: ICCV (2019)

  57. [57]

    In: CVPR (2022)

    Vu, T., Kim, K., Luu, T.M., Nguyen, T., Yoo, C.D.: Softgroup for 3d instance segmentation on point clouds. In: CVPR (2022)

  58. [58]

    In: CVPR (2020)

    Wald, J., Dhamo, H., Navab, N., Tombari, F.: Learning 3d semantic scene graphs from 3d indoor reconstructions. In: CVPR (2020)

  59. [59]

    In: CVPR (2023)

    Wang, Z., Cheng, B., Zhao, L., Xu, D., Tang, Y., Sheng, L.: Vl-sat: Visual-linguistic semantics assisted training for 3d semantic scene graph prediction in point cloud. In: CVPR (2023)

  60. [60]

    Weder, S., Blum, H., Engelmann, F., Pollefeys, M.: Labelmaker: Automatic se- mantic label generation from rgb-d trajectories (2024)

  61. [61]

    Weder, S., Engelmann, F., Schönberger, J.L., Seki, A., Pollefeys, M., Oswald, M.R.: Alster: A Local Spatio-temporal Expert for Online 3D Semantic Reconstruction (2023)

  62. [62]

    In: CVPR (2023)

    Wu, S.C., Tateno, K., Navab, N., Tombari, F.: Incremental 3d semantic scene graph prediction from rgb sequences. In: CVPR (2023)

  63. [63]

    In: CVPR (2021)

    Wu, S.C., Wald, J., Tateno, K., Navab, N., Tombari, F.: SceneGraphFusion: Incre- mental 3d scene graph prediction from rgb-d sequences. In: CVPR (2021)

  64. [64]

    In: ICCV (2021)

    Yang, Z., Zhang, S., Wang, L., Luo, J.: Sat: 2d semantics assisted training for 3d visual grounding. In: ICCV (2021)

  65. [65]

    In: CVPR (2024)

    Ye, Y., Gupta, A., Kitani, K., Tulsiani, S.: G-hop: Generative hand-object prior for interaction reconstruction and grasp synthesis. In: CVPR (2024)

  66. [66]

    In: CVPR (2022)

    Ye, Y., Gupta, A., Tulsiani, S.: What’s in your hands? 3d reconstruction of generic objects in hands. In: CVPR (2022)

  67. [67]

    arXiv preprint arXiv:2405.20141 (2024)

    Yilmaz, G., Peng, S., Pollefeys, M., Engelmann, F., Blum, H.: OpenDAS: Open- Vocabulary Domain Adaptation for 2D and 3D Segmentation. arXiv preprint arXiv:2405.20141 (2024)

  68. [68]

    arXiv preprint arXiv:2404.02523 (2024)

    Yoshida, T., Kurita, S., Nishimura, T., Mori, S.: Text-driven affordance learning from egocentric vision. arXiv preprint arXiv:2404.02523 (2024)

  69. [69]

    ICLR (2024)

    Yue, Y., Mahadevan, S., Schult, J., Engelmann, F., Leibe, B., Schindler, K., Kon- togianni, T.: Agile3d: Attention guided interactive multi-object 3d segmentation. ICLR (2024)

  70. [70]

    IJCV (2022)

    Zhai,W.,Luo,H.,Zhang,J.,Cao,Y.,Tao,D.:One-shotobjectaffordancedetection in the wild. IJCV (2022)

  71. [71]

    In: CVPR (2021)

    Zhang, C., Yu, J., Song, Y., Cai, W.: Exploiting edge-oriented reasoning for 3d point-based scene graph analysis. In: CVPR (2021)

  72. [72]

    In: CVPR (2025)

    Zhang, C., Delitzas, A., Wang, F., Zhang, R., Ji, X., Pollefeys, M., Engelmann, F.: Open-vocabulary functional 3d scene graphs for real-world indoor spaces. In: CVPR (2025)

  73. [73]

    NeurIPS (2023)

    Zhang, C., Di, Y., Zhang, R., Zhai, G., Manhardt, F., Tombari, F., Ji, X.: Ddf-ho: Hand-held object reconstruction via conditional directed distance field. NeurIPS (2023)

  74. [74]

    In: CVPR (2024)

    Zhang, C., Jiao, G., Di, Y., Wang, G., Huang, Z., Zhang, R., Manhardt, F., Fu, B., Tombari, F., Ji, X.: Moho: Learning single-view hand-held object reconstruction with multi-view occlusion-aware supervision. In: CVPR (2024)

  75. [75]

    NeurIPS (2021) Abbreviated paper title 19

    Zhang, S., Hao, A., Qin, H., et al.: Knowledge-inspired 3d scene graph prediction in point cloud. NeurIPS (2021) Abbreviated paper title 19

  76. [76]

    In: ICCV (2023)

    Zhang, Y., Gong, Z., Chang, A.X.: Multi3drefer: Grounding text description to multiple 3d objects. In: ICCV (2023)

  77. [77]

    In: CVPR (2024)

    Zhang, Y., Huang, X., Ma, J., Li, Z., Luo, Z., Xie, Y., Qin, Y., Luo, T., Li, Y., Liu, S., et al.: Recognize anything: A strong image tagging model. In: CVPR (2024)

  78. [78]

    In: CVPR (2024)

    Zhou,S.,Chang,H.,Jiang,S.,Fan,Z.,Zhu,Z.,Xu,D.,Chari,P.,You,S.,Wang,Z., Kadambi, A.: Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields. In: CVPR (2024)

  79. [79]

    IJCV (2024)

    Zuo, X., Samangouei, P., Zhou, Y., Di, Y., Li, M.: Fmgs: Foundation model em- bedded 3d gaussian splatting for holistic 3d scene understanding. IJCV (2024)