Hierarchical and Holistic Open-Vocabulary Functional 3D Scene Graphs for Indoor Spaces

Alexandros Delitzas; Chenyangguang Zhang; Francis Engelmann; Marc Pollefeys; Xiangkui Zhang; Xiangyang Ji; Xinggang Hu

arxiv: 2605.15753 · v1 · pith:ZTLW4NQSnew · submitted 2026-05-15 · 💻 cs.RO · cs.CV

Hierarchical and Holistic Open-Vocabulary Functional 3D Scene Graphs for Indoor Spaces

Xinggang Hu , Chenyangguang Zhang , Alexandros Delitzas , Xiangkui Zhang , Marc Pollefeys , Francis Engelmann , Xiangyang Ji This is my paper

Pith reviewed 2026-05-20 18:53 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords functional scene graphs3D scene understandingopen-vocabularyindoor environmentshierarchical graphsrobotic manipulationtemporal graph optimizationvisual grounding

0 comments

The pith

An open-vocabulary pipeline using 2D grounding and 3D temporal optimization can construct hierarchical functional 3D scene graphs for dense indoor scenes with small objects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tries to establish that functional 3D scene graphs can be extended to include dense tabletop objects and multi-level functional relationships in indoor spaces. A sympathetic reader would care because such representations could help robots better understand and interact with cluttered real-world environments beyond just large furniture. The work introduces a new benchmark coverage and addresses challenges like instance confusion and attribution uncertainty by anchoring edges in 2D visuals and optimizing the graph over time in 3D. If the method works, it would allow more complete and hierarchical scene understanding for practical robotic applications.

Core claim

The paper claims that by anchoring fine-grained functional edges from 2D visual evidence, associating nodes across frames with multiple cues, formulating edge association as temporal graph optimization that integrates evidence accumulation, entropy regularization, and temporal smoothing, and performing global hierarchy shaping, it is possible to reliably infer functional 3D scene graphs in challenging real-world scenes with small-scale dense objects.

What carries the argument

Temporal graph optimization that combines evidence accumulation, entropy regularization, and temporal smoothing to robustly determine functional connections of each node.

If this is right

The approach handles small-scale, dense, and similar instances that lack visual anchoring.
Multiple cues and temporal optimization resolve instance confusion and attribution uncertainty across frames.
Global hierarchy shaping recovers the multi-level graph structure.
This unlocks potential for practical applications in robotic manipulation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar optimization techniques might improve other graph-based scene representations in dynamic environments.
Combining this with language models could enhance open-vocabulary capabilities further for zero-shot scenarios.
Future work could test scalability to larger or outdoor scenes.

Load-bearing premise

That 2D visual grounding supplies accurate and unambiguous evidence for fine-grained functional edges and that multiple cues with temporal optimization can resolve confusions without additional adjustments.

What would settle it

A scene with many similar small tabletop objects filmed from moving viewpoints where the predicted functional edges or hierarchies do not match detailed human ground truth annotations.

Figures

Figures reproduced from arXiv: 2605.15753 by Alexandros Delitzas, Chenyangguang Zhang, Francis Engelmann, Marc Pollefeys, Xiangkui Zhang, Xiangyang Ji, Xinggang Hu.

**Figure 1.** Figure 1: Hierarchical and holistic functional 3D scene graphs. In contrast to prior approaches [72], we model tabletop manipulable objects and explicit hierarchical object–part structures in functional 3D scene graphs. Recently, the pioneering work OpenFunGraph [72] introduces the concept of functional 3D scene graphs by extending traditional scene graphs to include objects, interactive elements, and functional r… view at source ↗

**Figure 2.** Figure 2: Overview of the hierarchical functional 3D scene graph construction pipeline. tiny parts suffer from unstable localization under reconstruction noise, and dense hierarchical nodes (e.g., adjacent drawers) exhibit severe bounding box aliasing. Consequently, 3D spatial proximity becomes non-discriminative. To circumvent this issue, we abandon the paradigm of relying on distorted 3D coordinates for inference … view at source ↗

**Figure 3.** Figure 3: Examples of the improved benchmark. We introduce tabletop manipulable objects and hierarchical relationships. enable the above extensions, we conducted a systematic hierarchical functional annotation and statistical analysis on FunGraph3D [72] and SceneFun3D [11]. FunGraph3D contains 722 nodes (O = 224, C = 94, U = 404), with 592 functional edges in total; among them, 118 are hierarchical relations, and 1… view at source ↗

**Figure 4.** Figure 4: Qualitative results. Our constructed functional 3D scene graph features a hierarchical structure and covers a more comprehensive range of manipulable objects. hierarchical structures, and successfully construct most nodes and functional edges in tabletop scenarios. See the supplement for more qualitative results. Real-World Manipulation Tasks. Furthermore, to validate the utility of the constructed hierar… view at source ↗

read the original abstract

Functional 3D scene graphs offer a versatile and flexible representation for 3D scene understanding and robotic manipulation, defined by object nodes, interactive elements, and functional relationship edges. However, their potential remains underexplored due to the limited coverage of existing benchmarks and the overly straightforward design of previous pipelines, which primarily focus on large-scale furniture but lack of hierarchical structures. Therefore, in this work, we extend the benchmark coverage by introducing dense tabletop objects and explicit multi-level functional relationships. This expansion introduces critical challenges involving small-scale, dense, and similar instances, with lack of visual anchoring in relational reasoning, instance confusion during cross-frame fusion, and attribution uncertainty under dynamic viewpoints. To address these issues, we propose an open-vocabulary pipeline based on 2D visual grounding and 3D graph optimization. Specifically, we anchor fine-grained functional edges from 2D visual evidence, and associate nodes across frames in 3D using multiple cues. Furthermore, edge association is formulated as temporal graph optimization, integrating evidence accumulation, entropy regularization, and temporal smoothing to robustly determine the functional connections of each node. Finally, global hierarchy shaping is performed to recover the hierarchical graph structure. Extensive experiments demonstrate that the proposed method can reliably infer functional 3D scene graphs in challenging real-world scenes, thereby further unlocking their potential for practical applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper extends functional 3D scene graphs to dense tabletop objects and multi-level relationships through 2D grounding plus temporal optimization, but the reliability claim rests on unshown experiments.

read the letter

The core advance is expanding the benchmark to small, dense tabletop objects and adding explicit hierarchical functional edges, then tackling the resulting issues of instance confusion and weak visual anchoring with a pipeline that grounds edges in 2D and optimizes them over time in 3D. This directly addresses gaps in earlier furniture-focused graphs by including the kinds of items that matter for actual manipulation tasks. The approach breaks the problem into concrete steps: 2D visual grounding for fine-grained edges, multi-cue 3D association across frames, and a temporal graph optimization that accumulates evidence, adds entropy regularization, and applies smoothing before final hierarchy shaping. That structure is a reasonable way to handle dynamic viewpoints and similar instances without starting from scratch. The main soft spot is the heavy dependence on the accuracy of the external 2D grounding step for those small objects. If grounding produces ambiguous or wrong instance-to-function mappings under occlusion or visual similarity, the later optimization and hierarchy steps may not fully correct the functional edges, since they build on the initial anchors. The abstract states that extensive experiments show reliable inference in real scenes, but without the actual numbers, ablations, or error breakdowns visible, it is difficult to judge how well the method holds up in practice. This work is aimed at robotics groups building scene representations for indoor manipulation and planning. Readers working on open-vocabulary 3D graphs or temporal fusion would find the problem breakdown and pipeline useful to discuss. It deserves peer review because the direction is practical and the challenges are clearly stated, even if the experiments will need close scrutiny on grounding robustness and quantitative gains.

Referee Report

3 major / 2 minor

Summary. The paper introduces an open-vocabulary pipeline for generating hierarchical functional 3D scene graphs in indoor spaces, extending benchmarks to dense tabletop objects and multi-level relationships. It addresses challenges of small-scale instances, instance confusion, and attribution uncertainty by anchoring functional edges via 2D visual grounding, using multi-cue 3D association across frames, formulating edge association as temporal graph optimization with evidence accumulation, entropy regularization, and temporal smoothing, and applying global hierarchy shaping. The authors assert that this enables reliable inference of functional graphs in real-world scenes for applications like robotic manipulation.

Significance. If the claims hold, the work would significantly advance functional scene graph construction by handling fine-grained details in cluttered indoor environments, which prior work neglected. The combination of 2D grounding with 3D temporal optimization offers a practical approach to open-vocabulary functional reasoning. Strengths include the explicit handling of new challenges like dense objects and the use of regularization techniques for robustness. This could unlock better performance in downstream robotic tasks.

major comments (3)

[Abstract] The abstract claims 'extensive experiments demonstrate that the proposed method can reliably infer functional 3D scene graphs in challenging real-world scenes' but provides no quantitative metrics, error bars, ablation studies, or specific validation results. This absence leaves the central reliability claim unsupported and requires detailed experimental evidence in the main text.
[§3.2] The pipeline anchors fine-grained functional edges directly from 2D visual evidence before temporal graph optimization (§3.2). For the dense tabletop benchmark, this relies on 2D grounding producing unambiguous instance-to-function mappings despite occlusion and visual similarity. No error analysis or sensitivity study on grounding accuracy is mentioned, which is load-bearing since optimization steps may not correct systematic attribution errors.
[Experiments] The temporal graph optimization integrates evidence accumulation, entropy regularization, and smoothing. Without ablations isolating the contribution of each component or comparisons to simpler baselines, it is difficult to evaluate if this formulation is necessary or superior for resolving cross-frame uncertainties.

minor comments (2)

[Abstract] Consider adding a sentence on the scale of the new benchmark or number of scenes evaluated to give readers a sense of the experimental scope.
[Notation] Ensure that terms like 'functional relationship edges' and 'multi-level functional relationships' are defined with clear notation upon first use to improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below. Where appropriate, we have revised the manuscript to incorporate additional evidence, analyses, and clarifications that strengthen the presentation of our contributions.

read point-by-point responses

Referee: [Abstract] The abstract claims 'extensive experiments demonstrate that the proposed method can reliably infer functional 3D scene graphs in challenging real-world scenes' but provides no quantitative metrics, error bars, ablation studies, or specific validation results. This absence leaves the central reliability claim unsupported and requires detailed experimental evidence in the main text.

Authors: We agree that the abstract would benefit from explicit reference to key quantitative results to better ground the reliability claim. In the revised version, we have updated the abstract to include concise mentions of our main quantitative findings (e.g., functional edge precision/recall on the dense tabletop benchmark and improvements over baselines), while preserving brevity. The main text already contains the full experimental results, error bars, and ablations in Section 4; the abstract revision now explicitly points readers to these sections. revision: yes
Referee: [§3.2] The pipeline anchors fine-grained functional edges directly from 2D visual evidence before temporal graph optimization (§3.2). For the dense tabletop benchmark, this relies on 2D grounding producing unambiguous instance-to-function mappings despite occlusion and visual similarity. No error analysis or sensitivity study on grounding accuracy is mentioned, which is load-bearing since optimization steps may not correct systematic attribution errors.

Authors: This observation correctly identifies a load-bearing assumption. While the subsequent temporal optimization and multi-cue association are intended to mitigate isolated grounding failures, we acknowledge that a dedicated analysis of grounding accuracy is valuable. In the revision, we have added a short error analysis paragraph in §3.2 that discusses observed grounding failure modes on the dense tabletop data and a sensitivity study in the Experiments section that quantifies how controlled degradation in 2D grounding accuracy propagates to final graph metrics. These additions make the dependence on the 2D stage explicit and demonstrate the robustness margin provided by the 3D optimization. revision: yes
Referee: [Experiments] The temporal graph optimization integrates evidence accumulation, entropy regularization, and smoothing. Without ablations isolating the contribution of each component or comparisons to simpler baselines, it is difficult to evaluate if this formulation is necessary or superior for resolving cross-frame uncertainties.

Authors: We concur that component-wise ablations and baseline comparisons are necessary to justify the optimization design. The revised manuscript now includes a dedicated ablation table in the Experiments section that isolates evidence accumulation, entropy regularization, and temporal smoothing. We also report results against simpler baselines (per-frame majority voting and non-regularized temporal averaging). The new results show that each term contributes measurably to cross-frame consistency and that the full formulation outperforms the ablated variants on the dense scenes, thereby substantiating the necessity of the integrated approach. revision: yes

Circularity Check

0 steps flagged

No significant circularity; pipeline uses external grounding and standard optimization

full rationale

The paper presents an engineering pipeline that first anchors fine-grained functional edges directly from 2D visual evidence, then associates nodes across frames using multiple 3D cues, formulates edge association as temporal graph optimization incorporating evidence accumulation, entropy regularization and smoothing, and finally applies global hierarchy shaping. None of these steps reduce by construction to fitted parameters, self-definitions, or load-bearing self-citations; the method description relies on external 2D grounding tools and conventional graph techniques whose correctness is independent of the target functional-graph outputs. The central claim of reliable inference on the new dense tabletop benchmark is evaluated via experiments rather than being presupposed by the derivation itself, rendering the chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim depends on the effectiveness of 2D visual evidence for functional edges and the ability of multi-cue 3D association plus optimization to handle small-object confusion; no explicit free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption 2D visual grounding supplies reliable evidence for anchoring fine-grained functional relationships in 3D
Pipeline explicitly anchors edges from 2D visual evidence as the starting point for relational reasoning.

pith-pipeline@v0.9.0 · 5795 in / 1331 out tokens · 65754 ms · 2026-05-20T18:53:13.776764+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

anchor fine-grained functional edges from 2D visual evidence... temporal graph optimization, integrating evidence accumulation, entropy regularization, and temporal smoothing
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

global hierarchy shaping... O ← C ← U hierarchy

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · 1 internal anchor

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

In: ICCV (2019)

Armeni, I., He, Z.Y., Gwak, J., Zamir, A.R., Fischer, M., Malik, J., Savarese, S.: 3d scene graph: A structure for unified semantics, 3d space, and camera. In: ICCV (2019)

work page 2019
[3]

ACM TOG (2018)

Atzmon, M., Maron, H., Lipman, Y.: Point convolutional neural networks by ex- tension operators. ACM TOG (2018)

work page 2018
[4]

Banerjee, S

Banerjee, P., Shkodrani, S., Moulon, P., Hampali, S., Zhang, F., Fountain, J., Miller, E., Basol, S., Newcombe, R., Wang, R., et al.: Introducing hot3d: An ego- centric dataset for 3d hand and object tracking. arXiv preprint arXiv:2406.09598 (2024)

work page arXiv 2024
[5]

Bieri, V., Zamboni, M., Blumer, N.S., Chen, Q., Engelmann, F.: OpenCity3D: 3D Urban Scene Understanding with Vision-Language Models (2025)

work page 2025
[6]

Chen, L., Wang, X., Lu, J., Lin, S., Wang, C., He, G.: Clip-driven open-vocabulary 3dscenegraphgenerationviacross-modalitycontrastivelearning.In:CVPR(2024)

work page 2024
[7]

In: ECCV (2022)

Chen, Z., Hasson, Y., Schmid, C., Laptev, I.: Alignsdf: Pose-aligned signed distance fields for hand-object reconstruction. In: ECCV (2022)

work page 2022
[8]

ECCV (2024)

Cho, W., Lee, J., Yi, M., Kim, M., Woo, T., Kim, D., Ha, T., Lee, H., Ryu, J.H., Woo, W., et al.: Dense hand-object (ho) graspnet with full grasping taxonomy and dynamics. ECCV (2024)

work page 2024
[9]

In: CVPR (2019)

Choy, C., Gwak, J., Savarese, S.: 4d spatio-temporal convnets: Minkowski convo- lutional neural networks. In: CVPR (2019)

work page 2019
[10]

In: BMVC (2023)

Delitzas, A., Parelli, M., Hars, N., Vlassis, G., Anagnostidis, S.K., Bachmann, G., Hofmann, T.: Multi-clip: Contrastive vision-language pre-training for question answering tasks in 3d scenes. In: BMVC (2023)

work page 2023
[11]

In: CVPR (2024)

Delitzas, A., Takmaz, A., Tombari, F., Sumner, R., Pollefeys, M., Engelmann, F.: Scenefun3d: Fine-grained functionality and affordance understanding in 3d scenes. In: CVPR (2024)

work page 2024
[12]

Hu et al

Do, T.T., Nguyen, A., Reid, I.: Affordancenet: An end-to-end deep learning ap- proach for object affordance detection (2018) 16 X. Hu et al

work page 2018
[13]

In: CVPR (2020)

Engelmann, F., Bokeloh, M., Fathi, A., Leibe, B., Nießner, M.: 3d-mpa: Multi- proposal aggregation for 3d semantic instance segmentation. In: CVPR (2020)

work page 2020
[14]

ICLR (2024)

Engelmann, F., Manhardt, F., Niemeyer, M., Tateno, K., Pollefeys, M., Tombari, F.: Opennerf: Open set 3d neural scene segmentation with pixel-wise features and rendered novel views. ICLR (2024)

work page 2024
[15]

In: CVPR (2024)

Fan, Z., Parelli, M., Kadoglou, M.E., Chen, X., Kocabas, M., Black, M.J., Hilliges, O.: Hold: Category-agnostic 3d reconstruction of interacting hands and objects from video. In: CVPR (2024)

work page 2024
[16]

In: CVPR (2018)

Fang, K., Wu, T.L., Yang, D., Savarese, S., Lim, J.J.: Demo2vec: Reasoning object affordances from online videos. In: CVPR (2018)

work page 2018
[17]

In: CVPR (2025)

Fu, S., Yang, Q., Mo, Q., Yan, J., Wei, X., Meng, J., Xie, X., Zheng, W.S.: Llmdet: Learning strong open-vocabulary object detectors under the supervision of large language models. In: CVPR (2025)

work page 2025
[18]

Gu, Q., Kuwajerwala, A., Morin, S., Jatavallabhula, K.M., Sen, B., Agarwal, A., Rivera, C., Paul, W., Ellis, K., Chellappa, R., et al.: ConceptGraphs: Open- vocabulary 3d scene graphs for perception and planning (2024)

work page 2024
[19]

In: CVPR (2020)

Han, L., Zheng, T., Xu, L., Fang, L.: Occuseg: Occupancy-aware 3d instance seg- mentation. In: CVPR (2020)

work page 2020
[20]

In: CVPR (2019)

Hou, J., Dai, A., Nießner, M.: 3d-sis: 3d semantic instance segmentation of rgb-d scans. In: CVPR (2019)

work page 2019
[21]

In: CVPR (2023)

Hsu, J., Mao, J., Wu, J.: Ns3d: Neuro-symbolic grounding of 3d objects and rela- tions. In: CVPR (2023)

work page 2023
[22]

Hu, X., Wu, Y., Zhao, M., Cao, Z., Zhang, X., Ji, X.: Dyo-slam: Visual localization andobjectmappingindynamicscenes.IEEETransactionsonCircuitsandSystems for Video Technology (2025)

work page 2025
[23]

In: ICCV (2021)

Hu, Z., Bai, X., Shang, J., Zhang, R., Dong, J., Wang, X., Sun, G., Fu, H., Tai, C.L.: Vmnet: Voxel-mesh network for geodesic-aware 3d semantic segmentation. In: ICCV (2021)

work page 2021
[24]

In: CVPR (2018)

Hua, B.S., Tran, M.K., Yeung, S.K.: Pointwise convolutional neural networks. In: CVPR (2018)

work page 2018
[25]

ECCV (2024)

Huang, R., Peng, S., Takmaz, A., Tombari, F., Pollefeys, M., Song, S., Huang, G., Engelmann, F.: Segment3d: Learning fine-grained class-agnostic 3d segmentation without manual labels. ECCV (2024)

work page 2024
[26]

In: CVPR (2022)

Huang,S.,Chen,Y.,Jia,J.,Wang,L.:Multi-viewtransformerfor3dvisualground- ing. In: CVPR (2022)

work page 2022
[27]

In: ACM MM (2025)

Huang, X., Huang, Y.J., Zhang, Y., Tian, W., Feng, R., Zhang, Y., Xie, Y., Li, Y., Zhang, L.: Open-set image tagging with multi-grained text supervision. In: ACM MM (2025)

work page 2025
[28]

ICRA2023 Workshop on Pretraining for Robotics (PT4R) (2023)

Jatavallabhula, K.M., Kuwajerwala, A., Gu, Q., Omama, M., Chen, T., Maalouf, A., Li, S., Iyer, G., Saryazdi, S., Keetha, N., et al.: Conceptfusion: Open-set mul- timodal 3d mapping. ICRA2023 Workshop on Pretraining for Robotics (PT4R) (2023)

work page 2023
[29]

In: CVPR (2020)

Jiang, L., Zhao, H., Shi, S., Liu, S., Fu, C.W., Jia, J.: Pointgroup: Dual-set point grouping for 3d instance segmentation. In: CVPR (2020)

work page 2020
[30]

In: ICCV (2023)

Kerr, J., Kim, C.M., Goldberg, K., Kanazawa, A., Tancik, M.: Lerf: Language embedded radiance fields. In: ICCV (2023)

work page 2023
[31]

Koch, S., Hermosilla, P., Vaskevicius, N., Colosi, M., Ropinski, T.: Lang3dsg: Language-based contrastive pre-training for 3d scene graph prediction (2024)

work page 2024
[32]

In: CVPR (2024) Abbreviated paper title 17

Koch, S., Vaskevicius, N., Colosi, M., Hermosilla, P., Ropinski, T.: Open3dsg: Open-vocabulary 3d scene graphs from point clouds with queryable objects and open-set relationships. In: CVPR (2024) Abbreviated paper title 17

work page 2024
[33]

In: CVPR (2018)

Landrieu, L., Simonovsky, M.: Large-scale point cloud semantic segmentation with superpoint graphs. In: CVPR (2018)

work page 2018
[34]

ICLR (2022)

Li, Q., Mo, K., Yang, Y., Zhao, H., Guibas, L.: IFR-Explore: Learning inter-object functional relationships in 3d indoor scenes. ICLR (2022)

work page 2022
[35]

NeurIPS (2018)

Li, Y., Bu, R., Sun, M., Wu, W., Di, X., Chen, B.: Pointcnn: Convolution on x-transformed points. NeurIPS (2018)

work page 2018
[36]

In: CVPR (2024)

Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. In: CVPR (2024)

work page 2024
[37]

Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., Lee, Y.J.: Llava-next: Improved reasoning, ocr, and world knowledge (2024)

work page 2024
[38]

In: NeurIPS (2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023)

work page 2023
[39]

In: ICCV (2019)

Nagarajan, T., Feichtenhofer, C., Grauman, K.: Grounded human-object interac- tion hotspots from video. In: ICCV (2019)

work page 2019
[40]

In: 2019 International Conference on Robotics and Automation (ICRA)

Ok, K., Liu, K., Frey, K., How, J.P., Roy, N.: Robust object-based slam for high- speed autonomous navigation. In: 2019 International Conference on Robotics and Automation (ICRA). pp. 669–675. IEEE (2019)

work page 2019
[41]

Technical report, OpenAI (Aug 2025),https://cdn

OpenAI: Gpt-5 system card. Technical report, OpenAI (Aug 2025),https://cdn. openai.com/gpt-5-system-card.pdf, accessed: 2025-11-14

work page 2025
[42]

In: CVPRW (2023)

Parelli, M., Delitzas, A., Hars, N., Vlassis, G., Anagnostidis, S., Bachmann, G., Hofmann, T.: CLIP-Guided Vision-Language Pre-Training for Question Answering in 3D Scenes. In: CVPRW (2023)

work page 2023
[43]

Peng, S., Genova, K., Jiang, C., Tagliasacchi, A., Pollefeys, M., Funkhouser, T., etal.:Openscene:3dsceneunderstandingwithopenvocabularies.In:CVPR(2023)

work page 2023
[44]

In: CVPR (2017)

Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: Deep learning on point sets for 3d classification and segmentation. In: CVPR (2017)

work page 2017
[45]

NeurIPS (2017)

Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: Deep hierarchical feature learn- ing on point sets in a metric space. NeurIPS (2017)

work page 2017
[46]

In: CVPR (2024)

Qin, M., Li, W., Zhou, J., Wang, H., Pfister, H.: Langsplat: 3d language gaussian splatting. In: CVPR (2024)

work page 2024
[47]

Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K.V., Carion, N., Wu, C.Y., Girshick, R., Dollár, P., Feichtenhofer, C.: Sam 2: Segment anything in images and videos (2024)

work page 2024
[48]

Roh, J., Desingh, K., Farhadi, A., Fox, D.: Languagerefer: Spatial-language model for 3d visual grounding (2022)

work page 2022
[49]

Robotics, Science and Systems (2020)

Rosinol, A., Gupta, A., Abate, M., Shi, J., Carlone, L.: 3d dynamic scene graphs: Actionable spatial perception with places, objects, and humans. Robotics, Science and Systems (2020)

work page 2020
[50]

Rosinol, A., Violette, A., Abate, M., Hughes, N., Chang, Y., Shi, J., Gupta, A., Carlone,L.:Kimera:Fromslamtospatialperceptionwith3ddynamicscenegraphs (2021)

work page 2021
[51]

Rotondi, D., Scaparro, F., Blum, H., Arras, K.O.: Fungraph: Functionality aware 3d scene graphs for language-prompted scene interaction (2025)

work page 2025
[52]

Schult, J., Engelmann, F., Hermans, A., Litany, O., Tang, S., Leibe, B.: Mask3d: Mask transformer for 3d semantic instance segmentation (2023)

work page 2023
[53]

Takmaz, A., Delitzas, A., Sumner, R.W., Engelmann, F., Wald, J., Tombari, F.: Search3D: Hierarchical Open-Vocabulary 3D Segmentation (2025)

work page 2025
[54]

NeurIPS (2023) 18 X

Takmaz, A., Fedele, E., Sumner, R.W., Pollefeys, M., Tombari, F., Engelmann, F.: Openmask3d: Open-vocabulary 3d instance segmentation. NeurIPS (2023) 18 X. Hu et al

work page 2023
[55]

In: ICCV (2023)

Takmaz, A., Schult, J., Kaftan, I., Akçay, M., Leibe, B., Sumner, R., Engelmann, F., Tang, S.: 3D Segmentation of Humans in Point Clouds with Synthetic Data. In: ICCV (2023)

work page 2023
[56]

In: ICCV (2019)

Thomas, H., Qi, C.R., Deschaud, J.E., Marcotegui, B., Goulette, F., Guibas, L.J.: Kpconv: Flexible and deformable convolution for point clouds. In: ICCV (2019)

work page 2019
[57]

In: CVPR (2022)

Vu, T., Kim, K., Luu, T.M., Nguyen, T., Yoo, C.D.: Softgroup for 3d instance segmentation on point clouds. In: CVPR (2022)

work page 2022
[58]

In: CVPR (2020)

Wald, J., Dhamo, H., Navab, N., Tombari, F.: Learning 3d semantic scene graphs from 3d indoor reconstructions. In: CVPR (2020)

work page 2020
[59]

In: CVPR (2023)

Wang, Z., Cheng, B., Zhao, L., Xu, D., Tang, Y., Sheng, L.: Vl-sat: Visual-linguistic semantics assisted training for 3d semantic scene graph prediction in point cloud. In: CVPR (2023)

work page 2023
[60]

Weder, S., Blum, H., Engelmann, F., Pollefeys, M.: Labelmaker: Automatic se- mantic label generation from rgb-d trajectories (2024)

work page 2024
[61]

Weder, S., Engelmann, F., Schönberger, J.L., Seki, A., Pollefeys, M., Oswald, M.R.: Alster: A Local Spatio-temporal Expert for Online 3D Semantic Reconstruction (2023)

work page 2023
[62]

In: CVPR (2023)

Wu, S.C., Tateno, K., Navab, N., Tombari, F.: Incremental 3d semantic scene graph prediction from rgb sequences. In: CVPR (2023)

work page 2023
[63]

In: CVPR (2021)

Wu, S.C., Wald, J., Tateno, K., Navab, N., Tombari, F.: SceneGraphFusion: Incre- mental 3d scene graph prediction from rgb-d sequences. In: CVPR (2021)

work page 2021
[64]

In: ICCV (2021)

Yang, Z., Zhang, S., Wang, L., Luo, J.: Sat: 2d semantics assisted training for 3d visual grounding. In: ICCV (2021)

work page 2021
[65]

In: CVPR (2024)

Ye, Y., Gupta, A., Kitani, K., Tulsiani, S.: G-hop: Generative hand-object prior for interaction reconstruction and grasp synthesis. In: CVPR (2024)

work page 2024
[66]

In: CVPR (2022)

Ye, Y., Gupta, A., Tulsiani, S.: What’s in your hands? 3d reconstruction of generic objects in hands. In: CVPR (2022)

work page 2022
[67]

arXiv preprint arXiv:2405.20141 (2024)

Yilmaz, G., Peng, S., Pollefeys, M., Engelmann, F., Blum, H.: OpenDAS: Open- Vocabulary Domain Adaptation for 2D and 3D Segmentation. arXiv preprint arXiv:2405.20141 (2024)

work page arXiv 2024
[68]

arXiv preprint arXiv:2404.02523 (2024)

Yoshida, T., Kurita, S., Nishimura, T., Mori, S.: Text-driven affordance learning from egocentric vision. arXiv preprint arXiv:2404.02523 (2024)

work page arXiv 2024
[69]

ICLR (2024)

Yue, Y., Mahadevan, S., Schult, J., Engelmann, F., Leibe, B., Schindler, K., Kon- togianni, T.: Agile3d: Attention guided interactive multi-object 3d segmentation. ICLR (2024)

work page 2024
[70]

IJCV (2022)

Zhai,W.,Luo,H.,Zhang,J.,Cao,Y.,Tao,D.:One-shotobjectaffordancedetection in the wild. IJCV (2022)

work page 2022
[71]

In: CVPR (2021)

Zhang, C., Yu, J., Song, Y., Cai, W.: Exploiting edge-oriented reasoning for 3d point-based scene graph analysis. In: CVPR (2021)

work page 2021
[72]

In: CVPR (2025)

Zhang, C., Delitzas, A., Wang, F., Zhang, R., Ji, X., Pollefeys, M., Engelmann, F.: Open-vocabulary functional 3d scene graphs for real-world indoor spaces. In: CVPR (2025)

work page 2025
[73]

NeurIPS (2023)

Zhang, C., Di, Y., Zhang, R., Zhai, G., Manhardt, F., Tombari, F., Ji, X.: Ddf-ho: Hand-held object reconstruction via conditional directed distance field. NeurIPS (2023)

work page 2023
[74]

In: CVPR (2024)

Zhang, C., Jiao, G., Di, Y., Wang, G., Huang, Z., Zhang, R., Manhardt, F., Fu, B., Tombari, F., Ji, X.: Moho: Learning single-view hand-held object reconstruction with multi-view occlusion-aware supervision. In: CVPR (2024)

work page 2024
[75]

NeurIPS (2021) Abbreviated paper title 19

Zhang, S., Hao, A., Qin, H., et al.: Knowledge-inspired 3d scene graph prediction in point cloud. NeurIPS (2021) Abbreviated paper title 19

work page 2021
[76]

In: ICCV (2023)

Zhang, Y., Gong, Z., Chang, A.X.: Multi3drefer: Grounding text description to multiple 3d objects. In: ICCV (2023)

work page 2023
[77]

In: CVPR (2024)

Zhang, Y., Huang, X., Ma, J., Li, Z., Luo, Z., Xie, Y., Qin, Y., Luo, T., Li, Y., Liu, S., et al.: Recognize anything: A strong image tagging model. In: CVPR (2024)

work page 2024
[78]

In: CVPR (2024)

Zhou,S.,Chang,H.,Jiang,S.,Fan,Z.,Zhu,Z.,Xu,D.,Chari,P.,You,S.,Wang,Z., Kadambi, A.: Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields. In: CVPR (2024)

work page 2024
[79]

IJCV (2024)

Zuo, X., Samangouei, P., Zhou, Y., Di, Y., Li, M.: Fmgs: Foundation model em- bedded 3d gaussian splatting for holistic 3d scene understanding. IJCV (2024)

work page 2024

[1] [1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

In: ICCV (2019)

Armeni, I., He, Z.Y., Gwak, J., Zamir, A.R., Fischer, M., Malik, J., Savarese, S.: 3d scene graph: A structure for unified semantics, 3d space, and camera. In: ICCV (2019)

work page 2019

[3] [3]

ACM TOG (2018)

Atzmon, M., Maron, H., Lipman, Y.: Point convolutional neural networks by ex- tension operators. ACM TOG (2018)

work page 2018

[4] [4]

Banerjee, S

Banerjee, P., Shkodrani, S., Moulon, P., Hampali, S., Zhang, F., Fountain, J., Miller, E., Basol, S., Newcombe, R., Wang, R., et al.: Introducing hot3d: An ego- centric dataset for 3d hand and object tracking. arXiv preprint arXiv:2406.09598 (2024)

work page arXiv 2024

[5] [5]

Bieri, V., Zamboni, M., Blumer, N.S., Chen, Q., Engelmann, F.: OpenCity3D: 3D Urban Scene Understanding with Vision-Language Models (2025)

work page 2025

[6] [6]

Chen, L., Wang, X., Lu, J., Lin, S., Wang, C., He, G.: Clip-driven open-vocabulary 3dscenegraphgenerationviacross-modalitycontrastivelearning.In:CVPR(2024)

work page 2024

[7] [7]

In: ECCV (2022)

Chen, Z., Hasson, Y., Schmid, C., Laptev, I.: Alignsdf: Pose-aligned signed distance fields for hand-object reconstruction. In: ECCV (2022)

work page 2022

[8] [8]

ECCV (2024)

Cho, W., Lee, J., Yi, M., Kim, M., Woo, T., Kim, D., Ha, T., Lee, H., Ryu, J.H., Woo, W., et al.: Dense hand-object (ho) graspnet with full grasping taxonomy and dynamics. ECCV (2024)

work page 2024

[9] [9]

In: CVPR (2019)

Choy, C., Gwak, J., Savarese, S.: 4d spatio-temporal convnets: Minkowski convo- lutional neural networks. In: CVPR (2019)

work page 2019

[10] [10]

In: BMVC (2023)

Delitzas, A., Parelli, M., Hars, N., Vlassis, G., Anagnostidis, S.K., Bachmann, G., Hofmann, T.: Multi-clip: Contrastive vision-language pre-training for question answering tasks in 3d scenes. In: BMVC (2023)

work page 2023

[11] [11]

In: CVPR (2024)

Delitzas, A., Takmaz, A., Tombari, F., Sumner, R., Pollefeys, M., Engelmann, F.: Scenefun3d: Fine-grained functionality and affordance understanding in 3d scenes. In: CVPR (2024)

work page 2024

[12] [12]

Hu et al

Do, T.T., Nguyen, A., Reid, I.: Affordancenet: An end-to-end deep learning ap- proach for object affordance detection (2018) 16 X. Hu et al

work page 2018

[13] [13]

In: CVPR (2020)

Engelmann, F., Bokeloh, M., Fathi, A., Leibe, B., Nießner, M.: 3d-mpa: Multi- proposal aggregation for 3d semantic instance segmentation. In: CVPR (2020)

work page 2020

[14] [14]

ICLR (2024)

Engelmann, F., Manhardt, F., Niemeyer, M., Tateno, K., Pollefeys, M., Tombari, F.: Opennerf: Open set 3d neural scene segmentation with pixel-wise features and rendered novel views. ICLR (2024)

work page 2024

[15] [15]

In: CVPR (2024)

Fan, Z., Parelli, M., Kadoglou, M.E., Chen, X., Kocabas, M., Black, M.J., Hilliges, O.: Hold: Category-agnostic 3d reconstruction of interacting hands and objects from video. In: CVPR (2024)

work page 2024

[16] [16]

In: CVPR (2018)

Fang, K., Wu, T.L., Yang, D., Savarese, S., Lim, J.J.: Demo2vec: Reasoning object affordances from online videos. In: CVPR (2018)

work page 2018

[17] [17]

In: CVPR (2025)

Fu, S., Yang, Q., Mo, Q., Yan, J., Wei, X., Meng, J., Xie, X., Zheng, W.S.: Llmdet: Learning strong open-vocabulary object detectors under the supervision of large language models. In: CVPR (2025)

work page 2025

[18] [18]

Gu, Q., Kuwajerwala, A., Morin, S., Jatavallabhula, K.M., Sen, B., Agarwal, A., Rivera, C., Paul, W., Ellis, K., Chellappa, R., et al.: ConceptGraphs: Open- vocabulary 3d scene graphs for perception and planning (2024)

work page 2024

[19] [19]

In: CVPR (2020)

Han, L., Zheng, T., Xu, L., Fang, L.: Occuseg: Occupancy-aware 3d instance seg- mentation. In: CVPR (2020)

work page 2020

[20] [20]

In: CVPR (2019)

Hou, J., Dai, A., Nießner, M.: 3d-sis: 3d semantic instance segmentation of rgb-d scans. In: CVPR (2019)

work page 2019

[21] [21]

In: CVPR (2023)

Hsu, J., Mao, J., Wu, J.: Ns3d: Neuro-symbolic grounding of 3d objects and rela- tions. In: CVPR (2023)

work page 2023

[22] [22]

Hu, X., Wu, Y., Zhao, M., Cao, Z., Zhang, X., Ji, X.: Dyo-slam: Visual localization andobjectmappingindynamicscenes.IEEETransactionsonCircuitsandSystems for Video Technology (2025)

work page 2025

[23] [23]

In: ICCV (2021)

Hu, Z., Bai, X., Shang, J., Zhang, R., Dong, J., Wang, X., Sun, G., Fu, H., Tai, C.L.: Vmnet: Voxel-mesh network for geodesic-aware 3d semantic segmentation. In: ICCV (2021)

work page 2021

[24] [24]

In: CVPR (2018)

Hua, B.S., Tran, M.K., Yeung, S.K.: Pointwise convolutional neural networks. In: CVPR (2018)

work page 2018

[25] [25]

ECCV (2024)

Huang, R., Peng, S., Takmaz, A., Tombari, F., Pollefeys, M., Song, S., Huang, G., Engelmann, F.: Segment3d: Learning fine-grained class-agnostic 3d segmentation without manual labels. ECCV (2024)

work page 2024

[26] [26]

In: CVPR (2022)

Huang,S.,Chen,Y.,Jia,J.,Wang,L.:Multi-viewtransformerfor3dvisualground- ing. In: CVPR (2022)

work page 2022

[27] [27]

In: ACM MM (2025)

Huang, X., Huang, Y.J., Zhang, Y., Tian, W., Feng, R., Zhang, Y., Xie, Y., Li, Y., Zhang, L.: Open-set image tagging with multi-grained text supervision. In: ACM MM (2025)

work page 2025

[28] [28]

ICRA2023 Workshop on Pretraining for Robotics (PT4R) (2023)

Jatavallabhula, K.M., Kuwajerwala, A., Gu, Q., Omama, M., Chen, T., Maalouf, A., Li, S., Iyer, G., Saryazdi, S., Keetha, N., et al.: Conceptfusion: Open-set mul- timodal 3d mapping. ICRA2023 Workshop on Pretraining for Robotics (PT4R) (2023)

work page 2023

[29] [29]

In: CVPR (2020)

Jiang, L., Zhao, H., Shi, S., Liu, S., Fu, C.W., Jia, J.: Pointgroup: Dual-set point grouping for 3d instance segmentation. In: CVPR (2020)

work page 2020

[30] [30]

In: ICCV (2023)

Kerr, J., Kim, C.M., Goldberg, K., Kanazawa, A., Tancik, M.: Lerf: Language embedded radiance fields. In: ICCV (2023)

work page 2023

[31] [31]

Koch, S., Hermosilla, P., Vaskevicius, N., Colosi, M., Ropinski, T.: Lang3dsg: Language-based contrastive pre-training for 3d scene graph prediction (2024)

work page 2024

[32] [32]

In: CVPR (2024) Abbreviated paper title 17

Koch, S., Vaskevicius, N., Colosi, M., Hermosilla, P., Ropinski, T.: Open3dsg: Open-vocabulary 3d scene graphs from point clouds with queryable objects and open-set relationships. In: CVPR (2024) Abbreviated paper title 17

work page 2024

[33] [33]

In: CVPR (2018)

Landrieu, L., Simonovsky, M.: Large-scale point cloud semantic segmentation with superpoint graphs. In: CVPR (2018)

work page 2018

[34] [34]

ICLR (2022)

Li, Q., Mo, K., Yang, Y., Zhao, H., Guibas, L.: IFR-Explore: Learning inter-object functional relationships in 3d indoor scenes. ICLR (2022)

work page 2022

[35] [35]

NeurIPS (2018)

Li, Y., Bu, R., Sun, M., Wu, W., Di, X., Chen, B.: Pointcnn: Convolution on x-transformed points. NeurIPS (2018)

work page 2018

[36] [36]

In: CVPR (2024)

Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. In: CVPR (2024)

work page 2024

[37] [37]

Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., Lee, Y.J.: Llava-next: Improved reasoning, ocr, and world knowledge (2024)

work page 2024

[38] [38]

In: NeurIPS (2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023)

work page 2023

[39] [39]

In: ICCV (2019)

Nagarajan, T., Feichtenhofer, C., Grauman, K.: Grounded human-object interac- tion hotspots from video. In: ICCV (2019)

work page 2019

[40] [40]

In: 2019 International Conference on Robotics and Automation (ICRA)

Ok, K., Liu, K., Frey, K., How, J.P., Roy, N.: Robust object-based slam for high- speed autonomous navigation. In: 2019 International Conference on Robotics and Automation (ICRA). pp. 669–675. IEEE (2019)

work page 2019

[41] [41]

Technical report, OpenAI (Aug 2025),https://cdn

OpenAI: Gpt-5 system card. Technical report, OpenAI (Aug 2025),https://cdn. openai.com/gpt-5-system-card.pdf, accessed: 2025-11-14

work page 2025

[42] [42]

In: CVPRW (2023)

Parelli, M., Delitzas, A., Hars, N., Vlassis, G., Anagnostidis, S., Bachmann, G., Hofmann, T.: CLIP-Guided Vision-Language Pre-Training for Question Answering in 3D Scenes. In: CVPRW (2023)

work page 2023

[43] [43]

Peng, S., Genova, K., Jiang, C., Tagliasacchi, A., Pollefeys, M., Funkhouser, T., etal.:Openscene:3dsceneunderstandingwithopenvocabularies.In:CVPR(2023)

work page 2023

[44] [44]

In: CVPR (2017)

Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: Deep learning on point sets for 3d classification and segmentation. In: CVPR (2017)

work page 2017

[45] [45]

NeurIPS (2017)

Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: Deep hierarchical feature learn- ing on point sets in a metric space. NeurIPS (2017)

work page 2017

[46] [46]

In: CVPR (2024)

Qin, M., Li, W., Zhou, J., Wang, H., Pfister, H.: Langsplat: 3d language gaussian splatting. In: CVPR (2024)

work page 2024

[47] [47]

Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K.V., Carion, N., Wu, C.Y., Girshick, R., Dollár, P., Feichtenhofer, C.: Sam 2: Segment anything in images and videos (2024)

work page 2024

[48] [48]

Roh, J., Desingh, K., Farhadi, A., Fox, D.: Languagerefer: Spatial-language model for 3d visual grounding (2022)

work page 2022

[49] [49]

Robotics, Science and Systems (2020)

Rosinol, A., Gupta, A., Abate, M., Shi, J., Carlone, L.: 3d dynamic scene graphs: Actionable spatial perception with places, objects, and humans. Robotics, Science and Systems (2020)

work page 2020

[50] [50]

Rosinol, A., Violette, A., Abate, M., Hughes, N., Chang, Y., Shi, J., Gupta, A., Carlone,L.:Kimera:Fromslamtospatialperceptionwith3ddynamicscenegraphs (2021)

work page 2021

[51] [51]

Rotondi, D., Scaparro, F., Blum, H., Arras, K.O.: Fungraph: Functionality aware 3d scene graphs for language-prompted scene interaction (2025)

work page 2025

[52] [52]

Schult, J., Engelmann, F., Hermans, A., Litany, O., Tang, S., Leibe, B.: Mask3d: Mask transformer for 3d semantic instance segmentation (2023)

work page 2023

[53] [53]

Takmaz, A., Delitzas, A., Sumner, R.W., Engelmann, F., Wald, J., Tombari, F.: Search3D: Hierarchical Open-Vocabulary 3D Segmentation (2025)

work page 2025

[54] [54]

NeurIPS (2023) 18 X

Takmaz, A., Fedele, E., Sumner, R.W., Pollefeys, M., Tombari, F., Engelmann, F.: Openmask3d: Open-vocabulary 3d instance segmentation. NeurIPS (2023) 18 X. Hu et al

work page 2023

[55] [55]

In: ICCV (2023)

Takmaz, A., Schult, J., Kaftan, I., Akçay, M., Leibe, B., Sumner, R., Engelmann, F., Tang, S.: 3D Segmentation of Humans in Point Clouds with Synthetic Data. In: ICCV (2023)

work page 2023

[56] [56]

In: ICCV (2019)

Thomas, H., Qi, C.R., Deschaud, J.E., Marcotegui, B., Goulette, F., Guibas, L.J.: Kpconv: Flexible and deformable convolution for point clouds. In: ICCV (2019)

work page 2019

[57] [57]

In: CVPR (2022)

Vu, T., Kim, K., Luu, T.M., Nguyen, T., Yoo, C.D.: Softgroup for 3d instance segmentation on point clouds. In: CVPR (2022)

work page 2022

[58] [58]

In: CVPR (2020)

Wald, J., Dhamo, H., Navab, N., Tombari, F.: Learning 3d semantic scene graphs from 3d indoor reconstructions. In: CVPR (2020)

work page 2020

[59] [59]

In: CVPR (2023)

Wang, Z., Cheng, B., Zhao, L., Xu, D., Tang, Y., Sheng, L.: Vl-sat: Visual-linguistic semantics assisted training for 3d semantic scene graph prediction in point cloud. In: CVPR (2023)

work page 2023

[60] [60]

Weder, S., Blum, H., Engelmann, F., Pollefeys, M.: Labelmaker: Automatic se- mantic label generation from rgb-d trajectories (2024)

work page 2024

[61] [61]

Weder, S., Engelmann, F., Schönberger, J.L., Seki, A., Pollefeys, M., Oswald, M.R.: Alster: A Local Spatio-temporal Expert for Online 3D Semantic Reconstruction (2023)

work page 2023

[62] [62]

In: CVPR (2023)

Wu, S.C., Tateno, K., Navab, N., Tombari, F.: Incremental 3d semantic scene graph prediction from rgb sequences. In: CVPR (2023)

work page 2023

[63] [63]

In: CVPR (2021)

Wu, S.C., Wald, J., Tateno, K., Navab, N., Tombari, F.: SceneGraphFusion: Incre- mental 3d scene graph prediction from rgb-d sequences. In: CVPR (2021)

work page 2021

[64] [64]

In: ICCV (2021)

Yang, Z., Zhang, S., Wang, L., Luo, J.: Sat: 2d semantics assisted training for 3d visual grounding. In: ICCV (2021)

work page 2021

[65] [65]

In: CVPR (2024)

Ye, Y., Gupta, A., Kitani, K., Tulsiani, S.: G-hop: Generative hand-object prior for interaction reconstruction and grasp synthesis. In: CVPR (2024)

work page 2024

[66] [66]

In: CVPR (2022)

Ye, Y., Gupta, A., Tulsiani, S.: What’s in your hands? 3d reconstruction of generic objects in hands. In: CVPR (2022)

work page 2022

[67] [67]

arXiv preprint arXiv:2405.20141 (2024)

Yilmaz, G., Peng, S., Pollefeys, M., Engelmann, F., Blum, H.: OpenDAS: Open- Vocabulary Domain Adaptation for 2D and 3D Segmentation. arXiv preprint arXiv:2405.20141 (2024)

work page arXiv 2024

[68] [68]

arXiv preprint arXiv:2404.02523 (2024)

Yoshida, T., Kurita, S., Nishimura, T., Mori, S.: Text-driven affordance learning from egocentric vision. arXiv preprint arXiv:2404.02523 (2024)

work page arXiv 2024

[69] [69]

ICLR (2024)

Yue, Y., Mahadevan, S., Schult, J., Engelmann, F., Leibe, B., Schindler, K., Kon- togianni, T.: Agile3d: Attention guided interactive multi-object 3d segmentation. ICLR (2024)

work page 2024

[70] [70]

IJCV (2022)

Zhai,W.,Luo,H.,Zhang,J.,Cao,Y.,Tao,D.:One-shotobjectaffordancedetection in the wild. IJCV (2022)

work page 2022

[71] [71]

In: CVPR (2021)

Zhang, C., Yu, J., Song, Y., Cai, W.: Exploiting edge-oriented reasoning for 3d point-based scene graph analysis. In: CVPR (2021)

work page 2021

[72] [72]

In: CVPR (2025)

Zhang, C., Delitzas, A., Wang, F., Zhang, R., Ji, X., Pollefeys, M., Engelmann, F.: Open-vocabulary functional 3d scene graphs for real-world indoor spaces. In: CVPR (2025)

work page 2025

[73] [73]

NeurIPS (2023)

Zhang, C., Di, Y., Zhang, R., Zhai, G., Manhardt, F., Tombari, F., Ji, X.: Ddf-ho: Hand-held object reconstruction via conditional directed distance field. NeurIPS (2023)

work page 2023

[74] [74]

In: CVPR (2024)

Zhang, C., Jiao, G., Di, Y., Wang, G., Huang, Z., Zhang, R., Manhardt, F., Fu, B., Tombari, F., Ji, X.: Moho: Learning single-view hand-held object reconstruction with multi-view occlusion-aware supervision. In: CVPR (2024)

work page 2024

[75] [75]

NeurIPS (2021) Abbreviated paper title 19

Zhang, S., Hao, A., Qin, H., et al.: Knowledge-inspired 3d scene graph prediction in point cloud. NeurIPS (2021) Abbreviated paper title 19

work page 2021

[76] [76]

In: ICCV (2023)

Zhang, Y., Gong, Z., Chang, A.X.: Multi3drefer: Grounding text description to multiple 3d objects. In: ICCV (2023)

work page 2023

[77] [77]

In: CVPR (2024)

Zhang, Y., Huang, X., Ma, J., Li, Z., Luo, Z., Xie, Y., Qin, Y., Luo, T., Li, Y., Liu, S., et al.: Recognize anything: A strong image tagging model. In: CVPR (2024)

work page 2024

[78] [78]

In: CVPR (2024)

Zhou,S.,Chang,H.,Jiang,S.,Fan,Z.,Zhu,Z.,Xu,D.,Chari,P.,You,S.,Wang,Z., Kadambi, A.: Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields. In: CVPR (2024)

work page 2024

[79] [79]

IJCV (2024)

Zuo, X., Samangouei, P., Zhou, Y., Di, Y., Li, M.: Fmgs: Foundation model em- bedded 3d gaussian splatting for holistic 3d scene understanding. IJCV (2024)

work page 2024