NoPA: Non-Parametric Online 3D Scene Graph Generation

Gim Hee Lee; Qi Xun Yeo; Seungjun Lee; Yan Li

arxiv: 2607.00529 · v1 · pith:VOHQ7IUOnew · submitted 2026-07-01 · 💻 cs.CV

NoPA: Non-Parametric Online 3D Scene Graph Generation

Qi Xun Yeo , Seungjun Lee , Yan Li , Gim Hee Lee This is my paper

Pith reviewed 2026-07-02 14:33 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D scene graphnon-parametric distributiononline inferencemaximum mean discrepancykernel density estimateobject mergingreal-time robotics

0 comments

The pith

NoPA represents each object as a non-parametric distribution to retain geometric detail in real-time 3D scene graph generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Classic 3D scene graph methods struggle with real-time performance because of expensive point-cloud processing. A prior approach approximated objects with single 3D Gaussians to gain speed but lost geometric detail and merged objects poorly. NoPA instead models each object as its own non-parametric distribution kept as a fixed set of particles. This keeps the detail while a merging step based on maximum mean discrepancy between kernel density estimates allows reliable online combination of observations. The system also fixes relation errors by propagating links between similar objects and runs at real-time speeds with better accuracy than prior work.

Core claim

The paper claims that representing each object as a separate non-parametric distribution retains 3D geometric information while preserving real-time inference of the parametric Gaussian formulation. To build upon this, a tailored merging strategy leverages maximum mean discrepancy on kernel density estimates to enable robust merging of object candidates during online exploration while minimizing added computational complexity by maintaining a fixed particle set per object. Furthermore, relationships between objects with high affinity are propagated to rectify the relation loss caused by misclassified objects.

What carries the argument

Non-parametric distribution for each object maintained via a fixed particle set, with merging performed using maximum mean discrepancy on kernel density estimates.

If this is right

Retains 3D geometric detail compared to single Gaussian approximations.
Enables robust merging of object candidates in online settings.
Preserves real-time inference speed.
Rectifies relation losses through affinity-based propagation.
Outperforms current methods in 3D scene graph generation tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This representation could be applied to other real-time mapping problems where balancing detail and speed is critical.
The fixed particle approach might integrate with existing particle-based filters in robotics.
Improved scene graphs could lead to better performance in robotic navigation and interaction tasks.

Load-bearing premise

A fixed particle set per object can retain sufficient geometric detail for accurate merging and relation propagation without increasing computational cost beyond real-time limits.

What would settle it

Running the method on a dataset with complex object geometries and measuring whether merging accuracy drops or inference time exceeds real-time thresholds compared to Gaussian baselines.

Figures

Figures reproduced from arXiv: 2607.00529 by Gim Hee Lee, Qi Xun Yeo, Seungjun Lee, Yan Li.

**Figure 2.** Figure 2: Overview of our online 3D scene graph generation pipeline. (1) A pretrained RT-DETR-EGTR [13,45] model predicts a local 2D scene graph from each RGB frame. (2) For every object node, we sample pixels inside its 2D bounding box, back-project them with depth, and obtain a 3D particle set in the world frame. (3) We associate each local particle set with existing global objects using a two-stage test: a consta… view at source ↗

**Figure 3.** Figure 3: The visualization of objects in Scene 41385849 from the 3DSSG dataset. In the top row, we visualize an instance of the sink class localized by a red bounding box. In the bottom row, we visualize the global 3D object instances. Left: Visualized Gaussian blobs from FROSS [11]. The Gaussian blob only encompasses half of the sink in (a). Spurious blobs spanning across the scene in (c) visualizes the impact of … view at source ↗

**Figure 4.** Figure 4: Visualization of the merging process for an object in Scene 7272e16c from the 3DSSG dataset. If the local particle set and the global particle set yields a small Hellinger distance (d_H < \delta _H - \epsilon ) after fitting a unimodal Gaussian (Stage 1 ), their covariances clearly matches, and the merge decision is straightforward. If the Hellinger distance falls within the margin band (Stage 2 ), covaria… view at source ↗

**Figure 5.** Figure 5: We compare the qualitative results between FROSS and our proposed approach for scene 321c867e from the 3DSSG dataset. The scene shows a kitchen from bird’s eye view (BEV). FROSS fails to predict a majority of the wall background class. As a consequence, a majority of the predicate relationships are lost. Our method correctly classifies most objects, except for the counter instance, while correctly predicti… view at source ↗

**Figure 6.** Figure 6: We compare the qualitative results between FROSS and our proposed approach for scene ab835fae from the 3DSSG dataset. The object instances denoted with a * are not visible from either viewpoint angles but are visible in the input images. FROSS fails to predict a majority of the wall background class. Notably, FROSS has trouble differentiating the wall class with the sink class. FROSS also fails to predict … view at source ↗

**Figure 7.** Figure 7: We compare the qualitative results between FROSS and our proposed approach for scene c2d9933f from the 3DSSG dataset. FROSS once again fails to predict a majority of the wall background class. FROSS also misclassifies the sofa instance as a chair instance. Even though FROSS correctly classifies most objects, FROSS fails to predict the predicate relationships between most objects. Our method correctly class… view at source ↗

**Figure 8.** Figure 8: We compare the qualitative results between FROSS and our proposed approach for scene 5630cfe7 from the 3DSSG dataset. FROSS misclassifies the desk object as a table class. FROSS also misclassifies the wall instance as an other furniture instance. Because of the initial incorrect classification, all the relationships that are predicted with the wall instance are misclassified or missing. Our method correctl… view at source ↗

**Figure 9.** Figure 9: We show the distribution of merge decisions in the test split of the ReplicaSSG dataset according the Hellinger distance calculated from fitting the Gaussian distribution on NoPA’s particles. Even in the narrow margin band between \delta _H - \epsilon \leq d_H \leq \delta _H + \epsilon where \delta _H = 0.85 and \epsilon = 0.05 , there exists a substantial number of merge decisions that requires sensitive… view at source ↗

read the original abstract

Classic 3D scene graph generation approaches fail to work in real-time due to the heavy computational cost of environment mapping and the need to generate intermediate point-cloud representations. To alleviate this issue, a recent work eschews point clouds in favor of a lightweight Gaussian distribution for each object. This approximation drastically speeds up inference and enables real-time 3D scene graph generation. However, the representation has two key weaknesses. \textbf{1)} Each object is approximated by a single 3D Gaussian, which causes a severe loss of 3D geometric detail. \textbf{2)} The discrepancy between this approximation and the true object geometry exacerbates the inaccurate merging of object candidates during online inference. To address these issues, we propose \textbf{NoPA}, which represents each object as a separate non-parametric distribution. This formulation retains 3D geometric information while preserving real-time inference of the parametric Gaussian formulation. To build upon our novel object representation, we propose a tailored merging strategy to recover coherent object instances. Specifically, we leverage maximum mean discrepancy on kernel density estimates to enable robust merging of object candidates during online exploration while minimizing added computational complexity. The key is to maintain a fixed particle set per object. Furthermore, to rectify the relation loss caused by misclassified objects, NoPA propagates relationships between objects with high affinity. Experiments show that NoPA substantially outperforms current methods without sacrificing real-time inference speed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NoPA swaps single Gaussians for fixed-particle non-parametric distributions plus MMD merging to fix detail loss and bad online merges in real-time scene graphs, but the abstract gives zero numbers so the performance claims stay uncheckable.

read the letter

The main thing here is a shift from the recent Gaussian-per-object baseline to a non-parametric representation that keeps a fixed set of particles per object. They then merge candidates with maximum mean discrepancy on kernel density estimates instead of whatever the Gaussian version used. That plus a relationship propagation step for misclassified objects is the concrete addition.

It does a clean job naming the two weaknesses in the Gaussian approach: obvious loss of geometric detail from the single-ellipse fit, and the knock-on effect on merging accuracy during online exploration. The fixed-particle trick is presented as the way to keep the speed while recovering some of that detail, and the MMD-on-KDE step is meant to make merging more robust without blowing up compute.

The soft spot is obvious and central: the abstract asserts substantial outperformance at real-time speeds but supplies no tables, no baselines, no error bars, and no timing numbers. Without those, the claim that the fixed particle set simultaneously retains detail, enables reliable MMD merging, and stays fast cannot be evaluated. The assumption looks internally consistent on paper, but it is exactly the kind of assumption that needs numbers to check.

This is aimed at the robotics and AR crowd that already cares about online 3D scene graphs. A reader already working on the Gaussian version or similar lightweight representations would get the most out of it, mainly to see whether the non-parametric route actually delivers on the merging and detail promises.

If the full paper contains proper experiments, ablations, and timing data, it deserves a serious referee. Right now the abstract alone does not give enough to decide.

Referee Report

2 major / 1 minor

Summary. The paper proposes NoPA for online 3D scene graph generation. It replaces single-Gaussian object representations with non-parametric distributions maintained via a fixed particle set per object, introduces a merging strategy based on maximum mean discrepancy computed on kernel density estimates, and adds relationship propagation to correct for misclassified objects. The abstract asserts that this retains geometric detail, enables robust online merging, and yields substantial outperformance over prior methods while preserving real-time inference speed.

Significance. If the performance and runtime claims are substantiated by rigorous experiments, the work would address a key limitation in real-time 3D scene understanding by improving geometric fidelity without incurring the cost of full point-cloud mapping, with relevance to robotics and augmented reality applications.

major comments (2)

[Abstract] Abstract: the assertion that 'Experiments show that NoPA substantially outperforms current methods without sacrificing real-time inference speed' is unsupported by any quantitative results, baselines, metrics, error bars, or experimental details, so the central performance claim cannot be evaluated.
[Abstract] Abstract (paragraph on tailored merging strategy): the claim that a fixed particle set simultaneously retains geometric detail, supports robust MMD-on-KDE merging, and preserves real-time speed without new approximation errors or computational blowup is stated without any complexity analysis, particle-count justification, or empirical validation of the assumption.

minor comments (1)

[Abstract] The method description remains at a high level; concrete details on particle initialization, kernel selection, MMD implementation, and the exact affinity measure used for relationship propagation would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed comments on the abstract. We address each major comment point by point below and propose revisions where the abstract can be strengthened without altering the manuscript's core contributions.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that 'Experiments show that NoPA substantially outperforms current methods without sacrificing real-time inference speed' is unsupported by any quantitative results, baselines, metrics, error bars, or experimental details, so the central performance claim cannot be evaluated.

Authors: The referee is correct that the abstract itself contains no numerical results. The full experimental section provides quantitative comparisons against baselines on standard metrics, runtime measurements confirming real-time performance, and supporting details. To make the central claim evaluable from the abstract alone, we will revise the abstract to incorporate key quantitative highlights (e.g., mAP improvements and FPS) drawn directly from the experiments. revision: yes
Referee: [Abstract] Abstract (paragraph on tailored merging strategy): the claim that a fixed particle set simultaneously retains geometric detail, supports robust MMD-on-KDE merging, and preserves real-time speed without new approximation errors or computational blowup is stated without any complexity analysis, particle-count justification, or empirical validation of the assumption.

Authors: The abstract summarizes the design choice; the complexity analysis (O(1) per merge due to fixed particle count), justification for the chosen particle number, and empirical validation against full point-cloud and Gaussian baselines appear in the Method and Experiments sections. We agree the abstract would be clearer with a brief reference to these elements and will revise it to include a short statement on the fixed particle count and its complexity implications. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper proposes a methodological contribution for real-time 3D scene graph generation by replacing single-Gaussian object representations with non-parametric particle-based distributions and introducing an MMD-on-KDE merging strategy. No equations, derivations, or first-principles results are present that reduce any claimed prediction or uniqueness to a fitted parameter or self-citation chain defined by the authors' own prior work. The abstract and description explicitly contrast the new approach against an external recent work on Gaussians, and the core claims (retained geometric detail, robust merging, real-time speed) are presented as engineering choices rather than mathematical reductions. This is the common case of an applied CV method whose validity rests on empirical results rather than internal definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no equations, implementation details, or parameter lists; cannot identify free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5788 in / 1159 out tokens · 26233 ms · 2026-07-02T14:33:57.213667+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 17 canonical work pages

[1]

In: Proceedings of the IEEE/CVF international conference on computer vision

Armeni, I., He, Z.Y., Gwak, J., Zamir, A.R., Fischer, M., Malik, J., Savarese, S.: 3d scene graph: A structure for unified semantics, 3d space, and camera. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 5664–5673 (2019)

2019
[2]

Articulated 3D scene graphs for open-world mobile manipulation.arXiv preprint arXiv:2602.16356, 2026

Buechner, M., Roefer, A., Engelbracht, T., Welschehold, T., Bauer, Z., Blum, H., Pollefeys, M., Valada, A.: Articulated 3d scene graphs for open-world mobile ma- nipulation. arXiv preprint arXiv:2602.16356 (2026)

work page arXiv 2026
[3]

In: Eu- ropean Conference on Computer Vision (2020),https://doi.org/10.1007/978- 3-030-58452-8_241

Çelen,A.,Han,G.,Schindler,K.,Gool,L.V.,Armeni,I.,Obukhov,A.,Wang,X.:I- design: Personalized LLM interior designer. In: Bue, A.D., Canton, C., Pont-Tuset, J., Tommasi, T. (eds.) Computer Vision - ECCV 2024 Workshops - Milan, Italy, September 29-October 4, 2024, Proceedings, Part II. Lecture Notes in Computer Science, vol. 15624, pp. 217–234. Springer (2024...

work page doi:10.1007/978- 2024
[4]

IEEE Robotics Autom

Chang, Y., Ballotta, L., Carlone, L.: D-lite: Navigation-oriented compression of 3d scene graphs for multi-robot collaboration. IEEE Robotics Autom. Lett.8(11), 7527–7534 (2023).https://doi.org/10.1109/LRA.2023.3320011

work page doi:10.1109/lra.2023.3320011 2023
[5]

In: IEEE International Conference on Computer Vision (ICCV) (2021)

Dhamo, H., Manhardt, F., Navab, N., Tombari, F.: Graph-to-3d: End-to-end gen- eration and manipulation of 3d scenes using scene graphs. In: IEEE International Conference on Computer Vision (ICCV) (2021)

2021
[6]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Feng, M., Hou, H., Zhang, L., Wu, Z., Guo, Y., Mian, A.: 3d spatial multimodal knowledge accumulation for scene graph prediction in point cloud. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9182–9191 (2023)

2023
[7]

Niedermayr, J

Fischer, T., Porzi, L., Bulò, S.R., Pollefeys, M., Kontschieder, P.: Multi-level neural scene graphs for dynamic urban environments. In: IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16- 22, 2024. pp. 21125–21135. IEEE (2024).https://doi.org/10.1109/CVPR52733. 2024.01996

work page doi:10.1109/cvpr52733 2024
[8]

In: 2024 IEEE International Conference on Robotics and Automation (ICRA)

Greve, E., Büchner, M., Vödisch, N., Burgard, W., Valada, A.: Collaborative dy- namic 3d scene graphs for automated driving pp. 11118–11124 (2024).https: //doi.org/10.1109/ICRA57147.2024.10610112

work page doi:10.1109/icra57147.2024.10610112 2024
[9]

In: 2024 IEEE Inter- national Conference on Robotics and Automation (ICRA)

Gu, Q., Kuwajerwala, A., Morin, S., Jatavallabhula, K.M., Sen, B., Agarwal, A., Rivera, C., Paul, W., Ellis, K., Chellappa, R., et al.: Conceptgraphs: Open- vocabulary 3d scene graphs for perception and planning. In: 2024 IEEE Inter- national Conference on Robotics and Automation (ICRA). pp. 5021–5028. IEEE (2024)

2024
[10]

In: MICCAI

Guo, D., Lin, M., Pei, J., Tang, H., Jin, Y., Heng, P.A.: Tri-modal confluence with temporal dynamics for scene graph generation in operating rooms. In: MICCAI. Springer (2024) 16 Yeo et al

2024
[11]

In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (October 2025)

Hou, H.Y., Lee, C.Y., Sonogashira, M., Kawanishi, Y.: FROSS: Faster-than-Real- Time Online 3D Semantic Scene Graph Generation from RGB-D Images. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (October 2025)

2025
[12]

Huang, X., Zhao, S., Wang, Y., Lu, X., Zhang, W., Qu, R., Li, W., Wang, Y., Wen, C.: Msgnav: Unleashing the power of multi-modal 3d scene graph for zero- shot embodied navigation (2026),https://arxiv.org/abs/2511.10376

work page arXiv 2026
[13]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Im, J., Nam, J., Park, N., Lee, H., Park, S.: Egtr: Extracting graph from trans- former for scene graph generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 24229–24238 (June 2024)

2024
[14]

Kim, U.H., Park, J.M., Song, T.J., Kim, J.H.: 3d-scene-graph: A sparse and seman- ticrepresentationofphysicalenvironmentsforintelligentagents.IEEECybernetics (2019)

2019
[15]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4015–4026 (2023)

2023
[16]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2024)

Koch, S., Vaskevicius, N., Colosi, M., Hermosilla, P., Ropinski, T.: Open3dsg: Open-vocabulary 3d scene graphs from point clouds with queryable objects and open-set relationships. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2024)

2024
[17]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

Koch, S., Wald, J., Colosi, M., Vaskevicius, N., Hermosilla, P., Tombari, F., Ropin- ski, T.: Relationfield: Relate anything in radiance fields. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

2025
[18]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Lee, S., Lee, G.H.: Diet-gs: Diffusion prior and event stream-assisted motion de- blurring 3d gaussian splatting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21739–21749 (2025)

2025
[19]

arXiv preprint arXiv:2601.23159 (2026)

Lee, S., Lee, G.H.: Segment any events with language. arXiv preprint arXiv:2601.23159 (2026)

work page arXiv 2026
[20]

arXiv preprint arXiv:2404.02157 (2024)

Lee, S., Zhao, Y., Lee, G.H.: Segment any 3d object with language. arXiv preprint arXiv:2404.02157 (2024)

work page arXiv 2024
[21]

In: International Conference on Learning Representations (ICLR) (2024)

Lin, C., Mu, Y.: Instructscene: Instruction-driven 3d indoor scene synthesis with semantic graph prior. In: International Conference on Learning Representations (ICLR) (2024)

2024
[22]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Liu, Y., Li, X., Zhang, Y., Qi, L., Li, X., Wang, W., Li, C., Li, X., Yang, M.H.: Controllable 3d outdoor scene generation via scene graphs. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 28052– 28062 (October 2025)

2025
[23]

Nyffeler, J., Tombari, F., Barath, D.: Hierarchical 3d scene graphs construction outdoors.In:ProceedingsoftheIEEE/CVFInternationalConferenceonComputer Vision (ICCV). pp. 26817–26826 (October 2025)

2025
[24]

In: Greenspan, H., Madabhushi, A., Mousavi, P., Salcudean, S.E., Duncan, J., Syeda-Mahmood, T.F., Taylor, R.H

Özsoy, E., Czempiel, T., Holm, F., Pellegrini, C., Navab, N.: LABRAD-OR: lightweight memory scene graphs for accurate bimodal reasoning in dynamic op- erating rooms. In: Greenspan, H., Madabhushi, A., Mousavi, P., Salcudean, S.E., Duncan, J., Syeda-Mahmood, T.F., Taylor, R.H. (eds.) Medical Image Comput- ing and Computer Assisted Intervention - MICCAI 202...

work page doi:10.1007/978-3-031-43996-4_29 2023
[25]

In: Wang, L., Dou, Q., Fletcher, P., Spei- del, S., Li, S

Özsoy, E., Örnek, E., Eck, U., Czempiel, T., Tombari, F., Navab, N.: 4d-or: Seman- tic scene graphs for or domain modeling. In: Wang, L., Dou, Q., Fletcher, P., Spei- del, S., Li, S. (eds.) Medical Image Computing and Computer Assisted Intervention – MICCAI 2022 - 25th International Conference, Proceedings. pp. 475–485. Lecture Notes in Computer Science (...

work page doi:10.1007/978-3-031-16449-1_45 2022
[26]

In: 2020 IEEE-RAS 20th International Conference on Humanoid Robots (Humanoids)

Seiwald, P., Wu, S.C., Sygulla, F., Berninger, T.F.C., Staufenberg, N.S., Sattler, M.F., Neuburger, N., Rixen, D., Tombari, F.: Lola v1.1 – an upgrade in hardware and software design for dynamic multi-contact locomotion. In: 2020 IEEE-RAS 20th International Conference on Humanoid Robots (Humanoids). IEEE (2021). https://doi.org/10.1109/humanoids47582.2021.9555790

work page doi:10.1109/humanoids47582.2021.9555790 2020
[27]

IEEE Access10, 11574–11583 (2022).https: //doi.org/10.1109/ACCESS.2022.3145465

Sonogashira, M., Iiyama, M., Kawanishi, Y.: Towards open-set scene graph gen- eration with unknown objects. IEEE Access10, 11574–11583 (2022).https: //doi.org/10.1109/ACCESS.2022.3145465

work page doi:10.1109/access.2022.3145465 2022
[28]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wald, J., Dhamo, H., Navab, N., Tombari, F.: Learning 3d semantic scene graphs from 3d indoor reconstructions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3961–3970 (2020)

2020
[29]

International Journal of Computer Vision130(3), 630–651 (2022)

Wald, J., Navab, N., Tombari, F.: Learning 3d semantic scene graphs with instance embeddings. International Journal of Computer Vision130(3), 630–651 (2022)

2022
[30]

arXiv preprint arXiv:2512.12622 (2025)

Wang, Z., Lee, S., Dai, G., Lee, G.H.: D3d-vlp: Dynamic 3d vision-language- planning model for embodied grounding and navigation. arXiv preprint arXiv:2512.12622 (2025)

work page arXiv 2025
[31]

arXiv preprint arXiv:2505.11383 (2025)

Wang, Z., Lee, S., Lee, G.H.: Dynam3d: Dynamic layered 3d tokens empower vlm for vision-and-language navigation. arXiv preprint arXiv:2505.11383 (2025)

work page arXiv 2025
[32]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Wang, Z., Cheng, B., Zhao, L., Xu, D., Tang, Y., Sheng, L.: Vl-sat: Visual-linguistic semantics assisted training for 3d semantic scene graph prediction in point cloud. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 21560–21569 (2023)

2023
[33]

Robotics: Science and Systems (2024)

Werby, A., Huang, C., Büchner, M., Valada, A., Burgard, W.: Hierarchical open- vocabulary 3d scene graphs for language-grounded robot navigation. Robotics: Science and Systems (2024)

2024
[34]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wu, S.C., Tateno, K., Navab, N., Tombari, F.: Incremental 3d semantic scene graph prediction from rgb sequences. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5064–5074 (2023)

2023
[35]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wu, S.C., Wald, J., Tateno, K., Navab, N., Tombari, F.: Scenegraphfusion: In- cremental 3d scene graph prediction from rgb-d sequences. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7515– 7525 (2021)

2021
[36]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Yang, Z., Lu, K., Zhang, C., Qi, J., Jiang, H., Ma, R., Yin, S., Xu, Y., Xing, M., Xiao, Z., et al.: Mmgdreamer: Mixed-modality graph for geometry-controllable 3d indoor scene generation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 9391–9399 (2025)

2025
[37]

In: Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV)

Yeo, Q.X., Li, Y., Lee, G.H.: Statistical confidence rescoring for robust 3d scene graph generation from multi-view images. In: Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV). pp. 24999–25008 (October 2025)

2025
[38]

Gc-vln: Instruction as graph constraints for training-free vision-and-language navigation.arXiv preprint arXiv:2509.10454, 2025a

Yin, H., Wei, H., Xu, X., Guo, W., Zhou, J., Lu, J.: Gc-vln: Instruction as graph constraints for training-free vision-and-language navigation. arXiv preprint arXiv:2509.10454 (2025) 18 Yeo et al

work page arXiv 2025
[39]

In: The Thirty-eighth Annual Confer- ence on Neural Information Processing Systems (2024),https://openreview.net/ forum?id=HmCmxbCpp2

Yin, H., Xu, X., Wu, Z., Zhou, J., Lu, J.: SG-nav: Online 3d scene graph prompting for LLM-based zero-shot object navigation. In: The Thirty-eighth Annual Confer- ence on Neural Information Processing Systems (2024),https://openreview.net/ forum?id=HmCmxbCpp2

2024
[40]

In: Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XXI

Zhai, G., Örnek, E.P., Chen, D.Z., Liao, R., Di, Y., Navab, N., Tombari, F., Busam, B.: Echoscene: Indoor scene generation via information echo over scene graph dif- fusion. In: Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XXI. p. 167–184. Springer- Verlag, Berlin, Heidelberg (2024).h...

work page doi:10.1007/978-3-031-72664- 2024
[41]

In: Thirty-seventh Conference on Neural Information Processing Systems (2023), https://openreview.net/forum?id=1SF2tiopYJ

Zhai, G., Örnek, E.P., Wu, S.C., Di, Y., Tombari, F., Navab, N., Busam, B.: Commonscenes: Generating commonsense 3d indoor scenes with scene graphs. In: Thirty-seventh Conference on Neural Information Processing Systems (2023), https://openreview.net/forum?id=1SF2tiopYJ

2023
[42]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Zhang, C., Yu, J., Song, Y., Cai, W.: Exploiting edge-oriented reasoning for 3d point-based scene graph analysis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9705–9715 (2021)

2021
[43]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

Zhang, C., Delitzas, A., Wang, F., Zhang, R., Ji, X., Pollefeys, M., Engelmann, F.: Open-vocabulary functional 3d scene graphs for real-world indoor spaces. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

2025
[44]

In: Kwok, J

Zhang, Y., Qian, D., Li, D., Pan, Y., Chen, Y., Liang, Z., Zhang, Z., Liu, Y., Mei, J., Fu, M., Ye, Y., Liang, Z., Shan, Y., Du, D.: Graphad: Interaction scene graph for end-to-end autonomous driving. In: Kwok, J. (ed.) Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, IJCAI-25. pp. 2422–2430. International Joint ...

work page doi:10.24963/ijcai.2025/270 2025
[45]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Zhao, Y., Lv, W., Xu, S., Wei, J., Wang, G., Dang, Q., Liu, Y., Chen, J.: Detrs beat yolos on real-time object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 16965–16974 (June 2024) NoPA: Non-Parametric Online 3D Scene Graph Generation 19 NoPA: Non-Parametric Online 3D Scene Graph Generation Su...

2024

[1] [1]

In: Proceedings of the IEEE/CVF international conference on computer vision

Armeni, I., He, Z.Y., Gwak, J., Zamir, A.R., Fischer, M., Malik, J., Savarese, S.: 3d scene graph: A structure for unified semantics, 3d space, and camera. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 5664–5673 (2019)

2019

[2] [2]

Articulated 3D scene graphs for open-world mobile manipulation.arXiv preprint arXiv:2602.16356, 2026

Buechner, M., Roefer, A., Engelbracht, T., Welschehold, T., Bauer, Z., Blum, H., Pollefeys, M., Valada, A.: Articulated 3d scene graphs for open-world mobile ma- nipulation. arXiv preprint arXiv:2602.16356 (2026)

work page arXiv 2026

[3] [3]

In: Eu- ropean Conference on Computer Vision (2020),https://doi.org/10.1007/978- 3-030-58452-8_241

Çelen,A.,Han,G.,Schindler,K.,Gool,L.V.,Armeni,I.,Obukhov,A.,Wang,X.:I- design: Personalized LLM interior designer. In: Bue, A.D., Canton, C., Pont-Tuset, J., Tommasi, T. (eds.) Computer Vision - ECCV 2024 Workshops - Milan, Italy, September 29-October 4, 2024, Proceedings, Part II. Lecture Notes in Computer Science, vol. 15624, pp. 217–234. Springer (2024...

work page doi:10.1007/978- 2024

[4] [4]

IEEE Robotics Autom

Chang, Y., Ballotta, L., Carlone, L.: D-lite: Navigation-oriented compression of 3d scene graphs for multi-robot collaboration. IEEE Robotics Autom. Lett.8(11), 7527–7534 (2023).https://doi.org/10.1109/LRA.2023.3320011

work page doi:10.1109/lra.2023.3320011 2023

[5] [5]

In: IEEE International Conference on Computer Vision (ICCV) (2021)

Dhamo, H., Manhardt, F., Navab, N., Tombari, F.: Graph-to-3d: End-to-end gen- eration and manipulation of 3d scenes using scene graphs. In: IEEE International Conference on Computer Vision (ICCV) (2021)

2021

[6] [6]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Feng, M., Hou, H., Zhang, L., Wu, Z., Guo, Y., Mian, A.: 3d spatial multimodal knowledge accumulation for scene graph prediction in point cloud. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9182–9191 (2023)

2023

[7] [7]

Niedermayr, J

Fischer, T., Porzi, L., Bulò, S.R., Pollefeys, M., Kontschieder, P.: Multi-level neural scene graphs for dynamic urban environments. In: IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16- 22, 2024. pp. 21125–21135. IEEE (2024).https://doi.org/10.1109/CVPR52733. 2024.01996

work page doi:10.1109/cvpr52733 2024

[8] [8]

In: 2024 IEEE International Conference on Robotics and Automation (ICRA)

Greve, E., Büchner, M., Vödisch, N., Burgard, W., Valada, A.: Collaborative dy- namic 3d scene graphs for automated driving pp. 11118–11124 (2024).https: //doi.org/10.1109/ICRA57147.2024.10610112

work page doi:10.1109/icra57147.2024.10610112 2024

[9] [9]

In: 2024 IEEE Inter- national Conference on Robotics and Automation (ICRA)

Gu, Q., Kuwajerwala, A., Morin, S., Jatavallabhula, K.M., Sen, B., Agarwal, A., Rivera, C., Paul, W., Ellis, K., Chellappa, R., et al.: Conceptgraphs: Open- vocabulary 3d scene graphs for perception and planning. In: 2024 IEEE Inter- national Conference on Robotics and Automation (ICRA). pp. 5021–5028. IEEE (2024)

2024

[10] [10]

In: MICCAI

Guo, D., Lin, M., Pei, J., Tang, H., Jin, Y., Heng, P.A.: Tri-modal confluence with temporal dynamics for scene graph generation in operating rooms. In: MICCAI. Springer (2024) 16 Yeo et al

2024

[11] [11]

In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (October 2025)

Hou, H.Y., Lee, C.Y., Sonogashira, M., Kawanishi, Y.: FROSS: Faster-than-Real- Time Online 3D Semantic Scene Graph Generation from RGB-D Images. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (October 2025)

2025

[12] [12]

Huang, X., Zhao, S., Wang, Y., Lu, X., Zhang, W., Qu, R., Li, W., Wang, Y., Wen, C.: Msgnav: Unleashing the power of multi-modal 3d scene graph for zero- shot embodied navigation (2026),https://arxiv.org/abs/2511.10376

work page arXiv 2026

[13] [13]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Im, J., Nam, J., Park, N., Lee, H., Park, S.: Egtr: Extracting graph from trans- former for scene graph generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 24229–24238 (June 2024)

2024

[14] [14]

Kim, U.H., Park, J.M., Song, T.J., Kim, J.H.: 3d-scene-graph: A sparse and seman- ticrepresentationofphysicalenvironmentsforintelligentagents.IEEECybernetics (2019)

2019

[15] [15]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4015–4026 (2023)

2023

[16] [16]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2024)

Koch, S., Vaskevicius, N., Colosi, M., Hermosilla, P., Ropinski, T.: Open3dsg: Open-vocabulary 3d scene graphs from point clouds with queryable objects and open-set relationships. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2024)

2024

[17] [17]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

Koch, S., Wald, J., Colosi, M., Vaskevicius, N., Hermosilla, P., Tombari, F., Ropin- ski, T.: Relationfield: Relate anything in radiance fields. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

2025

[18] [18]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Lee, S., Lee, G.H.: Diet-gs: Diffusion prior and event stream-assisted motion de- blurring 3d gaussian splatting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21739–21749 (2025)

2025

[19] [19]

arXiv preprint arXiv:2601.23159 (2026)

Lee, S., Lee, G.H.: Segment any events with language. arXiv preprint arXiv:2601.23159 (2026)

work page arXiv 2026

[20] [20]

arXiv preprint arXiv:2404.02157 (2024)

Lee, S., Zhao, Y., Lee, G.H.: Segment any 3d object with language. arXiv preprint arXiv:2404.02157 (2024)

work page arXiv 2024

[21] [21]

In: International Conference on Learning Representations (ICLR) (2024)

Lin, C., Mu, Y.: Instructscene: Instruction-driven 3d indoor scene synthesis with semantic graph prior. In: International Conference on Learning Representations (ICLR) (2024)

2024

[22] [22]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Liu, Y., Li, X., Zhang, Y., Qi, L., Li, X., Wang, W., Li, C., Li, X., Yang, M.H.: Controllable 3d outdoor scene generation via scene graphs. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 28052– 28062 (October 2025)

2025

[23] [23]

Nyffeler, J., Tombari, F., Barath, D.: Hierarchical 3d scene graphs construction outdoors.In:ProceedingsoftheIEEE/CVFInternationalConferenceonComputer Vision (ICCV). pp. 26817–26826 (October 2025)

2025

[24] [24]

In: Greenspan, H., Madabhushi, A., Mousavi, P., Salcudean, S.E., Duncan, J., Syeda-Mahmood, T.F., Taylor, R.H

Özsoy, E., Czempiel, T., Holm, F., Pellegrini, C., Navab, N.: LABRAD-OR: lightweight memory scene graphs for accurate bimodal reasoning in dynamic op- erating rooms. In: Greenspan, H., Madabhushi, A., Mousavi, P., Salcudean, S.E., Duncan, J., Syeda-Mahmood, T.F., Taylor, R.H. (eds.) Medical Image Comput- ing and Computer Assisted Intervention - MICCAI 202...

work page doi:10.1007/978-3-031-43996-4_29 2023

[25] [25]

In: Wang, L., Dou, Q., Fletcher, P., Spei- del, S., Li, S

Özsoy, E., Örnek, E., Eck, U., Czempiel, T., Tombari, F., Navab, N.: 4d-or: Seman- tic scene graphs for or domain modeling. In: Wang, L., Dou, Q., Fletcher, P., Spei- del, S., Li, S. (eds.) Medical Image Computing and Computer Assisted Intervention – MICCAI 2022 - 25th International Conference, Proceedings. pp. 475–485. Lecture Notes in Computer Science (...

work page doi:10.1007/978-3-031-16449-1_45 2022

[26] [26]

In: 2020 IEEE-RAS 20th International Conference on Humanoid Robots (Humanoids)

Seiwald, P., Wu, S.C., Sygulla, F., Berninger, T.F.C., Staufenberg, N.S., Sattler, M.F., Neuburger, N., Rixen, D., Tombari, F.: Lola v1.1 – an upgrade in hardware and software design for dynamic multi-contact locomotion. In: 2020 IEEE-RAS 20th International Conference on Humanoid Robots (Humanoids). IEEE (2021). https://doi.org/10.1109/humanoids47582.2021.9555790

work page doi:10.1109/humanoids47582.2021.9555790 2020

[27] [27]

IEEE Access10, 11574–11583 (2022).https: //doi.org/10.1109/ACCESS.2022.3145465

Sonogashira, M., Iiyama, M., Kawanishi, Y.: Towards open-set scene graph gen- eration with unknown objects. IEEE Access10, 11574–11583 (2022).https: //doi.org/10.1109/ACCESS.2022.3145465

work page doi:10.1109/access.2022.3145465 2022

[28] [28]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wald, J., Dhamo, H., Navab, N., Tombari, F.: Learning 3d semantic scene graphs from 3d indoor reconstructions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3961–3970 (2020)

2020

[29] [29]

International Journal of Computer Vision130(3), 630–651 (2022)

Wald, J., Navab, N., Tombari, F.: Learning 3d semantic scene graphs with instance embeddings. International Journal of Computer Vision130(3), 630–651 (2022)

2022

[30] [30]

arXiv preprint arXiv:2512.12622 (2025)

Wang, Z., Lee, S., Dai, G., Lee, G.H.: D3d-vlp: Dynamic 3d vision-language- planning model for embodied grounding and navigation. arXiv preprint arXiv:2512.12622 (2025)

work page arXiv 2025

[31] [31]

arXiv preprint arXiv:2505.11383 (2025)

Wang, Z., Lee, S., Lee, G.H.: Dynam3d: Dynamic layered 3d tokens empower vlm for vision-and-language navigation. arXiv preprint arXiv:2505.11383 (2025)

work page arXiv 2025

[32] [32]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Wang, Z., Cheng, B., Zhao, L., Xu, D., Tang, Y., Sheng, L.: Vl-sat: Visual-linguistic semantics assisted training for 3d semantic scene graph prediction in point cloud. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 21560–21569 (2023)

2023

[33] [33]

Robotics: Science and Systems (2024)

Werby, A., Huang, C., Büchner, M., Valada, A., Burgard, W.: Hierarchical open- vocabulary 3d scene graphs for language-grounded robot navigation. Robotics: Science and Systems (2024)

2024

[34] [34]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wu, S.C., Tateno, K., Navab, N., Tombari, F.: Incremental 3d semantic scene graph prediction from rgb sequences. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5064–5074 (2023)

2023

[35] [35]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wu, S.C., Wald, J., Tateno, K., Navab, N., Tombari, F.: Scenegraphfusion: In- cremental 3d scene graph prediction from rgb-d sequences. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7515– 7525 (2021)

2021

[36] [36]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Yang, Z., Lu, K., Zhang, C., Qi, J., Jiang, H., Ma, R., Yin, S., Xu, Y., Xing, M., Xiao, Z., et al.: Mmgdreamer: Mixed-modality graph for geometry-controllable 3d indoor scene generation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 9391–9399 (2025)

2025

[37] [37]

In: Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV)

Yeo, Q.X., Li, Y., Lee, G.H.: Statistical confidence rescoring for robust 3d scene graph generation from multi-view images. In: Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV). pp. 24999–25008 (October 2025)

2025

[38] [38]

Gc-vln: Instruction as graph constraints for training-free vision-and-language navigation.arXiv preprint arXiv:2509.10454, 2025a

Yin, H., Wei, H., Xu, X., Guo, W., Zhou, J., Lu, J.: Gc-vln: Instruction as graph constraints for training-free vision-and-language navigation. arXiv preprint arXiv:2509.10454 (2025) 18 Yeo et al

work page arXiv 2025

[39] [39]

In: The Thirty-eighth Annual Confer- ence on Neural Information Processing Systems (2024),https://openreview.net/ forum?id=HmCmxbCpp2

Yin, H., Xu, X., Wu, Z., Zhou, J., Lu, J.: SG-nav: Online 3d scene graph prompting for LLM-based zero-shot object navigation. In: The Thirty-eighth Annual Confer- ence on Neural Information Processing Systems (2024),https://openreview.net/ forum?id=HmCmxbCpp2

2024

[40] [40]

In: Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XXI

Zhai, G., Örnek, E.P., Chen, D.Z., Liao, R., Di, Y., Navab, N., Tombari, F., Busam, B.: Echoscene: Indoor scene generation via information echo over scene graph dif- fusion. In: Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XXI. p. 167–184. Springer- Verlag, Berlin, Heidelberg (2024).h...

work page doi:10.1007/978-3-031-72664- 2024

[41] [41]

In: Thirty-seventh Conference on Neural Information Processing Systems (2023), https://openreview.net/forum?id=1SF2tiopYJ

Zhai, G., Örnek, E.P., Wu, S.C., Di, Y., Tombari, F., Navab, N., Busam, B.: Commonscenes: Generating commonsense 3d indoor scenes with scene graphs. In: Thirty-seventh Conference on Neural Information Processing Systems (2023), https://openreview.net/forum?id=1SF2tiopYJ

2023

[42] [42]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Zhang, C., Yu, J., Song, Y., Cai, W.: Exploiting edge-oriented reasoning for 3d point-based scene graph analysis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9705–9715 (2021)

2021

[43] [43]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

Zhang, C., Delitzas, A., Wang, F., Zhang, R., Ji, X., Pollefeys, M., Engelmann, F.: Open-vocabulary functional 3d scene graphs for real-world indoor spaces. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

2025

[44] [44]

In: Kwok, J

Zhang, Y., Qian, D., Li, D., Pan, Y., Chen, Y., Liang, Z., Zhang, Z., Liu, Y., Mei, J., Fu, M., Ye, Y., Liang, Z., Shan, Y., Du, D.: Graphad: Interaction scene graph for end-to-end autonomous driving. In: Kwok, J. (ed.) Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, IJCAI-25. pp. 2422–2430. International Joint ...

work page doi:10.24963/ijcai.2025/270 2025

[45] [45]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Zhao, Y., Lv, W., Xu, S., Wei, J., Wang, G., Dang, Q., Liu, Y., Chen, J.: Detrs beat yolos on real-time object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 16965–16974 (June 2024) NoPA: Non-Parametric Online 3D Scene Graph Generation 19 NoPA: Non-Parametric Online 3D Scene Graph Generation Su...

2024