DeWorldSG: Depth-Aware 3D Semantic Scene Graph Generation via World-Model Priors

Abdelrahman Elskhawy; Benjamin Busam; Dooyoung Kim; Eunjae Shin; Seok-Young Kim; Taewook Ha; Woontack Woo

arxiv: 2607.00889 · v1 · pith:NU75VJJRnew · submitted 2026-07-01 · 💻 cs.CV · cs.AI

DeWorldSG: Depth-Aware 3D Semantic Scene Graph Generation via World-Model Priors

Seok-Young Kim , Abdelrahman Elskhawy , Taewook Ha , Dooyoung Kim , Eunjae Shin , Benjamin Busam , Woontack Woo This is my paper

Pith reviewed 2026-07-02 14:04 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords 3D semantic scene graphsRGB-D sequencesGaussian distributionsdepth-guided filteringworld model priorsspatiotemporal aggregationscene graph generation

0 comments

The pith

DeWorldSG builds 3D scene graphs by turning depth data into probabilistic Gaussian object nodes and refining relations with world-model priors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework that processes RGB-D video to create 3D semantic scene graphs with better stability over time. Objects are represented as 3D Gaussian distributions filtered by depth rather than single points to reduce instability from frame to frame. Relations between objects are completed by collecting evidence across multiple frames and adjusting with priors from a world model. The result is higher accuracy in identifying objects and their connections on standard test sets.

Core claim

DeWorldSG estimates instance-level geometric 3D Gaussian distributions through depth-guided filtering and represents each object as a probabilistic 3D node rather than a single projected point. It aggregates spatiotemporal evidence across object pairs and refines relations using contextual priors derived from a world model. Experiments on the 3DSSG and ReplicaSSG datasets demonstrate state-of-the-art performance in both object and predicate prediction while producing temporally consistent scene structures, with triplet recall improved by 77.4 percent and predicate recall by 23.2 percent over prior approaches.

What carries the argument

Instance-level 3D Gaussian distributions as probabilistic object nodes, paired with spatiotemporal evidence aggregation across frames and world-model contextual priors for relation refinement.

If this is right

Scene graphs become temporally consistent across video frames instead of fluctuating per frame.
Object and relation predictions improve together on RGB-D benchmarks.
The graphs become usable for tasks that require stable 3D spatial understanding over time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The Gaussian node representation could be tested directly on consumer depth cameras with lower precision to check robustness.
World-model priors might transfer to other incomplete 3D observation settings such as monocular video with estimated depth.
Longer video sequences could expose whether the aggregation step saturates or continues to add value.

Load-bearing premise

Depth sensor readings are accurate and complete enough to form reliable 3D Gaussian distributions for each object instance without major noise or occlusion problems.

What would settle it

Measure triplet and predicate recall on the same datasets after adding realistic depth sensor noise or heavy occlusions to the input sequences; if gains over baselines disappear the central claim does not hold.

Figures

Figures reproduced from arXiv: 2607.00889 by Abdelrahman Elskhawy, Benjamin Busam, Dooyoung Kim, Eunjae Shin, Seok-Young Kim, Taewook Ha, Woontack Woo.

**Figure 1.** Figure 1: DeWorldSG generates a spatio-temporally robust 3D semantic scene graph from RGB-D observations. It combines 2D scene graph generation, depth-aware 3D lifting, and world-model-based relation enhancement to produce stable object nodes and context-aware relational edges for AR and embodied robotic systems. this need by transforming complex 3D environments into a compact graph structure consisting of object-l… view at source ↗

**Figure 2.** Figure 2: Overview of DeWorldSG. Given RGB-D image sequences, DeWorldSG first constructs frame-wise 2D scene graphs by detecting objects and extracting their relationships (a). The detected instances are lifted to depth-aware probabilistic 3D Gaussians through mask-guided depth filtering and dual-domain depth refinement (b). Finally, spatiotemporal predicate priors from a video world model are aggregated and fuse… view at source ↗

**Figure 3.** Figure 3: For each object pair, we extract a 16-frame union-crop clip and encode it with a frozen V-JEPA 2 backbone. The resulting spatiotemporal tokens are mean-pooled and passed to an MLP probe to produce a clip-level predicate distribution. This distance is small when two nodes assign high probability mass to the same class. We merge nodes only when both geometric and semantic criteria are satisfied: HD(i,j) < \… view at source ↗

**Figure 5.** Figure 5: 3D SSG results under the same camera viewpoint without (left) and with (right) DualDomain Refinement (DR). The red rectangle highlights the unstable object position [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of 3D SSG generation results on ReplicaSSG. (Zoom for details) ated or object locations are inaccurately projected, leading to overly dense or incorrectly connected relation edges. Additional qualitative comparisons are provided in the Supplementary Material. 5.3 Ablation Study We analyze DeWorldSG’s key components through four configurations as presented in Tab. 3. The baseline mo… view at source ↗

**Figure 6.** Figure 6: Failure case caused by inconsistent 2D object classification across frames. DeWorldSG relies on a 2D object detector for object class prediction, which can propagate frame-level recognition errors into the constructed 3D scene graph. As shown in [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

read the original abstract

We present DeWorldSG, a novel framework that generates spatio-temporally robust 3D Semantic Scene Graphs from RGB-D sequences. Existing methods often struggle to construct reliable 3D scene graphs due to unstable 3D object representations and missing relations caused by frame-wise inference. DeWorldSG addresses these issues by estimating instance-level geometric 3D Gaussian distributions through depth-guided filtering and representing each object as a probabilistic 3D node rather than a single projected point. To mitigate relational sparsity from frame-wise inference, our framework further aggregates spatiotemporal evidence across object pairs and refines relations using contextual priors derived from a world model (V-JEPA 2). Experiments on the 3DSSG and ReplicaSSG datasets demonstrate state-of-the-art (SoTA) performance in both object and predicate prediction, while producing temporally consistent scene structures. In particular, our method improves triplet recall by 77.4% and predicate recall by 23.2% over prior SoTA approaches, making it suitable for robotic manipulation and AR applications. Our code and models are open-sourced.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DeWorldSG uses depth-guided 3D Gaussians for object nodes and V-JEPA 2 priors for relations, but the large reported gains rest on untested depth stability.

read the letter

The paper's main move is to replace point-based object nodes with instance-level 3D Gaussian distributions filtered from depth, then aggregate spatiotemporal evidence across frames and refine predicates with contextual priors from V-JEPA 2. This targets the usual problems of unstable geometry and missing relations in frame-wise 3D scene graph work.

The technical choices are concrete: depth-guided filtering to build the Gaussians, probabilistic nodes instead of projected points, and the world-model step for relation aggregation. They evaluate on 3DSSG and ReplicaSSG, claim state-of-the-art numbers, and release code and models. That last part makes the work easier to check.

The gains they list—77.4% triplet recall and 23.2% predicate recall—are big enough to notice. If the experiments hold up with proper controls and ablations, the approach could help downstream tasks in robotics and AR that need consistent 3D graphs.

The soft spot is exactly where the stress-test flags it. The whole pipeline starts with depth data producing stable covariance parameters for each object. Typical RGB-D issues like holes, specular noise, and partial occlusions are common, yet the description gives no noise model, no depth completion step, and no ablation on degraded depth. Spatiotemporal aggregation and V-JEPA 2 priors sit downstream and cannot correct upstream geometric error. Without those checks, the performance numbers are hard to trust outside clean lab sequences. Domain shift for the world-model priors is also unaddressed in the available text.

This is aimed at people already working on 3D semantic mapping and scene graphs. It has enough specific machinery and open resources to deserve referee time, even if the depth assumptions will need tightening.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces DeWorldSG for generating spatio-temporally robust 3D semantic scene graphs from RGB-D sequences. It estimates instance-level geometric 3D Gaussian distributions via depth-guided filtering to represent each object as a probabilistic 3D node (rather than a projected point), aggregates spatiotemporal evidence across object pairs to address relational sparsity, and refines relations using contextual priors from the V-JEPA 2 world model. Experiments on the 3DSSG and ReplicaSSG datasets claim state-of-the-art results, including 77.4% improvement in triplet recall and 23.2% in predicate recall over prior approaches, with temporally consistent structures; code and models are open-sourced.

Significance. If the central performance claims are substantiated, the probabilistic 3D node representation and world-model priors could meaningfully improve robustness and temporal consistency in 3D scene graphs, with direct relevance to robotic manipulation and AR. The open-sourcing of code strengthens reproducibility.

major comments (2)

[Method (depth-guided filtering and probabilistic node construction)] The method's core step of estimating instance-level 3D Gaussian distributions through depth-guided filtering lacks any explicit noise model, depth completion procedure, or ablation on realistic RGB-D artifacts (holes, specular reflections, partial occlusions). This directly undermines the stability of the covariance parameters and the claim that probabilistic nodes are more reliable than point projections.
[Experiments and results] The reported gains (77.4% triplet recall, 23.2% predicate recall) are presented without accompanying experimental controls, error bars, dataset splits, or ablation studies on the depth component. Because the spatiotemporal aggregation and V-JEPA 2 priors operate downstream, they cannot compensate for upstream geometric errors; this renders the quantitative claims unevaluable from the given text.

minor comments (1)

[Abstract] The abstract states improvements 'over prior SoTA approaches' without naming the specific baselines; adding this would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater methodological rigor and experimental transparency. We will revise the manuscript to incorporate explicit noise modeling, artifact ablations, error bars, and dataset details as outlined below.

read point-by-point responses

Referee: [Method (depth-guided filtering and probabilistic node construction)] The method's core step of estimating instance-level 3D Gaussian distributions through depth-guided filtering lacks any explicit noise model, depth completion procedure, or ablation on realistic RGB-D artifacts (holes, specular reflections, partial occlusions). This directly undermines the stability of the covariance parameters and the claim that probabilistic nodes are more reliable than point projections.

Authors: We agree that the current description does not include an explicit sensor noise model or targeted ablations on common RGB-D artifacts. In the revision we will add a depth noise model (Gaussian perturbation scaled by distance and sensor variance) and a new ablation section that injects synthetic holes, specular noise, and partial occlusions into the input depth maps, reporting the resulting covariance stability and downstream triplet recall. This will directly support the claim that probabilistic nodes improve reliability over point projections. revision: yes
Referee: [Experiments and results] The reported gains (77.4% triplet recall, 23.2% predicate recall) are presented without accompanying experimental controls, error bars, dataset splits, or ablation studies on the depth component. Because the spatiotemporal aggregation and V-JEPA 2 priors operate downstream, they cannot compensate for upstream geometric errors; this renders the quantitative claims unevaluable from the given text.

Authors: We acknowledge the absence of error bars, explicit split descriptions, and isolated depth-component ablations. The revised manuscript will report mean and standard deviation over three random seeds, specify the exact 3DSSG and ReplicaSSG train/validation/test partitions used, and add an ablation that replaces the 3D Gaussian nodes with point projections while keeping all other modules fixed. These additions will allow readers to evaluate the upstream geometric contribution independently of the downstream aggregation and world-model priors. revision: yes

Circularity Check

0 steps flagged

No circularity detected; framework description contains no equations or derivations

full rationale

The provided abstract and manuscript summary describe a pipeline that estimates 3D Gaussian distributions via depth-guided filtering, aggregates spatiotemporal evidence, and refines relations with V-JEPA 2 priors, then reports empirical gains on 3DSSG and ReplicaSSG. No equations, parameter fits, or derivation steps appear in the text. Consequently none of the enumerated circularity patterns (self-definitional, fitted-input-called-prediction, self-citation load-bearing, etc.) can be exhibited by direct quotation. The central claims remain externally falsifiable through the stated dataset experiments and are not shown to reduce to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete and based on surface claims. The framework rests on the effectiveness of depth filtering for Gaussians and the transferability of V-JEPA 2 priors.

axioms (2)

domain assumption V-JEPA 2 supplies contextual priors that improve relation prediction on the target datasets
Invoked to refine relations after spatiotemporal aggregation.
domain assumption Depth data from RGB-D sensors is accurate enough to support instance-level 3D Gaussian estimation
Central to the object representation step.

pith-pipeline@v0.9.1-grok · 5751 in / 1418 out tokens · 23410 ms · 2026-07-02T14:04:22.591296+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 17 canonical work pages · 5 internal anchors

[1]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Assran, M., Bardes, A., Fan, D., Garrido, Q., Howes, R., Muckley, M., Rizvi, A., Roberts, C., Sinha, K., Zholus, A., et al.: V-jepa 2: Self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Bardes, A., Garrido, Q., Ponce, J., Chen, X., Rabbat, M., LeCun, Y., Assran, M., Ballas, N.: V-jepa: Latent video prediction for visual representation learning (2023) 16 SY.Kim et al

2023
[3]

arXiv preprint arXiv:2410.23968 (2024)

Booker, M., Byrd, G., Kemp, B., Schmidt, A., Rivera, C.: Embodiedrag: Dy- namic 3d scene graph retrieval for efficient and scalable robot task planning. arXiv preprint arXiv:2410.23968 (2024)

work page arXiv 2024
[4]

Articulated 3D scene graphs for open-world mobile manipulation.arXiv preprint arXiv:2602.16356, 2026

Büchner, M., Röfer, A., Engelbracht, T., Welschehold, T., Bauer, Z., Blum, H., Pollefeys, M., Valada, A.: Articulated 3d scene graphs for open-world mobile ma- nipulation. arXiv preprint arXiv:2602.16356 (2026)

work page arXiv 2026
[5]

IEEE transactions on robotics37(6), 1874–1890 (2021)

Campos, C., Elvira, R., Rodríguez, J.J.G., Montiel, J.M., Tardós, J.D.: Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE transactions on robotics37(6), 1874–1890 (2021)

2021
[6]

Vl- jepa: Joint embedding predictive architecture for vision-language.arXiv preprint arXiv:2512.10942, 2025

Chen, D., Shukor, M., Moutakanni, T., Chung, W., Yu, J., Kasarla, T., Bolourchi, A., LeCun, Y., Fung, P.: Vl-jepa: Joint embedding predictive architecture for vision-language. arXiv preprint arXiv:2512.10942 (2025)

work page arXiv 2025
[7]

In: 14th International Conference on Information Fusion

Crouse, D.F., Willett, P., Pattipati, K., Svensson, L.: A look at gaussian mixture reduction algorithms. In: 14th International Conference on Information Fusion. pp. 1–8. IEEE (2011)

2011
[8]

arXiv preprint arXiv:2504.00844 (2025)

Elskhawy, A., Li, M., Navab, N., Busam, B.: Prism-0: A predicate-rich scene graph generation framework for zero-shot open-vocabulary tasks. arXiv preprint arXiv:2504.00844 (2025)

work page arXiv 2025
[9]

In: 2025IEEE-RAS24thInternationalConferenceonHumanoidRobots(Humanoids)

Engelbracht, T., Zurbrügg, R., Pollefeys, M., Blum, H., Bauer, Z.: Spotlight: Robotic scene understanding through interaction and affordance detection. In: 2025IEEE-RAS24thInternationalConferenceonHumanoidRobots(Humanoids). pp. 1–8. IEEE (2025)

2025
[10]

In: Asian Conference on Computer Vision

Gay, P., Stuart, J., Del Bue, A.: Visual graphs from motion (vgfm): Scene un- derstanding with object geometry reasoning. In: Asian Conference on Computer Vision. pp. 330–346. Springer (2018)

2018
[11]

In: International conference on machine learning

Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., Dahl, G.E.: Neural message passing for quantum chemistry. In: International conference on machine learning. pp. 1263–1272. Pmlr (2017)

2017
[12]

In: 2024 IEEE Inter- national Conference on Robotics and Automation (ICRA)

Gu, Q., Kuwajerwala, A., Morin, S., Jatavallabhula, K.M., Sen, B., Agarwal, A., Rivera, C., Paul, W., Ellis, K., Chellappa, R., et al.: Conceptgraphs: Open- vocabulary 3d scene graphs for perception and planning. In: 2024 IEEE Inter- national Conference on Robotics and Automation (ICRA). pp. 5021–5028. IEEE (2024)

2024
[13]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Hou, H.Y., Lee, C.Y., Sonogashira, M., Kawanishi, Y.: Fross: Faster-than-real-time online 3d semantic scene graph generation from rgb-d images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 28818–28827 (2025)

2025
[14]

arXiv preprint arXiv:2506.05217 (2025)

Hu, W., Wen, X., Li, X., Wang, G.: Dsg-world: Learning a 3d gaussian world model from dual state videos. arXiv preprint arXiv:2506.05217 (2025)

work page arXiv 2025
[15]

Advances in Neural In- formation Processing Systems (2025)

Huang, Z., Wu, X., Zhong, F., Zhao, H., Nießner, M., Lasenby, J.: Litereality: graphics-ready 3d scene reconstruction from rgb-d scans. Advances in Neural In- formation Processing Systems (2025)

2025
[16]

Proceedings of Robotics: Science and Systems (2022)

Hughes, N., Chang, Y., Carlone, L.: Hydra: A real-time spatial perception system for 3d scene graph construction and optimization. Proceedings of Robotics: Science and Systems (2022)

2022
[17]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Im, J., Nam, J., Park, N., Lee, H., Park, S.: Egtr: Extracting graph from trans- former for scene graph generation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24229–24238 (2024)

2024
[18]

Dexterous world models.arXiv preprint arXiv:2512.17907, 2025

Kim, B., Kim, T., Lee, J., Joo, H.: Dexterous world models. arXiv preprint arXiv:2512.17907 (2025) DeWorldSG 17

work page arXiv 2025
[19]

arXiv preprint arXiv:2602.02974 (2026)

Kim, S.Y., Kim, D., Cho, W., Song, H., Kang, S., Woo, W.: Scenelinker: Compo- sitional 3d scene generation via semantic scene graph from rgb sequences. arXiv preprint arXiv:2602.02974 (2026)

work page arXiv 2026
[20]

In: Adjunct Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology

Kim, S., Kim, D., Son, T., Woo, W.: Realitycrafter: User-guided editable 3d scene generation from a single image in mixed reality. In: Adjunct Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology. pp. 1– 3 (2025)

2025
[21]

Kim, U.H.,Park, J.M.,Song,T.J.,Kim, J.H.:3-dscenegraph: Asparseandseman- ticrepresentationofphysicalenvironmentsforintelligentagents.IEEEtransactions on cybernetics50(12), 4921–4933 (2019)

2019
[22]

Semi-Supervised Classification with Graph Convolutional Networks

Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[23]

International journal of computer vision123(1), 32–73 (2017)

Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.J., Shamma, D.A., et al.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision123(1), 32–73 (2017)

2017
[24]

2, 2022-06-27

LeCun, Y., et al.: A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review62(1), 1–62 (2022)

2022
[25]

In: Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology

Lee, J., Aleotti, F., Mazala, D., Garcia-Hernando, G., Vicente, S., Johnston, O.J., Kraus-Liang, I., Powierza, J., Shin, D., Froehlich, J.E., et al.: Imaginatear: Ai- assisted in-situ authoring in augmented reality. In: Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology. pp. 1–21 (2025)

2025
[26]

IEEE Transactions on Visualization and Computer Graphics (2025)

Liu, J., Zhang, R., Butaslac, I., Sawabe, T., Fujimoto, Y., Kanbara, M., Kato, H.: Everywherear: A visual authoring system for creating adaptive ar game scenes. IEEE Transactions on Visualization and Computer Graphics (2025)

2025
[27]

arXiv:2407.17140 [cs.CV] https://arxiv.org/abs/2407.17140 Paul C

Lv, W., Zhao, Y., Chang, Q., Huang, K., Wang, G., Liu, Y.: Rt-detrv2: Improved baseline with bag-of-freebies for real-time detection transformer. arXiv preprint arXiv:2407.17140 (2024)

work page arXiv 2024
[28]

Robotic Intelligence and Automation43(1), 12–22 (2023)

Miao, R., Jia, Q., Sun, F.: Long-term robot manipulation task planning with scene graph and semantic knowledge. Robotic Intelligence and Automation43(1), 12–22 (2023)

2023
[29]

IEEE Transactions on Image Processing33, 671–681 (2024)

Murrugarra-Llerena, J., Kirsten, L.N., Zeni, L.F., Jung, C.R.: Probabilistic intersection-over-union for training and evaluation of oriented object detectors. IEEE Transactions on Image Processing33, 671–681 (2024)

2024
[30]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

International Journal of Computer Assisted Radiology and Surgery19(5), 791–799 (2024)

Özsoy, E., Czempiel, T., Örnek, E.P., Eck, U., Tombari, F., Navab, N.: Holistic or domain modeling: a semantic scene graph approach. International Journal of Computer Assisted Radiology and Surgery19(5), 791–799 (2024)

2024
[32]

arXiv preprint arXiv:2411.10509 (2024)

Pham,Q.P.,Nguyen,K.T.,Ngo,L.C.,Do,T.,Song,D.,Hy,T.S.:Tesgnn:Temporal equivariantscenegraphneuralnetworksforefficientandrobustmulti-view3dscene understanding. arXiv preprint arXiv:2411.10509 (2024)

work page arXiv 2024
[33]

SAM 2: Segment Anything in Images and Videos

Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., et al.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

The International Journal of Robotics Research40(12-14), 1510–1546 (2021) 18 SY.Kim et al

Rosinol,A.,Violette,A.,Abate,M.,Hughes,N.,Chang,Y.,Shi,J.,Gupta,A.,Car- lone, L.: Kimera: From slam to spatial perception with 3d dynamic scene graphs. The International Journal of Robotics Research40(12-14), 1510–1546 (2021) 18 SY.Kim et al

2021
[35]

arXiv preprint arXiv:2412.14480 (2024)

Saxena, S., Buchanan, B., Paxton, C., Liu, P., Chen, B., Vaskevicius, N., Palmieri, L., Francis, J., Kroemer, O.: Grapheqa: Using 3d semantic scene graphs for real- time embodied question answering. arXiv preprint arXiv:2412.14480 (2024)

work page arXiv 2024
[36]

In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision

Singh, K.P., Salvador, J., Weihs, L., Kembhavi, A.: Scene graph contrastive learn- ing for embodied navigation. In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision. pp. 10884–10894 (2023)

2023
[37]

IEEE Transactions on Knowledge and Data Engineering36(11), 6962–6976 (2024)

Song, Y., Sun, P., Liu, H., Li, Z., Song, W., Xiao, Y., Zhou, X.: Scene-driven multimodal knowledge graph construction for embodied ai. IEEE Transactions on Knowledge and Data Engineering36(11), 6962–6976 (2024)

2024
[38]

The Replica Dataset: A Digital Replica of Indoor Spaces

Straub, J., Whelan, T., Ma, L., Chen, Y., Wijmans, E., Green, S., Engel, J.J., Mur-Artal, R., Ren, C., Verma, S., et al.: The replica dataset: A digital replica of indoor spaces. arXiv preprint arXiv:1906.05797 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1906
[39]

In: 2020 IEEE Inter- nationalSymposiumonMixedandAugmentedRealityAdjunct(ISMAR-Adjunct)

Tahara, T., Seno, T., Narita, G., Ishikawa, T.: Retargetable ar: Context-aware augmented reality in indoor scenes based on 3d scene graph. In: 2020 IEEE Inter- nationalSymposiumonMixedandAugmentedRealityAdjunct(ISMAR-Adjunct). pp. 249–255. IEEE (2020)

2020
[40]

In: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

Tateno, K., Tombari, F., Navab, N.: Real-time and scalable incremental segmen- tation on dense slam. In: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 4465–4472. IEEE (2015)

2015
[41]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Wald, J., Avetisyan, A., Navab, N., Tombari, F., Nießner, M.: Rio: 3d object instance re-localization in changing indoor environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7658–7667 (2019)

2019
[42]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wald, J., Dhamo, H., Navab, N., Tombari, F.: Learning 3d semantic scene graphs from 3d indoor reconstructions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3961–3970 (2020)

2020
[43]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Wang, Z., Cheng, B., Zhao, L., Xu, D., Tang, Y., Sheng, L.: Vl-sat: Visual-linguistic semantics assisted training for 3d semantic scene graph prediction in point cloud. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 21560–21569 (2023)

2023
[44]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Wu, S.C., Tateno, K., Navab, N., Tombari, F.: Incremental 3d semantic scene graph prediction from rgb sequences. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5064–5074 (2023)

2023
[45]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wu, S.C., Wald, J., Tateno, K., Navab, N., Tombari, F.: Scenegraphfusion: In- cremental 3d scene graph prediction from rgb-d sequences. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7515– 7525 (2021)

2021
[46]

In: Proceedings of the IEEE conference on computer vision and pat- tern recognition

Xu, D., Zhu, Y., Choy, C.B., Fei-Fei, L.: Scene graph generation by iterative mes- sage passing. In: Proceedings of the IEEE conference on computer vision and pat- tern recognition. pp. 5410–5419 (2017)

2017
[47]

IEEE Robotics and Automation Letters (2025)

Yan, Z., Li, S., Wang, Z., Wu, L., Wang, H., Zhu, J., Chen, L., Liu, J.: Dynamic open-vocabulary 3d scene graphs for long-term language-guided mobile manipula- tion. IEEE Robotics and Automation Letters (2025)

2025
[48]

arXiv preprint arXiv:2507.12508 (2025)

Yang, Y., Liu, J., Zhang, Z., Zhou, S., Tan, R., Yang, J., Du, Y., Gan, C.: Mind- journey: Test-time scaling with world models for spatial reasoning. arXiv preprint arXiv:2507.12508 (2025)

work page arXiv 2025
[49]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Yang, Z., Lu, K., Zhang, C., Qi, J., Jiang, H., Ma, R., Yin, S., Xu, Y., Xing, M., Xiao, Z., et al.: Mmgdreamer: Mixed-modality graph for geometry-controllable 3d indoor scene generation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 9391–9399 (2025) DeWorldSG 19

2025
[50]

In: European Conference on Computer Vision

Zhai, G., Örnek, E.P., Chen, D.Z., Liao, R., Di, Y., Navab, N., Tombari, F., Busam, B.: Echoscene: Indoor scene generation via information echo over scene graph diffu- sion. In: European Conference on Computer Vision. pp. 167–184. Springer (2024)

2024
[51]

Advances in Neural Information Processing Systems36, 30026–30038 (2023)

Zhai, G., Örnek, E.P., Wu, S.C., Di, Y., Tombari, F., Navab, N., Busam, B.: Com- monscenes: Generating commonsense 3d indoor scenes with scene graph diffusion. Advances in Neural Information Processing Systems36, 30026–30038 (2023)

2023
[52]

arXiv preprint arXiv:2510.23607 (2025)

Zhang, Y., Wu, X., Lao, Y., Wang, C., Tian, Z., Wang, N., Zhao, H.: Concerto: Joint 2d-3d self-supervised learning emerges spatial representations. arXiv preprint arXiv:2510.23607 (2025)

work page arXiv 2025
[53]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Zhao, Y., Lv, W., Xu, S., Wei, J., Wang, G., Dang, Q., Liu, Y., Chen, J.: Detrs beat yolos on real-time object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16965–16974 (2024)

2024

[1] [1]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Assran, M., Bardes, A., Fan, D., Garrido, Q., Howes, R., Muckley, M., Rizvi, A., Roberts, C., Sinha, K., Zholus, A., et al.: V-jepa 2: Self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Bardes, A., Garrido, Q., Ponce, J., Chen, X., Rabbat, M., LeCun, Y., Assran, M., Ballas, N.: V-jepa: Latent video prediction for visual representation learning (2023) 16 SY.Kim et al

2023

[3] [3]

arXiv preprint arXiv:2410.23968 (2024)

Booker, M., Byrd, G., Kemp, B., Schmidt, A., Rivera, C.: Embodiedrag: Dy- namic 3d scene graph retrieval for efficient and scalable robot task planning. arXiv preprint arXiv:2410.23968 (2024)

work page arXiv 2024

[4] [4]

Articulated 3D scene graphs for open-world mobile manipulation.arXiv preprint arXiv:2602.16356, 2026

Büchner, M., Röfer, A., Engelbracht, T., Welschehold, T., Bauer, Z., Blum, H., Pollefeys, M., Valada, A.: Articulated 3d scene graphs for open-world mobile ma- nipulation. arXiv preprint arXiv:2602.16356 (2026)

work page arXiv 2026

[5] [5]

IEEE transactions on robotics37(6), 1874–1890 (2021)

Campos, C., Elvira, R., Rodríguez, J.J.G., Montiel, J.M., Tardós, J.D.: Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE transactions on robotics37(6), 1874–1890 (2021)

2021

[6] [6]

Vl- jepa: Joint embedding predictive architecture for vision-language.arXiv preprint arXiv:2512.10942, 2025

Chen, D., Shukor, M., Moutakanni, T., Chung, W., Yu, J., Kasarla, T., Bolourchi, A., LeCun, Y., Fung, P.: Vl-jepa: Joint embedding predictive architecture for vision-language. arXiv preprint arXiv:2512.10942 (2025)

work page arXiv 2025

[7] [7]

In: 14th International Conference on Information Fusion

Crouse, D.F., Willett, P., Pattipati, K., Svensson, L.: A look at gaussian mixture reduction algorithms. In: 14th International Conference on Information Fusion. pp. 1–8. IEEE (2011)

2011

[8] [8]

arXiv preprint arXiv:2504.00844 (2025)

Elskhawy, A., Li, M., Navab, N., Busam, B.: Prism-0: A predicate-rich scene graph generation framework for zero-shot open-vocabulary tasks. arXiv preprint arXiv:2504.00844 (2025)

work page arXiv 2025

[9] [9]

In: 2025IEEE-RAS24thInternationalConferenceonHumanoidRobots(Humanoids)

Engelbracht, T., Zurbrügg, R., Pollefeys, M., Blum, H., Bauer, Z.: Spotlight: Robotic scene understanding through interaction and affordance detection. In: 2025IEEE-RAS24thInternationalConferenceonHumanoidRobots(Humanoids). pp. 1–8. IEEE (2025)

2025

[10] [10]

In: Asian Conference on Computer Vision

Gay, P., Stuart, J., Del Bue, A.: Visual graphs from motion (vgfm): Scene un- derstanding with object geometry reasoning. In: Asian Conference on Computer Vision. pp. 330–346. Springer (2018)

2018

[11] [11]

In: International conference on machine learning

Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., Dahl, G.E.: Neural message passing for quantum chemistry. In: International conference on machine learning. pp. 1263–1272. Pmlr (2017)

2017

[12] [12]

In: 2024 IEEE Inter- national Conference on Robotics and Automation (ICRA)

Gu, Q., Kuwajerwala, A., Morin, S., Jatavallabhula, K.M., Sen, B., Agarwal, A., Rivera, C., Paul, W., Ellis, K., Chellappa, R., et al.: Conceptgraphs: Open- vocabulary 3d scene graphs for perception and planning. In: 2024 IEEE Inter- national Conference on Robotics and Automation (ICRA). pp. 5021–5028. IEEE (2024)

2024

[13] [13]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Hou, H.Y., Lee, C.Y., Sonogashira, M., Kawanishi, Y.: Fross: Faster-than-real-time online 3d semantic scene graph generation from rgb-d images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 28818–28827 (2025)

2025

[14] [14]

arXiv preprint arXiv:2506.05217 (2025)

Hu, W., Wen, X., Li, X., Wang, G.: Dsg-world: Learning a 3d gaussian world model from dual state videos. arXiv preprint arXiv:2506.05217 (2025)

work page arXiv 2025

[15] [15]

Advances in Neural In- formation Processing Systems (2025)

Huang, Z., Wu, X., Zhong, F., Zhao, H., Nießner, M., Lasenby, J.: Litereality: graphics-ready 3d scene reconstruction from rgb-d scans. Advances in Neural In- formation Processing Systems (2025)

2025

[16] [16]

Proceedings of Robotics: Science and Systems (2022)

Hughes, N., Chang, Y., Carlone, L.: Hydra: A real-time spatial perception system for 3d scene graph construction and optimization. Proceedings of Robotics: Science and Systems (2022)

2022

[17] [17]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Im, J., Nam, J., Park, N., Lee, H., Park, S.: Egtr: Extracting graph from trans- former for scene graph generation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24229–24238 (2024)

2024

[18] [18]

Dexterous world models.arXiv preprint arXiv:2512.17907, 2025

Kim, B., Kim, T., Lee, J., Joo, H.: Dexterous world models. arXiv preprint arXiv:2512.17907 (2025) DeWorldSG 17

work page arXiv 2025

[19] [19]

arXiv preprint arXiv:2602.02974 (2026)

Kim, S.Y., Kim, D., Cho, W., Song, H., Kang, S., Woo, W.: Scenelinker: Compo- sitional 3d scene generation via semantic scene graph from rgb sequences. arXiv preprint arXiv:2602.02974 (2026)

work page arXiv 2026

[20] [20]

In: Adjunct Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology

Kim, S., Kim, D., Son, T., Woo, W.: Realitycrafter: User-guided editable 3d scene generation from a single image in mixed reality. In: Adjunct Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology. pp. 1– 3 (2025)

2025

[21] [21]

Kim, U.H.,Park, J.M.,Song,T.J.,Kim, J.H.:3-dscenegraph: Asparseandseman- ticrepresentationofphysicalenvironmentsforintelligentagents.IEEEtransactions on cybernetics50(12), 4921–4933 (2019)

2019

[22] [22]

Semi-Supervised Classification with Graph Convolutional Networks

Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016

[23] [23]

International journal of computer vision123(1), 32–73 (2017)

Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.J., Shamma, D.A., et al.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision123(1), 32–73 (2017)

2017

[24] [24]

2, 2022-06-27

LeCun, Y., et al.: A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review62(1), 1–62 (2022)

2022

[25] [25]

In: Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology

Lee, J., Aleotti, F., Mazala, D., Garcia-Hernando, G., Vicente, S., Johnston, O.J., Kraus-Liang, I., Powierza, J., Shin, D., Froehlich, J.E., et al.: Imaginatear: Ai- assisted in-situ authoring in augmented reality. In: Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology. pp. 1–21 (2025)

2025

[26] [26]

IEEE Transactions on Visualization and Computer Graphics (2025)

Liu, J., Zhang, R., Butaslac, I., Sawabe, T., Fujimoto, Y., Kanbara, M., Kato, H.: Everywherear: A visual authoring system for creating adaptive ar game scenes. IEEE Transactions on Visualization and Computer Graphics (2025)

2025

[27] [27]

arXiv:2407.17140 [cs.CV] https://arxiv.org/abs/2407.17140 Paul C

Lv, W., Zhao, Y., Chang, Q., Huang, K., Wang, G., Liu, Y.: Rt-detrv2: Improved baseline with bag-of-freebies for real-time detection transformer. arXiv preprint arXiv:2407.17140 (2024)

work page arXiv 2024

[28] [28]

Robotic Intelligence and Automation43(1), 12–22 (2023)

Miao, R., Jia, Q., Sun, F.: Long-term robot manipulation task planning with scene graph and semantic knowledge. Robotic Intelligence and Automation43(1), 12–22 (2023)

2023

[29] [29]

IEEE Transactions on Image Processing33, 671–681 (2024)

Murrugarra-Llerena, J., Kirsten, L.N., Zeni, L.F., Jung, C.R.: Probabilistic intersection-over-union for training and evaluation of oriented object detectors. IEEE Transactions on Image Processing33, 671–681 (2024)

2024

[30] [30]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [31]

International Journal of Computer Assisted Radiology and Surgery19(5), 791–799 (2024)

Özsoy, E., Czempiel, T., Örnek, E.P., Eck, U., Tombari, F., Navab, N.: Holistic or domain modeling: a semantic scene graph approach. International Journal of Computer Assisted Radiology and Surgery19(5), 791–799 (2024)

2024

[32] [32]

arXiv preprint arXiv:2411.10509 (2024)

Pham,Q.P.,Nguyen,K.T.,Ngo,L.C.,Do,T.,Song,D.,Hy,T.S.:Tesgnn:Temporal equivariantscenegraphneuralnetworksforefficientandrobustmulti-view3dscene understanding. arXiv preprint arXiv:2411.10509 (2024)

work page arXiv 2024

[33] [33]

SAM 2: Segment Anything in Images and Videos

Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., et al.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [34]

The International Journal of Robotics Research40(12-14), 1510–1546 (2021) 18 SY.Kim et al

Rosinol,A.,Violette,A.,Abate,M.,Hughes,N.,Chang,Y.,Shi,J.,Gupta,A.,Car- lone, L.: Kimera: From slam to spatial perception with 3d dynamic scene graphs. The International Journal of Robotics Research40(12-14), 1510–1546 (2021) 18 SY.Kim et al

2021

[35] [35]

arXiv preprint arXiv:2412.14480 (2024)

Saxena, S., Buchanan, B., Paxton, C., Liu, P., Chen, B., Vaskevicius, N., Palmieri, L., Francis, J., Kroemer, O.: Grapheqa: Using 3d semantic scene graphs for real- time embodied question answering. arXiv preprint arXiv:2412.14480 (2024)

work page arXiv 2024

[36] [36]

In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision

Singh, K.P., Salvador, J., Weihs, L., Kembhavi, A.: Scene graph contrastive learn- ing for embodied navigation. In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision. pp. 10884–10894 (2023)

2023

[37] [37]

IEEE Transactions on Knowledge and Data Engineering36(11), 6962–6976 (2024)

Song, Y., Sun, P., Liu, H., Li, Z., Song, W., Xiao, Y., Zhou, X.: Scene-driven multimodal knowledge graph construction for embodied ai. IEEE Transactions on Knowledge and Data Engineering36(11), 6962–6976 (2024)

2024

[38] [38]

The Replica Dataset: A Digital Replica of Indoor Spaces

Straub, J., Whelan, T., Ma, L., Chen, Y., Wijmans, E., Green, S., Engel, J.J., Mur-Artal, R., Ren, C., Verma, S., et al.: The replica dataset: A digital replica of indoor spaces. arXiv preprint arXiv:1906.05797 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1906

[39] [39]

In: 2020 IEEE Inter- nationalSymposiumonMixedandAugmentedRealityAdjunct(ISMAR-Adjunct)

Tahara, T., Seno, T., Narita, G., Ishikawa, T.: Retargetable ar: Context-aware augmented reality in indoor scenes based on 3d scene graph. In: 2020 IEEE Inter- nationalSymposiumonMixedandAugmentedRealityAdjunct(ISMAR-Adjunct). pp. 249–255. IEEE (2020)

2020

[40] [40]

In: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

Tateno, K., Tombari, F., Navab, N.: Real-time and scalable incremental segmen- tation on dense slam. In: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 4465–4472. IEEE (2015)

2015

[41] [41]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Wald, J., Avetisyan, A., Navab, N., Tombari, F., Nießner, M.: Rio: 3d object instance re-localization in changing indoor environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7658–7667 (2019)

2019

[42] [42]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wald, J., Dhamo, H., Navab, N., Tombari, F.: Learning 3d semantic scene graphs from 3d indoor reconstructions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3961–3970 (2020)

2020

[43] [43]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Wang, Z., Cheng, B., Zhao, L., Xu, D., Tang, Y., Sheng, L.: Vl-sat: Visual-linguistic semantics assisted training for 3d semantic scene graph prediction in point cloud. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 21560–21569 (2023)

2023

[44] [44]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Wu, S.C., Tateno, K., Navab, N., Tombari, F.: Incremental 3d semantic scene graph prediction from rgb sequences. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5064–5074 (2023)

2023

[45] [45]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wu, S.C., Wald, J., Tateno, K., Navab, N., Tombari, F.: Scenegraphfusion: In- cremental 3d scene graph prediction from rgb-d sequences. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7515– 7525 (2021)

2021

[46] [46]

In: Proceedings of the IEEE conference on computer vision and pat- tern recognition

Xu, D., Zhu, Y., Choy, C.B., Fei-Fei, L.: Scene graph generation by iterative mes- sage passing. In: Proceedings of the IEEE conference on computer vision and pat- tern recognition. pp. 5410–5419 (2017)

2017

[47] [47]

IEEE Robotics and Automation Letters (2025)

Yan, Z., Li, S., Wang, Z., Wu, L., Wang, H., Zhu, J., Chen, L., Liu, J.: Dynamic open-vocabulary 3d scene graphs for long-term language-guided mobile manipula- tion. IEEE Robotics and Automation Letters (2025)

2025

[48] [48]

arXiv preprint arXiv:2507.12508 (2025)

Yang, Y., Liu, J., Zhang, Z., Zhou, S., Tan, R., Yang, J., Du, Y., Gan, C.: Mind- journey: Test-time scaling with world models for spatial reasoning. arXiv preprint arXiv:2507.12508 (2025)

work page arXiv 2025

[49] [49]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Yang, Z., Lu, K., Zhang, C., Qi, J., Jiang, H., Ma, R., Yin, S., Xu, Y., Xing, M., Xiao, Z., et al.: Mmgdreamer: Mixed-modality graph for geometry-controllable 3d indoor scene generation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 9391–9399 (2025) DeWorldSG 19

2025

[50] [50]

In: European Conference on Computer Vision

Zhai, G., Örnek, E.P., Chen, D.Z., Liao, R., Di, Y., Navab, N., Tombari, F., Busam, B.: Echoscene: Indoor scene generation via information echo over scene graph diffu- sion. In: European Conference on Computer Vision. pp. 167–184. Springer (2024)

2024

[51] [51]

Advances in Neural Information Processing Systems36, 30026–30038 (2023)

Zhai, G., Örnek, E.P., Wu, S.C., Di, Y., Tombari, F., Navab, N., Busam, B.: Com- monscenes: Generating commonsense 3d indoor scenes with scene graph diffusion. Advances in Neural Information Processing Systems36, 30026–30038 (2023)

2023

[52] [52]

arXiv preprint arXiv:2510.23607 (2025)

Zhang, Y., Wu, X., Lao, Y., Wang, C., Tian, Z., Wang, N., Zhao, H.: Concerto: Joint 2d-3d self-supervised learning emerges spatial representations. arXiv preprint arXiv:2510.23607 (2025)

work page arXiv 2025

[53] [53]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Zhao, Y., Lv, W., Xu, S., Wei, J., Wang, G., Dang, Q., Liu, Y., Chen, J.: Detrs beat yolos on real-time object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16965–16974 (2024)

2024