PhysGraph: A Physics-aware 3D Scene Graph for Perception and Reasoning

Aaron Thomas; Haoyu Li; Shuyan Zhou; Xianyi Cheng

arxiv: 2606.08655 · v1 · pith:MYWWJ5BBnew · submitted 2026-06-07 · 💻 cs.RO · cs.CV

PhysGraph: A Physics-aware 3D Scene Graph for Perception and Reasoning

Haoyu Li , Aaron Thomas , Shuyan Zhou , Xianyi Cheng This is my paper

Pith reviewed 2026-06-27 18:12 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords 3D scene graphphysics-aware perceptionRGB-D reconstructionarticulation predictionmass estimationsemantic segmentationrobot reasoningkinematic modeling

0 comments

The pith

PhysGraph builds 3D scene graphs that include physical properties like mass and articulation from RGB-D observations in cluttered scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

PhysGraph aims to equip robots with a 3D representation that captures both semantic labels and physical behaviors such as material properties and joint movements. It takes RGB-D input, builds consistent object models across views, breaks objects into parts, and uses visual cues to estimate materials and articulations without restricting to single objects or narrow datasets. Current approaches either skip physical factors or fail to generalize, so this unified method targets scalability for everyday tasks. If it holds, robots could use the resulting graphs for planning that respects physical constraints.

Core claim

PhysGraph unifies symbolic reasoning with structured 3D geometry to model kinematic and physical properties in cluttered scenes. Given RGB-D observations, it reconstructs object-centric 3D geometry and associates object instances across views. It then decomposes objects into functional parts and infers materials and articulations through visual reasoning. Evaluated on both synthetic and real-world datasets, PhysGraph achieves state-of-the-art results in semantic segmentation, multi-object mass estimation, and articulation prediction, and supports downstream tasks such as constraint-aware 3D affordance prediction and real-to-sim transfer.

What carries the argument

The PhysGraph pipeline of object-centric 3D reconstruction from RGB-D, cross-view instance association, functional part decomposition, and visual inference of materials and articulations to form structured scene graphs.

If this is right

The scene graphs support constraint-aware 3D affordance prediction for task planning.
They enable real-to-sim transfer as demonstrated in the experiments.
A single model handles semantic segmentation, multi-object mass estimation, and articulation prediction together.
Object instances remain consistent across multiple views while physical properties are attached.
The representation scales to cluttered scenes with varied object types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same structured output could initialize simulations with inferred physical parameters rather than manual setup.
Downstream planners that use the graphs might avoid actions that violate inferred joint limits or material constraints.
Extending the visual inference step to video input could capture time-varying articulations without additional sensors.

Load-bearing premise

Visual reasoning from RGB-D observations alone can accurately infer materials and articulations for diverse object types in cluttered scenes without narrow training sets or single-object modeling.

What would settle it

A test set of real cluttered scenes containing objects with matching appearances but different densities or joint types where the system's mass estimates or articulation predictions show large errors against ground-truth measurements.

Figures

Figures reproduced from arXiv: 2606.08655 by Aaron Thomas, Haoyu Li, Shuyan Zhou, Xianyi Cheng.

**Figure 2.** Figure 2: Overview of PhysGraph. Given RGB-D observations and camera poses, PhysGraph first performs object-centric perception to reconstruct and segment objects, extract object-level features, and associate instances across views. Selected key frames are then used for visual reasoning to estimate part-level physical properties, including articulation and material attributes. The resulting outputs are integrated int… view at source ↗

**Figure 3.** Figure 3: Object Reconstruction Qualitative Results. We present three reconstructed scenes from both a synthetic dataset (Replica) and real-world sequences (SceneFun3D). The first row shows the recognized object-level 3D reconstructions. The second row visualizes the corresponding object feature embeddings projected via PCA, where similar objects exhibit similar colors. materials Pj and their physical attributes, in… view at source ↗

**Figure 4.** Figure 4: Articulation Results. Compared to 3DOI [30], PhysGraph generalizes across diverse cabinet, door, and toilet geometries, accurately identifying joint types and motion directions even in cluttered or unseen configurations, while baselines frequently fail or mispredict articulation. The visualization articulation axes follow the right-hand rule. prediction. We then project the per-Gaussian material probabili… view at source ↗

**Figure 5.** Figure 5: Reconstruction to Simulation: the room 0 scene from the Replica dataset. From left to right: 3DGS reconstruction, watertight mesh with movable revolute joint locations and directions encoded in the red arrows, and the resulting simulation environment in MuJoCo. to find the Part ID(s) that solve the query. For evaluation, we retrieve ground truth point clouds and our Gaussians to compute the 3D point cloud … view at source ↗

**Figure 6.** Figure 6: PhysGraph for manipulation. A robot executes manipulation tasks guided by the structured scene graph. Left: the robot localizes the oven handle and infers the correct articulation axis to open the door. Right: the robot reasons over functional relationships to interact with the environment. REFERENCES [1] M. Z. Irshad, M. Comi, Y.-C. Lin, N. Heppert, A. Valada, R. Ambrus, Z. Kira, and J. Tremblay, “Neural … view at source ↗

read the original abstract

To perform a wide range of daily tasks, robots need to construct a 3D representation that is semantically rich, physically grounded, and structured enough to support task planning and affordance prediction. However, existing approaches primarily focus on semantic retrieval, often overlooking physical and kinematic factors. Methods that attempt to model physical properties typically rely on narrow training sets or single-object modeling, limiting scalability and generalization across diverse object types. To address these challenges, we present PhysGraph, a framework that unifies symbolic reasoning with structured 3D geometry to model kinematic and physical properties in cluttered scenes. Given RGB-D observations, PhysGraph reconstructs object-centric 3D geometry and associates object instances across views. It then decomposes objects into functional parts and infers materials and articulations through visual reasoning. Evaluated on both synthetic and real-world datasets, PhysGraph achieves state-of-the-art results in semantic segmentation, multi-object mass estimation, and articulation prediction. With its simple yet effective design, PhysGraph produces physically consistent and semantically structured scene graphs, serving as a structured 3D representation for downstream tasks such as constraint-aware 3D affordance prediction and real-to-sim transfer, both of which are demonstrated in our experiments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PhysGraph adds physical properties to scene graphs from RGB-D but its SOTA claims lack supporting details in the abstract.

read the letter

PhysGraph is an attempt to build 3D scene graphs that include physical and kinematic properties from RGB-D input. The main takeaway is that while the high-level pipeline makes sense, the SOTA performance claims rest on details not visible in the abstract.

What is new is the unified framework that reconstructs object-centric geometry, decomposes objects into parts, and applies visual reasoning to infer materials, articulations, and mass in cluttered scenes. This addresses limitations in both semantic-only scene graphs and narrow physical modeling approaches.

The paper does well in connecting the representation to practical robotics uses, such as affordance prediction with constraints and real-to-sim transfer. The design is described as simple yet effective, which is a reasonable goal.

Where it gets soft is in the evaluation. It asserts state-of-the-art results in semantic segmentation, multi-object mass estimation, and articulation prediction on synthetic and real datasets, but without any mention of specific baselines, metrics, or how the datasets were constructed, those claims are difficult to verify. The visual reasoning component is key, and its reliability across diverse object types in clutter is the critical assumption that needs strong support.

No obvious internal contradictions appear in the described approach. The steps follow logically from reconstruction to reasoning to downstream tasks.

This work is for researchers in robotics and 3D vision interested in physically grounded scene representations. Readers working on similar graph-based or part-based models might find the ideas worth considering, though they would need the full paper to assess the implementation.

I think it deserves peer review so that the methods and quantitative results can be properly examined.

Referee Report

1 major / 0 minor

Summary. The paper introduces PhysGraph, a framework that builds a physics-aware 3D scene graph from RGB-D observations. It reconstructs object-centric geometry, associates instances across views, decomposes objects into functional parts, and uses visual reasoning to infer materials, articulations, and mass. The central claims are state-of-the-art performance in semantic segmentation, multi-object mass estimation, and articulation prediction on synthetic and real datasets, plus demonstrations of downstream uses in constraint-aware affordance prediction and real-to-sim transfer.

Significance. If the quantitative claims hold with proper baselines and error bars, the work would offer a scalable, object-centric representation that integrates semantic structure with physical properties, addressing limitations of narrow training sets or single-object models in robotics. The design choices (object-centric reconstruction plus part decomposition) align with standard practices, and the downstream applications are consistent with the representation. No machine-checked proofs or parameter-free derivations are present; the contribution is primarily empirical and engineering-oriented.

major comments (1)

[Abstract] Abstract: The claims of state-of-the-art results in semantic segmentation, multi-object mass estimation, and articulation prediction are presented without any description of methods, baselines, quantitative metrics, error bars, or dataset details. This absence makes the central empirical claims unverifiable from the provided text and is load-bearing for the paper's primary contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the single major comment below regarding the abstract.

read point-by-point responses

Referee: The claims of state-of-the-art results in semantic segmentation, multi-object mass estimation, and articulation prediction are presented without any description of methods, baselines, quantitative metrics, error bars, or dataset details. This absence makes the central empirical claims unverifiable from the provided text and is load-bearing for the paper's primary contribution.

Authors: We acknowledge that the abstract is high-level and omits specific methodological details, baselines, metrics, error bars, and dataset information, as is conventional for abstracts due to strict length constraints. The full manuscript substantiates all claims with complete descriptions: methods in Section 3, baselines/quantitative metrics/error bars in Section 4 (including Tables 1-3 and Figures 4-6), and dataset details in Section 4.1. The claims are therefore verifiable from the manuscript body. We disagree that the abstract itself must contain these elements to support the contribution, but we can partially revise the abstract to include one additional sentence referencing key metrics (e.g., mIoU improvements) if the editor permits. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The provided abstract and high-level description contain no equations, derivations, fitted parameters, or predictions that could reduce to inputs by construction. The framework is described as a pipeline of reconstruction, part decomposition, and visual reasoning from RGB-D data, with SOTA claims resting on empirical evaluation rather than any self-referential mathematical step. No self-citations, ansatzes, or uniqueness theorems are invoked in a load-bearing way within the given text. The derivation chain is therefore self-contained against external benchmarks and receives the default non-finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; assessment limited to surface claims.

pith-pipeline@v0.9.1-grok · 5751 in / 1113 out tokens · 27714 ms · 2026-06-27T18:12:14.600335+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 16 canonical work pages · 4 internal anchors

[1]

Neural fields in robotics: A survey,

M. Z. Irshad, M. Comi, Y .-C. Lin, N. Heppert, A. Valada, R. Ambrus, Z. Kira, and J. Tremblay, “Neural fields in robotics: A survey,” arXiv:2410.20220, 2024

work page arXiv 2024
[2]

Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning,

Q. Gu, A. Kuwajerwala, S. Morin, K. M. Jatavallabhula, B. Sen, A. Agarwal, C. Rivera, W. Paul, K. Ellis, R. Chellappaet al., “Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning,”ICRA, 2024

2024
[3]

Conceptfusion: Open-set multimodal 3d mapping,

K. M. Jatavallabhula, A. Kuwajerwala, Q. Gu, M. Omama, T. Chen, S. Li, G. Iyer, S. Saryazdi, N. Keetha, A. Tewari, J. B. Tenenbaum, C. M. de Melo, M. Krishna, L. Paull, F. Shkurti, and A. Torralba, “Conceptfusion: Open-set multimodal 3d mapping,”RSS, 2023

2023
[4]

Iaao: Interactive affordance learning for articulated objects in 3d environments,

C. Zhang and G. H. Lee, “Iaao: Interactive affordance learning for articulated objects in 3d environments,”CVPR, 2025

2025
[5]

Physical property understanding from language-embedded feature fields,

A. J. Zhai, Y . Shen, E. Y . Chen, G. X. Wang, X. Wang, S. Wang, K. Guan, and S. Wang, “Physical property understanding from language-embedded feature fields,”CVPR, 2024

2024
[6]

Pugs: Zero-shot physical understanding with gaussian splatting,

Y . Shuai, R. Yu, Y . Chen, Z. Jiang, X. Song, N. Wang, J. Zheng, J. Ma, M. Yang, Z. Wanget al., “Pugs: Zero-shot physical understanding with gaussian splatting,”arXiv:2502.12231, 2025

work page arXiv 2025
[7]

Mobilesamv2: Faster segment anything to everything,

C. Zhang, D. Han, S. Zhenget al., “Mobilesamv2: Faster segment anything to everything,”arXiv:2312.09579, 2023

work page arXiv 2023
[8]

DINOv3

O. Sim ´eoni, H. V . V o, M. Seitzer,et al., “DINOv3,”arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

(2025, aug) Gpt-5: System card

OpenAI. (2025, aug) Gpt-5: System card

2025
[10]

Lang3dsg: Language-based contrastive pre-training for 3d scene graph prediction,

S. Koch, P. Hermosilla, N. Vaskevicius, M. Colosi, and T. Ropinski, “Lang3dsg: Language-based contrastive pre-training for 3d scene graph prediction,”3DV, 2024

2024
[11]

3d scene graph: A structure for unified semantics, 3d space, and camera,

I. Armeni, Z.-Y . He, J. Gwak, A. R. Zamir, M. Fischer, J. Malik, and S. Savarese, “3d scene graph: A structure for unified semantics, 3d space, and camera,”ICCV, 2019

2019
[12]

arXiv preprint arXiv:2002.06289 (2020)

A. Rosinol, A. Gupta, M. Abate, J. Shi, and L. Carlone, “3d dynamic scene graphs: Actionable spatial perception with places, objects, and humans,”arXiv:2002.06289, 2020

work page arXiv 2002
[13]

Ifr-explore: Learning inter-object functional relationships in 3d indoor scenes,

Q. Li, K. Mo, Y . Yang, H. Zhao, and L. Guibas, “Ifr-explore: Learning inter-object functional relationships in 3d indoor scenes,” arXiv:2112.05298, 2021

work page arXiv 2021
[14]

Open3dsg: Open-vocabulary 3d scene graphs from point clouds with queryable objects and open-set relationships,

S. Koch, N. Vaskevicius, M. Colosi, P. Hermosilla, and T. Ropinski, “Open3dsg: Open-vocabulary 3d scene graphs from point clouds with queryable objects and open-set relationships,”CVPR, 2024

2024
[15]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,”ICML, 2021

2021
[16]

Open-vocabulary functional 3d scene graphs for real- world indoor spaces,

C. Zhang, A. Delitzas, F. Wang, R. Zhang, X. Ji, M. Pollefeys, and F. Engelmann, “Open-vocabulary functional 3d scene graphs for real- world indoor spaces,”CVPR, 2025

2025
[17]

Rgb-d local implicit function for depth completion of transparent objects,

L. Zhu, A. Mousavian, Y . Xiang, H. Mazhar, J. van Eenbergen, S. Debnath, and D. Fox, “Rgb-d local implicit function for depth completion of transparent objects,”CVPR, 2021

2021
[18]

Clip-fields: Weakly supervised semantic fields for robotic memory,

N. M. M. Shafiullah, C. Paxton, L. Pinto, S. Chintala, and A. Szlam, “Clip-fields: Weakly supervised semantic fields for robotic memory,” arXiv:2210.05663, 2022

work page arXiv 2022
[19]

Distilled feature fields enable few-shot language-guided manipula- tion,

W. Shen, G. Yang, A. Yu, J. Wong, L. P. Kaelbling, and P. Isola, “Distilled feature fields enable few-shot language-guided manipula- tion,”arXiv:2308.07931, 2023

work page arXiv 2023
[20]

Segment anything,

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,”ICCV, 2023

2023
[21]

Nerf: Representing scenes as neural radiance fields for view synthesis,

B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoor- thi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,”Communications of the ACM, 2021

2021
[22]

Visual affordance and function understanding: A survey,

M. Hassanin, S. Khan, and M. Tahtali, “Visual affordance and function understanding: A survey,”ACM Computing Surveys, 2021

2021
[23]

Uad: Unsupervised affordance distillation for generaliza- tion in robotic manipulation,

Y . Tang, W. Huang, Y . Wang, C. Li, R. Yuan, R. Zhang, J. Wu, and L. Fei-Fei, “Uad: Unsupervised affordance distillation for generaliza- tion in robotic manipulation,”arXiv:2506.09284, 2025

work page arXiv 2025
[24]

ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation

W. Huang, C. Wang, Y . Li, R. Zhang, and L. Fei-Fei, “Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation,”arXiv:2409.01652, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Learning dexterous grasping with object-centric visual affordances,

P. Mandikal and K. Grauman, “Learning dexterous grasping with object-centric visual affordances,”ICRA, 2021

2021
[26]

Artgs: 3d gaussian splatting for interactive visual-physical modeling and manipulation of articulated objects,

Q. Yu, X. Yuan, J. Chenet al., “Artgs: 3d gaussian splatting for interactive visual-physical modeling and manipulation of articulated objects,”arXiv:2507.02600, 2025

work page arXiv 2025
[27]

3d gaussian splatting for real-time radiance field rendering

B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering.”ACM Transactions on Graphics, 2023

2023
[28]

Yolo-world: Real-time open-vocabulary object detection,

T. Cheng, L. Song, Y . Ge, W. Liu, X. Wang, and Y . Shan, “Yolo-world: Real-time open-vocabulary object detection,”CVPR, 2024

2024
[29]

A survey on bounding volume hierarchies for ray tracing,

D. Meister, S. Ogaki, C. Benthin, M. J. Doyle, M. Guthe, and J. Bittner, “A survey on bounding volume hierarchies for ray tracing,”Computer Graphics F orum, 2021

2021
[30]

Understanding 3d object interaction from a single image,

S. Qian and D. F. Fouhey, “Understanding 3d object interaction from a single image,”ICCV, 2023

2023
[31]

The Replica Dataset: A Digital Replica of Indoor Spaces

J. Straub, T. Whelan, L. Maet al., “The replica dataset: A digital replica of indoor spaces,”arXiv:1906.05797, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1906
[32]

Opengaussian: Towards point-level 3d gaussian-based open vocabulary understanding,

Y . Wu, J. Meng, H. Li, C. Wu, Y . Shi, X. Cheng, C. Zhao, H. Feng, E. Ding, J. Wang, and J. Zhang, “Opengaussian: Towards point-level 3d gaussian-based open vocabulary understanding,”NeurIPS, 2024

2024
[33]

Langsplat: 3d language gaussian splatting,

M. Qin, W. Li, J. Zhou, H. Wang, and H. Pfister, “Langsplat: 3d language gaussian splatting,”arXiv:2312.16084, 2023

work page arXiv 2023
[34]

Graspsplats: Efficient manipulation with 3d feature splatting,

M. Ji, R.-Z. Qiu, X. Zou, and X. Wang, “Graspsplats: Efficient manipulation with 3d feature splatting,”arXiv:2409.02084, 2024

work page arXiv 2024
[35]

Omnimap: A general mapping framework integrating optics, geometry, and semantics,

Y . Deng, Y . Yue, J. Dou, J. Zhao, J. Wang, Y . Tang, Y . Yang, and M. Fu, “Omnimap: A general mapping framework integrating optics, geometry, and semantics,”IEEE Transactions on Robotics, 2025

2025
[36]

Open-fusion: Real-time open-vocabulary 3d mapping and queryable scene representation,

K. Yamazaki, T. Hanyu, K. V o, T. Pham, M. Tran, G. Doretto, A. Nguyen, and N. Le, “Open-fusion: Real-time open-vocabulary 3d mapping and queryable scene representation,”ICRA, 2024

2024
[37]

Lian Fu, Ryoichi Ishikawa, Yoshihiro Sato, and Takeshi Oishi

Z. Chen, A. Walsman, M. Memmel, K. Mo, A. Fang, K. Vemuri, A. Wu, D. Fox, and A. Gupta, “Urdformer: A pipeline for con- structing articulated simulation environments from real-world images,” arXiv:2405.11656, 2024

work page arXiv 2024
[38]

Drawer: Digital reconstruction and articulation with environment realism,

H. Xia, E. Su, M. Memmelet al., “Drawer: Digital reconstruction and articulation with environment realism,”CVPR, 2025

2025
[39]

Openclip,

G. Ilharco, M. Wortsman, R. Wightman, C. Gordon, N. Carlini, R. Taori, A. Dave, V . Shankar, H. Namkoong, J. Miller, H. Hajishirzi, A. Farhadi, and L. Schmidt, “Openclip,”Zenodo, 2021

2021
[40]

BEHAVIOR-1K: A Human-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation

C. Li, R. Zhang, J. Wonget al., “Behavior-1k: A human-centered, embodied ai benchmark with 1,000 everyday activities and realistic simulation,”arXiv:2403.09227, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

image2mass: Estimating the mass of an object from its image,

T. Standley, O. Sener, D. Chen, and S. Savarese, “image2mass: Estimating the mass of an object from its image,”CoRL, 2017

2017
[42]

Fungraph: Functionality aware 3d scene graphs for language-prompted scene interaction,

D. Rotondi, F. Scaparro, H. Blum, and K. O. Arras, “Fungraph: Functionality aware 3d scene graphs for language-prompted scene interaction,”IROS, 2025

2025
[43]

Scenefun3d: Fine-grained functionality and affordance understanding in 3d scenes,

A. Delitzas, A. Takmaz, F. Tombari, R. Sumner, M. Pollefeys, and F. Engelmann, “Scenefun3d: Fine-grained functionality and affordance understanding in 3d scenes,”CVPR, 2024

2024
[44]

Multiscan: Scalable rgbd scanning for 3d environments with articulated objects,

Y . Mao, Y . Zhang, H. Jiang, A. Chang, and M. Savva, “Multiscan: Scalable rgbd scanning for 3d environments with articulated objects,” NeurIPS, 2022

2022
[45]

Func- tionality understanding and segmentation in 3d scenes,

J. Corsetti, F. Giuliari, A. Fasoli, D. Boscaini, and F. Poiesi, “Func- tionality understanding and segmentation in 3d scenes,”CVPR, 2025

2025
[46]

Search3d: Hierarchical open-vocabulary 3d segmenta- tion,

A. Takmaz, A. Delitzas, R. W. Sumner, F. Engelmann, J. Wald, and F. Tombari, “Search3d: Hierarchical open-vocabulary 3d segmenta- tion,”RA-L, 2025

2025
[47]

Mujoco: A physics engine for model-based control,

E. Todorov, T. Erez, and Y . Tassa, “Mujoco: A physics engine for model-based control,”IROS, 2012

2012

[1] [1]

Neural fields in robotics: A survey,

M. Z. Irshad, M. Comi, Y .-C. Lin, N. Heppert, A. Valada, R. Ambrus, Z. Kira, and J. Tremblay, “Neural fields in robotics: A survey,” arXiv:2410.20220, 2024

work page arXiv 2024

[2] [2]

Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning,

Q. Gu, A. Kuwajerwala, S. Morin, K. M. Jatavallabhula, B. Sen, A. Agarwal, C. Rivera, W. Paul, K. Ellis, R. Chellappaet al., “Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning,”ICRA, 2024

2024

[3] [3]

Conceptfusion: Open-set multimodal 3d mapping,

K. M. Jatavallabhula, A. Kuwajerwala, Q. Gu, M. Omama, T. Chen, S. Li, G. Iyer, S. Saryazdi, N. Keetha, A. Tewari, J. B. Tenenbaum, C. M. de Melo, M. Krishna, L. Paull, F. Shkurti, and A. Torralba, “Conceptfusion: Open-set multimodal 3d mapping,”RSS, 2023

2023

[4] [4]

Iaao: Interactive affordance learning for articulated objects in 3d environments,

C. Zhang and G. H. Lee, “Iaao: Interactive affordance learning for articulated objects in 3d environments,”CVPR, 2025

2025

[5] [5]

Physical property understanding from language-embedded feature fields,

A. J. Zhai, Y . Shen, E. Y . Chen, G. X. Wang, X. Wang, S. Wang, K. Guan, and S. Wang, “Physical property understanding from language-embedded feature fields,”CVPR, 2024

2024

[6] [6]

Pugs: Zero-shot physical understanding with gaussian splatting,

Y . Shuai, R. Yu, Y . Chen, Z. Jiang, X. Song, N. Wang, J. Zheng, J. Ma, M. Yang, Z. Wanget al., “Pugs: Zero-shot physical understanding with gaussian splatting,”arXiv:2502.12231, 2025

work page arXiv 2025

[7] [7]

Mobilesamv2: Faster segment anything to everything,

C. Zhang, D. Han, S. Zhenget al., “Mobilesamv2: Faster segment anything to everything,”arXiv:2312.09579, 2023

work page arXiv 2023

[8] [8]

DINOv3

O. Sim ´eoni, H. V . V o, M. Seitzer,et al., “DINOv3,”arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

(2025, aug) Gpt-5: System card

OpenAI. (2025, aug) Gpt-5: System card

2025

[10] [10]

Lang3dsg: Language-based contrastive pre-training for 3d scene graph prediction,

S. Koch, P. Hermosilla, N. Vaskevicius, M. Colosi, and T. Ropinski, “Lang3dsg: Language-based contrastive pre-training for 3d scene graph prediction,”3DV, 2024

2024

[11] [11]

3d scene graph: A structure for unified semantics, 3d space, and camera,

I. Armeni, Z.-Y . He, J. Gwak, A. R. Zamir, M. Fischer, J. Malik, and S. Savarese, “3d scene graph: A structure for unified semantics, 3d space, and camera,”ICCV, 2019

2019

[12] [12]

arXiv preprint arXiv:2002.06289 (2020)

A. Rosinol, A. Gupta, M. Abate, J. Shi, and L. Carlone, “3d dynamic scene graphs: Actionable spatial perception with places, objects, and humans,”arXiv:2002.06289, 2020

work page arXiv 2002

[13] [13]

Ifr-explore: Learning inter-object functional relationships in 3d indoor scenes,

Q. Li, K. Mo, Y . Yang, H. Zhao, and L. Guibas, “Ifr-explore: Learning inter-object functional relationships in 3d indoor scenes,” arXiv:2112.05298, 2021

work page arXiv 2021

[14] [14]

Open3dsg: Open-vocabulary 3d scene graphs from point clouds with queryable objects and open-set relationships,

S. Koch, N. Vaskevicius, M. Colosi, P. Hermosilla, and T. Ropinski, “Open3dsg: Open-vocabulary 3d scene graphs from point clouds with queryable objects and open-set relationships,”CVPR, 2024

2024

[15] [15]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,”ICML, 2021

2021

[16] [16]

Open-vocabulary functional 3d scene graphs for real- world indoor spaces,

C. Zhang, A. Delitzas, F. Wang, R. Zhang, X. Ji, M. Pollefeys, and F. Engelmann, “Open-vocabulary functional 3d scene graphs for real- world indoor spaces,”CVPR, 2025

2025

[17] [17]

Rgb-d local implicit function for depth completion of transparent objects,

L. Zhu, A. Mousavian, Y . Xiang, H. Mazhar, J. van Eenbergen, S. Debnath, and D. Fox, “Rgb-d local implicit function for depth completion of transparent objects,”CVPR, 2021

2021

[18] [18]

Clip-fields: Weakly supervised semantic fields for robotic memory,

N. M. M. Shafiullah, C. Paxton, L. Pinto, S. Chintala, and A. Szlam, “Clip-fields: Weakly supervised semantic fields for robotic memory,” arXiv:2210.05663, 2022

work page arXiv 2022

[19] [19]

Distilled feature fields enable few-shot language-guided manipula- tion,

W. Shen, G. Yang, A. Yu, J. Wong, L. P. Kaelbling, and P. Isola, “Distilled feature fields enable few-shot language-guided manipula- tion,”arXiv:2308.07931, 2023

work page arXiv 2023

[20] [20]

Segment anything,

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,”ICCV, 2023

2023

[21] [21]

Nerf: Representing scenes as neural radiance fields for view synthesis,

B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoor- thi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,”Communications of the ACM, 2021

2021

[22] [22]

Visual affordance and function understanding: A survey,

M. Hassanin, S. Khan, and M. Tahtali, “Visual affordance and function understanding: A survey,”ACM Computing Surveys, 2021

2021

[23] [23]

Uad: Unsupervised affordance distillation for generaliza- tion in robotic manipulation,

Y . Tang, W. Huang, Y . Wang, C. Li, R. Yuan, R. Zhang, J. Wu, and L. Fei-Fei, “Uad: Unsupervised affordance distillation for generaliza- tion in robotic manipulation,”arXiv:2506.09284, 2025

work page arXiv 2025

[24] [24]

ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation

W. Huang, C. Wang, Y . Li, R. Zhang, and L. Fei-Fei, “Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation,”arXiv:2409.01652, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

Learning dexterous grasping with object-centric visual affordances,

P. Mandikal and K. Grauman, “Learning dexterous grasping with object-centric visual affordances,”ICRA, 2021

2021

[26] [26]

Artgs: 3d gaussian splatting for interactive visual-physical modeling and manipulation of articulated objects,

Q. Yu, X. Yuan, J. Chenet al., “Artgs: 3d gaussian splatting for interactive visual-physical modeling and manipulation of articulated objects,”arXiv:2507.02600, 2025

work page arXiv 2025

[27] [27]

3d gaussian splatting for real-time radiance field rendering

B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering.”ACM Transactions on Graphics, 2023

2023

[28] [28]

Yolo-world: Real-time open-vocabulary object detection,

T. Cheng, L. Song, Y . Ge, W. Liu, X. Wang, and Y . Shan, “Yolo-world: Real-time open-vocabulary object detection,”CVPR, 2024

2024

[29] [29]

A survey on bounding volume hierarchies for ray tracing,

D. Meister, S. Ogaki, C. Benthin, M. J. Doyle, M. Guthe, and J. Bittner, “A survey on bounding volume hierarchies for ray tracing,”Computer Graphics F orum, 2021

2021

[30] [30]

Understanding 3d object interaction from a single image,

S. Qian and D. F. Fouhey, “Understanding 3d object interaction from a single image,”ICCV, 2023

2023

[31] [31]

The Replica Dataset: A Digital Replica of Indoor Spaces

J. Straub, T. Whelan, L. Maet al., “The replica dataset: A digital replica of indoor spaces,”arXiv:1906.05797, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1906

[32] [32]

Opengaussian: Towards point-level 3d gaussian-based open vocabulary understanding,

Y . Wu, J. Meng, H. Li, C. Wu, Y . Shi, X. Cheng, C. Zhao, H. Feng, E. Ding, J. Wang, and J. Zhang, “Opengaussian: Towards point-level 3d gaussian-based open vocabulary understanding,”NeurIPS, 2024

2024

[33] [33]

Langsplat: 3d language gaussian splatting,

M. Qin, W. Li, J. Zhou, H. Wang, and H. Pfister, “Langsplat: 3d language gaussian splatting,”arXiv:2312.16084, 2023

work page arXiv 2023

[34] [34]

Graspsplats: Efficient manipulation with 3d feature splatting,

M. Ji, R.-Z. Qiu, X. Zou, and X. Wang, “Graspsplats: Efficient manipulation with 3d feature splatting,”arXiv:2409.02084, 2024

work page arXiv 2024

[35] [35]

Omnimap: A general mapping framework integrating optics, geometry, and semantics,

Y . Deng, Y . Yue, J. Dou, J. Zhao, J. Wang, Y . Tang, Y . Yang, and M. Fu, “Omnimap: A general mapping framework integrating optics, geometry, and semantics,”IEEE Transactions on Robotics, 2025

2025

[36] [36]

Open-fusion: Real-time open-vocabulary 3d mapping and queryable scene representation,

K. Yamazaki, T. Hanyu, K. V o, T. Pham, M. Tran, G. Doretto, A. Nguyen, and N. Le, “Open-fusion: Real-time open-vocabulary 3d mapping and queryable scene representation,”ICRA, 2024

2024

[37] [37]

Lian Fu, Ryoichi Ishikawa, Yoshihiro Sato, and Takeshi Oishi

Z. Chen, A. Walsman, M. Memmel, K. Mo, A. Fang, K. Vemuri, A. Wu, D. Fox, and A. Gupta, “Urdformer: A pipeline for con- structing articulated simulation environments from real-world images,” arXiv:2405.11656, 2024

work page arXiv 2024

[38] [38]

Drawer: Digital reconstruction and articulation with environment realism,

H. Xia, E. Su, M. Memmelet al., “Drawer: Digital reconstruction and articulation with environment realism,”CVPR, 2025

2025

[39] [39]

Openclip,

G. Ilharco, M. Wortsman, R. Wightman, C. Gordon, N. Carlini, R. Taori, A. Dave, V . Shankar, H. Namkoong, J. Miller, H. Hajishirzi, A. Farhadi, and L. Schmidt, “Openclip,”Zenodo, 2021

2021

[40] [40]

BEHAVIOR-1K: A Human-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation

C. Li, R. Zhang, J. Wonget al., “Behavior-1k: A human-centered, embodied ai benchmark with 1,000 everyday activities and realistic simulation,”arXiv:2403.09227, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [41]

image2mass: Estimating the mass of an object from its image,

T. Standley, O. Sener, D. Chen, and S. Savarese, “image2mass: Estimating the mass of an object from its image,”CoRL, 2017

2017

[42] [42]

Fungraph: Functionality aware 3d scene graphs for language-prompted scene interaction,

D. Rotondi, F. Scaparro, H. Blum, and K. O. Arras, “Fungraph: Functionality aware 3d scene graphs for language-prompted scene interaction,”IROS, 2025

2025

[43] [43]

Scenefun3d: Fine-grained functionality and affordance understanding in 3d scenes,

A. Delitzas, A. Takmaz, F. Tombari, R. Sumner, M. Pollefeys, and F. Engelmann, “Scenefun3d: Fine-grained functionality and affordance understanding in 3d scenes,”CVPR, 2024

2024

[44] [44]

Multiscan: Scalable rgbd scanning for 3d environments with articulated objects,

Y . Mao, Y . Zhang, H. Jiang, A. Chang, and M. Savva, “Multiscan: Scalable rgbd scanning for 3d environments with articulated objects,” NeurIPS, 2022

2022

[45] [45]

Func- tionality understanding and segmentation in 3d scenes,

J. Corsetti, F. Giuliari, A. Fasoli, D. Boscaini, and F. Poiesi, “Func- tionality understanding and segmentation in 3d scenes,”CVPR, 2025

2025

[46] [46]

Search3d: Hierarchical open-vocabulary 3d segmenta- tion,

A. Takmaz, A. Delitzas, R. W. Sumner, F. Engelmann, J. Wald, and F. Tombari, “Search3d: Hierarchical open-vocabulary 3d segmenta- tion,”RA-L, 2025

2025

[47] [47]

Mujoco: A physics engine for model-based control,

E. Todorov, T. Erez, and Y . Tassa, “Mujoco: A physics engine for model-based control,”IROS, 2012

2012