pith. sign in

arxiv: 2602.12633 · v2 · pith:CBWW52PNnew · submitted 2026-02-13 · 💻 cs.RO

Real-to-Sim for Highly Cluttered Environments via Physics-Consistent Inter-Object Reasoning

Pith reviewed 2026-05-21 13:06 UTC · model grok-4.3

classification 💻 cs.RO
keywords real-to-simscene reconstructionphysics constraintscontact graphdifferentiable simulationrobotic manipulationcluttered environmentsrgb-d
0
0 comments X

The pith

A contact-graph optimization with differentiable simulation reconstructs physically valid 3D scenes from single-view RGB-D observations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard 3D reconstruction from one RGB-D image often leaves objects floating or overlapping, which makes robotic simulation unreliable for tasks like picking in piles. This paper introduces a pipeline that builds an explicit contact graph among objects and then jointly tunes their poses and physical properties inside a differentiable rigid-body simulator. The optimization enforces physical laws so that the resulting scene respects real contact forces and stable stacking. A reader would care because accurate physical scenes let robot planners test actions in simulation before real execution, reducing trial-and-error in cluttered settings. If the method works, it turns single-view perception into a usable bridge for contact-rich manipulation.

Core claim

By modeling inter-object spatial dependencies via a contact graph and refining object poses together with physical properties through differentiable rigid-body simulation, single-view RGB-D data can be turned into 3D scenes that exhibit high physical fidelity and accurately replicate real-world contact dynamics.

What carries the argument

The contact graph, which encodes spatial dependencies between objects and drives joint pose and property refinement inside differentiable rigid-body simulation to enforce physical consistency.

If this is right

  • Reconstructed scenes achieve high physical fidelity.
  • Scenes faithfully replicate real-world contact dynamics.
  • The scenes enable stable and reliable contact-rich manipulation.
  • The pipeline works across both simulated and real-world evaluations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same contact-graph approach could support online scene updates when a robot moves objects during manipulation.
  • Adding material and friction estimation to the optimization might further improve simulation-to-real transfer for learning-based policies.
  • The method connects perception directly to planning by producing scenes that can be used as starting states for physics-based planners.

Load-bearing premise

Single-view RGB-D observations contain enough information for contact-graph optimization to uniquely determine object poses and physical properties while preventing invalid states such as floating or inter-penetrating objects.

What would settle it

Reconstructed scenes that contain floating objects or inter-penetrations when loaded into a rigid-body simulator, or robot manipulation trials whose success rates differ sharply from real-world execution due to mismatched contact forces.

Figures

Figures reproduced from arXiv: 2602.12633 by Andrew F. Luo, Guoyang Zhao, Jiahang Cao, Jun Ma, Sikai Guo, Tianyi Xiang.

Figure 1
Figure 1. Figure 1: Scene-level Real2Sim methods for physical stability. Given a single RGB-D observation and instance masks, we reconstruct the scene and simulate in PyBullet [7]. (a) SAM3D [8] with Iterative Closest Point (ICP) refinement, without geometric and physical constraints, results in interpenetration and floating, leading to unstable rollouts. (b) The geometry￾only constrained method avoids penetration and ensures… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our method. Our physics-constrained Real2Sim pipeline consists of four stages. (a) Initial Reconstruction: Given a single RGB-D image It and instance masks Mt, we obtain an initial estimation of objects geometry and appearance θ using SAM3D [8] and ICP pose refinement. (b) Contact Graph Construction: We construct a contact graph cg = (pt, E), where parse tree pt represents supporting tree and e… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparisons of physical simulation results with state-of-the-art scene-level reconstruction methods in the simulation environment. We visualize geometry and appearance before and after physical simulation with gravity in PyBullet [7]. Our method produces non-interpenetrating, contact￾coherent geometry and achieves long-horizon physical stability compared with baseline methods. TABLE I QUANTITAT… view at source ↗
Figure 4
Figure 4. Figure 4: Real-world Real2Sim experiment with robot pushing interaction replay. We record the pushing trajectory of a Franka arm in the real world and replay it in the reconstructed digital twin. Using a single-view observation, our method produces a physically consistent scene and better matches the predicted post-interaction than SAM3D+ICP. TABLE III QUANTITATIVE RESULTS ON PHYSICAL STABILITY, SCENE PREDICTION ERR… view at source ↗
read the original abstract

Reconstructing physically valid 3D scenes from single-view observations is a prerequisite for bridging the gap between visual perception and robotic control. However, in scenarios requiring precise contact reasoning, such as robotic manipulation in highly cluttered environments, geometric fidelity alone is insufficient. Standard perception pipelines often neglect physical constraints, resulting in invalid states, e.g., floating objects or severe inter-penetration, rendering downstream simulation unreliable. To address these limitations, we propose a novel physics-constrained Real-to-Sim pipeline that reconstructs physically consistent 3D scenes from single-view RGB-D data. Central to our approach is a differentiable optimization pipeline that explicitly models spatial dependencies via a contact graph, jointly refining object poses and physical properties through differentiable rigid-body simulation. Extensive evaluations in both simulation and real-world settings demonstrate that our reconstructed scenes achieve high physical fidelity and faithfully replicate real-world contact dynamics, enabling stable and reliable contact-rich manipulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a physics-constrained Real-to-Sim pipeline that reconstructs physically consistent 3D scenes from single-view RGB-D observations in highly cluttered environments. It introduces a differentiable optimization framework that builds a contact graph to capture inter-object spatial dependencies and jointly refines object poses and physical properties via differentiable rigid-body simulation, with the goal of eliminating invalid states such as floating objects or inter-penetrations. The manuscript reports extensive evaluations in simulation and real-world settings claiming high physical fidelity and faithful replication of contact dynamics to support stable contact-rich robotic manipulation.

Significance. If the central claims are substantiated, the work would be significant for robotics and simulation-based planning, as it directly targets the common failure mode of physically invalid reconstructions that undermine downstream control in cluttered, contact-rich scenarios. The explicit use of inter-object contact reasoning and differentiable simulation represents a targeted advance over purely geometric perception pipelines.

major comments (2)
  1. [Method (contact-graph optimization and differentiable simulation)] The central claim that the joint pose-and-property optimization produces scenes free of floating objects or inter-penetration (and thereby enables reliable contact-rich manipulation) is load-bearing. In the method section describing the contact-graph construction and differentiable simulation penalties, the manuscript must demonstrate—via ablation on loss terms, convergence analysis from varied initializations, or explicit metrics on invalid-state rates—that residual pose ambiguities from single-view RGB-D and partial occlusions are resolved rather than merely locally consistent.
  2. [Evaluation and results] The abstract states that 'extensive evaluations in both simulation and real-world settings demonstrate' high physical fidelity and replication of contact dynamics. To support this, the results section (or associated tables/figures) must report quantitative metrics with error bars and baselines, such as mean penetration volume, floating height distributions, or downstream manipulation success rates under the reconstructed scenes versus geometric-only or non-contact-graph ablations.
minor comments (1)
  1. [Method] Notation for the contact graph and the exact form of the differentiable simulation loss could be clarified with a small diagram or pseudocode to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We have carefully considered each point and provide point-by-point responses below. Where appropriate, we have revised the manuscript to address the concerns raised.

read point-by-point responses
  1. Referee: [Method (contact-graph optimization and differentiable simulation)] The central claim that the joint pose-and-property optimization produces scenes free of floating objects or inter-penetration (and thereby enables reliable contact-rich manipulation) is load-bearing. In the method section describing the contact-graph construction and differentiable simulation penalties, the manuscript must demonstrate—via ablation on loss terms, convergence analysis from varied initializations, or explicit metrics on invalid-state rates—that residual pose ambiguities from single-view RGB-D and partial occlusions are resolved rather than merely locally consistent.

    Authors: We agree that additional evidence is needed to substantiate that the optimization resolves pose ambiguities rather than achieving only local consistency. In the revised version, we have expanded the method section with an ablation study on the individual loss terms, including those from the contact graph and differentiable simulation. We also provide convergence analysis from varied initializations and report explicit metrics on the rates of invalid states (such as floating objects and inter-penetrations) pre- and post-optimization. These additions demonstrate the effectiveness of the joint optimization in handling ambiguities from single-view observations. revision: yes

  2. Referee: [Evaluation and results] The abstract states that 'extensive evaluations in both simulation and real-world settings demonstrate' high physical fidelity and replication of contact dynamics. To support this, the results section (or associated tables/figures) must report quantitative metrics with error bars and baselines, such as mean penetration volume, floating height distributions, or downstream manipulation success rates under the reconstructed scenes versus geometric-only or non-contact-graph ablations.

    Authors: We appreciate this suggestion to strengthen the empirical support. We have revised the results section to include quantitative metrics with error bars, such as mean penetration volume and distributions of floating heights. Comparisons to baselines, including geometric-only reconstructions and ablations without the contact graph, are now presented. Furthermore, we report downstream manipulation success rates in the reconstructed scenes for both simulation and real-world settings to better support the claims of high physical fidelity and reliable contact-rich manipulation. revision: yes

Circularity Check

0 steps flagged

No circularity: forward optimization pipeline remains independent of its outputs.

full rationale

The paper describes a differentiable optimization pipeline that builds a contact graph from single-view RGB-D input and jointly refines poses and properties via rigid-body simulation. This constitutes a self-contained forward procedure whose validity is assessed by external simulation and real-world evaluations rather than by re-deriving the same quantities from fitted parameters or prior self-citations. No equations reduce the claimed physical consistency to a re-labeling of the input observations, and no uniqueness theorem or ansatz is imported from overlapping author work. The derivation therefore does not collapse to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient technical detail to enumerate free parameters, axioms, or invented entities; no explicit fitting constants, background lemmas, or new postulated objects are described.

pith-pipeline@v0.9.0 · 5699 in / 1116 out tokens · 35635 ms · 2026-05-21T13:06:40.750067+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 3 internal anchors

  1. [1]

    Anal- ysis and observations from the first amazon picking challenge,

    N. Correll, K. E. Bekris, D. Berenson, O. Brock, A. Causo, K. Hauser, K. Okada, A. Rodriguez, J. M. Romano, and P. R. Wurman, “Anal- ysis and observations from the first amazon picking challenge,”IEEE Transactions on Automation Science and Engineering, vol. 15, no. 1, pp. 172–188, 2016

  2. [2]

    A framework for push-grasping in clutter,

    M. Dogar and S. Srinivasa, “A framework for push-grasping in clutter,” Robotics: Science and systems VII, vol. 1, pp. 65–72, 2011

  3. [3]

    Extraction of physically plausible support relations to predict and validate manipulation action effects,

    R. Kartmann, F. Paus, M. Grotz, and T. Asfour, “Extraction of physically plausible support relations to predict and validate manipulation action effects,”IEEE Robotics and Automation Letters, vol. 3, no. 4, pp. 3991– 3998, 2018

  4. [4]

    Holoscene: Simulation-ready interactive 3d worlds from a single video,

    H. Xia, C.-H. Lin, H.-Y . Hsu, Q. Leboutet, K. Gao, M. Paulitsch, B. Ummenhofer, and S. Wang, “Holoscene: Simulation-ready interactive 3d worlds from a single video,”arXiv preprint arXiv:2510.05560, 2025

  5. [5]

    Reconciling reality through simulation: A real-to-sim- to-real approach for robust manipulation,

    M. T. Villasevil, A. Simeonov, Z. Li, A. Chan, T. Chen, A. Gupta, and P. Agrawal, “Reconciling reality through simulation: A real-to-sim- to-real approach for robust manipulation,” inRobotics: Science and Systems, 2024

  6. [6]

    Closing the sim-to-real loop: Adapting simulation random- ization with real world experience,

    Y . Chebotar, A. Handa, V . Makoviychuk, M. Macklin, J. Issac, N. Ratliff, and D. Fox, “Closing the sim-to-real loop: Adapting simulation random- ization with real world experience,” in2019 International Conference on Robotics and Automation, 2019, pp. 8973–8979

  7. [7]

    Pybullet, a python module for physics simulation for games, robotics and machine learning,

    E. Coumans and Y . Bai, “Pybullet, a python module for physics simulation for games, robotics and machine learning,” 2016

  8. [8]

    SAM 3D: 3Dfy Anything in Images

    X. Chen, F.-J. Chu, P. Gleize, K. J. Liang, A. Sax, H. Tang, W. Wang, M. Guo, T. Hardin, X. Liet al., “Sam 3d: 3dfy anything in images,” arXiv preprint arXiv:2511.16624, 2025

  9. [9]

    Phyrecon: Physically plausible neural scene reconstruction,

    J. Ni, Y . Chen, B. Jing, N. Jiang, B. Wang, B. Dai, P. Li, Y . Zhu, S.- C. Zhu, and S. Huang, “Phyrecon: Physically plausible neural scene reconstruction,”Advances in Neural Information Processing Systems, vol. 37, pp. 25 747–25 780, 2024

  10. [10]

    Physically compatible 3d object modeling from a single image,

    M. Guo, B. Wang, P. Ma, T. Zhang, C. Owens, C. Gan, J. Tenenbaum, K. He, and W. Matusik, “Physically compatible 3d object modeling from a single image,”Advances in Neural Information Processing Systems, vol. 37, pp. 119 260–119 282, 2024

  11. [11]

    Cast: Component-aligned 3d scene reconstruction from an rgb image,

    K. Yao, L. Zhang, X. Yan, Y . Zeng, Q. Zhang, L. Xu, W. Yang, J. Gu, and J. Yu, “Cast: Component-aligned 3d scene reconstruction from an rgb image,”ACM Transactions on Graphics, vol. 44, no. 4, pp. 1–19, 2025

  12. [12]

    Physpose: Refining 6d object poses with physical constraints,

    M. Malenick `y, M. C´ıfka, M. Fourmy, L. Montaut, J. Carpentier, J. Sivic, and V . Petrik, “Physpose: Refining 6d object poses with physical constraints,”arXiv preprint arXiv:2503.23587, 2025

  13. [13]

    Reconstructing interactive 3d scenes by panoptic mapping and cad model alignments,

    M. Han, Z. Zhang, Z. Jiao, X. Xie, Y . Zhu, S.-C. Zhu, and H. Liu, “Reconstructing interactive 3d scenes by panoptic mapping and cad model alignments,” in2021 International Conference on Robotics and Automation, 2021, pp. 12 199–12 206

  14. [14]

    Inferring 3d shapes of unknown rigid objects in clutter through inverse physics reasoning,

    C. Song and A. Boularias, “Inferring 3d shapes of unknown rigid objects in clutter through inverse physics reasoning,”IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 201–208, 2018

  15. [15]

    Brax–a differentiable physics engine for large scale rigid body simulation,

    C. D. Freeman, E. Frey, A. Raichuk, S. Girgin, I. Mordatch, and O. Bachem, “Brax–a differentiable physics engine for large scale rigid body simulation,”arXiv preprint arXiv:2106.13281, 2021

  16. [16]

    Diffsdfsim: Differentiable rigid-body dynamics with implicit shapes,

    M. Strecke and J. Stueckler, “Diffsdfsim: Differentiable rigid-body dynamics with implicit shapes,” in2021 international conference on 3D Vision, 2021, pp. 96–105

  17. [17]

    One-shot real-to-sim via end-to-end differentiable simulation and rendering,

    Y . Zhu, T. Xiang, A. M. Dollar, and Z. Pan, “One-shot real-to-sim via end-to-end differentiable simulation and rendering,”IEEE Robotics and Automation Letters, 2025

  18. [18]

    Acdc: Automated creation of digital cousins for robust policy learning,

    T. Dai, J. Wong, Y . Jiang, C. Wang, C. Gokmen, R. Zhang, J. Wu, and L. Fei-Fei, “Acdc: Automated creation of digital cousins for robust policy learning,”arXiv e-prints, pp. arXiv–2410, 2024

  19. [19]

    Urdformer: A pipeline for constructing articulated simulation environments from real-world images

    Z. Chen, A. Walsman, M. Memmel, K. Mo, A. Fang, K. Vemuri, A. Wu, D. Fox, and A. Gupta, “Urdformer: A pipeline for constructing articulated simulation environments from real-world images,”arXiv preprint arXiv:2405.11656, 2024

  20. [20]

    Physically embodied gaussian splatting: A visually learnt and physically grounded 3d representation for robotics,

    J. Abou-Chakra, K. Rana, F. Dayoub, and N. Suenderhauf, “Physically embodied gaussian splatting: A visually learnt and physically grounded 3d representation for robotics,” in8th Annual Conference on Robot Learning, 2024

  21. [21]

    Real-is-sim: Bridging the sim-to-real gap with a dynamic digital twin for real-world robot policy evaluation,

    J. Abou-Chakra, L. Sun, K. Rana, B. May, K. Schmeckpeper, M. V . Minniti, and L. Herlant, “Real-is-sim: Bridging the sim-to-real gap with a dynamic digital twin for real-world robot policy evaluation,”arXiv preprint arXiv:2504.03597, 2025

  22. [22]

    Phystwin: Physics-informed reconstruction and simulation of deformable objects from videos.arXiv preprint arXiv:2503.17973,

    H. Jiang, H.-Y . Hsu, K. Zhang, H.-N. Yu, S. Wang, and Y . Li, “Phystwin: Physics-informed reconstruction and simulation of deformable objects from videos,”arXiv preprint arXiv:2503.17973, 2025

  23. [23]

    Synctwin: Fast digital twin construction and synchronization for safe robotic grasping,

    R. Huang, B. Yang, W. Gui, J. Morgan, E. Biyik, and J. Li, “Synctwin: Fast digital twin construction and synchronization for safe robotic grasping,”arXiv preprint arXiv:2601.09920, 2026

  24. [24]

    RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

    S. Nasiriany, A. Maddukuri, L. Zhang, A. Parikh, A. Lo, A. Joshi, A. Mandlekar, and Y . Zhu, “Robocasa: Large-scale simulation of ev- eryday tasks for generalist robots,”arXiv preprint arXiv:2406.02523, 2024

  25. [25]

    Open3D: A Modern Library for 3D Data Processing

    Q.-Y . Zhou, J. Park, and V . Koltun, “Open3D: A modern library for 3D data processing,”arXiv:1801.09847, 2018

  26. [26]

    On the stability properties of quadruped creeping gaits,

    R. B. McGhee and A. A. Frank, “On the stability properties of quadruped creeping gaits,”Mathematical Biosciences, vol. 3, pp. 331–351, 1968

  27. [27]

    Kaolin: A pytorch library for accelerating 3D deep learning research,

    K. M. Jatavallabhula, E. Smith, J.-F. Lafleche, C. F. Tsang, A. Rozantsev, W. Chen, T. Xiang, R. Lebaredian, and S. Fidler, “Kaolin: A pytorch library for accelerating 3D deep learning research,”arXiv preprint arXiv:1911.05063, 2019

  28. [28]

    Local optimization for robust signed distance field collision,

    M. Macklin, K. Erleben, M. M ¨uller, N. Chentanez, S. Jeschke, and Z. Corse, “Local optimization for robust signed distance field collision,” Proceedings of the ACM on Computer Graphics and Interactive Tech- niques, vol. 3, no. 1, pp. 1–17, 2020

  29. [29]

    Learning to predict 3d objects with an interpolation-based differentiable renderer,

    W. Chen, H. Ling, J. Gao, E. Smith, J. Lehtinen, A. Jacobson, and S. Fidler, “Learning to predict 3d objects with an interpolation-based differentiable renderer,”Advances in neural information processing systems, vol. 32, 2019

  30. [30]

    Google scanned objects: A high- quality dataset of 3d scanned household items,

    L. Downs, A. Francis, N. Koenig, B. Kinman, R. Hickman, K. Reymann, T. B. McHugh, and V . Vanhoucke, “Google scanned objects: A high- quality dataset of 3d scanned household items,” in2022 International Conference on Robotics and Automation, 2022, pp. 2553–2560

  31. [31]

    Benchmarking in manipulation research: Using the yale-cmu- berkeley object and model set,

    B. Calli, A. Walsman, A. Singh, S. Srinivasa, P. Abbeel, and A. M. Dollar, “Benchmarking in manipulation research: Using the yale-cmu- berkeley object and model set,”IEEE Robotics & Automation Magazine, vol. 22, no. 3, pp. 36–52, 2015

  32. [32]

    Using shape to categorize: Low-shot learning with an explicit shape bias,

    S. Stojanov, A. Thai, and J. M. Rehg, “Using shape to categorize: Low-shot learning with an explicit shape bias,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 1798–1808