Real-to-Sim for Highly Cluttered Environments via Physics-Consistent Inter-Object Reasoning

Andrew F. Luo; Guoyang Zhao; Jiahang Cao; Jun Ma; Sikai Guo; Tianyi Xiang

arxiv: 2602.12633 · v2 · pith:CBWW52PNnew · submitted 2026-02-13 · 💻 cs.RO

Real-to-Sim for Highly Cluttered Environments via Physics-Consistent Inter-Object Reasoning

Tianyi Xiang , Jiahang Cao , Sikai Guo , Guoyang Zhao , Andrew F. Luo , Jun Ma This is my paper

Pith reviewed 2026-05-21 13:06 UTC · model grok-4.3

classification 💻 cs.RO

keywords real-to-simscene reconstructionphysics constraintscontact graphdifferentiable simulationrobotic manipulationcluttered environmentsrgb-d

0 comments

The pith

A contact-graph optimization with differentiable simulation reconstructs physically valid 3D scenes from single-view RGB-D observations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard 3D reconstruction from one RGB-D image often leaves objects floating or overlapping, which makes robotic simulation unreliable for tasks like picking in piles. This paper introduces a pipeline that builds an explicit contact graph among objects and then jointly tunes their poses and physical properties inside a differentiable rigid-body simulator. The optimization enforces physical laws so that the resulting scene respects real contact forces and stable stacking. A reader would care because accurate physical scenes let robot planners test actions in simulation before real execution, reducing trial-and-error in cluttered settings. If the method works, it turns single-view perception into a usable bridge for contact-rich manipulation.

Core claim

By modeling inter-object spatial dependencies via a contact graph and refining object poses together with physical properties through differentiable rigid-body simulation, single-view RGB-D data can be turned into 3D scenes that exhibit high physical fidelity and accurately replicate real-world contact dynamics.

What carries the argument

The contact graph, which encodes spatial dependencies between objects and drives joint pose and property refinement inside differentiable rigid-body simulation to enforce physical consistency.

If this is right

Reconstructed scenes achieve high physical fidelity.
Scenes faithfully replicate real-world contact dynamics.
The scenes enable stable and reliable contact-rich manipulation.
The pipeline works across both simulated and real-world evaluations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same contact-graph approach could support online scene updates when a robot moves objects during manipulation.
Adding material and friction estimation to the optimization might further improve simulation-to-real transfer for learning-based policies.
The method connects perception directly to planning by producing scenes that can be used as starting states for physics-based planners.

Load-bearing premise

Single-view RGB-D observations contain enough information for contact-graph optimization to uniquely determine object poses and physical properties while preventing invalid states such as floating or inter-penetrating objects.

What would settle it

Reconstructed scenes that contain floating objects or inter-penetrations when loaded into a rigid-body simulator, or robot manipulation trials whose success rates differ sharply from real-world execution due to mismatched contact forces.

Figures

Figures reproduced from arXiv: 2602.12633 by Andrew F. Luo, Guoyang Zhao, Jiahang Cao, Jun Ma, Sikai Guo, Tianyi Xiang.

**Figure 1.** Figure 1: Scene-level Real2Sim methods for physical stability. Given a single RGB-D observation and instance masks, we reconstruct the scene and simulate in PyBullet [7]. (a) SAM3D [8] with Iterative Closest Point (ICP) refinement, without geometric and physical constraints, results in interpenetration and floating, leading to unstable rollouts. (b) The geometryonly constrained method avoids penetration and ensures… view at source ↗

**Figure 2.** Figure 2: Overview of our method. Our physics-constrained Real2Sim pipeline consists of four stages. (a) Initial Reconstruction: Given a single RGB-D image It and instance masks Mt, we obtain an initial estimation of objects geometry and appearance θ using SAM3D [8] and ICP pose refinement. (b) Contact Graph Construction: We construct a contact graph cg = (pt, E), where parse tree pt represents supporting tree and e… view at source ↗

**Figure 3.** Figure 3: Qualitative comparisons of physical simulation results with state-of-the-art scene-level reconstruction methods in the simulation environment. We visualize geometry and appearance before and after physical simulation with gravity in PyBullet [7]. Our method produces non-interpenetrating, contactcoherent geometry and achieves long-horizon physical stability compared with baseline methods. TABLE I QUANTITAT… view at source ↗

**Figure 4.** Figure 4: Real-world Real2Sim experiment with robot pushing interaction replay. We record the pushing trajectory of a Franka arm in the real world and replay it in the reconstructed digital twin. Using a single-view observation, our method produces a physically consistent scene and better matches the predicted post-interaction than SAM3D+ICP. TABLE III QUANTITATIVE RESULTS ON PHYSICAL STABILITY, SCENE PREDICTION ERR… view at source ↗

read the original abstract

Reconstructing physically valid 3D scenes from single-view observations is a prerequisite for bridging the gap between visual perception and robotic control. However, in scenarios requiring precise contact reasoning, such as robotic manipulation in highly cluttered environments, geometric fidelity alone is insufficient. Standard perception pipelines often neglect physical constraints, resulting in invalid states, e.g., floating objects or severe inter-penetration, rendering downstream simulation unreliable. To address these limitations, we propose a novel physics-constrained Real-to-Sim pipeline that reconstructs physically consistent 3D scenes from single-view RGB-D data. Central to our approach is a differentiable optimization pipeline that explicitly models spatial dependencies via a contact graph, jointly refining object poses and physical properties through differentiable rigid-body simulation. Extensive evaluations in both simulation and real-world settings demonstrate that our reconstructed scenes achieve high physical fidelity and faithfully replicate real-world contact dynamics, enabling stable and reliable contact-rich manipulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a contact graph to differentiable rigid-body simulation for physically consistent single-view reconstruction in clutter, but the abstract gives no numbers to show it actually resolves ambiguities.

read the letter

The paper's main move is to build a Real-to-Sim pipeline that puts a contact graph inside differentiable rigid-body simulation. The idea is to jointly optimize object poses and physical properties from single-view RGB-D so the output scenes avoid floating objects and inter-penetration in cluttered settings. This targets a practical failure mode in robotic perception where pure geometry produces states that break downstream control and simulation transfer. The contact-graph step is the clearest addition; it makes inter-object dependencies explicit during the optimization rather than treating objects independently. That fits naturally with existing differentiable simulation work and addresses a real pain point for contact-rich manipulation. The framing of the problem is straightforward and the motivation for physics constraints over geometric fidelity alone is clear. The soft spot is that the abstract states the scenes achieve high physical fidelity and faithfully replicate contact dynamics, yet supplies no quantitative results, error bars, ablations, or concrete examples of how ambiguities are resolved. Single-view RGB-D in clutter often admits multiple contact-consistent configurations, and it is not obvious from the description that the chosen losses and simulation penalties are strong enough to rule out locally plausible but globally invalid states. The stress-test concern about residual pose ambiguities therefore looks worth checking in the full results. If the experiments show clear improvements in manipulation stability with proper baselines, the method gains traction; without that, the claims stay unverified. This work is aimed at robotics researchers doing sim-to-real transfer for manipulation in complex scenes. Readers already using differentiable physics would get the most out of the contact modeling details. I would send it for peer review so the experiments and implementation can be examined directly.

Referee Report

2 major / 1 minor

Summary. The paper proposes a physics-constrained Real-to-Sim pipeline that reconstructs physically consistent 3D scenes from single-view RGB-D observations in highly cluttered environments. It introduces a differentiable optimization framework that builds a contact graph to capture inter-object spatial dependencies and jointly refines object poses and physical properties via differentiable rigid-body simulation, with the goal of eliminating invalid states such as floating objects or inter-penetrations. The manuscript reports extensive evaluations in simulation and real-world settings claiming high physical fidelity and faithful replication of contact dynamics to support stable contact-rich robotic manipulation.

Significance. If the central claims are substantiated, the work would be significant for robotics and simulation-based planning, as it directly targets the common failure mode of physically invalid reconstructions that undermine downstream control in cluttered, contact-rich scenarios. The explicit use of inter-object contact reasoning and differentiable simulation represents a targeted advance over purely geometric perception pipelines.

major comments (2)

[Method (contact-graph optimization and differentiable simulation)] The central claim that the joint pose-and-property optimization produces scenes free of floating objects or inter-penetration (and thereby enables reliable contact-rich manipulation) is load-bearing. In the method section describing the contact-graph construction and differentiable simulation penalties, the manuscript must demonstrate—via ablation on loss terms, convergence analysis from varied initializations, or explicit metrics on invalid-state rates—that residual pose ambiguities from single-view RGB-D and partial occlusions are resolved rather than merely locally consistent.
[Evaluation and results] The abstract states that 'extensive evaluations in both simulation and real-world settings demonstrate' high physical fidelity and replication of contact dynamics. To support this, the results section (or associated tables/figures) must report quantitative metrics with error bars and baselines, such as mean penetration volume, floating height distributions, or downstream manipulation success rates under the reconstructed scenes versus geometric-only or non-contact-graph ablations.

minor comments (1)

[Method] Notation for the contact graph and the exact form of the differentiable simulation loss could be clarified with a small diagram or pseudocode to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We have carefully considered each point and provide point-by-point responses below. Where appropriate, we have revised the manuscript to address the concerns raised.

read point-by-point responses

Referee: [Method (contact-graph optimization and differentiable simulation)] The central claim that the joint pose-and-property optimization produces scenes free of floating objects or inter-penetration (and thereby enables reliable contact-rich manipulation) is load-bearing. In the method section describing the contact-graph construction and differentiable simulation penalties, the manuscript must demonstrate—via ablation on loss terms, convergence analysis from varied initializations, or explicit metrics on invalid-state rates—that residual pose ambiguities from single-view RGB-D and partial occlusions are resolved rather than merely locally consistent.

Authors: We agree that additional evidence is needed to substantiate that the optimization resolves pose ambiguities rather than achieving only local consistency. In the revised version, we have expanded the method section with an ablation study on the individual loss terms, including those from the contact graph and differentiable simulation. We also provide convergence analysis from varied initializations and report explicit metrics on the rates of invalid states (such as floating objects and inter-penetrations) pre- and post-optimization. These additions demonstrate the effectiveness of the joint optimization in handling ambiguities from single-view observations. revision: yes
Referee: [Evaluation and results] The abstract states that 'extensive evaluations in both simulation and real-world settings demonstrate' high physical fidelity and replication of contact dynamics. To support this, the results section (or associated tables/figures) must report quantitative metrics with error bars and baselines, such as mean penetration volume, floating height distributions, or downstream manipulation success rates under the reconstructed scenes versus geometric-only or non-contact-graph ablations.

Authors: We appreciate this suggestion to strengthen the empirical support. We have revised the results section to include quantitative metrics with error bars, such as mean penetration volume and distributions of floating heights. Comparisons to baselines, including geometric-only reconstructions and ablations without the contact graph, are now presented. Furthermore, we report downstream manipulation success rates in the reconstructed scenes for both simulation and real-world settings to better support the claims of high physical fidelity and reliable contact-rich manipulation. revision: yes

Circularity Check

0 steps flagged

No circularity: forward optimization pipeline remains independent of its outputs.

full rationale

The paper describes a differentiable optimization pipeline that builds a contact graph from single-view RGB-D input and jointly refines poses and properties via rigid-body simulation. This constitutes a self-contained forward procedure whose validity is assessed by external simulation and real-world evaluations rather than by re-deriving the same quantities from fitted parameters or prior self-citations. No equations reduce the claimed physical consistency to a re-labeling of the input observations, and no uniqueness theorem or ansatz is imported from overlapping author work. The derivation therefore does not collapse to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient technical detail to enumerate free parameters, axioms, or invented entities; no explicit fitting constants, background lemmas, or new postulated objects are described.

pith-pipeline@v0.9.0 · 5699 in / 1116 out tokens · 35635 ms · 2026-05-21T13:06:40.750067+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

hierarchical physics-constrained optimization strategy based on differentiable rigid-body simulation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 3 internal anchors

[1]

Anal- ysis and observations from the first amazon picking challenge,

N. Correll, K. E. Bekris, D. Berenson, O. Brock, A. Causo, K. Hauser, K. Okada, A. Rodriguez, J. M. Romano, and P. R. Wurman, “Anal- ysis and observations from the first amazon picking challenge,”IEEE Transactions on Automation Science and Engineering, vol. 15, no. 1, pp. 172–188, 2016

work page 2016
[2]

A framework for push-grasping in clutter,

M. Dogar and S. Srinivasa, “A framework for push-grasping in clutter,” Robotics: Science and systems VII, vol. 1, pp. 65–72, 2011

work page 2011
[3]

Extraction of physically plausible support relations to predict and validate manipulation action effects,

R. Kartmann, F. Paus, M. Grotz, and T. Asfour, “Extraction of physically plausible support relations to predict and validate manipulation action effects,”IEEE Robotics and Automation Letters, vol. 3, no. 4, pp. 3991– 3998, 2018

work page 2018
[4]

Holoscene: Simulation-ready interactive 3d worlds from a single video,

H. Xia, C.-H. Lin, H.-Y . Hsu, Q. Leboutet, K. Gao, M. Paulitsch, B. Ummenhofer, and S. Wang, “Holoscene: Simulation-ready interactive 3d worlds from a single video,”arXiv preprint arXiv:2510.05560, 2025

work page arXiv 2025
[5]

Reconciling reality through simulation: A real-to-sim- to-real approach for robust manipulation,

M. T. Villasevil, A. Simeonov, Z. Li, A. Chan, T. Chen, A. Gupta, and P. Agrawal, “Reconciling reality through simulation: A real-to-sim- to-real approach for robust manipulation,” inRobotics: Science and Systems, 2024

work page 2024
[6]

Closing the sim-to-real loop: Adapting simulation random- ization with real world experience,

Y . Chebotar, A. Handa, V . Makoviychuk, M. Macklin, J. Issac, N. Ratliff, and D. Fox, “Closing the sim-to-real loop: Adapting simulation random- ization with real world experience,” in2019 International Conference on Robotics and Automation, 2019, pp. 8973–8979

work page 2019
[7]

Pybullet, a python module for physics simulation for games, robotics and machine learning,

E. Coumans and Y . Bai, “Pybullet, a python module for physics simulation for games, robotics and machine learning,” 2016

work page 2016
[8]

SAM 3D: 3Dfy Anything in Images

X. Chen, F.-J. Chu, P. Gleize, K. J. Liang, A. Sax, H. Tang, W. Wang, M. Guo, T. Hardin, X. Liet al., “Sam 3d: 3dfy anything in images,” arXiv preprint arXiv:2511.16624, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Phyrecon: Physically plausible neural scene reconstruction,

J. Ni, Y . Chen, B. Jing, N. Jiang, B. Wang, B. Dai, P. Li, Y . Zhu, S.- C. Zhu, and S. Huang, “Phyrecon: Physically plausible neural scene reconstruction,”Advances in Neural Information Processing Systems, vol. 37, pp. 25 747–25 780, 2024

work page 2024
[10]

Physically compatible 3d object modeling from a single image,

M. Guo, B. Wang, P. Ma, T. Zhang, C. Owens, C. Gan, J. Tenenbaum, K. He, and W. Matusik, “Physically compatible 3d object modeling from a single image,”Advances in Neural Information Processing Systems, vol. 37, pp. 119 260–119 282, 2024

work page 2024
[11]

Cast: Component-aligned 3d scene reconstruction from an rgb image,

K. Yao, L. Zhang, X. Yan, Y . Zeng, Q. Zhang, L. Xu, W. Yang, J. Gu, and J. Yu, “Cast: Component-aligned 3d scene reconstruction from an rgb image,”ACM Transactions on Graphics, vol. 44, no. 4, pp. 1–19, 2025

work page 2025
[12]

Physpose: Refining 6d object poses with physical constraints,

M. Malenick `y, M. C´ıfka, M. Fourmy, L. Montaut, J. Carpentier, J. Sivic, and V . Petrik, “Physpose: Refining 6d object poses with physical constraints,”arXiv preprint arXiv:2503.23587, 2025

work page arXiv 2025
[13]

Reconstructing interactive 3d scenes by panoptic mapping and cad model alignments,

M. Han, Z. Zhang, Z. Jiao, X. Xie, Y . Zhu, S.-C. Zhu, and H. Liu, “Reconstructing interactive 3d scenes by panoptic mapping and cad model alignments,” in2021 International Conference on Robotics and Automation, 2021, pp. 12 199–12 206

work page 2021
[14]

Inferring 3d shapes of unknown rigid objects in clutter through inverse physics reasoning,

C. Song and A. Boularias, “Inferring 3d shapes of unknown rigid objects in clutter through inverse physics reasoning,”IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 201–208, 2018

work page 2018
[15]

Brax–a differentiable physics engine for large scale rigid body simulation,

C. D. Freeman, E. Frey, A. Raichuk, S. Girgin, I. Mordatch, and O. Bachem, “Brax–a differentiable physics engine for large scale rigid body simulation,”arXiv preprint arXiv:2106.13281, 2021

work page arXiv 2021
[16]

Diffsdfsim: Differentiable rigid-body dynamics with implicit shapes,

M. Strecke and J. Stueckler, “Diffsdfsim: Differentiable rigid-body dynamics with implicit shapes,” in2021 international conference on 3D Vision, 2021, pp. 96–105

work page 2021
[17]

One-shot real-to-sim via end-to-end differentiable simulation and rendering,

Y . Zhu, T. Xiang, A. M. Dollar, and Z. Pan, “One-shot real-to-sim via end-to-end differentiable simulation and rendering,”IEEE Robotics and Automation Letters, 2025

work page 2025
[18]

Acdc: Automated creation of digital cousins for robust policy learning,

T. Dai, J. Wong, Y . Jiang, C. Wang, C. Gokmen, R. Zhang, J. Wu, and L. Fei-Fei, “Acdc: Automated creation of digital cousins for robust policy learning,”arXiv e-prints, pp. arXiv–2410, 2024

work page 2024
[19]

Urdformer: A pipeline for constructing articulated simulation environments from real-world images

Z. Chen, A. Walsman, M. Memmel, K. Mo, A. Fang, K. Vemuri, A. Wu, D. Fox, and A. Gupta, “Urdformer: A pipeline for constructing articulated simulation environments from real-world images,”arXiv preprint arXiv:2405.11656, 2024

work page arXiv 2024
[20]

Physically embodied gaussian splatting: A visually learnt and physically grounded 3d representation for robotics,

J. Abou-Chakra, K. Rana, F. Dayoub, and N. Suenderhauf, “Physically embodied gaussian splatting: A visually learnt and physically grounded 3d representation for robotics,” in8th Annual Conference on Robot Learning, 2024

work page 2024
[21]

Real-is-sim: Bridging the sim-to-real gap with a dynamic digital twin for real-world robot policy evaluation,

J. Abou-Chakra, L. Sun, K. Rana, B. May, K. Schmeckpeper, M. V . Minniti, and L. Herlant, “Real-is-sim: Bridging the sim-to-real gap with a dynamic digital twin for real-world robot policy evaluation,”arXiv preprint arXiv:2504.03597, 2025

work page arXiv 2025
[22]

Phystwin: Physics-informed reconstruction and simulation of deformable objects from videos.arXiv preprint arXiv:2503.17973,

H. Jiang, H.-Y . Hsu, K. Zhang, H.-N. Yu, S. Wang, and Y . Li, “Phystwin: Physics-informed reconstruction and simulation of deformable objects from videos,”arXiv preprint arXiv:2503.17973, 2025

work page arXiv 2025
[23]

Synctwin: Fast digital twin construction and synchronization for safe robotic grasping,

R. Huang, B. Yang, W. Gui, J. Morgan, E. Biyik, and J. Li, “Synctwin: Fast digital twin construction and synchronization for safe robotic grasping,”arXiv preprint arXiv:2601.09920, 2026

work page arXiv 2026
[24]

RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

S. Nasiriany, A. Maddukuri, L. Zhang, A. Parikh, A. Lo, A. Joshi, A. Mandlekar, and Y . Zhu, “Robocasa: Large-scale simulation of ev- eryday tasks for generalist robots,”arXiv preprint arXiv:2406.02523, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Open3D: A Modern Library for 3D Data Processing

Q.-Y . Zhou, J. Park, and V . Koltun, “Open3D: A modern library for 3D data processing,”arXiv:1801.09847, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[26]

On the stability properties of quadruped creeping gaits,

R. B. McGhee and A. A. Frank, “On the stability properties of quadruped creeping gaits,”Mathematical Biosciences, vol. 3, pp. 331–351, 1968

work page 1968
[27]

Kaolin: A pytorch library for accelerating 3D deep learning research,

K. M. Jatavallabhula, E. Smith, J.-F. Lafleche, C. F. Tsang, A. Rozantsev, W. Chen, T. Xiang, R. Lebaredian, and S. Fidler, “Kaolin: A pytorch library for accelerating 3D deep learning research,”arXiv preprint arXiv:1911.05063, 2019

work page arXiv 1911
[28]

Local optimization for robust signed distance field collision,

M. Macklin, K. Erleben, M. M ¨uller, N. Chentanez, S. Jeschke, and Z. Corse, “Local optimization for robust signed distance field collision,” Proceedings of the ACM on Computer Graphics and Interactive Tech- niques, vol. 3, no. 1, pp. 1–17, 2020

work page 2020
[29]

Learning to predict 3d objects with an interpolation-based differentiable renderer,

W. Chen, H. Ling, J. Gao, E. Smith, J. Lehtinen, A. Jacobson, and S. Fidler, “Learning to predict 3d objects with an interpolation-based differentiable renderer,”Advances in neural information processing systems, vol. 32, 2019

work page 2019
[30]

Google scanned objects: A high- quality dataset of 3d scanned household items,

L. Downs, A. Francis, N. Koenig, B. Kinman, R. Hickman, K. Reymann, T. B. McHugh, and V . Vanhoucke, “Google scanned objects: A high- quality dataset of 3d scanned household items,” in2022 International Conference on Robotics and Automation, 2022, pp. 2553–2560

work page 2022
[31]

Benchmarking in manipulation research: Using the yale-cmu- berkeley object and model set,

B. Calli, A. Walsman, A. Singh, S. Srinivasa, P. Abbeel, and A. M. Dollar, “Benchmarking in manipulation research: Using the yale-cmu- berkeley object and model set,”IEEE Robotics & Automation Magazine, vol. 22, no. 3, pp. 36–52, 2015

work page 2015
[32]

Using shape to categorize: Low-shot learning with an explicit shape bias,

S. Stojanov, A. Thai, and J. M. Rehg, “Using shape to categorize: Low-shot learning with an explicit shape bias,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 1798–1808

work page 2021

[1] [1]

Anal- ysis and observations from the first amazon picking challenge,

N. Correll, K. E. Bekris, D. Berenson, O. Brock, A. Causo, K. Hauser, K. Okada, A. Rodriguez, J. M. Romano, and P. R. Wurman, “Anal- ysis and observations from the first amazon picking challenge,”IEEE Transactions on Automation Science and Engineering, vol. 15, no. 1, pp. 172–188, 2016

work page 2016

[2] [2]

A framework for push-grasping in clutter,

M. Dogar and S. Srinivasa, “A framework for push-grasping in clutter,” Robotics: Science and systems VII, vol. 1, pp. 65–72, 2011

work page 2011

[3] [3]

Extraction of physically plausible support relations to predict and validate manipulation action effects,

R. Kartmann, F. Paus, M. Grotz, and T. Asfour, “Extraction of physically plausible support relations to predict and validate manipulation action effects,”IEEE Robotics and Automation Letters, vol. 3, no. 4, pp. 3991– 3998, 2018

work page 2018

[4] [4]

Holoscene: Simulation-ready interactive 3d worlds from a single video,

H. Xia, C.-H. Lin, H.-Y . Hsu, Q. Leboutet, K. Gao, M. Paulitsch, B. Ummenhofer, and S. Wang, “Holoscene: Simulation-ready interactive 3d worlds from a single video,”arXiv preprint arXiv:2510.05560, 2025

work page arXiv 2025

[5] [5]

Reconciling reality through simulation: A real-to-sim- to-real approach for robust manipulation,

M. T. Villasevil, A. Simeonov, Z. Li, A. Chan, T. Chen, A. Gupta, and P. Agrawal, “Reconciling reality through simulation: A real-to-sim- to-real approach for robust manipulation,” inRobotics: Science and Systems, 2024

work page 2024

[6] [6]

Closing the sim-to-real loop: Adapting simulation random- ization with real world experience,

Y . Chebotar, A. Handa, V . Makoviychuk, M. Macklin, J. Issac, N. Ratliff, and D. Fox, “Closing the sim-to-real loop: Adapting simulation random- ization with real world experience,” in2019 International Conference on Robotics and Automation, 2019, pp. 8973–8979

work page 2019

[7] [7]

Pybullet, a python module for physics simulation for games, robotics and machine learning,

E. Coumans and Y . Bai, “Pybullet, a python module for physics simulation for games, robotics and machine learning,” 2016

work page 2016

[8] [8]

SAM 3D: 3Dfy Anything in Images

X. Chen, F.-J. Chu, P. Gleize, K. J. Liang, A. Sax, H. Tang, W. Wang, M. Guo, T. Hardin, X. Liet al., “Sam 3d: 3dfy anything in images,” arXiv preprint arXiv:2511.16624, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Phyrecon: Physically plausible neural scene reconstruction,

J. Ni, Y . Chen, B. Jing, N. Jiang, B. Wang, B. Dai, P. Li, Y . Zhu, S.- C. Zhu, and S. Huang, “Phyrecon: Physically plausible neural scene reconstruction,”Advances in Neural Information Processing Systems, vol. 37, pp. 25 747–25 780, 2024

work page 2024

[10] [10]

Physically compatible 3d object modeling from a single image,

M. Guo, B. Wang, P. Ma, T. Zhang, C. Owens, C. Gan, J. Tenenbaum, K. He, and W. Matusik, “Physically compatible 3d object modeling from a single image,”Advances in Neural Information Processing Systems, vol. 37, pp. 119 260–119 282, 2024

work page 2024

[11] [11]

Cast: Component-aligned 3d scene reconstruction from an rgb image,

K. Yao, L. Zhang, X. Yan, Y . Zeng, Q. Zhang, L. Xu, W. Yang, J. Gu, and J. Yu, “Cast: Component-aligned 3d scene reconstruction from an rgb image,”ACM Transactions on Graphics, vol. 44, no. 4, pp. 1–19, 2025

work page 2025

[12] [12]

Physpose: Refining 6d object poses with physical constraints,

M. Malenick `y, M. C´ıfka, M. Fourmy, L. Montaut, J. Carpentier, J. Sivic, and V . Petrik, “Physpose: Refining 6d object poses with physical constraints,”arXiv preprint arXiv:2503.23587, 2025

work page arXiv 2025

[13] [13]

Reconstructing interactive 3d scenes by panoptic mapping and cad model alignments,

M. Han, Z. Zhang, Z. Jiao, X. Xie, Y . Zhu, S.-C. Zhu, and H. Liu, “Reconstructing interactive 3d scenes by panoptic mapping and cad model alignments,” in2021 International Conference on Robotics and Automation, 2021, pp. 12 199–12 206

work page 2021

[14] [14]

Inferring 3d shapes of unknown rigid objects in clutter through inverse physics reasoning,

C. Song and A. Boularias, “Inferring 3d shapes of unknown rigid objects in clutter through inverse physics reasoning,”IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 201–208, 2018

work page 2018

[15] [15]

Brax–a differentiable physics engine for large scale rigid body simulation,

C. D. Freeman, E. Frey, A. Raichuk, S. Girgin, I. Mordatch, and O. Bachem, “Brax–a differentiable physics engine for large scale rigid body simulation,”arXiv preprint arXiv:2106.13281, 2021

work page arXiv 2021

[16] [16]

Diffsdfsim: Differentiable rigid-body dynamics with implicit shapes,

M. Strecke and J. Stueckler, “Diffsdfsim: Differentiable rigid-body dynamics with implicit shapes,” in2021 international conference on 3D Vision, 2021, pp. 96–105

work page 2021

[17] [17]

One-shot real-to-sim via end-to-end differentiable simulation and rendering,

Y . Zhu, T. Xiang, A. M. Dollar, and Z. Pan, “One-shot real-to-sim via end-to-end differentiable simulation and rendering,”IEEE Robotics and Automation Letters, 2025

work page 2025

[18] [18]

Acdc: Automated creation of digital cousins for robust policy learning,

T. Dai, J. Wong, Y . Jiang, C. Wang, C. Gokmen, R. Zhang, J. Wu, and L. Fei-Fei, “Acdc: Automated creation of digital cousins for robust policy learning,”arXiv e-prints, pp. arXiv–2410, 2024

work page 2024

[19] [19]

Urdformer: A pipeline for constructing articulated simulation environments from real-world images

Z. Chen, A. Walsman, M. Memmel, K. Mo, A. Fang, K. Vemuri, A. Wu, D. Fox, and A. Gupta, “Urdformer: A pipeline for constructing articulated simulation environments from real-world images,”arXiv preprint arXiv:2405.11656, 2024

work page arXiv 2024

[20] [20]

Physically embodied gaussian splatting: A visually learnt and physically grounded 3d representation for robotics,

J. Abou-Chakra, K. Rana, F. Dayoub, and N. Suenderhauf, “Physically embodied gaussian splatting: A visually learnt and physically grounded 3d representation for robotics,” in8th Annual Conference on Robot Learning, 2024

work page 2024

[21] [21]

Real-is-sim: Bridging the sim-to-real gap with a dynamic digital twin for real-world robot policy evaluation,

J. Abou-Chakra, L. Sun, K. Rana, B. May, K. Schmeckpeper, M. V . Minniti, and L. Herlant, “Real-is-sim: Bridging the sim-to-real gap with a dynamic digital twin for real-world robot policy evaluation,”arXiv preprint arXiv:2504.03597, 2025

work page arXiv 2025

[22] [22]

Phystwin: Physics-informed reconstruction and simulation of deformable objects from videos.arXiv preprint arXiv:2503.17973,

H. Jiang, H.-Y . Hsu, K. Zhang, H.-N. Yu, S. Wang, and Y . Li, “Phystwin: Physics-informed reconstruction and simulation of deformable objects from videos,”arXiv preprint arXiv:2503.17973, 2025

work page arXiv 2025

[23] [23]

Synctwin: Fast digital twin construction and synchronization for safe robotic grasping,

R. Huang, B. Yang, W. Gui, J. Morgan, E. Biyik, and J. Li, “Synctwin: Fast digital twin construction and synchronization for safe robotic grasping,”arXiv preprint arXiv:2601.09920, 2026

work page arXiv 2026

[24] [24]

RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

S. Nasiriany, A. Maddukuri, L. Zhang, A. Parikh, A. Lo, A. Joshi, A. Mandlekar, and Y . Zhu, “Robocasa: Large-scale simulation of ev- eryday tasks for generalist robots,”arXiv preprint arXiv:2406.02523, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

Open3D: A Modern Library for 3D Data Processing

Q.-Y . Zhou, J. Park, and V . Koltun, “Open3D: A modern library for 3D data processing,”arXiv:1801.09847, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[26] [26]

On the stability properties of quadruped creeping gaits,

R. B. McGhee and A. A. Frank, “On the stability properties of quadruped creeping gaits,”Mathematical Biosciences, vol. 3, pp. 331–351, 1968

work page 1968

[27] [27]

Kaolin: A pytorch library for accelerating 3D deep learning research,

K. M. Jatavallabhula, E. Smith, J.-F. Lafleche, C. F. Tsang, A. Rozantsev, W. Chen, T. Xiang, R. Lebaredian, and S. Fidler, “Kaolin: A pytorch library for accelerating 3D deep learning research,”arXiv preprint arXiv:1911.05063, 2019

work page arXiv 1911

[28] [28]

Local optimization for robust signed distance field collision,

M. Macklin, K. Erleben, M. M ¨uller, N. Chentanez, S. Jeschke, and Z. Corse, “Local optimization for robust signed distance field collision,” Proceedings of the ACM on Computer Graphics and Interactive Tech- niques, vol. 3, no. 1, pp. 1–17, 2020

work page 2020

[29] [29]

Learning to predict 3d objects with an interpolation-based differentiable renderer,

W. Chen, H. Ling, J. Gao, E. Smith, J. Lehtinen, A. Jacobson, and S. Fidler, “Learning to predict 3d objects with an interpolation-based differentiable renderer,”Advances in neural information processing systems, vol. 32, 2019

work page 2019

[30] [30]

Google scanned objects: A high- quality dataset of 3d scanned household items,

L. Downs, A. Francis, N. Koenig, B. Kinman, R. Hickman, K. Reymann, T. B. McHugh, and V . Vanhoucke, “Google scanned objects: A high- quality dataset of 3d scanned household items,” in2022 International Conference on Robotics and Automation, 2022, pp. 2553–2560

work page 2022

[31] [31]

Benchmarking in manipulation research: Using the yale-cmu- berkeley object and model set,

B. Calli, A. Walsman, A. Singh, S. Srinivasa, P. Abbeel, and A. M. Dollar, “Benchmarking in manipulation research: Using the yale-cmu- berkeley object and model set,”IEEE Robotics & Automation Magazine, vol. 22, no. 3, pp. 36–52, 2015

work page 2015

[32] [32]

Using shape to categorize: Low-shot learning with an explicit shape bias,

S. Stojanov, A. Thai, and J. M. Rehg, “Using shape to categorize: Low-shot learning with an explicit shape bias,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 1798–1808

work page 2021