pith. machine review for the scientific record. sign in

arxiv: 2604.27106 · v1 · submitted 2026-04-29 · 💻 cs.CV · cs.AI· cs.LG· cs.RO

Recognition: unknown

Reconstruction by Generation: 3D Multi-Object Scene Reconstruction from Sparse Observations

Authors on Pith no claims yet

Pith reviewed 2026-05-07 08:24 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGcs.RO
keywords 3D scene reconstructionmulti-object scenesgenerative modelsRGB-D imagesocclusion handlingpose estimationsynthetic priorsshape reconstruction
0
0 comments X

The pith

RecGen jointly estimates shapes, parts, and poses for multi-object 3D scenes from sparse RGB-D views by training generative models on compositional synthetic scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RecGen, a generative approach to full 3D scene reconstruction that jointly infers object shapes, part shapes, and object poses even when views are limited and objects block one another. It trains on synthetically assembled scenes to build shape knowledge that carries over to real photographs and varied environments. A reader would care because this reduces the need for enormous real-world 3D datasets while still delivering usable geometry and positioning for downstream tasks such as robotics simulation. The reported results show consistent gains on challenging occluded test cases over a prior method that used far more training meshes.

Core claim

RecGen is a generative framework for probabilistic joint estimation of object and part shapes, as well as their pose under occlusion and partial visibility from one or multiple RGB-D images. By leveraging compositional synthetic scene generation and strong 3D shape priors, RecGen generalizes across diverse object types and real-world environments. It achieves state-of-the-art performance on complex, heavily occluded datasets, robustly handling severe occlusions, symmetric objects, object parts, and intricate geometry and texture.

What carries the argument

Generative model trained on compositionally assembled synthetic scenes to produce transferable 3D shape and pose priors for joint probabilistic inference from sparse RGB-D input.

If this is right

  • The method produces usable estimates for object parts and symmetric items that prior techniques handled poorly under occlusion.
  • It reaches higher geometric accuracy, texture fidelity, and pose precision than SAM3D while requiring roughly 80 percent fewer training meshes.
  • Performance holds across single-view and multi-view inputs on heavily occluded real-world test sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The data-efficiency result points to structured synthetic composition as a practical route for lowering the cost of building 3D perception systems for new environments.
  • Similar generative priors could be tested for extending reconstruction to dynamic or video sequences where temporal information further constrains the possible shapes and motions.
  • Robotics applications that need rapid scene models for planning would gain from the reported robustness to partial views and clutter.

Load-bearing premise

Shape priors acquired from synthetic scenes composed of known objects will transfer to real photographs that contain different lighting, textures, and object instances without a large performance penalty.

What would settle it

A clear performance collapse relative to baselines when the same model is evaluated on a fresh set of real multi-object scenes whose object categories or surface appearances were never used in the synthetic training compositions.

read the original abstract

Accurately reconstructing complex full multi-object scenes from sparse observations remains a core challenge in computer vision and a key step toward scalable and reliable simulation for robotics. In this work, we introduce RecGen, a generative framework for probabilistic joint estimation of object and part shapes, as well as their pose under occlusion and partial visibility from one or multiple RGB-D images. By leveraging compositional synthetic scene generation and strong 3D shape priors, RecGen generalizes across diverse object types and real-world environments. RecGen achieves state-of-the-art performance on complex, heavily occluded datasets, robustly handling severe occlusions, symmetric objects, object parts, and intricate geometry and texture. Despite using nearly 80% fewer training meshes than the previous state of the art SAM3D, RecGen outperforms it by 30.1% in geometric shape quality, 9.1% in texture reconstruction, and 33.9% in pose estimation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces RecGen, a generative framework for probabilistic joint estimation of object and part shapes as well as their poses from sparse RGB-D observations in multi-object scenes. It relies on compositional synthetic scene generation to learn strong 3D shape priors that are claimed to generalize to diverse real-world environments, handling severe occlusions, symmetry, and intricate geometry/texture. The central claim is state-of-the-art performance on complex, heavily occluded datasets, outperforming SAM3D by 30.1% in geometric shape quality, 9.1% in texture reconstruction, and 33.9% in pose estimation while using nearly 80% fewer training meshes.

Significance. If the generalization from synthetic compositional priors to real occluded scenes holds, the work would be significant for scalable robotics simulation by showing that generative 3D priors can deliver substantial gains with far less training data than prior methods. The reconstruction-by-generation paradigm for joint shape-pose inference under partial visibility is a promising direction, and the efficiency claim (80% fewer meshes) would be a notable contribution if supported by rigorous cross-domain validation.

major comments (2)
  1. [§5] §5 (Experiments): The headline performance gains on real-world heavily occluded datasets are presented without quantitative evidence that the synthetic training distribution closes the domain gap for real textures, lighting, and sensor noise. No real-vs-synthetic performance tables, domain-randomization ablations, or texture distribution statistics are reported, so it is unclear whether the 30.1% shape-quality improvement follows from the method or from unverified transfer assumptions.
  2. [§4] §4 (Method) and §5.1 (Ablations): The claim that strong shape priors learned from compositional synthetic scenes suffice for real-world generalization is load-bearing for the data-efficiency argument, yet the manuscript provides no controlled experiments isolating the contribution of the generative prior versus potential differences in baseline re-implementations or metric definitions.
minor comments (2)
  1. The abstract and introduction should explicitly list the exact real-world datasets used for testing and the precise training mesh count for both RecGen and SAM3D to allow direct verification of the 80% reduction claim.
  2. Figure captions and table footnotes could more clearly indicate whether reported metrics are computed on held-out synthetic scenes or on the real-world test sets.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential significance of the reconstruction-by-generation paradigm. We address the major comments point by point below and outline planned revisions to strengthen the empirical support for our claims.

read point-by-point responses
  1. Referee: [§5] §5 (Experiments): The headline performance gains on real-world heavily occluded datasets are presented without quantitative evidence that the synthetic training distribution closes the domain gap for real textures, lighting, and sensor noise. No real-vs-synthetic performance tables, domain-randomization ablations, or texture distribution statistics are reported, so it is unclear whether the 30.1% shape-quality improvement follows from the method or from unverified transfer assumptions.

    Authors: We agree that explicit quantification of the domain gap would strengthen the presentation. The current results rely on direct evaluation on real datasets as implicit evidence of generalization from the compositional synthetic priors. In the revised manuscript we will add (i) a table comparing reconstruction metrics on held-out synthetic test scenes versus the real evaluation sets, (ii) domain-randomization ablations that vary texture, lighting, and noise parameters during training, and (iii) basic texture-distribution statistics between the synthetic corpus and the real test images. These additions will make the source of the reported gains more transparent. revision: yes

  2. Referee: [§4] §4 (Method) and §5.1 (Ablations): The claim that strong shape priors learned from compositional synthetic scenes suffice for real-world generalization is load-bearing for the data-efficiency argument, yet the manuscript provides no controlled experiments isolating the contribution of the generative prior versus potential differences in baseline re-implementations or metric definitions.

    Authors: We acknowledge the need for tighter isolation of the generative prior's contribution. Section 5.1 already contains ablations that disable the compositional generation and shape-prior components, showing measurable drops in performance. To address concerns about re-implementation details, the revision will (i) expand the description of our SAM3D re-implementation (including exact mesh counts, training schedules, and metric computation code), (ii) add a controlled experiment that trains RecGen without the generative prior while keeping all other architecture and optimization choices identical, and (iii) include a short appendix clarifying metric definitions. These changes will better separate the effect of the prior from other factors. revision: yes

Circularity Check

0 steps flagged

No circularity detected in derivation or performance claims

full rationale

The manuscript introduces RecGen as a generative model leveraging compositional synthetic scene generation and 3D shape priors to achieve reported gains over the external baseline SAM3D. No equations, self-definitional relations, fitted-input predictions, or load-bearing self-citations are present that reduce the claimed shape/texture/pose metrics or generalization statements to quantities defined by construction within the paper itself. Performance numbers are framed as direct empirical comparisons against an independent prior method on held-out data, rendering the central claims self-contained rather than tautological.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that synthetic compositional scenes plus learned shape priors suffice to bridge the domain gap to real data; this is a domain assumption rather than a derived result. No explicit free parameters or invented entities are named in the abstract, but the generative model implicitly contains many tunable components typical of modern neural architectures.

free parameters (1)
  • shape prior strength
    The weighting or regularization strength of the 3D shape priors is almost certainly tuned during training to achieve the reported generalization.
axioms (1)
  • domain assumption Compositional synthetic scene generation produces training distributions sufficiently close to real-world multi-object scenes for the learned priors to transfer.
    Invoked to justify training on synthetic data while claiming real-world generalization.

pith-pipeline@v0.9.0 · 5510 in / 1456 out tokens · 53482 ms · 2026-05-07T08:24:04.364715+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

96 extracted references · 21 canonical work pages · 6 internal anchors

  1. [1]

    Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation

    Li, C, Zhang, R, Wong, J, Gokmen, C, Srivastava, S, Martín-Martín, R, Wang, C, Levine, G, Lingelbach, M, Sun, J, et al. Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation. CoRL. (2023)

  2. [2]

    Habitat: A platform for embodied ai research

    Savva, M, Kadian, A, Maksymets, O, Zhao, Y, Wijmans, E, Jain, B, Straub, J, Liu, J, Koltun, V, Malik, J, et al. Habitat: A platform for embodied ai research. ICCV. (2019)

  3. [3]

    Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning

    Mittal, M, Roth, P, Tigue, J, Richard, A, Zhang, O, Du, P, Serrano-Munoz, A, Yao, X, Zurbrügg, R, Rudin, N, et al. Isaac lab: A gpu-accelerated simulation framework for multi-modal robot learning. arXiv:2511.04831 (2025)

  4. [4]

    RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

    Chen, T, Chen, Z, Chen, B, Cai, Z, Liu, Y, Li, Z, Liang, Q, Lin, X, Ge, Y, Gu, Z, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation. arXiv:2506.18088 (2025)

  5. [5]

    Advancements and challenges of digital twins in industry

    Tao, F, Zhang, H, and Zhang, C. Advancements and challenges of digital twins in industry. Nature Computational Science (2024)

  6. [6]

    Living scenes: Multi-object relocalization and recon- struction in changing 3d environments

    Zhu, L, Huang, S, Schindler, K, and Armeni, I. Living scenes: Multi-object relocalization and recon- struction in changing 3d environments. CVPR. (2024)

  7. [7]

    SAM 3D: 3Dfy Anything in Images

    Chen, X, Chu, FJ, Gleize, P, Liang, KJ, Sax, A, Tang, H, Wang, W, Guo, M, Hardin, T, Li, X, et al. Sam 3d: 3dfy anything in images. arXiv:2511.16624 (2025)

  8. [8]

    Diffusionnocs: Managing symmetry and uncertainty in sim2real multi-modal category-level pose estimation

    Ikeda, T, Zakharov, S, Ko, T, Irshad, MZ, Lee, R, Liu, K, Ambrus, R, and Nishiwaki, K. Diffusionnocs: Managing symmetry and uncertainty in sim2real multi-modal category-level pose estimation. IROS. (2024)

  9. [9]

    Zero-1-to-3: Zero-shot one image to 3d object

    Liu, R, Wu, R, Van Hoorick, B, Tokmakov, P, Zakharov, S, and Vondrick, C. Zero-1-to-3: Zero-shot one image to 3d object. ICCV. (2023)

  10. [10]

    Structured 3d latents for scalable and versatile 3d generation

    Xiang, J, Lv, Z, Xu, S, Deng, Y, Wang, R, Zhang, B, Chen, D, Tong, X, and Yang, J. Structured 3d latents for scalable and versatile 3d generation. CVPR. (2025)

  11. [11]

    Team, TH.Hunyuan3D 1.0: A Unified Framework for Text-to-3D and Image-to-3D Generation. (2024)

  12. [12]

    Any6D: Model-free 6D Pose Estimation of Novel Objects

    Lee, T, Wen, B, Kang, M, Kang, G, Kweon, IS, and Yoon, KJ. Any6D: Model-free 6D Pose Estimation of Novel Objects. CVPR. (2025)

  13. [13]

    Gen6d: Generalizable model-free 6-dof object pose estimation from rgb images

    Liu, Y, Wen, Y, Peng, S, Lin, C, Long, X, Komura, T, and Wang, W. Gen6d: Generalizable model-free 6-dof object pose estimation from rgb images. ECCV. (2022)

  14. [14]

    SceneComplete: Open-World 3D Scene Completion in Complex Real World Environments for Robot Manipulation

    Agarwal, A, Singh, G, Sen, B, Lozano-Pérez, T, and Kaelbling, LP. SceneComplete: Open-World 3D Scene Completion in Complex Real World Environments for Robot Manipulation. arXiv:2410.23643 (2024)

  15. [15]

    Foundationpose: Unified 6d pose estimation and tracking of novel objects

    Wen, B, Yang, W, Kautz, J, and Birchfield, S. Foundationpose: Unified 6d pose estimation and tracking of novel objects. CVPR. (2024) 12

  16. [16]

    InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models

    Xu, J, Cheng, W, Gao, Y, Wang, X, Gao, S, and Shan, Y. InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models. arXiv:2404.07191 (2024)

  17. [17]

    Artgs: 3d gaussian splatting for interactive visual-physical modeling and manipulation of articulated objects

    Yu, Q, Yuan, X, Jiang, Y, Chen, J, Zheng, D, Hao, C, You, Y, Chen, Y, Mu, Y, Liu, L, et al. Artgs: 3d gaussian splatting for interactive visual-physical modeling and manipulation of articulated objects. IROS. (2025)

  18. [18]

    DexSim2Real2: Building Explicit World Model for Precise Articulated Object Dexterous Manipulation

    Jiang, T, Guan, Y, Ma, L, Xu, J, Meng, J, Chen, W, Zeng, Z, Li, L, Wu, D, and Chen, R. DexSim2Real2: Building Explicit World Model for Precise Articulated Object Dexterous Manipulation. IEEE Transac- tions on Robotics (2025)

  19. [19]

    Foundationstereo: Zero-shot stereo matching

    Wen, B, Trepte, M, Aribido, J, Kautz, J, Gallo, O, and Birchfield, S. Foundationstereo: Zero-shot stereo matching. CVPR. (2025)

  20. [20]

    Wang, Z, Wang, Y, Chen, Y, Xiang, C, Chen, S, Yu, D, Li, C, Su, H, and Zhu, J.CRM: Single Image to 3D Textured Mesh with Convolutional Reconstruction Model. (2024)

  21. [21]

    Tang, J, Chen, Z, Chen, X, Wang, T, Zeng, G, and Liu, Z.LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation. (2024)

  22. [22]

    Team, TH.Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation. (2025)

  23. [23]

    Gigapose: Fast and robust novel object pose estimation via one correspondence

    Nguyen, VN, Groueix, T, Salzmann, M, and Lepetit, V. Gigapose: Fast and robust novel object pose estimation via one correspondence. CVPR. (2024)

  24. [24]

    Pos3R: 6D Pose Estimation for Unseen Objects Made Easy

    Deng, W, Campbell, D, Sun, C, Zhang, J, Kanitkar, S, Shaffer, ME, and Gould, S. Pos3R: 6D Pose Estimation for Unseen Objects Made Easy. CVPR. (2025)

  25. [25]

    Liu, K, Zakharov, S, Chen, D, Ikeda, T, Shakhnarovich, G, Gaidon, A, and Ambrus, R.OmniShape: Zero-Shot Multi-Hypothesis Shape and Pose Estimation in the Real World. (2025)

  26. [26]

    Structure-from-motion revisited

    Schonberger, JL and Frahm, JM. Structure-from-motion revisited. CVPR. (2016)

  27. [27]

    Ardelean, A, Özer, M, and Egger, B.Gen3DSR: Generalizable 3D Scene Reconstruction via Divide and Conquer from a Single View. (2025)

  28. [28]

    CenterSnap: Single-Shot Multi-Object 3D Shape Reconstruction and Categorical 6D Pose and Size Estimation

    Irshad, MZ, Kollar, T, Laskey, M, Stone, K, and Kira, Z. CenterSnap: Single-Shot Multi-Object 3D Shape Reconstruction and Categorical 6D Pose and Size Estimation. ICRA. (2022)

  29. [29]

    ShAPO: Implicit Representations for Multi-Object Shape Appearance and Pose Optimization

    Irshad, MZ, Zakharov, S, Ambrus, R, Kollar, T, Kira, Z, and Gaidon, A. ShAPO: Implicit Representations for Multi-Object Shape Appearance and Pose Optimization. ECCV. (2022)

  30. [30]

    MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation

    Li, Y, Zhang, J, Chen, Z, Wang, Z, and Liu, Z. MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation. CVPR. (2025)

  31. [31]

    PartGen: Part-level 3D Generation and Reconstruction with Multi-View Diffusion Models

    Chen, M, Shapovalov, R, Laina, I, Monnier, T, Wang, J, Novotny, D, and Vedaldi, A. PartGen: Part-level 3D Generation and Reconstruction with Multi-View Diffusion Models. CVPR. (2025)

  32. [32]

    UniPart: Part-Level 3D Generation with Unified 3D Geom–Seg Latents

    He, X, Wu, Y, Guo, X, Ye, C, Zhou, J, Hu, T, Han, X, and Du, D. UniPart: Part-Level 3D Generation with Unified 3D Geom–Seg Latents. arXiv:2512.09435 (2026)

  33. [33]

    BANG: Dividing 3D Assets via Generative Exploded Dynamics

    Zhang, L, Zhang, Q, Jiang, H, Bai, Y, Yang, W, Xu, L, and Yu, J. BANG: Dividing 3D Assets via Generative Exploded Dynamics. ACM TOG (2025)

  34. [34]

    Lin, Y, Lin, C, Pan, P, Yan, H, Feng, Y, Mu, Y, and Fragkiadaki, K.PartCrafter: Structured 3D Mesh Generation via Compositional Latent Diffusion Transformers. (2025)

  35. [35]

    Melnik, A, Alt, B, Nguyen, G, Wilkowski, A, Stefańczyk, M, Wu, Q, Harms, S, Rhodin, H, Savva, M, and Beetz, M.Digital Twin Generation from Visual Data: A Survey. (2026)

  36. [36]

    Irshad, MZ, Comi, M, Lin, YC, Heppert, N, Valada, A, Ambrus, R, Kira, Z, and Tremblay, J.Neural Fields in Robotics: A Survey. (2024)

  37. [37]

    3D Gaussian Splatting for Real-Time Radiance Field Rendering

    Kerbl, B, Kopanas, G, Leimkühler, T, and Drettakis, G. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM TOG (2023)

  38. [38]

    Persistent Object Gaussian Splat (POGS) for Tracking Human and Robot Manipulation of Irregularly Shaped Objects

    Yu, J, Hari, K, El-Refai, K, Dalil, A, Kerr, J, Kim, CM, Cheng, R, Irshad, MZ, and Goldberg, K. Persistent Object Gaussian Splat (POGS) for Tracking Human and Robot Manipulation of Irregularly Shaped Objects. ICRA (2025)

  39. [39]

    Qureshi, MN, Garg, S, Yandun, F, Held, D, Kantor, G, and Silwal, A.SplatSim: Zero-Shot Sim2Real Transfer of RGB Manipulation Policies Using Gaussian Splatting. (2024)

  40. [40]

    Splat-MOVER: Multi-Stage, Open-Vocabulary Robotic Manipulation via Editable Gaussian Splatting (2024)

    Shorinwa, O, Tucker, J, Smith, A, Swann, A, Chen, T, Firoozi, R, Kennedy, MD, and Schwager, M. Splat-MOVER: Multi-Stage, Open-Vocabulary Robotic Manipulation via Editable Gaussian Splatting (2024)

  41. [41]

    (2024) 13

    Abou-Chakra, J, Rana, K, Dayoub, F, and Sünderhauf, N.Physically Embodied Gaussian Splatting: A Realtime Correctable World Model for Robotics. (2024) 13

  42. [42]

    GraspSplats: Efficient Manipulation with 3D Feature Splatting

    Ji, M, Qiu, RZ, Zou, X, and Wang, X. GraspSplats: Efficient Manipulation with 3D Feature Splatting. arXiv:2409.02084 (2024)

  43. [43]

    Chhablani, G, Ye, X, Irshad, MZ, and Kira, Z.EmbodiedSplat: Personalized Real-to-Sim-to-Real Navigation with Gaussian Splats from a Mobile Device. (2025)

  44. [44]

    Escontrela, A, Kerr, J, Allshire, A, Frey, J, Duan, R, Sferrazza, C, and Abbeel, P.GaussGym: An open-source real-to-sim framework for learning locomotion from pixels. (2025)

  45. [45]

    Distilled feature fields enable few-shot language-guided manipulation

    Shen, W, Yang, G, Yu, A, Wong, J, Kaelbling, LP, and Isola, P. Distilled feature fields enable few-shot language-guided manipulation. arXiv:2308.07931 (2023)

  46. [46]

    Yang, S, Yu, W, Zeng, J, Lv, J, Ren, K, Lu, C, Lin, D, and Pang, J.Novel Demonstration Generation with Gaussian Splatting Enables Robust One-Shot Manipulation. (2025)

  47. [47]

    GSWorld: Closed-Loop Photo-Realistic Simulation Suite for Robotic Manipulation

    Jiang, G, Chang, H, Qiu, RZ, Liang, Y, Ji, M, Zhu, J, Dong, Z, Zou, X, and Wang, X. GSWorld: Closed-Loop Photo-Realistic Simulation Suite for Robotic Manipulation. arXiv:2510.20813 (2025)

  48. [48]

    X-Sim: Cross-Embodiment Learning via Real-to-Sim-to-Real

    Dan, P, Kedia, K, Chao, A, Duan, E, Pace, MA, Ma, WC, and Choudhury, S. X-Sim: Cross-Embodiment Learning via Real-to-Sim-to-Real. CoRL. (2025)

  49. [49]

    Dream to Manipulate: Compositional World Models Empowering Robot Imitation Learning with Imagination

    Barcellona, L, Zadaianchuk, A, Allegro, D, Papa, S, Ghidoni, S, and Gavves, E. Dream to Manipulate: Compositional World Models Empowering Robot Imitation Learning with Imagination. ICLR. (2025)

  50. [50]

    Yu,J,Fu,L,Huang,H,El-Refai,K,Ambrus,RA,Cheng,R,Irshad,MZ,andGoldberg,K.Real2Render2Real: Scaling Robot Data Without Dynamics Simulation or Robot Hardware. (2025)

  51. [51]

    ZeroBot: Learning From Scratch in Minutes With Generative Real2Sim

    Kapelyukh, I, Zhang, X, James, S, Herlant, L, and Johns, E. ZeroBot: Learning From Scratch in Minutes With Generative Real2Sim. RA-L (2026)

  52. [52]

    Real-to-Sim Robot Policy Evaluation with Gaussian Splatting Simulation of Soft-Body Interactions

    Zhang, K, Sha, S, Jiang, H, Loper, M, Song, H, Cai, G, Xu, Z, Hu, X, Zheng, C, and Li, Y. Real-to-Sim Robot Policy Evaluation with Gaussian Splatting Simulation of Soft-Body Interactions. ICRA. (2026)

  53. [53]

    Jangir, Y, Zhang, Y, Lo, PC, Yamazaki, K, Zhang, C, Tu, KH, Ke, TW, Ke, L, Bisk, Y, and Fragkiadaki, K.RobotArena∞: Scalable Robot Benchmarking via Real-to-Sim Translation. (2025)

  54. [54]

    Jain, A, Zhang, M, Arora, K, Chen, W, Torne, M, Irshad, MZ, Zakharov, S, Wang, Y, Levine, S, Finn, C, Ma, WC, Shah, D, Gupta, A, and Pertsch, K.PolaRiS: Scalable Real-to-Sim Evaluations for Generalist Robot Policies. (2025)

  55. [55]

    Picasso: Holistic Scene Reconstruction with Physics- Constrained Sampling

    Yu, X, Talak, R, Shaikewitz, L, and Carlone, L. Picasso: Holistic Scene Reconstruction with Physics- Constrained Sampling. arXiv:2602.08058 (2026)

  56. [56]

    Xiang, T, Cao, J, Guo, S, Zhao, G, Luo, AF, and Ma, J.Real-to-Sim for Highly Cluttered Environments via Physics-Consistent Inter-Object Reasoning. (2026)

  57. [57]

    Huang, WC, Han, J, Ye, X, Pan, Z, and Hauser, K.Simulation-Ready Cluttered Scene Estimation via Physics-aware Joint Shape and Pose Optimization. (2026)

  58. [58]

    Flow Matching for Generative Modeling

    Lipman, Y, Chen, RT, Ben-Hamu, H, Nickel, M, and Le, M. Flow Matching for Generative Modeling. ICLR. (2023)

  59. [59]

    Scalable diffusion models with transformers

    Peebles, W and Xie, S. Scalable diffusion models with transformers. ICCV. (2023)

  60. [60]

    DINOv2: Learning Robust Visual Features without Supervision

    Oquab, M, Darcet, T, Moutakanni, T, Vo, HV, Szafraniec, M, Khalidov, V, Fernandez, P, Haziza, D, Massa, F, El-Nouby, A, Howes, R, Huang, PY, Xu, H, Sharma, V, Li, SW, Galuba, W, Rabbat, M, Assran, M, Ballas, N, Synnaeve, G, Misra, I, Jegou, H, Mairal, J, Labatut, P, Joulin, A, and Bojanowski, P. DINOv2: Learning Robust Visual Features without Supervision....

  61. [61]

    Learning with 3D rotations, a hitchhiker’s guide to SO (3)

    Geist, AR, Frey, J, Zhobro, M, Levina, A, and Martius, G. Learning with 3D rotations, a hitchhiker’s guide to SO (3). arXiv:2404.11735 (2024)

  62. [62]

    On the continuity of rotation representations in neural networks

    Zhou, Y, Barnes, C, Lu, J, Yang, J, and Li, H. On the continuity of rotation representations in neural networks. CVPR. (2019)

  63. [63]

    O-cnn: Octree-based convolutional neural networks for 3d shape analysis

    Wang, PS, Liu, Y, Guo, YX, Sun, CY, and Tong, X. O-cnn: Octree-based convolutional neural networks for 3d shape analysis. ACM TOG (2017)

  64. [64]

    Flexible Isosurface Extraction for Gradient-Based Mesh Optimization

    Shen, T, Munkberg, J, Hasselgren, J, Yin, K, Wang, Z, Chen, W, Gojcic, Z, Fidler, S, Sharp, N, and Gao, J. Flexible Isosurface Extraction for Gradient-Based Mesh Optimization. ACM TOG (2023)

  65. [65]

    CoRR , volume =

    Deitke, M, Liu, R, Wallingford, M, Ngo, H, Michel, O, Kusupati, A, Fan, A, Laforte, C, Voleti, V, Gadre, SY, VanderBilt, E, Kembhavi, A, Vondrick, C, Gkioxari, G, Ehsani, K, Schmidt, L, and Farhadi, A. Objaverse-XL: A Universe of 10M+ 3D Objects. arXiv:2307.05663 (2023)

  66. [66]

    ABO: Dataset and Benchmarks for Real-World 3D Object Understanding

    Collins, J, Goel, S, Deng, K, Luthra, A, Xu, L, Gundogdu, E, Zhang, X, Yago Vicente, TF, Dideriksen, T, Arora, H, Guillaumin, M, and Malik, J. ABO: Dataset and Benchmarks for Real-World 3D Object Understanding. CVPR (2022) 14

  67. [67]

    Habitat Synthetic Scenes Dataset (HSSD-200): An Analysis of 3D Scene Scale and Realism Tradeoffs for ObjectGoal Navigation

    Khanna*, M, Mao*, Y, Jiang, H, Haresh, S, Shacklett, B, Batra, D, Clegg, A, Undersander, E, Chang, AX, and Savva, M. Habitat Synthetic Scenes Dataset (HSSD-200): An Analysis of 3D Scene Scale and Realism Tradeoffs for ObjectGoal Navigation. arXiv (2023)

  68. [68]

    Cao,Z,Chen,Z,Pan,L,andLiu,Z.PhysX-3D:Physical-Grounded3DAssetGeneration.arXiv:2507.12465 (2025)

  69. [69]

    Partnext: A next-generation dataset for fine-grained and hierarchical 3d part understanding

    Wang, P, He, Y, Lv, X, Zhou, Y, Xu, L, Yu, J, and Gu, J. Partnext: A next-generation dataset for fine-grained and hierarchical 3d part understanding. arXiv:2510.20155 (2025)

  70. [70]

    SAPIEN: A SimulAted Part-based Interactive ENvironment

    Xiang, F, Qin, Y, Mo, K, Xia, Y, Zhu, H, Liu, F, Liu, M, Jiang, H, Yuan, Y, Wang, H, Yi, L, Chang, AX, Guibas, LJ, and Su, H. SAPIEN: A SimulAted Part-based Interactive ENvironment. CVPR. (2020)

  71. [71]

    Learning 6d object pose estimation using 3d object coordinates

    Brachmann, E, Krull, A, Michel, F, Gumhold, S, Shotton, J, and Rother, C. Learning 6d object pose estimation using 3d object coordinates. ECCV. (2014)

  72. [72]

    Homebreweddb: Rgb-d dataset for 6d pose estimation of 3d objects

    Kaskman, R, Zakharov, S, Shugurov, I, and Ilic, S. Homebreweddb: Rgb-d dataset for 6d pose estimation of 3d objects. ICCVW. (2019)

  73. [73]

    6-DoF Pose Estimation of Household Objects for Robotic Manipulation: An Accessible Dataset and Benchmark

    Tyree, S, Tremblay, J, To, T, Cheng, J, Mosier, T, Smith, J, and Birchfield, S. 6-DoF Pose Estimation of Household Objects for Robotic Manipulation: An Accessible Dataset and Benchmark. IROS. (2022)

  74. [74]

    ZeroGrasp: Zero-shot shape reconstruction enabled robotic grasping

    Iwase, S, Irshad, MZ, Liu, K, Guizilini, V, Lee, R, Ikeda, T, Amma, A, Nishiwaki, K, Kitani, K, Ambrus, R, et al. ZeroGrasp: Zero-shot shape reconstruction enabled robotic grasping. CVPR. (2025)

  75. [75]

    Artvip: Articulated digital assets of visual realism, modular interaction, and physical fidelity for robot learning

    Jin, Z, Che, Z, Zhao, Z, Wu, K, Zhang, Y, Zhao, Y, Liu, Z, Zhang, Q, Ju, X, Tian, J, et al. Artvip: Articulated digital assets of visual realism, modular interaction, and physical fidelity for robot learning. arXiv:2506.04941 (2025)

  76. [76]

    BOP: Benchmark for 6D object pose estimation

    Hodan, T, Michel, F, Brachmann, E, Kehl, W, GlentBuch, A, Kraft, D, Drost, B, Vidal, J, Ihrke, S, Zabulis, X, et al. BOP: Benchmark for 6D object pose estimation. ECCV. (2018)

  77. [77]

    DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset (2024)

    Khazatsky, A, Pertsch, K, Nair, S, Balakrishna, A, Dasari, S, Karamcheti, S, Nasiriany, S, Srirama, MK, Chen, LY, Ellis, K, Fagan, PD, Hejna, J, Itkina, M, Lepert, M, Ma, YJ, Miller, PT, Wu, J, Belkhale, S, Dass, S, Ha, H, Jain, A, Lee, A, Lee, Y, Memmel, M, Park, S, Radosavovic, I, Wang, K, Zhan, A, Black, K, Chi, C, Hatch, KB, Lin, S, Lu, J, Mercat, J, ...

  78. [78]

    Native and Compact Structured Latents for 3D Generation

    Xiang, J, Chen, X, Xu, S, Wang, R, Lv, Z, Deng, Y, Zhu, H, Dong, Y, Zhao, H, Yuan, NJ, and Yang, J. Native and Compact Structured Latents for 3D Generation. Tech report (2025)

  79. [79]

    One View, Many Worlds: Single-Image to 3D Object Meets Generative Domain Randomization for One-Shot 6D Pose Estimation

    Geng, Z, Wang, N, Xu, S, Ye, C, Li, B, Chen, Z, Peng, S, and Zhao, H. One View, Many Worlds: Single-Image to 3D Object Meets Generative Domain Randomization for One-Shot 6D Pose Estimation. arXiv:2509.07978 (2025)

  80. [80]

    Blenderproc

    Denninger, M, Sundermeyer, M, Winkelbauer, D, Zidan, Y, Olefir, D, Elbadrawy, M, Lodhi, A, and Katam, H. Blenderproc. arXiv:1911.01911 (2019)

Showing first 80 references.