arxiv: 2604.27106 · v1 · submitted 2026-04-29 · 💻 cs.CV · cs.AI· cs.LG· cs.RO

Recognition: unknown

Reconstruction by Generation: 3D Multi-Object Scene Reconstruction from Sparse Observations

Andrii Zadaianchuk , Leonardo Barcellona , Lennard Schuenemann , Christian Gumbsch , Zehao Wang , Muhammad Zubair Irshad , Fabien Despinoy , Rahaf Aljundi

show 2 more authors

Stratis Gavves Sergey Zakharov

Authors on Pith no claims yet

Pith reviewed 2026-05-07 08:24 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGcs.RO

keywords 3D scene reconstructionmulti-object scenesgenerative modelsRGB-D imagesocclusion handlingpose estimationsynthetic priorsshape reconstruction

0 comments

The pith

RecGen jointly estimates shapes, parts, and poses for multi-object 3D scenes from sparse RGB-D views by training generative models on compositional synthetic scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RecGen, a generative approach to full 3D scene reconstruction that jointly infers object shapes, part shapes, and object poses even when views are limited and objects block one another. It trains on synthetically assembled scenes to build shape knowledge that carries over to real photographs and varied environments. A reader would care because this reduces the need for enormous real-world 3D datasets while still delivering usable geometry and positioning for downstream tasks such as robotics simulation. The reported results show consistent gains on challenging occluded test cases over a prior method that used far more training meshes.

Core claim

RecGen is a generative framework for probabilistic joint estimation of object and part shapes, as well as their pose under occlusion and partial visibility from one or multiple RGB-D images. By leveraging compositional synthetic scene generation and strong 3D shape priors, RecGen generalizes across diverse object types and real-world environments. It achieves state-of-the-art performance on complex, heavily occluded datasets, robustly handling severe occlusions, symmetric objects, object parts, and intricate geometry and texture.

What carries the argument

Generative model trained on compositionally assembled synthetic scenes to produce transferable 3D shape and pose priors for joint probabilistic inference from sparse RGB-D input.

If this is right

The method produces usable estimates for object parts and symmetric items that prior techniques handled poorly under occlusion.
It reaches higher geometric accuracy, texture fidelity, and pose precision than SAM3D while requiring roughly 80 percent fewer training meshes.
Performance holds across single-view and multi-view inputs on heavily occluded real-world test sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The data-efficiency result points to structured synthetic composition as a practical route for lowering the cost of building 3D perception systems for new environments.
Similar generative priors could be tested for extending reconstruction to dynamic or video sequences where temporal information further constrains the possible shapes and motions.
Robotics applications that need rapid scene models for planning would gain from the reported robustness to partial views and clutter.

Load-bearing premise

Shape priors acquired from synthetic scenes composed of known objects will transfer to real photographs that contain different lighting, textures, and object instances without a large performance penalty.

What would settle it

A clear performance collapse relative to baselines when the same model is evaluated on a fresh set of real multi-object scenes whose object categories or surface appearances were never used in the synthetic training compositions.

read the original abstract

Accurately reconstructing complex full multi-object scenes from sparse observations remains a core challenge in computer vision and a key step toward scalable and reliable simulation for robotics. In this work, we introduce RecGen, a generative framework for probabilistic joint estimation of object and part shapes, as well as their pose under occlusion and partial visibility from one or multiple RGB-D images. By leveraging compositional synthetic scene generation and strong 3D shape priors, RecGen generalizes across diverse object types and real-world environments. RecGen achieves state-of-the-art performance on complex, heavily occluded datasets, robustly handling severe occlusions, symmetric objects, object parts, and intricate geometry and texture. Despite using nearly 80% fewer training meshes than the previous state of the art SAM3D, RecGen outperforms it by 30.1% in geometric shape quality, 9.1% in texture reconstruction, and 33.9% in pose estimation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RecGen frames multi-object 3D reconstruction as generative joint estimation of shapes and poses from sparse RGB-D, trained on compositional synthetic scenes, and reports clear gains over SAM3D with far less data, though the synthetic-to-real transfer is the part that still needs checking.

read the letter

RecGen stands out because it treats reconstruction as a probabilistic generative task that learns strong shape priors from synthetically composed multi-object scenes. This lets the model estimate object and part shapes plus poses jointly from one or more sparse RGB-D views, and the abstract shows it beating SAM3D by 30% on shape, 9% on texture, and 34% on pose while using roughly 80% fewer training meshes on occluded data. The compositional generation step is the practical piece that makes scaling the priors feasible without huge real 3D asset collections. The model also appears to manage symmetry, partial visibility, and intricate geometry better than earlier pipelines. Those are concrete advances for anyone building simulation environments or robotics perception stacks. The main soft spot is the domain transfer. Training happens entirely on synthetic scenes, yet the gains are claimed on real-world occluded datasets. Without visible ablations on texture variety, lighting shifts, or sensor noise, it is not yet clear how much the performance edge depends on the test distributions staying close to the synthetic ones. If that gap turns out larger than expected, the data-efficiency story weakens. The paper is aimed at 3D vision researchers and robotics groups that need better handling of sparse, occluded scenes. It shows enough concrete claims and engagement with prior work like SAM3D that it deserves a full referee rather than a desk reject. I would send it to peer review and ask specifically for the generalization experiments and any domain-randomization checks.

Referee Report

2 major / 2 minor

Summary. The paper introduces RecGen, a generative framework for probabilistic joint estimation of object and part shapes as well as their poses from sparse RGB-D observations in multi-object scenes. It relies on compositional synthetic scene generation to learn strong 3D shape priors that are claimed to generalize to diverse real-world environments, handling severe occlusions, symmetry, and intricate geometry/texture. The central claim is state-of-the-art performance on complex, heavily occluded datasets, outperforming SAM3D by 30.1% in geometric shape quality, 9.1% in texture reconstruction, and 33.9% in pose estimation while using nearly 80% fewer training meshes.

Significance. If the generalization from synthetic compositional priors to real occluded scenes holds, the work would be significant for scalable robotics simulation by showing that generative 3D priors can deliver substantial gains with far less training data than prior methods. The reconstruction-by-generation paradigm for joint shape-pose inference under partial visibility is a promising direction, and the efficiency claim (80% fewer meshes) would be a notable contribution if supported by rigorous cross-domain validation.

major comments (2)

[§5] §5 (Experiments): The headline performance gains on real-world heavily occluded datasets are presented without quantitative evidence that the synthetic training distribution closes the domain gap for real textures, lighting, and sensor noise. No real-vs-synthetic performance tables, domain-randomization ablations, or texture distribution statistics are reported, so it is unclear whether the 30.1% shape-quality improvement follows from the method or from unverified transfer assumptions.
[§4] §4 (Method) and §5.1 (Ablations): The claim that strong shape priors learned from compositional synthetic scenes suffice for real-world generalization is load-bearing for the data-efficiency argument, yet the manuscript provides no controlled experiments isolating the contribution of the generative prior versus potential differences in baseline re-implementations or metric definitions.

minor comments (2)

The abstract and introduction should explicitly list the exact real-world datasets used for testing and the precise training mesh count for both RecGen and SAM3D to allow direct verification of the 80% reduction claim.
Figure captions and table footnotes could more clearly indicate whether reported metrics are computed on held-out synthetic scenes or on the real-world test sets.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential significance of the reconstruction-by-generation paradigm. We address the major comments point by point below and outline planned revisions to strengthen the empirical support for our claims.

read point-by-point responses

Referee: [§5] §5 (Experiments): The headline performance gains on real-world heavily occluded datasets are presented without quantitative evidence that the synthetic training distribution closes the domain gap for real textures, lighting, and sensor noise. No real-vs-synthetic performance tables, domain-randomization ablations, or texture distribution statistics are reported, so it is unclear whether the 30.1% shape-quality improvement follows from the method or from unverified transfer assumptions.

Authors: We agree that explicit quantification of the domain gap would strengthen the presentation. The current results rely on direct evaluation on real datasets as implicit evidence of generalization from the compositional synthetic priors. In the revised manuscript we will add (i) a table comparing reconstruction metrics on held-out synthetic test scenes versus the real evaluation sets, (ii) domain-randomization ablations that vary texture, lighting, and noise parameters during training, and (iii) basic texture-distribution statistics between the synthetic corpus and the real test images. These additions will make the source of the reported gains more transparent. revision: yes
Referee: [§4] §4 (Method) and §5.1 (Ablations): The claim that strong shape priors learned from compositional synthetic scenes suffice for real-world generalization is load-bearing for the data-efficiency argument, yet the manuscript provides no controlled experiments isolating the contribution of the generative prior versus potential differences in baseline re-implementations or metric definitions.

Authors: We acknowledge the need for tighter isolation of the generative prior's contribution. Section 5.1 already contains ablations that disable the compositional generation and shape-prior components, showing measurable drops in performance. To address concerns about re-implementation details, the revision will (i) expand the description of our SAM3D re-implementation (including exact mesh counts, training schedules, and metric computation code), (ii) add a controlled experiment that trains RecGen without the generative prior while keeping all other architecture and optimization choices identical, and (iii) include a short appendix clarifying metric definitions. These changes will better separate the effect of the prior from other factors. revision: yes

Circularity Check

0 steps flagged

No circularity detected in derivation or performance claims

full rationale

The manuscript introduces RecGen as a generative model leveraging compositional synthetic scene generation and 3D shape priors to achieve reported gains over the external baseline SAM3D. No equations, self-definitional relations, fitted-input predictions, or load-bearing self-citations are present that reduce the claimed shape/texture/pose metrics or generalization statements to quantities defined by construction within the paper itself. Performance numbers are framed as direct empirical comparisons against an independent prior method on held-out data, rendering the central claims self-contained rather than tautological.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that synthetic compositional scenes plus learned shape priors suffice to bridge the domain gap to real data; this is a domain assumption rather than a derived result. No explicit free parameters or invented entities are named in the abstract, but the generative model implicitly contains many tunable components typical of modern neural architectures.

free parameters (1)

shape prior strength
The weighting or regularization strength of the 3D shape priors is almost certainly tuned during training to achieve the reported generalization.

axioms (1)

domain assumption Compositional synthetic scene generation produces training distributions sufficiently close to real-world multi-object scenes for the learned priors to transfer.
Invoked to justify training on synthetic data while claiming real-world generalization.

pith-pipeline@v0.9.0 · 5510 in / 1456 out tokens · 53482 ms · 2026-05-07T08:24:04.364715+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

96 extracted references · 21 canonical work pages · 6 internal anchors

[1]

Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation

Li, C, Zhang, R, Wong, J, Gokmen, C, Srivastava, S, Martín-Martín, R, Wang, C, Levine, G, Lingelbach, M, Sun, J, et al. Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation. CoRL. (2023)

2023
[2]

Habitat: A platform for embodied ai research

Savva, M, Kadian, A, Maksymets, O, Zhao, Y, Wijmans, E, Jain, B, Straub, J, Liu, J, Koltun, V, Malik, J, et al. Habitat: A platform for embodied ai research. ICCV. (2019)

2019
[3]

Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning

Mittal, M, Roth, P, Tigue, J, Richard, A, Zhang, O, Du, P, Serrano-Munoz, A, Yao, X, Zurbrügg, R, Rudin, N, et al. Isaac lab: A gpu-accelerated simulation framework for multi-modal robot learning. arXiv:2511.04831 (2025)

work page internal anchor Pith review arXiv 2025
[4]

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

Chen, T, Chen, Z, Chen, B, Cai, Z, Liu, Y, Li, Z, Liang, Q, Lin, X, Ge, Y, Gu, Z, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation. arXiv:2506.18088 (2025)

work page internal anchor Pith review arXiv 2025
[5]

Advancements and challenges of digital twins in industry

Tao, F, Zhang, H, and Zhang, C. Advancements and challenges of digital twins in industry. Nature Computational Science (2024)

2024
[6]

Living scenes: Multi-object relocalization and recon- struction in changing 3d environments

Zhu, L, Huang, S, Schindler, K, and Armeni, I. Living scenes: Multi-object relocalization and recon- struction in changing 3d environments. CVPR. (2024)

2024
[7]

SAM 3D: 3Dfy Anything in Images

Chen, X, Chu, FJ, Gleize, P, Liang, KJ, Sax, A, Tang, H, Wang, W, Guo, M, Hardin, T, Li, X, et al. Sam 3d: 3dfy anything in images. arXiv:2511.16624 (2025)

work page internal anchor Pith review arXiv 2025
[8]

Diffusionnocs: Managing symmetry and uncertainty in sim2real multi-modal category-level pose estimation

Ikeda, T, Zakharov, S, Ko, T, Irshad, MZ, Lee, R, Liu, K, Ambrus, R, and Nishiwaki, K. Diffusionnocs: Managing symmetry and uncertainty in sim2real multi-modal category-level pose estimation. IROS. (2024)

2024
[9]

Zero-1-to-3: Zero-shot one image to 3d object

Liu, R, Wu, R, Van Hoorick, B, Tokmakov, P, Zakharov, S, and Vondrick, C. Zero-1-to-3: Zero-shot one image to 3d object. ICCV. (2023)

2023
[10]

Structured 3d latents for scalable and versatile 3d generation

Xiang, J, Lv, Z, Xu, S, Deng, Y, Wang, R, Zhang, B, Chen, D, Tong, X, and Yang, J. Structured 3d latents for scalable and versatile 3d generation. CVPR. (2025)

2025
[11]

Team, TH.Hunyuan3D 1.0: A Unified Framework for Text-to-3D and Image-to-3D Generation. (2024)

2024
[12]

Any6D: Model-free 6D Pose Estimation of Novel Objects

Lee, T, Wen, B, Kang, M, Kang, G, Kweon, IS, and Yoon, KJ. Any6D: Model-free 6D Pose Estimation of Novel Objects. CVPR. (2025)

2025
[13]

Gen6d: Generalizable model-free 6-dof object pose estimation from rgb images

Liu, Y, Wen, Y, Peng, S, Lin, C, Long, X, Komura, T, and Wang, W. Gen6d: Generalizable model-free 6-dof object pose estimation from rgb images. ECCV. (2022)

2022
[14]

SceneComplete: Open-World 3D Scene Completion in Complex Real World Environments for Robot Manipulation

Agarwal, A, Singh, G, Sen, B, Lozano-Pérez, T, and Kaelbling, LP. SceneComplete: Open-World 3D Scene Completion in Complex Real World Environments for Robot Manipulation. arXiv:2410.23643 (2024)

work page arXiv 2024
[15]

Foundationpose: Unified 6d pose estimation and tracking of novel objects

Wen, B, Yang, W, Kautz, J, and Birchfield, S. Foundationpose: Unified 6d pose estimation and tracking of novel objects. CVPR. (2024) 12

2024
[16]

InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models

Xu, J, Cheng, W, Gao, Y, Wang, X, Gao, S, and Shan, Y. InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models. arXiv:2404.07191 (2024)

work page internal anchor Pith review arXiv 2024
[17]

Artgs: 3d gaussian splatting for interactive visual-physical modeling and manipulation of articulated objects

Yu, Q, Yuan, X, Jiang, Y, Chen, J, Zheng, D, Hao, C, You, Y, Chen, Y, Mu, Y, Liu, L, et al. Artgs: 3d gaussian splatting for interactive visual-physical modeling and manipulation of articulated objects. IROS. (2025)

2025
[18]

DexSim2Real2: Building Explicit World Model for Precise Articulated Object Dexterous Manipulation

Jiang, T, Guan, Y, Ma, L, Xu, J, Meng, J, Chen, W, Zeng, Z, Li, L, Wu, D, and Chen, R. DexSim2Real2: Building Explicit World Model for Precise Articulated Object Dexterous Manipulation. IEEE Transac- tions on Robotics (2025)

2025
[19]

Foundationstereo: Zero-shot stereo matching

Wen, B, Trepte, M, Aribido, J, Kautz, J, Gallo, O, and Birchfield, S. Foundationstereo: Zero-shot stereo matching. CVPR. (2025)

2025
[20]

Wang, Z, Wang, Y, Chen, Y, Xiang, C, Chen, S, Yu, D, Li, C, Su, H, and Zhu, J.CRM: Single Image to 3D Textured Mesh with Convolutional Reconstruction Model. (2024)

2024
[21]

Tang, J, Chen, Z, Chen, X, Wang, T, Zeng, G, and Liu, Z.LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation. (2024)

2024
[22]

Team, TH.Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation. (2025)

2025
[23]

Gigapose: Fast and robust novel object pose estimation via one correspondence

Nguyen, VN, Groueix, T, Salzmann, M, and Lepetit, V. Gigapose: Fast and robust novel object pose estimation via one correspondence. CVPR. (2024)

2024
[24]

Pos3R: 6D Pose Estimation for Unseen Objects Made Easy

Deng, W, Campbell, D, Sun, C, Zhang, J, Kanitkar, S, Shaffer, ME, and Gould, S. Pos3R: 6D Pose Estimation for Unseen Objects Made Easy. CVPR. (2025)

2025
[25]

Liu, K, Zakharov, S, Chen, D, Ikeda, T, Shakhnarovich, G, Gaidon, A, and Ambrus, R.OmniShape: Zero-Shot Multi-Hypothesis Shape and Pose Estimation in the Real World. (2025)

2025
[26]

Structure-from-motion revisited

Schonberger, JL and Frahm, JM. Structure-from-motion revisited. CVPR. (2016)

2016
[27]

Ardelean, A, Özer, M, and Egger, B.Gen3DSR: Generalizable 3D Scene Reconstruction via Divide and Conquer from a Single View. (2025)

2025
[28]

CenterSnap: Single-Shot Multi-Object 3D Shape Reconstruction and Categorical 6D Pose and Size Estimation

Irshad, MZ, Kollar, T, Laskey, M, Stone, K, and Kira, Z. CenterSnap: Single-Shot Multi-Object 3D Shape Reconstruction and Categorical 6D Pose and Size Estimation. ICRA. (2022)

2022
[29]

ShAPO: Implicit Representations for Multi-Object Shape Appearance and Pose Optimization

Irshad, MZ, Zakharov, S, Ambrus, R, Kollar, T, Kira, Z, and Gaidon, A. ShAPO: Implicit Representations for Multi-Object Shape Appearance and Pose Optimization. ECCV. (2022)

2022
[30]

MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation

Li, Y, Zhang, J, Chen, Z, Wang, Z, and Liu, Z. MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation. CVPR. (2025)

2025
[31]

PartGen: Part-level 3D Generation and Reconstruction with Multi-View Diffusion Models

Chen, M, Shapovalov, R, Laina, I, Monnier, T, Wang, J, Novotny, D, and Vedaldi, A. PartGen: Part-level 3D Generation and Reconstruction with Multi-View Diffusion Models. CVPR. (2025)

2025
[32]

UniPart: Part-Level 3D Generation with Unified 3D Geom–Seg Latents

He, X, Wu, Y, Guo, X, Ye, C, Zhou, J, Hu, T, Han, X, and Du, D. UniPart: Part-Level 3D Generation with Unified 3D Geom–Seg Latents. arXiv:2512.09435 (2026)

work page arXiv 2026
[33]

BANG: Dividing 3D Assets via Generative Exploded Dynamics

Zhang, L, Zhang, Q, Jiang, H, Bai, Y, Yang, W, Xu, L, and Yu, J. BANG: Dividing 3D Assets via Generative Exploded Dynamics. ACM TOG (2025)

2025
[34]

Lin, Y, Lin, C, Pan, P, Yan, H, Feng, Y, Mu, Y, and Fragkiadaki, K.PartCrafter: Structured 3D Mesh Generation via Compositional Latent Diffusion Transformers. (2025)

2025
[35]

Melnik, A, Alt, B, Nguyen, G, Wilkowski, A, Stefańczyk, M, Wu, Q, Harms, S, Rhodin, H, Savva, M, and Beetz, M.Digital Twin Generation from Visual Data: A Survey. (2026)

2026
[36]

Irshad, MZ, Comi, M, Lin, YC, Heppert, N, Valada, A, Ambrus, R, Kira, Z, and Tremblay, J.Neural Fields in Robotics: A Survey. (2024)

2024
[37]

3D Gaussian Splatting for Real-Time Radiance Field Rendering

Kerbl, B, Kopanas, G, Leimkühler, T, and Drettakis, G. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM TOG (2023)

2023
[38]

Persistent Object Gaussian Splat (POGS) for Tracking Human and Robot Manipulation of Irregularly Shaped Objects

Yu, J, Hari, K, El-Refai, K, Dalil, A, Kerr, J, Kim, CM, Cheng, R, Irshad, MZ, and Goldberg, K. Persistent Object Gaussian Splat (POGS) for Tracking Human and Robot Manipulation of Irregularly Shaped Objects. ICRA (2025)

2025
[39]

Qureshi, MN, Garg, S, Yandun, F, Held, D, Kantor, G, and Silwal, A.SplatSim: Zero-Shot Sim2Real Transfer of RGB Manipulation Policies Using Gaussian Splatting. (2024)

2024
[40]

Splat-MOVER: Multi-Stage, Open-Vocabulary Robotic Manipulation via Editable Gaussian Splatting (2024)

Shorinwa, O, Tucker, J, Smith, A, Swann, A, Chen, T, Firoozi, R, Kennedy, MD, and Schwager, M. Splat-MOVER: Multi-Stage, Open-Vocabulary Robotic Manipulation via Editable Gaussian Splatting (2024)

2024
[41]

(2024) 13

Abou-Chakra, J, Rana, K, Dayoub, F, and Sünderhauf, N.Physically Embodied Gaussian Splatting: A Realtime Correctable World Model for Robotics. (2024) 13

2024
[42]

GraspSplats: Efficient Manipulation with 3D Feature Splatting

Ji, M, Qiu, RZ, Zou, X, and Wang, X. GraspSplats: Efficient Manipulation with 3D Feature Splatting. arXiv:2409.02084 (2024)

work page arXiv 2024
[43]

Chhablani, G, Ye, X, Irshad, MZ, and Kira, Z.EmbodiedSplat: Personalized Real-to-Sim-to-Real Navigation with Gaussian Splats from a Mobile Device. (2025)

2025
[44]

Escontrela, A, Kerr, J, Allshire, A, Frey, J, Duan, R, Sferrazza, C, and Abbeel, P.GaussGym: An open-source real-to-sim framework for learning locomotion from pixels. (2025)

2025
[45]

Distilled feature fields enable few-shot language-guided manipulation

Shen, W, Yang, G, Yu, A, Wong, J, Kaelbling, LP, and Isola, P. Distilled feature fields enable few-shot language-guided manipulation. arXiv:2308.07931 (2023)

work page arXiv 2023
[46]

Yang, S, Yu, W, Zeng, J, Lv, J, Ren, K, Lu, C, Lin, D, and Pang, J.Novel Demonstration Generation with Gaussian Splatting Enables Robust One-Shot Manipulation. (2025)

2025
[47]

GSWorld: Closed-Loop Photo-Realistic Simulation Suite for Robotic Manipulation

Jiang, G, Chang, H, Qiu, RZ, Liang, Y, Ji, M, Zhu, J, Dong, Z, Zou, X, and Wang, X. GSWorld: Closed-Loop Photo-Realistic Simulation Suite for Robotic Manipulation. arXiv:2510.20813 (2025)

work page arXiv 2025
[48]

X-Sim: Cross-Embodiment Learning via Real-to-Sim-to-Real

Dan, P, Kedia, K, Chao, A, Duan, E, Pace, MA, Ma, WC, and Choudhury, S. X-Sim: Cross-Embodiment Learning via Real-to-Sim-to-Real. CoRL. (2025)

2025
[49]

Dream to Manipulate: Compositional World Models Empowering Robot Imitation Learning with Imagination

Barcellona, L, Zadaianchuk, A, Allegro, D, Papa, S, Ghidoni, S, and Gavves, E. Dream to Manipulate: Compositional World Models Empowering Robot Imitation Learning with Imagination. ICLR. (2025)

2025
[50]

Yu,J,Fu,L,Huang,H,El-Refai,K,Ambrus,RA,Cheng,R,Irshad,MZ,andGoldberg,K.Real2Render2Real: Scaling Robot Data Without Dynamics Simulation or Robot Hardware. (2025)

2025
[51]

ZeroBot: Learning From Scratch in Minutes With Generative Real2Sim

Kapelyukh, I, Zhang, X, James, S, Herlant, L, and Johns, E. ZeroBot: Learning From Scratch in Minutes With Generative Real2Sim. RA-L (2026)

2026
[52]

Real-to-Sim Robot Policy Evaluation with Gaussian Splatting Simulation of Soft-Body Interactions

Zhang, K, Sha, S, Jiang, H, Loper, M, Song, H, Cai, G, Xu, Z, Hu, X, Zheng, C, and Li, Y. Real-to-Sim Robot Policy Evaluation with Gaussian Splatting Simulation of Soft-Body Interactions. ICRA. (2026)

2026
[53]

Jangir, Y, Zhang, Y, Lo, PC, Yamazaki, K, Zhang, C, Tu, KH, Ke, TW, Ke, L, Bisk, Y, and Fragkiadaki, K.RobotArena∞: Scalable Robot Benchmarking via Real-to-Sim Translation. (2025)

2025
[54]

Jain, A, Zhang, M, Arora, K, Chen, W, Torne, M, Irshad, MZ, Zakharov, S, Wang, Y, Levine, S, Finn, C, Ma, WC, Shah, D, Gupta, A, and Pertsch, K.PolaRiS: Scalable Real-to-Sim Evaluations for Generalist Robot Policies. (2025)

2025
[55]

Picasso: Holistic Scene Reconstruction with Physics- Constrained Sampling

Yu, X, Talak, R, Shaikewitz, L, and Carlone, L. Picasso: Holistic Scene Reconstruction with Physics- Constrained Sampling. arXiv:2602.08058 (2026)

work page internal anchor Pith review arXiv 2026
[56]

Xiang, T, Cao, J, Guo, S, Zhao, G, Luo, AF, and Ma, J.Real-to-Sim for Highly Cluttered Environments via Physics-Consistent Inter-Object Reasoning. (2026)

2026
[57]

Huang, WC, Han, J, Ye, X, Pan, Z, and Hauser, K.Simulation-Ready Cluttered Scene Estimation via Physics-aware Joint Shape and Pose Optimization. (2026)

2026
[58]

Flow Matching for Generative Modeling

Lipman, Y, Chen, RT, Ben-Hamu, H, Nickel, M, and Le, M. Flow Matching for Generative Modeling. ICLR. (2023)

2023
[59]

Scalable diffusion models with transformers

Peebles, W and Xie, S. Scalable diffusion models with transformers. ICCV. (2023)

2023
[60]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M, Darcet, T, Moutakanni, T, Vo, HV, Szafraniec, M, Khalidov, V, Fernandez, P, Haziza, D, Massa, F, El-Nouby, A, Howes, R, Huang, PY, Xu, H, Sharma, V, Li, SW, Galuba, W, Rabbat, M, Assran, M, Ballas, N, Synnaeve, G, Misra, I, Jegou, H, Mairal, J, Labatut, P, Joulin, A, and Bojanowski, P. DINOv2: Learning Robust Visual Features without Supervision....

work page internal anchor Pith review arXiv 2023
[61]

Learning with 3D rotations, a hitchhiker’s guide to SO (3)

Geist, AR, Frey, J, Zhobro, M, Levina, A, and Martius, G. Learning with 3D rotations, a hitchhiker’s guide to SO (3). arXiv:2404.11735 (2024)

work page arXiv 2024
[62]

On the continuity of rotation representations in neural networks

Zhou, Y, Barnes, C, Lu, J, Yang, J, and Li, H. On the continuity of rotation representations in neural networks. CVPR. (2019)

2019
[63]

O-cnn: Octree-based convolutional neural networks for 3d shape analysis

Wang, PS, Liu, Y, Guo, YX, Sun, CY, and Tong, X. O-cnn: Octree-based convolutional neural networks for 3d shape analysis. ACM TOG (2017)

2017
[64]

Flexible Isosurface Extraction for Gradient-Based Mesh Optimization

Shen, T, Munkberg, J, Hasselgren, J, Yin, K, Wang, Z, Chen, W, Gojcic, Z, Fidler, S, Sharp, N, and Gao, J. Flexible Isosurface Extraction for Gradient-Based Mesh Optimization. ACM TOG (2023)

2023
[65]

CoRR , volume =

Deitke, M, Liu, R, Wallingford, M, Ngo, H, Michel, O, Kusupati, A, Fan, A, Laforte, C, Voleti, V, Gadre, SY, VanderBilt, E, Kembhavi, A, Vondrick, C, Gkioxari, G, Ehsani, K, Schmidt, L, and Farhadi, A. Objaverse-XL: A Universe of 10M+ 3D Objects. arXiv:2307.05663 (2023)

work page arXiv 2023
[66]

ABO: Dataset and Benchmarks for Real-World 3D Object Understanding

Collins, J, Goel, S, Deng, K, Luthra, A, Xu, L, Gundogdu, E, Zhang, X, Yago Vicente, TF, Dideriksen, T, Arora, H, Guillaumin, M, and Malik, J. ABO: Dataset and Benchmarks for Real-World 3D Object Understanding. CVPR (2022) 14

2022
[67]

Habitat Synthetic Scenes Dataset (HSSD-200): An Analysis of 3D Scene Scale and Realism Tradeoffs for ObjectGoal Navigation

Khanna*, M, Mao*, Y, Jiang, H, Haresh, S, Shacklett, B, Batra, D, Clegg, A, Undersander, E, Chang, AX, and Savva, M. Habitat Synthetic Scenes Dataset (HSSD-200): An Analysis of 3D Scene Scale and Realism Tradeoffs for ObjectGoal Navigation. arXiv (2023)

2023
[68]

Cao,Z,Chen,Z,Pan,L,andLiu,Z.PhysX-3D:Physical-Grounded3DAssetGeneration.arXiv:2507.12465 (2025)

work page arXiv 2025
[69]

Partnext: A next-generation dataset for fine-grained and hierarchical 3d part understanding

Wang, P, He, Y, Lv, X, Zhou, Y, Xu, L, Yu, J, and Gu, J. Partnext: A next-generation dataset for fine-grained and hierarchical 3d part understanding. arXiv:2510.20155 (2025)

work page arXiv 2025
[70]

SAPIEN: A SimulAted Part-based Interactive ENvironment

Xiang, F, Qin, Y, Mo, K, Xia, Y, Zhu, H, Liu, F, Liu, M, Jiang, H, Yuan, Y, Wang, H, Yi, L, Chang, AX, Guibas, LJ, and Su, H. SAPIEN: A SimulAted Part-based Interactive ENvironment. CVPR. (2020)

2020
[71]

Learning 6d object pose estimation using 3d object coordinates

Brachmann, E, Krull, A, Michel, F, Gumhold, S, Shotton, J, and Rother, C. Learning 6d object pose estimation using 3d object coordinates. ECCV. (2014)

2014
[72]

Homebreweddb: Rgb-d dataset for 6d pose estimation of 3d objects

Kaskman, R, Zakharov, S, Shugurov, I, and Ilic, S. Homebreweddb: Rgb-d dataset for 6d pose estimation of 3d objects. ICCVW. (2019)

2019
[73]

6-DoF Pose Estimation of Household Objects for Robotic Manipulation: An Accessible Dataset and Benchmark

Tyree, S, Tremblay, J, To, T, Cheng, J, Mosier, T, Smith, J, and Birchfield, S. 6-DoF Pose Estimation of Household Objects for Robotic Manipulation: An Accessible Dataset and Benchmark. IROS. (2022)

2022
[74]

ZeroGrasp: Zero-shot shape reconstruction enabled robotic grasping

Iwase, S, Irshad, MZ, Liu, K, Guizilini, V, Lee, R, Ikeda, T, Amma, A, Nishiwaki, K, Kitani, K, Ambrus, R, et al. ZeroGrasp: Zero-shot shape reconstruction enabled robotic grasping. CVPR. (2025)

2025
[75]

Artvip: Articulated digital assets of visual realism, modular interaction, and physical fidelity for robot learning

Jin, Z, Che, Z, Zhao, Z, Wu, K, Zhang, Y, Zhao, Y, Liu, Z, Zhang, Q, Ju, X, Tian, J, et al. Artvip: Articulated digital assets of visual realism, modular interaction, and physical fidelity for robot learning. arXiv:2506.04941 (2025)

work page arXiv 2025
[76]

BOP: Benchmark for 6D object pose estimation

Hodan, T, Michel, F, Brachmann, E, Kehl, W, GlentBuch, A, Kraft, D, Drost, B, Vidal, J, Ihrke, S, Zabulis, X, et al. BOP: Benchmark for 6D object pose estimation. ECCV. (2018)

2018
[77]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset (2024)

Khazatsky, A, Pertsch, K, Nair, S, Balakrishna, A, Dasari, S, Karamcheti, S, Nasiriany, S, Srirama, MK, Chen, LY, Ellis, K, Fagan, PD, Hejna, J, Itkina, M, Lepert, M, Ma, YJ, Miller, PT, Wu, J, Belkhale, S, Dass, S, Ha, H, Jain, A, Lee, A, Lee, Y, Memmel, M, Park, S, Radosavovic, I, Wang, K, Zhan, A, Black, K, Chi, C, Hatch, KB, Lin, S, Lu, J, Mercat, J, ...

2024
[78]

Native and Compact Structured Latents for 3D Generation

Xiang, J, Chen, X, Xu, S, Wang, R, Lv, Z, Deng, Y, Zhu, H, Dong, Y, Zhao, H, Yuan, NJ, and Yang, J. Native and Compact Structured Latents for 3D Generation. Tech report (2025)

2025
[79]

One View, Many Worlds: Single-Image to 3D Object Meets Generative Domain Randomization for One-Shot 6D Pose Estimation

Geng, Z, Wang, N, Xu, S, Ye, C, Li, B, Chen, Z, Peng, S, and Zhao, H. One View, Many Worlds: Single-Image to 3D Object Meets Generative Domain Randomization for One-Shot 6D Pose Estimation. arXiv:2509.07978 (2025)

work page arXiv 2025
[80]

Blenderproc

Denninger, M, Sundermeyer, M, Winkelbauer, D, Zidan, Y, Olefir, D, Elbadrawy, M, Lodhi, A, and Katam, H. Blenderproc. arXiv:1911.01911 (2019)

work page arXiv 1911

Showing first 80 references.