pith. sign in

arxiv: 2606.17520 · v1 · pith:NGZQ6GTQnew · submitted 2026-06-16 · 💻 cs.RO · cs.CV

GASE: Gaussian Splatting-Based Automated System for Reconstructing Embodied-Simulation Environments

Pith reviewed 2026-06-27 00:54 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords gaussian splattingscene reconstructionsim-to-realembodied agentsrobot learningobject extractioninpaintingsimulation environments
0
0 comments X

The pith

GASE automates reconstruction of high-fidelity simulation environments from panoramic videos for robot learning with under 10 percent sim-to-real gap.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes GASE as an automated system that scans environments using multi-view video from panoramic camera arrays and applies Gaussian splatting for reconstruction. It introduces a camera-pose-based strategy to extract objects in the 2D domain and performs high-fidelity inpainting to separate foreground and background. This pipeline allows independent reconstruction of assets that can be imported into physics simulators. The approach aims to enable large-scale training of embodied agents with reduced sim-to-real gap and lower costs than real-world data collection.

Core claim

GASE uses multi-view video streams from panoramic camera arrays for rapid environment scanning. A camera-pose-based strategy extracts objects across frames in the 2D domain, followed by high-fidelity scene inpainting. Foreground objects and the static background are reconstructed independently with Gaussian splatting and seamlessly imported into physics simulators for policy training. Experiments show it outperforms existing 3D Gaussian-based methods in segmentation accuracy by over 10% and achieves state-of-the-art inpainting quality. Real-robot deployments in manipulation and navigation tasks maintain a performance gap of less than 10% compared to policies trained on real-world data.

What carries the argument

The camera-pose-based strategy for robust object extraction across frames in the 2D domain followed by high-fidelity scene inpainting to enable independent reconstruction of foreground objects and static background.

If this is right

  • Outperforms existing 3D Gaussian-based methods in segmentation accuracy by over 10%.
  • Achieves state-of-the-art inpainting quality.
  • Maintains a performance gap of less than 10% in real-robot manipulation and navigation tasks compared to real-world trained policies.
  • Enables efficient import of reconstructed assets into physics simulators for embodied agent training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This system could support scaling up training data for robot policies without corresponding increases in real-world collection efforts.
  • The reconstruction technique might be adapted for other simulation domains beyond robotics if the extraction method generalizes.
  • Further validation on longer sequences or more cluttered scenes would test the robustness of the 2D extraction step.

Load-bearing premise

The camera-pose-based strategy robustly extracts objects across frames in the 2D domain to enable high-fidelity independent reconstruction of foreground objects and static background.

What would settle it

Observing a performance gap exceeding 10% between GASE-trained policies and real-world trained policies on the reported manipulation and navigation tasks.

read the original abstract

Training embodied agents in the real world requires skilled operators and expensive hardware. Simulation environments offer a compelling alternative by enabling large-scale, cost-effective data augmentation. Consequently, rapidly constructing high-fidelity simulation scenes with a minimal sim-to-real gap has become a critical objective in robot learning. While reconstruction-based methods provide superior visual quality, current workflows are hindered by inefficient data acquisition and subpar foreground object extraction. We thus propose GASE, a highly automated system for simulation scene construction. GASE leverages multi-view video streams from panoramic camera arrays to enable rapid environment scanning. To ensure high-quality asset generation, our pipeline introduces a camera-pose-based strategy that robustly extracts objects across frames in the 2D domain, followed by high-fidelity scene inpainting. Foreground objects and the static background are then reconstructed independently and seamlessly imported into physics simulators for policy training. Extensive experiments demonstrate that GASE outperforms existing 3D Gaussian-based methods in segmentation accuracy by over 10\% while achieving state-of-the-art inpainting quality. Furthermore, real-robot deployments across manipulation and navigation tasks maintains a performance gap of less than 10\% compared to policies trained purely on real-world data. These results confirm that GASE provides an efficient and highly effective solution for bridging the sim-to-real gap. Code will be released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes GASE, an automated pipeline for constructing high-fidelity simulation environments from multi-view panoramic video using 3D Gaussian splatting. A camera-pose-based 2D extraction step followed by scene inpainting separates foreground objects from static background, enabling independent reconstruction and direct import into physics simulators. The central empirical claims are >10% gains in segmentation accuracy over prior 3D Gaussian methods, state-of-the-art inpainting quality, and real-robot manipulation/navigation policies whose performance remains within 10% of policies trained exclusively on real data.

Significance. If the reported performance numbers are supported by properly documented experiments, the work would be significant for embodied AI: it directly targets the data-acquisition bottleneck in sim-to-real transfer by offering a largely automated, high-visual-fidelity reconstruction workflow. The explicit promise to release code is a positive factor for reproducibility.

major comments (2)
  1. [Abstract / §4] Abstract and §4 (Experiments): the headline claims of '>10% segmentation accuracy' and '<10% sim-to-real performance gap' are presented without any description of dataset sizes, number of scenes, exact metrics, baseline implementations, number of trials, or statistical tests. Because these numbers constitute the primary evidence for the central claim that the pipeline bridges the sim-to-real gap, the absence of protocol details is load-bearing.
  2. [§3.2] §3.2 (Object Extraction and Inpainting): the entire performance narrative rests on the assertion that the camera-pose-based 2D extraction 'robustly extracts objects across frames' followed by high-fidelity inpainting. No quantitative ablation (e.g., extraction IoU, failure rates under occlusion or blur) or removal of the inpainting stage is reported, leaving the causal link between the proposed strategy and the claimed downstream gains unverified.
minor comments (1)
  1. [Abstract] The abstract states 'Code will be released' but provides neither a repository URL nor a commit hash; this should be added for a camera-ready version.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. The feedback correctly identifies areas where additional experimental documentation and ablations would strengthen the presentation of our results. We address each major comment below and will incorporate the suggested revisions.

read point-by-point responses
  1. Referee: [Abstract / §4] Abstract and §4 (Experiments): the headline claims of '>10% segmentation accuracy' and '<10% sim-to-real performance gap' are presented without any description of dataset sizes, number of scenes, exact metrics, baseline implementations, number of trials, or statistical tests. Because these numbers constitute the primary evidence for the central claim that the pipeline bridges the sim-to-real gap, the absence of protocol details is load-bearing.

    Authors: We agree that the current presentation of headline claims in the abstract and §4 would benefit from more explicit protocol documentation. Although §4 describes the datasets, metrics, and tasks at a high level, it does not enumerate scene counts, trial numbers, or include statistical tests. In the revised manuscript we will expand §4 with a new 'Experimental Protocol' subsection that reports: number of panoramic video sequences and distinct scenes, exact metric definitions and implementations, baseline code references or re-implementations, number of policy training/evaluation trials per task, and results of appropriate statistical tests (e.g., paired t-tests with p-values). revision: yes

  2. Referee: [§3.2] §3.2 (Object Extraction and Inpainting): the entire performance narrative rests on the assertion that the camera-pose-based 2D extraction 'robustly extracts objects across frames' followed by high-fidelity inpainting. No quantitative ablation (e.g., extraction IoU, failure rates under occlusion or blur) or removal of the inpainting stage is reported, leaving the causal link between the proposed strategy and the claimed downstream gains unverified.

    Authors: The referee is correct that §3.2 currently provides only a qualitative description of the extraction and inpainting pipeline without supporting quantitative ablations. To establish the contribution of these components, the revised manuscript will add quantitative results: per-frame extraction IoU and failure rates under controlled occlusion/blur conditions, plus an ablation that removes the inpainting stage and measures the resulting impact on both segmentation accuracy and downstream policy performance. These new experiments will be reported in an expanded §4. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system paper with no derivations or fitted predictions

full rationale

The manuscript presents an automated reconstruction pipeline (multi-view video, camera-pose 2D extraction, inpainting, independent foreground/background Gaussian reconstruction) and supports its claims solely via reported experimental metrics (segmentation accuracy, inpainting quality, real-robot policy transfer gaps). No equations, parameter-fitting steps, self-citations used as uniqueness theorems, or renamings of known results appear in the provided text. The central claims rest on external empirical benchmarks rather than any internal reduction to fitted inputs or self-referential definitions, satisfying the self-contained criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of the proposed extraction and inpainting steps and on the assumption that Gaussian splatting produces assets suitable for physics simulation; no free parameters, ad-hoc axioms, or new invented entities are introduced in the abstract.

axioms (1)
  • domain assumption 3D Gaussian splatting can produce high-fidelity reconstructions from multi-view images suitable for downstream simulation
    Invoked implicitly when stating that foreground and background are reconstructed independently and imported into physics simulators.

pith-pipeline@v0.9.1-grok · 5801 in / 1346 out tokens · 40575 ms · 2026-06-27T00:54:17.557126+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

90 extracted references · 3 canonical work pages · 1 internal anchor

  1. [1]

    Real-is-sim: Bridging the sim-to-real gap with a dynamic digital twin, 2025

    Jad Abou-Chakra, Lingfeng Sun, Krishan Rana, Brandon May, Karl Schmeckpeper, Niko Suenderhauf, Maria Vittoria Minniti, and Laura Herlant. Real-is-sim: Bridging the sim-to-real gap with a dynamic digital twin, 2025. URL https: //arxiv.org/abs/2504.03597

  2. [2]

    Piper robotic arm.https://www.agibot.com/, 2025

    AgiBot. Piper robotic arm.https://www.agibot.com/, 2025. Accessed: 2026-05-09

  3. [3]

    Barron, Ben Mildenhall, Dor Verbin, Pratul P

    Jonathan T. Barron, Ben Mildenhall, Dor Verbin, Pratul P. Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields.CVPR, 2022

  4. [4]

    URLhttps://arxiv.org/abs/2410.24164

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky.π0: A visio...

  5. [5]

    Rt-1: Robotics transformer for real-world control at scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas Jackson, Sally Jesmonth, NikhilJ Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Utsav Malla, D...

  6. [6]

    Physx-3d: Physical-grounded 3d asset generation.arXiv preprint arXiv:2507.12465, 2025

    Ziang Cao, Zhaoxi Chen, Liang Pan, and Ziwei Liu. Physx-3d: Physical-grounded 3d asset generation.arXiv preprint arXiv:2507.12465, 2025

  7. [7]

    Physx-anything: Simulation-ready physical 3d assets from single image.arXiv preprint arXiv:2511.13648, 2025

    Ziang Cao, Fangzhou Hong, Zhaoxi Chen, Liang Pan, and Ziwei Liu. Physx-anything: Simulation-ready physical 3d assets from single image.arXiv preprint arXiv:2511.13648, 2025

  8. [8]

    Sam 3: Segment anything with concepts, 2025

    Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Va- sudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Lilian...

  9. [9]

    Segment any 3d gaussians

    Jiazhong Cen, Jiemin Fang, Chen Yang, Lingxi Xie, Xiaopeng Zhang, Wei Shen, and Qi Tian. Segment any 3d gaussians. Dec 2023

  10. [10]

    A survey on 3d gaussian splatting, 2025

    Guikun Chen and Wenguan Wang. A survey on 3d gaussian splatting, 2025. URL https://arxiv.org/abs/2401. 03890

  11. [11]

    Single-stage diffusion nerf: A unified approach to 3d generation and reconstruction

    Hansheng Chen, Jiatao Gu, Anpei Chen, Wei Tian, Zhuowen Tu, Lingjie Liu, and Hao Su. Single-stage diffusion nerf: A unified approach to 3d generation and reconstruction. Apr 2023

  12. [12]

    Flexworld: Progressively expanding 3d scenes for flexiable-view synthesis, 2025

    Luxi Chen, Zihan Zhou, Min Zhao, Yikai Wang, Ge Zhang, Wenhao Huang, Hao Sun, Ji-Rong Wen, and Chongxuan Li. Flexworld: Progressively expanding 3d scenes for flexiable-view synthesis, 2025. URL https://arxiv.org/abs/2503. 13265

  13. [13]

    Meshanything: Artist-created mesh generation with autoregressive transformers, 2024

    Yiwen Chen, Tong He, Di Huang, Weicai Ye, Sijin Chen, Jiaxiang Tang, Xin Chen, Zhongang Cai, Lei Yang, Gang Yu, Guosheng Lin, and Chi Zhang. Meshanything: Artist-created mesh generation with autoregressive transformers, 2024

  14. [14]

    Tracking anything with decoupled video segmentation

    Ho Kei Cheng, Seoung Wug Oh, Brian Price, Alexander Schwing, and Joon-Young Lee. Tracking anything with decoupled video segmentation. InICCV, 2023

  15. [15]

    Embodiedsplat: Personalized real-to-sim-to-real navigation with gaussian splats from a mobile device, 2025

    Gunjan Chhablani, Xiaomeng Ye, Muhammad Zubair Irshad, and Zsolt Kira. Embodiedsplat: Personalized real-to-sim-to-real navigation with gaussian splats from a mobile device, 2025. URLhttps://arxiv.org/abs/2509.17430

  16. [16]

    X-sim: Cross-embodiment learning via real-to-sim-to-real

    Prithwish Dan, Kushal Kedia, Angela Chao, Edward Weiyi Duan, Maximus Adrian Pace, Wei-Chiu Ma, and Sanjiban Choudhury. X-sim: Cross-embodiment learning via real-to-sim-to-real. 2025. URLhttps://arxiv.org/abs/2505.07096

  17. [17]

    Twinaligner: Visual-dynamic alignment empowers physics-aware real2sim2real for robotic manipulation

    Hongwei Fan, Hang Dai, Jiyao Zhang, Jinzhou Li, Qiyang Yan, Yujie Zhao, Mingju Gao, Jinghang Wu, Hao Tang, and Hao Dong. Twinaligner: Visual-dynamic alignment empowers physics-aware real2sim2real for robotic manipulation. 2025. URL https://arxiv.org/abs/2512.19390. 9

  18. [18]

    Zhao, and Chelsea Finn

    Zipeng Fu, Tony Z. Zhao, and Chelsea Finn. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation. InConference on Robot Learning (CoRL), 2024

  19. [19]

    Theory of the motion of the heavenly bodies moving about the sun in conic sections.Gauss’s Theoria Motus, 76(1):5–23, 1857

    Carl Friedrich Gauss and Charles Henry Davis. Theory of the motion of the heavenly bodies moving about the sun in conic sections.Gauss’s Theoria Motus, 76(1):5–23, 1857

  20. [20]

    Re3sim: Generating high-fidelity simulation data via 3d-photorealistic real-to-sim for robotic manipulation

    Xiaoshen Han, Junqiu Yu, Minghuan Liu, Yilun Chen, Xiaoyang Lyu, Yang Tian, Bolun Wang, Weinan Zhang, Weinan Zhang, and Jiangmiao Pang. Re3sim: Generating high-fidelity simulation data via 3d-photorealistic real-to-sim for robotic manipulation. InIEEE International Conference on Robotics and Automation (ICRA), 2026

  21. [21]

    Gvgen: Text-to-3d generation with volumetric representation

    Xianglong He, Junyi Chen, Sida Peng, Di Huang, Yangguang Li, Xiaoshui Huang, Chun Yuan, Wanli Ouyang, and Tong He. Gvgen: Text-to-3d generation with volumetric representation

  22. [22]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, Pieter Abbeel, and UC Berkeley. Denoising diffusion probabilistic models

  23. [23]

    In: Burbano, A., Zorin, D., Jarosz, W

    Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2d gaussian splatting for geometrically accurate radiance fields. InSIGGRAPH 2024 Conference Papers. Association for Computing Machinery, 2024. doi: 10.1145/3641519. 3657428

  24. [24]

    3d gaussian inpainting with depth-guided cross-view consistency

    Sheng-Yu Huang, Zi-Ting Chou, and Yu-Chiang Frank Wang. 3d gaussian inpainting with depth-guided cross-view consistency. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26704–26713, 2025

  25. [25]

    Neural wavelet-domain diffusion for 3d shape generation

    Ka-Hei Hui, Ruihui Li, Jingyu Hu, and Chi-Wing Fu. Neural wavelet-domain diffusion for 3d shape generation. Sep 2022

  26. [26]

    Insta360 x5.https://www.insta360.com/, 2025

    Insta360. Insta360 x5.https://www.insta360.com/, 2025. Accessed: 2026-05-09

  27. [27]

    Physical Intelligence, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Kevin Black, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Jared DiCarlo, Danny Driess, Michael Equi, Adnan Esmail, Yunhao Fang, Chelsea Finn, Catherine Glossop, Thomas Godden, Ivan Goryachev, Lachy Groom, Hunter Hancock, Karol Hausman, Gashon Hussein, Brian Ichter, Szym...

  28. [28]

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsc...

  29. [29]

    Polaris: Scalable real-to-sim evaluations for generalist robot policies, 2025

    Arhan Jain, Mingtong Zhang, Kanav Arora, William Chen, Marcel Torne, Muhammad Zubair Irshad, Sergey Zakharov, Yue Wang, Sergey Levine, Chelsea Finn, Wei-Chiu Ma, Dhruv Shah, Abhishek Gupta, and Karl Pertsch. Polaris: Scalable real-to-sim evaluations for generalist robot policies, 2025. URLhttps://arxiv.org/abs/2512.16881

  30. [30]

    Postshot.https://www.jawset.com/, 2025

    Jawset Visual Computing. Postshot.https://www.jawset.com/, 2025. Accessed: 2026-05-09

  31. [31]

    Fastlgs: Speeding up language embedded gaussians with feature grid mapping

    Yuzhou Ji, He Zhu, Junshu Tang, Wuyi Liu, Zhizhong Zhang, Xin Tan, and Yuan Xie. Fastlgs: Speeding up language embedded gaussians with feature grid mapping. InProceedings of the AAAI Conference on Artificial Intelligence, 2025

  32. [32]

    Planing: A loosely coupled triangle-gaussian framework for streaming 3d reconstruction, 2026

    Changjian Jiang, Kerui Ren, Xudong Li, Kaiwen Song, Linning Xu, Tao Lu, Junting Dong, Yu Zhang, Bo Dai, and Mulin Yu. Planing: A loosely coupled triangle-gaussian framework for streaming 3d reconstruction, 2026. URL https://arxiv. org/abs/2601.22046

  33. [33]

    Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning

    Zhenyu Jiang, Yuqi Xie, Kevin Lin, Zhenjia Xu, Weikang Wan, Ajay Mandlekar, Linxi Fan, and Yuke Zhu. Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning. In2025 IEEE International Conference on Robotics and Automation (ICRA), 2025

  34. [34]

    Poisson surface reconstruction

    M Kazhdan. Poisson surface reconstruction. 2006

  35. [35]

    Screened poisson surface reconstruction.Acm Transactions on Graphics, 32(3):1–13, 2013

    Michael Kazhdan and Hugues Hoppe. Screened poisson surface reconstruction.Acm Transactions on Graphics, 32(3):1–13, 2013. 10

  36. [36]

    3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4), July 2023

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4), July 2023. URL https://repo-sam.inria.fr/fungraph/ 3d-gaussian-splatting/

  37. [37]

    Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything.arXiv:2304.02643, 2023

  38. [38]

    Feature refinement to improve high resolution image inpainting, 2022

    Prakhar Kulshreshtha, Brian Pugh, and Salma Jiddi. Feature refinement to improve high resolution image inpainting, 2022. URLhttps://arxiv.org/abs/2206.13644

  39. [39]

    Desaint, Paris, 1788

    Joseph-Louis Lagrange.Mécanique Analytique. Desaint, Paris, 1788

  40. [40]

    Lehome: A simulation environment for deformable object manipulation in household scenarios

    Zeyi Li, Jade Yang, Jingkai Xu, Shangbin Xie, Yuran Wang, Zhenhao Shen, Tianxing Chen, Yan Shen, Wenjun Li, Yukun Zheng, Chaorui Zhang, Ming Chen, Chen Xie, and Ruihai Wu. Lehome: A simulation environment for deformable object manipulation in household scenarios. InIROS 2025 - 5th Workshop on RObotic MAnipulation of Deformable Objects: holistic approaches...

  41. [41]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection.arXiv preprint arXiv:2303.05499, 2023

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection.arXiv preprint arXiv:2303.05499, 2023

  42. [42]

    In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021

    Shitong Luo and Wei Hu. Diffusion probabilistic models for 3d point cloud generation. In2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2021. doi: 10.1109/cvpr46437.2021.00286. URL http://dx.doi. org/10.1109/cvpr46437.2021.00286

  43. [43]

    Gaga: Group any gaussians via 3d-aware memory bank

    Weijie Lyu, Xueting Li, Abhijit Kundu, Yi-Hsuan Tsai, and Ming-Hsuan Yang. Gaga: Group any gaussians via 3d-aware memory bank. Mar 2025

  44. [44]

    Mimicgen: A data generation system for scalable robot learning using human demonstrations

    Ajay Mandlekar, Soroush Nasiriany, Bowen Wen, Iretiayo Akinola, Yashraj Narang, Linxi Fan, Yuke Zhu, and Dieter Fox. Mimicgen: A data generation system for scalable robot learning using human demonstrations. In7th Annual Conference on Robot Learning, 2023

  45. [45]

    Srinivasan, Matthew Tancik, Jonathan T

    Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. InECCV, 2020

  46. [46]

    Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning

    M. Mittal, P. Roth, J. Tigue, A. Richard, O. Zhang, P. Du, A. Serrano-Muñoz, X. Yao, R. Zurbrügg, N. Rudin, L. Wawrzyniak, M. Rakhsha, A. Denzler, E. Heiden, A. Borovicka, O. Ahmed, I. Akinola, A. Anwar, M. T. Carlson, J. Y . Feng, A. Garg, R. Gasoto, L. Gulich, Y . Guo, M. Gussert, A. Hansen, M. Kulkarni, C. Li, W. Liu, V . Makoviychuk, G. Malczyk, H. Ma...

  47. [47]

    3d gaussian ray tracing: Fast tracing of particle scenes.ACM Transactions on Graphics and SIGGRAPH Asia, 2024

    Nicolas Moenne-Loccoz, Ashkan Mirzaei, Or Perel, Riccardo de Lutio, Janick Martinez Esturo, Gavriel State, Sanja Fidler, Nicholas Sharp, and Zan Gojcic. 3d gaussian ray tracing: Fast tracing of particle scenes.ACM Transactions on Graphics and SIGGRAPH Asia, 2024

  48. [48]

    Diffrf: Rendering- guided 3d radiance field diffusion.Cornell University - arXiv,Cornell University - arXiv, Dec 2022

    Norman Muller, Yawar Siddiqui, Lorenzo Porzi, SamuelRota Bulò, Peter Kontschieder, and Matthias Nießner. Diffrf: Rendering- guided 3d radiance field diffusion.Cornell University - arXiv,Cornell University - arXiv, Dec 2022

  49. [49]

    Battaglia

    Charlie Nash, Yaroslav Ganin, S.M.Ali Eslami, and PeterW. Battaglia. Polygen: An autoregressive generative model of 3d meshes.Cornell University - arXiv,Cornell University - arXiv, Feb 2020

  50. [50]

    Robocasa: Large-scale simulation of everyday tasks for generalist robots

    Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots. InRobotics: Science and Systems (RSS), 2024

  51. [51]

    Robocasa365: A large-scale simulation framework for training and benchmarking generalist robots

    Soroush Nasiriany, Sepehr Nasiriany, Abhiram Maddukuri, and Yuke Zhu. Robocasa365: A large-scale simulation framework for training and benchmarking generalist robots. InInternational Conference on Learning Representations (ICLR), 2026

  52. [52]

    Point-e: A system for generating 3d point clouds from complex prompts

    Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. Point-e: A system for generating 3d point clouds from complex prompts. Dec 2022. 11

  53. [53]

    Isaac Sim.https://github.com/isaac-sim/IsaacSim, 2024

    NVIDIA Corporation. Isaac Sim.https://github.com/isaac-sim/IsaacSim, 2024. Version 5.1.0

  54. [54]

    Deepsdf: Learning continuous signed distance functions for shape representation

    Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. Deepsdf: Learning continuous signed distance functions for shape representation. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019

  55. [55]

    Fast: Efficient action tokenization for vision-language-action models, 2025

    Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models, 2025. URL https://arxiv.org/abs/ 2501.09747

  56. [56]

    Polycam.https://poly.cam, 2024

    Polycam Inc. Polycam.https://poly.cam, 2024. 3D scanning application

  57. [57]

    Langsplat: 3d language gaussian splatting

    Minghan Qin, Wanhua Li, Jiawei Zhou, Haoqian Wang, and Hanspeter Pfister. Langsplat: 3d language gaussian splatting. Dec 2023

  58. [58]

    Splatsim: Zero-shot sim2real transfer of rgb manipulation policies using gaussian splatting, 2024

    Mohammad Nomaan Qureshi, Sparsh Garg, Francisco Yandun, David Held, George Kantor, and Abhishesh Silwal. Splatsim: Zero-shot sim2real transfer of rgb manipulation policies using gaussian splatting, 2024. URL https://arxiv.org/abs/ 2409.10161

  59. [59]

    Language models are unsupervised multitask learners

    Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019. URLhttps://api.semanticscholar.org/CorpusID:160025533

  60. [61]

    URLhttps://arxiv.org/abs/2408.00714

  61. [62]

    Grounded sam: Assembling open-world models for diverse visual tasks, 2024

    Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, and Lei Zhang. Grounded sam: Assembling open-world models for diverse visual tasks, 2024

  62. [63]

    Schönberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger

    Thomas Schöps, Johannes L. Schönberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A multi-view stereo benchmark with high-resolution images and multi-camera videos. InConference on Computer Vision and Pattern Recognition (CVPR), 2017

  63. [64]

    Scenemaker: Open-set 3d scene generation with decoupled de-occlusion and pose estimation model.arXiv preprint arXiv:2512.10957, 2025

    Yukai Shi, Weiyu Li, Zihao Wang, Hongyang Li, Xingyu Chen, Ping Tan, and Lei Zhang. Scenemaker: Open-set 3d scene generation with decoupled de-occlusion and pose estimation model.arXiv preprint arXiv:2512.10957, 2025

  64. [65]

    3d neural field generation using triplane diffusion

    J.Ryan Shue, EricRyan Chan, Ryan Po, Zachary Ankner, Jiajun Wu, and Gordon Wetzstein. 3d neural field generation using triplane diffusion. Nov 2022

  65. [66]

    Weiss, Niru Maheswaranathan, and Surya Ganguli

    Jascha Sohl-Dickstein, EricA. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequi- librium thermodynamics.arXiv: Learning,arXiv: Learning, Mar 2015

  66. [67]

    Resolution-robust large mask inpainting with fourier convolutions

    Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lempitsky. Resolution-robust large mask inpainting with fourier convolutions. arXiv preprint arXiv:2109.07161, 2021

  67. [68]

    V olumediffusion: Flexible text-to-3d generation with efficient volumetric encoder

    Zhicong Tang, Shuyang Gu, Chunyu Wang, Ting Zhang, Jianmin Bao, Dong Chen, and Baining Guo. V olumediffusion: Flexible text-to-3d generation with efficient volumetric encoder. Apr 2024

  68. [69]

    ALOHA 2 Team, Jorge Aldaco, Travis Armstrong, Robert Baruch, Jeff Bingham, Sanky Chan, Kenneth Draper, Debidatta Dwibedi, Chelsea Finn, Pete Florence, Spencer Goodrich, Wayne Gramlich, Torr Hage, Alexander Herzog, Jonathan Hoech, Thinh Nguyen, Ian Storz, Baruch Tabanpour, Leila Takayama, Jonathan Tompson, Ayzaan Wahid, Ted Wahrburg, Sichun Xu, Sergey Yaro...

  69. [70]

    URLhttps://arxiv.org/abs/2405.02292

  70. [71]

    Sam 3d: 3dfy anything in images

    SAM 3D Team, Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin J Liang, Alexander Sax, Hao Tang, Weiyao Wang, Michelle Guo, Thibaut Hardin, Xiang Li, Aohan Lin, Jiawei Liu, Ziqi Ma, Anushka Sagar, Bowen Song, Xiaodong Wang, Jianing Yang, Bowen Zhang, Piotr Dollár, Georgia Gkioxari, Matt Feiszli, and Jitendra Malik. Sam 3d: 3dfy anything in images. 2025. URL h...

  71. [72]

    Online segment any 3d thing as instance tracking

    Hanshi Wang, Zijian Cai, Jin Gao, Yiwei Zhang, Weiming Hu, Ke Wang, and Zhipeng Zhang. Online segment any 3d thing as instance tracking. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, . 12

  72. [73]

    Rodin: A generative model for sculpting 3d digital avatars using diffusion

    Tengfei Wang, Bo Zhang, Ting Zhang, Shuyang Gu, Jianmin Bao, Tadas Baltrusaitis, Jingjing Shen, Dong Chen, Fang Wen, Qifeng Chen, Baining Guo, and Microsoft Research. Rodin: A generative model for sculpting 3d digital avatars using diffusion

  73. [74]

    Embodiedgen: Towards a generative 3d world engine for embodied intelligence, 2025

    Xinjie Wang, Liu Liu, Yu Cao, Ruiqi Wu, Wenkang Qin, Dehui Wang, Wei Sui, and Zhizhong Su. Embodiedgen: Towards a generative 3d world engine for embodied intelligence, 2025. URLhttps://arxiv.org/abs/2506.10600

  74. [75]

    Gscream: Learning 3d geometry and feature consistent gaussian splatting for object removal

    Yuxin Wang, Qianyi Wu, Guofeng Zhang, and Dan Xu. Gscream: Learning 3d geometry and feature consistent gaussian splatting for object removal. InECCV, 2024

  75. [76]

    4d gaussian splatting for real-time dynamic scene rendering

    Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene rendering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20310–20320, June 2024

  76. [77]

    3dgut: Enabling distorted cameras and secondary rays in gaussian splatting.Conference on Computer Vision and Pattern Recognition (CVPR), 2025

    Qi Wu, Janick Martinez Esturo, Ashkan Mirzaei, Nicolas Moenne-Loccoz, and Zan Gojcic. 3dgut: Enabling distorted cameras and secondary rays in gaussian splatting.Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  77. [78]

    Drawer: Digital reconstruction and articulation with environment realism, 2025

    Hongchi Xia, Entong Su, Marius Memmel, Arhan Jain, Raymond Yu, Numfor Mbiziwo-Tiapo, Ali Farhadi, Abhishek Gupta, Shenlong Wang, and Wei-Chiu Ma. Drawer: Digital reconstruction and articulation with environment realism, 2025. URL https://arxiv.org/abs/2504.15278

  78. [79]

    Structured 3d latents for scalable and versatile 3d generation.arXiv preprint arXiv:2412.01506, 2024

    Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d generation.arXiv preprint arXiv:2412.01506, 2024

  79. [80]

    Embodiedsam: Online segment any 3d thing in real time.arXiv preprint arXiv:2408.11811, 2024

    Xiuwei Xu, Huangxing Chen, Linqing Zhao, Ziwei Wang, Jie Zhou, and Jiwen Lu. Embodiedsam: Online segment any 3d thing in real time.arXiv preprint arXiv:2408.11811, 2024

  80. [81]

    Omnipart: Part-aware 3d generation with semantic decoupling and structural cohesion.arXiv preprint arXiv:2507.06165, 2025

    Yunhan Yang, Yufan Zhou, Yuan-Chen Guo, Zi-Xin Zou, Yukun Huang, Ying-Tian Liu, Hao Xu, Ding Liang, Yan-Pei Cao, and Xihui Liu. Omnipart: Part-aware 3d generation with semantic decoupling and structural cohesion.arXiv preprint arXiv:2507.06165, 2025

Showing first 80 references.