GASE: Gaussian Splatting-Based Automated System for Reconstructing Embodied-Simulation Environments

Chao Liang; Jiawei Zhang; Nuo Xu; Qichen Zhang; Qin Jin; Seson Sun; Yantai Yang; Yiming Yan; Yingqiao Wang; Yuhao Xu

arxiv: 2606.17520 · v1 · pith:NGZQ6GTQnew · submitted 2026-06-16 · 💻 cs.RO · cs.CV

GASE: Gaussian Splatting-Based Automated System for Reconstructing Embodied-Simulation Environments

Jiawei Zhang , Yiming Yan , Chao Liang , Nuo Xu , Seson Sun , Qichen Zhang , Yuhao Xu , Yantai Yang

show 3 more authors

Yingqiao Wang Qin Jin Zhipeng Zhang

This is my paper

Pith reviewed 2026-06-27 00:54 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords gaussian splattingscene reconstructionsim-to-realembodied agentsrobot learningobject extractioninpaintingsimulation environments

0 comments

The pith

GASE automates reconstruction of high-fidelity simulation environments from panoramic videos for robot learning with under 10 percent sim-to-real gap.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes GASE as an automated system that scans environments using multi-view video from panoramic camera arrays and applies Gaussian splatting for reconstruction. It introduces a camera-pose-based strategy to extract objects in the 2D domain and performs high-fidelity inpainting to separate foreground and background. This pipeline allows independent reconstruction of assets that can be imported into physics simulators. The approach aims to enable large-scale training of embodied agents with reduced sim-to-real gap and lower costs than real-world data collection.

Core claim

GASE uses multi-view video streams from panoramic camera arrays for rapid environment scanning. A camera-pose-based strategy extracts objects across frames in the 2D domain, followed by high-fidelity scene inpainting. Foreground objects and the static background are reconstructed independently with Gaussian splatting and seamlessly imported into physics simulators for policy training. Experiments show it outperforms existing 3D Gaussian-based methods in segmentation accuracy by over 10% and achieves state-of-the-art inpainting quality. Real-robot deployments in manipulation and navigation tasks maintain a performance gap of less than 10% compared to policies trained on real-world data.

What carries the argument

The camera-pose-based strategy for robust object extraction across frames in the 2D domain followed by high-fidelity scene inpainting to enable independent reconstruction of foreground objects and static background.

If this is right

Outperforms existing 3D Gaussian-based methods in segmentation accuracy by over 10%.
Achieves state-of-the-art inpainting quality.
Maintains a performance gap of less than 10% in real-robot manipulation and navigation tasks compared to real-world trained policies.
Enables efficient import of reconstructed assets into physics simulators for embodied agent training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This system could support scaling up training data for robot policies without corresponding increases in real-world collection efforts.
The reconstruction technique might be adapted for other simulation domains beyond robotics if the extraction method generalizes.
Further validation on longer sequences or more cluttered scenes would test the robustness of the 2D extraction step.

Load-bearing premise

The camera-pose-based strategy robustly extracts objects across frames in the 2D domain to enable high-fidelity independent reconstruction of foreground objects and static background.

What would settle it

Observing a performance gap exceeding 10% between GASE-trained policies and real-world trained policies on the reported manipulation and navigation tasks.

read the original abstract

Training embodied agents in the real world requires skilled operators and expensive hardware. Simulation environments offer a compelling alternative by enabling large-scale, cost-effective data augmentation. Consequently, rapidly constructing high-fidelity simulation scenes with a minimal sim-to-real gap has become a critical objective in robot learning. While reconstruction-based methods provide superior visual quality, current workflows are hindered by inefficient data acquisition and subpar foreground object extraction. We thus propose GASE, a highly automated system for simulation scene construction. GASE leverages multi-view video streams from panoramic camera arrays to enable rapid environment scanning. To ensure high-quality asset generation, our pipeline introduces a camera-pose-based strategy that robustly extracts objects across frames in the 2D domain, followed by high-fidelity scene inpainting. Foreground objects and the static background are then reconstructed independently and seamlessly imported into physics simulators for policy training. Extensive experiments demonstrate that GASE outperforms existing 3D Gaussian-based methods in segmentation accuracy by over 10\% while achieving state-of-the-art inpainting quality. Furthermore, real-robot deployments across manipulation and navigation tasks maintains a performance gap of less than 10\% compared to policies trained purely on real-world data. These results confirm that GASE provides an efficient and highly effective solution for bridging the sim-to-real gap. Code will be released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GASE gives a concrete pipeline for turning panoramic video into simulator scenes via 2D pose-based extraction and separate Gaussian reconstruction, but the headline numbers have no visible support yet.

read the letter

The main point to take away is that this paper describes a complete system called GASE for quickly turning panoramic video into simulator-ready 3D scenes using Gaussian splatting, with separate handling of objects and background, and it claims the resulting policies transfer well to real robots. The evidence for those claims is not visible in the abstract.

What is actually new here is the specific workflow: using camera poses to extract objects in 2D across multiple frames, applying high-fidelity inpainting to the scenes, then reconstructing foreground and background independently with Gaussians before importing to a physics simulator. This end-to-end automation from capture to sim import extends prior Gaussian work in a way that targets the embodied AI use case directly.

The paper does well at identifying a real bottleneck in robot learning, which is the time and cost to build custom high-quality simulation environments. The idea of using panoramic arrays for fast scanning and then automating the asset creation makes sense as a practical contribution. If the real-robot results hold up, that would be the most useful part for the field.

The soft spots are in the experimental support. The abstract states quantitative wins like over 10% better segmentation accuracy than other 3D Gaussian methods and a performance gap under 10% versus real-world trained policies, but it provides zero information on the experimental setup, what the baselines were, how many scenes or tasks were tested, or any statistical measures. Without that, it's impossible to know if the numbers are reliable or if choices in the pipeline were tuned post-hoc.

The stress-test concern about the camera-pose 2D extraction being the unverified link is fair based on what's here. The paper says this strategy "robustly extracts objects," but there's no mention of failure cases, occlusion handling, or an ablation that removes the inpainting step to show its impact. If that step doesn't work consistently, the independent reconstructions and the downstream sim-to-real performance wouldn't follow.

This paper is aimed at people in robotics and embodied AI who build or use simulation for policy training. A reader looking for applied systems that combine reconstruction techniques with simulator integration would find the pipeline description relevant, even if the numbers need verification.

I think it deserves a serious referee to check the full methods and results, because the core idea is a useful one if the implementation details check out. The claims are empirical rather than theoretical, so the review should focus on reproducibility and whether the extraction really delivers the reported gains.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes GASE, an automated pipeline for constructing high-fidelity simulation environments from multi-view panoramic video using 3D Gaussian splatting. A camera-pose-based 2D extraction step followed by scene inpainting separates foreground objects from static background, enabling independent reconstruction and direct import into physics simulators. The central empirical claims are >10% gains in segmentation accuracy over prior 3D Gaussian methods, state-of-the-art inpainting quality, and real-robot manipulation/navigation policies whose performance remains within 10% of policies trained exclusively on real data.

Significance. If the reported performance numbers are supported by properly documented experiments, the work would be significant for embodied AI: it directly targets the data-acquisition bottleneck in sim-to-real transfer by offering a largely automated, high-visual-fidelity reconstruction workflow. The explicit promise to release code is a positive factor for reproducibility.

major comments (2)

[Abstract / §4] Abstract and §4 (Experiments): the headline claims of '>10% segmentation accuracy' and '<10% sim-to-real performance gap' are presented without any description of dataset sizes, number of scenes, exact metrics, baseline implementations, number of trials, or statistical tests. Because these numbers constitute the primary evidence for the central claim that the pipeline bridges the sim-to-real gap, the absence of protocol details is load-bearing.
[§3.2] §3.2 (Object Extraction and Inpainting): the entire performance narrative rests on the assertion that the camera-pose-based 2D extraction 'robustly extracts objects across frames' followed by high-fidelity inpainting. No quantitative ablation (e.g., extraction IoU, failure rates under occlusion or blur) or removal of the inpainting stage is reported, leaving the causal link between the proposed strategy and the claimed downstream gains unverified.

minor comments (1)

[Abstract] The abstract states 'Code will be released' but provides neither a repository URL nor a commit hash; this should be added for a camera-ready version.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. The feedback correctly identifies areas where additional experimental documentation and ablations would strengthen the presentation of our results. We address each major comment below and will incorporate the suggested revisions.

read point-by-point responses

Referee: [Abstract / §4] Abstract and §4 (Experiments): the headline claims of '>10% segmentation accuracy' and '<10% sim-to-real performance gap' are presented without any description of dataset sizes, number of scenes, exact metrics, baseline implementations, number of trials, or statistical tests. Because these numbers constitute the primary evidence for the central claim that the pipeline bridges the sim-to-real gap, the absence of protocol details is load-bearing.

Authors: We agree that the current presentation of headline claims in the abstract and §4 would benefit from more explicit protocol documentation. Although §4 describes the datasets, metrics, and tasks at a high level, it does not enumerate scene counts, trial numbers, or include statistical tests. In the revised manuscript we will expand §4 with a new 'Experimental Protocol' subsection that reports: number of panoramic video sequences and distinct scenes, exact metric definitions and implementations, baseline code references or re-implementations, number of policy training/evaluation trials per task, and results of appropriate statistical tests (e.g., paired t-tests with p-values). revision: yes
Referee: [§3.2] §3.2 (Object Extraction and Inpainting): the entire performance narrative rests on the assertion that the camera-pose-based 2D extraction 'robustly extracts objects across frames' followed by high-fidelity inpainting. No quantitative ablation (e.g., extraction IoU, failure rates under occlusion or blur) or removal of the inpainting stage is reported, leaving the causal link between the proposed strategy and the claimed downstream gains unverified.

Authors: The referee is correct that §3.2 currently provides only a qualitative description of the extraction and inpainting pipeline without supporting quantitative ablations. To establish the contribution of these components, the revised manuscript will add quantitative results: per-frame extraction IoU and failure rates under controlled occlusion/blur conditions, plus an ablation that removes the inpainting stage and measures the resulting impact on both segmentation accuracy and downstream policy performance. These new experiments will be reported in an expanded §4. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system paper with no derivations or fitted predictions

full rationale

The manuscript presents an automated reconstruction pipeline (multi-view video, camera-pose 2D extraction, inpainting, independent foreground/background Gaussian reconstruction) and supports its claims solely via reported experimental metrics (segmentation accuracy, inpainting quality, real-robot policy transfer gaps). No equations, parameter-fitting steps, self-citations used as uniqueness theorems, or renamings of known results appear in the provided text. The central claims rest on external empirical benchmarks rather than any internal reduction to fitted inputs or self-referential definitions, satisfying the self-contained criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of the proposed extraction and inpainting steps and on the assumption that Gaussian splatting produces assets suitable for physics simulation; no free parameters, ad-hoc axioms, or new invented entities are introduced in the abstract.

axioms (1)

domain assumption 3D Gaussian splatting can produce high-fidelity reconstructions from multi-view images suitable for downstream simulation
Invoked implicitly when stating that foreground and background are reconstructed independently and imported into physics simulators.

pith-pipeline@v0.9.1-grok · 5801 in / 1346 out tokens · 40575 ms · 2026-06-27T00:54:17.557126+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

90 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Real-is-sim: Bridging the sim-to-real gap with a dynamic digital twin, 2025

Jad Abou-Chakra, Lingfeng Sun, Krishan Rana, Brandon May, Karl Schmeckpeper, Niko Suenderhauf, Maria Vittoria Minniti, and Laura Herlant. Real-is-sim: Bridging the sim-to-real gap with a dynamic digital twin, 2025. URL https: //arxiv.org/abs/2504.03597

arXiv 2025
[2]

Piper robotic arm.https://www.agibot.com/, 2025

AgiBot. Piper robotic arm.https://www.agibot.com/, 2025. Accessed: 2026-05-09

2025
[3]

Barron, Ben Mildenhall, Dor Verbin, Pratul P

Jonathan T. Barron, Ben Mildenhall, Dor Verbin, Pratul P. Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields.CVPR, 2022

2022
[4]

URLhttps://arxiv.org/abs/2410.24164

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky.π0: A visio...

Pith/arXiv arXiv 2026
[5]

Rt-1: Robotics transformer for real-world control at scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas Jackson, Sally Jesmonth, NikhilJ Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Utsav Malla, D...

2022
[6]

Physx-3d: Physical-grounded 3d asset generation.arXiv preprint arXiv:2507.12465, 2025

Ziang Cao, Zhaoxi Chen, Liang Pan, and Ziwei Liu. Physx-3d: Physical-grounded 3d asset generation.arXiv preprint arXiv:2507.12465, 2025

arXiv 2025
[7]

Physx-anything: Simulation-ready physical 3d assets from single image.arXiv preprint arXiv:2511.13648, 2025

Ziang Cao, Fangzhou Hong, Zhaoxi Chen, Liang Pan, and Ziwei Liu. Physx-anything: Simulation-ready physical 3d assets from single image.arXiv preprint arXiv:2511.13648, 2025

arXiv 2025
[8]

Sam 3: Segment anything with concepts, 2025

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Va- sudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Lilian...

Pith/arXiv arXiv 2025
[9]

Segment any 3d gaussians

Jiazhong Cen, Jiemin Fang, Chen Yang, Lingxi Xie, Xiaopeng Zhang, Wei Shen, and Qi Tian. Segment any 3d gaussians. Dec 2023

2023
[10]

A survey on 3d gaussian splatting, 2025

Guikun Chen and Wenguan Wang. A survey on 3d gaussian splatting, 2025. URL https://arxiv.org/abs/2401. 03890

2025
[11]

Single-stage diffusion nerf: A unified approach to 3d generation and reconstruction

Hansheng Chen, Jiatao Gu, Anpei Chen, Wei Tian, Zhuowen Tu, Lingjie Liu, and Hao Su. Single-stage diffusion nerf: A unified approach to 3d generation and reconstruction. Apr 2023

2023
[12]

Flexworld: Progressively expanding 3d scenes for flexiable-view synthesis, 2025

Luxi Chen, Zihan Zhou, Min Zhao, Yikai Wang, Ge Zhang, Wenhao Huang, Hao Sun, Ji-Rong Wen, and Chongxuan Li. Flexworld: Progressively expanding 3d scenes for flexiable-view synthesis, 2025. URL https://arxiv.org/abs/2503. 13265

2025
[13]

Meshanything: Artist-created mesh generation with autoregressive transformers, 2024

Yiwen Chen, Tong He, Di Huang, Weicai Ye, Sijin Chen, Jiaxiang Tang, Xin Chen, Zhongang Cai, Lei Yang, Gang Yu, Guosheng Lin, and Chi Zhang. Meshanything: Artist-created mesh generation with autoregressive transformers, 2024

2024
[14]

Tracking anything with decoupled video segmentation

Ho Kei Cheng, Seoung Wug Oh, Brian Price, Alexander Schwing, and Joon-Young Lee. Tracking anything with decoupled video segmentation. InICCV, 2023

2023
[15]

Embodiedsplat: Personalized real-to-sim-to-real navigation with gaussian splats from a mobile device, 2025

Gunjan Chhablani, Xiaomeng Ye, Muhammad Zubair Irshad, and Zsolt Kira. Embodiedsplat: Personalized real-to-sim-to-real navigation with gaussian splats from a mobile device, 2025. URLhttps://arxiv.org/abs/2509.17430

arXiv 2025
[16]

X-sim: Cross-embodiment learning via real-to-sim-to-real

Prithwish Dan, Kushal Kedia, Angela Chao, Edward Weiyi Duan, Maximus Adrian Pace, Wei-Chiu Ma, and Sanjiban Choudhury. X-sim: Cross-embodiment learning via real-to-sim-to-real. 2025. URLhttps://arxiv.org/abs/2505.07096

arXiv 2025
[17]

Twinaligner: Visual-dynamic alignment empowers physics-aware real2sim2real for robotic manipulation

Hongwei Fan, Hang Dai, Jiyao Zhang, Jinzhou Li, Qiyang Yan, Yujie Zhao, Mingju Gao, Jinghang Wu, Hao Tang, and Hao Dong. Twinaligner: Visual-dynamic alignment empowers physics-aware real2sim2real for robotic manipulation. 2025. URL https://arxiv.org/abs/2512.19390. 9

arXiv 2025
[18]

Zhao, and Chelsea Finn

Zipeng Fu, Tony Z. Zhao, and Chelsea Finn. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation. InConference on Robot Learning (CoRL), 2024

2024
[19]

Theory of the motion of the heavenly bodies moving about the sun in conic sections.Gauss’s Theoria Motus, 76(1):5–23, 1857

Carl Friedrich Gauss and Charles Henry Davis. Theory of the motion of the heavenly bodies moving about the sun in conic sections.Gauss’s Theoria Motus, 76(1):5–23, 1857
[20]

Re3sim: Generating high-fidelity simulation data via 3d-photorealistic real-to-sim for robotic manipulation

Xiaoshen Han, Junqiu Yu, Minghuan Liu, Yilun Chen, Xiaoyang Lyu, Yang Tian, Bolun Wang, Weinan Zhang, Weinan Zhang, and Jiangmiao Pang. Re3sim: Generating high-fidelity simulation data via 3d-photorealistic real-to-sim for robotic manipulation. InIEEE International Conference on Robotics and Automation (ICRA), 2026

2026
[21]

Gvgen: Text-to-3d generation with volumetric representation

Xianglong He, Junyi Chen, Sida Peng, Di Huang, Yangguang Li, Xiaoshui Huang, Chun Yuan, Wanli Ouyang, and Tong He. Gvgen: Text-to-3d generation with volumetric representation
[22]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, Pieter Abbeel, and UC Berkeley. Denoising diffusion probabilistic models
[23]

In: Burbano, A., Zorin, D., Jarosz, W

Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2d gaussian splatting for geometrically accurate radiance fields. InSIGGRAPH 2024 Conference Papers. Association for Computing Machinery, 2024. doi: 10.1145/3641519. 3657428

work page doi:10.1145/3641519 2024
[24]

3d gaussian inpainting with depth-guided cross-view consistency

Sheng-Yu Huang, Zi-Ting Chou, and Yu-Chiang Frank Wang. 3d gaussian inpainting with depth-guided cross-view consistency. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26704–26713, 2025

2025
[25]

Neural wavelet-domain diffusion for 3d shape generation

Ka-Hei Hui, Ruihui Li, Jingyu Hu, and Chi-Wing Fu. Neural wavelet-domain diffusion for 3d shape generation. Sep 2022

2022
[26]

Insta360 x5.https://www.insta360.com/, 2025

Insta360. Insta360 x5.https://www.insta360.com/, 2025. Accessed: 2026-05-09

2025
[27]

Physical Intelligence, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Kevin Black, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Jared DiCarlo, Danny Driess, Michael Equi, Adnan Esmail, Yunhao Fang, Chelsea Finn, Catherine Glossop, Thomas Godden, Ivan Goryachev, Lachy Groom, Hunter Hancock, Karol Hausman, Gashon Hussein, Brian Ichter, Szym...

Pith/arXiv arXiv 2025
[28]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsc...

Pith/arXiv arXiv 2025
[29]

Polaris: Scalable real-to-sim evaluations for generalist robot policies, 2025

Arhan Jain, Mingtong Zhang, Kanav Arora, William Chen, Marcel Torne, Muhammad Zubair Irshad, Sergey Zakharov, Yue Wang, Sergey Levine, Chelsea Finn, Wei-Chiu Ma, Dhruv Shah, Abhishek Gupta, and Karl Pertsch. Polaris: Scalable real-to-sim evaluations for generalist robot policies, 2025. URLhttps://arxiv.org/abs/2512.16881

arXiv 2025
[30]

Postshot.https://www.jawset.com/, 2025

Jawset Visual Computing. Postshot.https://www.jawset.com/, 2025. Accessed: 2026-05-09

2025
[31]

Fastlgs: Speeding up language embedded gaussians with feature grid mapping

Yuzhou Ji, He Zhu, Junshu Tang, Wuyi Liu, Zhizhong Zhang, Xin Tan, and Yuan Xie. Fastlgs: Speeding up language embedded gaussians with feature grid mapping. InProceedings of the AAAI Conference on Artificial Intelligence, 2025

2025
[32]

Planing: A loosely coupled triangle-gaussian framework for streaming 3d reconstruction, 2026

Changjian Jiang, Kerui Ren, Xudong Li, Kaiwen Song, Linning Xu, Tao Lu, Junting Dong, Yu Zhang, Bo Dai, and Mulin Yu. Planing: A loosely coupled triangle-gaussian framework for streaming 3d reconstruction, 2026. URL https://arxiv. org/abs/2601.22046

arXiv 2026
[33]

Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning

Zhenyu Jiang, Yuqi Xie, Kevin Lin, Zhenjia Xu, Weikang Wan, Ajay Mandlekar, Linxi Fan, and Yuke Zhu. Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning. In2025 IEEE International Conference on Robotics and Automation (ICRA), 2025

2025
[34]

Poisson surface reconstruction

M Kazhdan. Poisson surface reconstruction. 2006

2006
[35]

Screened poisson surface reconstruction.Acm Transactions on Graphics, 32(3):1–13, 2013

Michael Kazhdan and Hugues Hoppe. Screened poisson surface reconstruction.Acm Transactions on Graphics, 32(3):1–13, 2013. 10

2013
[36]

3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4), July 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4), July 2023. URL https://repo-sam.inria.fr/fungraph/ 3d-gaussian-splatting/

2023
[37]

Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything.arXiv:2304.02643, 2023

Pith/arXiv arXiv 2023
[38]

Feature refinement to improve high resolution image inpainting, 2022

Prakhar Kulshreshtha, Brian Pugh, and Salma Jiddi. Feature refinement to improve high resolution image inpainting, 2022. URLhttps://arxiv.org/abs/2206.13644

arXiv 2022
[39]

Desaint, Paris, 1788

Joseph-Louis Lagrange.Mécanique Analytique. Desaint, Paris, 1788
[40]

Lehome: A simulation environment for deformable object manipulation in household scenarios

Zeyi Li, Jade Yang, Jingkai Xu, Shangbin Xie, Yuran Wang, Zhenhao Shen, Tianxing Chen, Yan Shen, Wenjun Li, Yukun Zheng, Chaorui Zhang, Ming Chen, Chen Xie, and Ruihai Wu. Lehome: A simulation environment for deformable object manipulation in household scenarios. InIROS 2025 - 5th Workshop on RObotic MAnipulation of Deformable Objects: holistic approaches...

2025
[41]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection.arXiv preprint arXiv:2303.05499, 2023

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection.arXiv preprint arXiv:2303.05499, 2023

Pith/arXiv arXiv 2023
[42]

In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021

Shitong Luo and Wei Hu. Diffusion probabilistic models for 3d point cloud generation. In2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2021. doi: 10.1109/cvpr46437.2021.00286. URL http://dx.doi. org/10.1109/cvpr46437.2021.00286

work page doi:10.1109/cvpr46437.2021.00286 2021
[43]

Gaga: Group any gaussians via 3d-aware memory bank

Weijie Lyu, Xueting Li, Abhijit Kundu, Yi-Hsuan Tsai, and Ming-Hsuan Yang. Gaga: Group any gaussians via 3d-aware memory bank. Mar 2025

2025
[44]

Mimicgen: A data generation system for scalable robot learning using human demonstrations

Ajay Mandlekar, Soroush Nasiriany, Bowen Wen, Iretiayo Akinola, Yashraj Narang, Linxi Fan, Yuke Zhu, and Dieter Fox. Mimicgen: A data generation system for scalable robot learning using human demonstrations. In7th Annual Conference on Robot Learning, 2023

2023
[45]

Srinivasan, Matthew Tancik, Jonathan T

Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. InECCV, 2020

2020
[46]

Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning

M. Mittal, P. Roth, J. Tigue, A. Richard, O. Zhang, P. Du, A. Serrano-Muñoz, X. Yao, R. Zurbrügg, N. Rudin, L. Wawrzyniak, M. Rakhsha, A. Denzler, E. Heiden, A. Borovicka, O. Ahmed, I. Akinola, A. Anwar, M. T. Carlson, J. Y . Feng, A. Garg, R. Gasoto, L. Gulich, Y . Guo, M. Gussert, A. Hansen, M. Kulkarni, C. Li, W. Liu, V . Makoviychuk, G. Malczyk, H. Ma...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.04831 2025
[47]

3d gaussian ray tracing: Fast tracing of particle scenes.ACM Transactions on Graphics and SIGGRAPH Asia, 2024

Nicolas Moenne-Loccoz, Ashkan Mirzaei, Or Perel, Riccardo de Lutio, Janick Martinez Esturo, Gavriel State, Sanja Fidler, Nicholas Sharp, and Zan Gojcic. 3d gaussian ray tracing: Fast tracing of particle scenes.ACM Transactions on Graphics and SIGGRAPH Asia, 2024

2024
[48]

Diffrf: Rendering- guided 3d radiance field diffusion.Cornell University - arXiv,Cornell University - arXiv, Dec 2022

Norman Muller, Yawar Siddiqui, Lorenzo Porzi, SamuelRota Bulò, Peter Kontschieder, and Matthias Nießner. Diffrf: Rendering- guided 3d radiance field diffusion.Cornell University - arXiv,Cornell University - arXiv, Dec 2022

2022
[49]

Battaglia

Charlie Nash, Yaroslav Ganin, S.M.Ali Eslami, and PeterW. Battaglia. Polygen: An autoregressive generative model of 3d meshes.Cornell University - arXiv,Cornell University - arXiv, Feb 2020

2020
[50]

Robocasa: Large-scale simulation of everyday tasks for generalist robots

Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots. InRobotics: Science and Systems (RSS), 2024

2024
[51]

Robocasa365: A large-scale simulation framework for training and benchmarking generalist robots

Soroush Nasiriany, Sepehr Nasiriany, Abhiram Maddukuri, and Yuke Zhu. Robocasa365: A large-scale simulation framework for training and benchmarking generalist robots. InInternational Conference on Learning Representations (ICLR), 2026

2026
[52]

Point-e: A system for generating 3d point clouds from complex prompts

Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. Point-e: A system for generating 3d point clouds from complex prompts. Dec 2022. 11

2022
[53]

Isaac Sim.https://github.com/isaac-sim/IsaacSim, 2024

NVIDIA Corporation. Isaac Sim.https://github.com/isaac-sim/IsaacSim, 2024. Version 5.1.0

2024
[54]

Deepsdf: Learning continuous signed distance functions for shape representation

Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. Deepsdf: Learning continuous signed distance functions for shape representation. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019

2019
[55]

Fast: Efficient action tokenization for vision-language-action models, 2025

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models, 2025. URL https://arxiv.org/abs/ 2501.09747

Pith/arXiv arXiv 2025
[56]

Polycam.https://poly.cam, 2024

Polycam Inc. Polycam.https://poly.cam, 2024. 3D scanning application

2024
[57]

Langsplat: 3d language gaussian splatting

Minghan Qin, Wanhua Li, Jiawei Zhou, Haoqian Wang, and Hanspeter Pfister. Langsplat: 3d language gaussian splatting. Dec 2023

2023
[58]

Splatsim: Zero-shot sim2real transfer of rgb manipulation policies using gaussian splatting, 2024

Mohammad Nomaan Qureshi, Sparsh Garg, Francisco Yandun, David Held, George Kantor, and Abhishesh Silwal. Splatsim: Zero-shot sim2real transfer of rgb manipulation policies using gaussian splatting, 2024. URL https://arxiv.org/abs/ 2409.10161

arXiv 2024
[59]

Language models are unsupervised multitask learners

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019. URLhttps://api.semanticscholar.org/CorpusID:160025533

2019
[61]

URLhttps://arxiv.org/abs/2408.00714

Pith/arXiv arXiv
[62]

Grounded sam: Assembling open-world models for diverse visual tasks, 2024

Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, and Lei Zhang. Grounded sam: Assembling open-world models for diverse visual tasks, 2024

2024
[63]

Schönberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger

Thomas Schöps, Johannes L. Schönberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A multi-view stereo benchmark with high-resolution images and multi-camera videos. InConference on Computer Vision and Pattern Recognition (CVPR), 2017

2017
[64]

Scenemaker: Open-set 3d scene generation with decoupled de-occlusion and pose estimation model.arXiv preprint arXiv:2512.10957, 2025

Yukai Shi, Weiyu Li, Zihao Wang, Hongyang Li, Xingyu Chen, Ping Tan, and Lei Zhang. Scenemaker: Open-set 3d scene generation with decoupled de-occlusion and pose estimation model.arXiv preprint arXiv:2512.10957, 2025

arXiv 2025
[65]

3d neural field generation using triplane diffusion

J.Ryan Shue, EricRyan Chan, Ryan Po, Zachary Ankner, Jiajun Wu, and Gordon Wetzstein. 3d neural field generation using triplane diffusion. Nov 2022

2022
[66]

Weiss, Niru Maheswaranathan, and Surya Ganguli

Jascha Sohl-Dickstein, EricA. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequi- librium thermodynamics.arXiv: Learning,arXiv: Learning, Mar 2015

2015
[67]

Resolution-robust large mask inpainting with fourier convolutions

Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lempitsky. Resolution-robust large mask inpainting with fourier convolutions. arXiv preprint arXiv:2109.07161, 2021

arXiv 2021
[68]

V olumediffusion: Flexible text-to-3d generation with efficient volumetric encoder

Zhicong Tang, Shuyang Gu, Chunyu Wang, Ting Zhang, Jianmin Bao, Dong Chen, and Baining Guo. V olumediffusion: Flexible text-to-3d generation with efficient volumetric encoder. Apr 2024

2024
[69]

ALOHA 2 Team, Jorge Aldaco, Travis Armstrong, Robert Baruch, Jeff Bingham, Sanky Chan, Kenneth Draper, Debidatta Dwibedi, Chelsea Finn, Pete Florence, Spencer Goodrich, Wayne Gramlich, Torr Hage, Alexander Herzog, Jonathan Hoech, Thinh Nguyen, Ian Storz, Baruch Tabanpour, Leila Takayama, Jonathan Tompson, Ayzaan Wahid, Ted Wahrburg, Sichun Xu, Sergey Yaro...
[70]

URLhttps://arxiv.org/abs/2405.02292

arXiv
[71]

Sam 3d: 3dfy anything in images

SAM 3D Team, Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin J Liang, Alexander Sax, Hao Tang, Weiyao Wang, Michelle Guo, Thibaut Hardin, Xiang Li, Aohan Lin, Jiawei Liu, Ziqi Ma, Anushka Sagar, Bowen Song, Xiaodong Wang, Jianing Yang, Bowen Zhang, Piotr Dollár, Georgia Gkioxari, Matt Feiszli, and Jitendra Malik. Sam 3d: 3dfy anything in images. 2025. URL h...

Pith/arXiv arXiv 2025
[72]

Online segment any 3d thing as instance tracking

Hanshi Wang, Zijian Cai, Jin Gao, Yiwei Zhang, Weiming Hu, Ke Wang, and Zhipeng Zhang. Online segment any 3d thing as instance tracking. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, . 12
[73]

Rodin: A generative model for sculpting 3d digital avatars using diffusion

Tengfei Wang, Bo Zhang, Ting Zhang, Shuyang Gu, Jianmin Bao, Tadas Baltrusaitis, Jingjing Shen, Dong Chen, Fang Wen, Qifeng Chen, Baining Guo, and Microsoft Research. Rodin: A generative model for sculpting 3d digital avatars using diffusion
[74]

Embodiedgen: Towards a generative 3d world engine for embodied intelligence, 2025

Xinjie Wang, Liu Liu, Yu Cao, Ruiqi Wu, Wenkang Qin, Dehui Wang, Wei Sui, and Zhizhong Su. Embodiedgen: Towards a generative 3d world engine for embodied intelligence, 2025. URLhttps://arxiv.org/abs/2506.10600

arXiv 2025
[75]

Gscream: Learning 3d geometry and feature consistent gaussian splatting for object removal

Yuxin Wang, Qianyi Wu, Guofeng Zhang, and Dan Xu. Gscream: Learning 3d geometry and feature consistent gaussian splatting for object removal. InECCV, 2024

2024
[76]

4d gaussian splatting for real-time dynamic scene rendering

Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene rendering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20310–20320, June 2024

2024
[77]

3dgut: Enabling distorted cameras and secondary rays in gaussian splatting.Conference on Computer Vision and Pattern Recognition (CVPR), 2025

Qi Wu, Janick Martinez Esturo, Ashkan Mirzaei, Nicolas Moenne-Loccoz, and Zan Gojcic. 3dgut: Enabling distorted cameras and secondary rays in gaussian splatting.Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025
[78]

Drawer: Digital reconstruction and articulation with environment realism, 2025

Hongchi Xia, Entong Su, Marius Memmel, Arhan Jain, Raymond Yu, Numfor Mbiziwo-Tiapo, Ali Farhadi, Abhishek Gupta, Shenlong Wang, and Wei-Chiu Ma. Drawer: Digital reconstruction and articulation with environment realism, 2025. URL https://arxiv.org/abs/2504.15278

arXiv 2025
[79]

Structured 3d latents for scalable and versatile 3d generation.arXiv preprint arXiv:2412.01506, 2024

Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d generation.arXiv preprint arXiv:2412.01506, 2024

Pith/arXiv arXiv 2024
[80]

Embodiedsam: Online segment any 3d thing in real time.arXiv preprint arXiv:2408.11811, 2024

Xiuwei Xu, Huangxing Chen, Linqing Zhao, Ziwei Wang, Jie Zhou, and Jiwen Lu. Embodiedsam: Online segment any 3d thing in real time.arXiv preprint arXiv:2408.11811, 2024

arXiv 2024
[81]

Omnipart: Part-aware 3d generation with semantic decoupling and structural cohesion.arXiv preprint arXiv:2507.06165, 2025

Yunhan Yang, Yufan Zhou, Yuan-Chen Guo, Zi-Xin Zou, Yukun Huang, Ying-Tian Liu, Hao Xu, Ding Liang, Yan-Pei Cao, and Xihui Liu. Omnipart: Part-aware 3d generation with semantic decoupling and structural cohesion.arXiv preprint arXiv:2507.06165, 2025

arXiv 2025

Showing first 80 references.

[1] [1]

Real-is-sim: Bridging the sim-to-real gap with a dynamic digital twin, 2025

Jad Abou-Chakra, Lingfeng Sun, Krishan Rana, Brandon May, Karl Schmeckpeper, Niko Suenderhauf, Maria Vittoria Minniti, and Laura Herlant. Real-is-sim: Bridging the sim-to-real gap with a dynamic digital twin, 2025. URL https: //arxiv.org/abs/2504.03597

arXiv 2025

[2] [2]

Piper robotic arm.https://www.agibot.com/, 2025

AgiBot. Piper robotic arm.https://www.agibot.com/, 2025. Accessed: 2026-05-09

2025

[3] [3]

Barron, Ben Mildenhall, Dor Verbin, Pratul P

Jonathan T. Barron, Ben Mildenhall, Dor Verbin, Pratul P. Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields.CVPR, 2022

2022

[4] [4]

URLhttps://arxiv.org/abs/2410.24164

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky.π0: A visio...

Pith/arXiv arXiv 2026

[5] [5]

Rt-1: Robotics transformer for real-world control at scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas Jackson, Sally Jesmonth, NikhilJ Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Utsav Malla, D...

2022

[6] [6]

Physx-3d: Physical-grounded 3d asset generation.arXiv preprint arXiv:2507.12465, 2025

Ziang Cao, Zhaoxi Chen, Liang Pan, and Ziwei Liu. Physx-3d: Physical-grounded 3d asset generation.arXiv preprint arXiv:2507.12465, 2025

arXiv 2025

[7] [7]

Physx-anything: Simulation-ready physical 3d assets from single image.arXiv preprint arXiv:2511.13648, 2025

Ziang Cao, Fangzhou Hong, Zhaoxi Chen, Liang Pan, and Ziwei Liu. Physx-anything: Simulation-ready physical 3d assets from single image.arXiv preprint arXiv:2511.13648, 2025

arXiv 2025

[8] [8]

Sam 3: Segment anything with concepts, 2025

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Va- sudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Lilian...

Pith/arXiv arXiv 2025

[9] [9]

Segment any 3d gaussians

Jiazhong Cen, Jiemin Fang, Chen Yang, Lingxi Xie, Xiaopeng Zhang, Wei Shen, and Qi Tian. Segment any 3d gaussians. Dec 2023

2023

[10] [10]

A survey on 3d gaussian splatting, 2025

Guikun Chen and Wenguan Wang. A survey on 3d gaussian splatting, 2025. URL https://arxiv.org/abs/2401. 03890

2025

[11] [11]

Single-stage diffusion nerf: A unified approach to 3d generation and reconstruction

Hansheng Chen, Jiatao Gu, Anpei Chen, Wei Tian, Zhuowen Tu, Lingjie Liu, and Hao Su. Single-stage diffusion nerf: A unified approach to 3d generation and reconstruction. Apr 2023

2023

[12] [12]

Flexworld: Progressively expanding 3d scenes for flexiable-view synthesis, 2025

Luxi Chen, Zihan Zhou, Min Zhao, Yikai Wang, Ge Zhang, Wenhao Huang, Hao Sun, Ji-Rong Wen, and Chongxuan Li. Flexworld: Progressively expanding 3d scenes for flexiable-view synthesis, 2025. URL https://arxiv.org/abs/2503. 13265

2025

[13] [13]

Meshanything: Artist-created mesh generation with autoregressive transformers, 2024

Yiwen Chen, Tong He, Di Huang, Weicai Ye, Sijin Chen, Jiaxiang Tang, Xin Chen, Zhongang Cai, Lei Yang, Gang Yu, Guosheng Lin, and Chi Zhang. Meshanything: Artist-created mesh generation with autoregressive transformers, 2024

2024

[14] [14]

Tracking anything with decoupled video segmentation

Ho Kei Cheng, Seoung Wug Oh, Brian Price, Alexander Schwing, and Joon-Young Lee. Tracking anything with decoupled video segmentation. InICCV, 2023

2023

[15] [15]

Embodiedsplat: Personalized real-to-sim-to-real navigation with gaussian splats from a mobile device, 2025

Gunjan Chhablani, Xiaomeng Ye, Muhammad Zubair Irshad, and Zsolt Kira. Embodiedsplat: Personalized real-to-sim-to-real navigation with gaussian splats from a mobile device, 2025. URLhttps://arxiv.org/abs/2509.17430

arXiv 2025

[16] [16]

X-sim: Cross-embodiment learning via real-to-sim-to-real

Prithwish Dan, Kushal Kedia, Angela Chao, Edward Weiyi Duan, Maximus Adrian Pace, Wei-Chiu Ma, and Sanjiban Choudhury. X-sim: Cross-embodiment learning via real-to-sim-to-real. 2025. URLhttps://arxiv.org/abs/2505.07096

arXiv 2025

[17] [17]

Twinaligner: Visual-dynamic alignment empowers physics-aware real2sim2real for robotic manipulation

Hongwei Fan, Hang Dai, Jiyao Zhang, Jinzhou Li, Qiyang Yan, Yujie Zhao, Mingju Gao, Jinghang Wu, Hao Tang, and Hao Dong. Twinaligner: Visual-dynamic alignment empowers physics-aware real2sim2real for robotic manipulation. 2025. URL https://arxiv.org/abs/2512.19390. 9

arXiv 2025

[18] [18]

Zhao, and Chelsea Finn

Zipeng Fu, Tony Z. Zhao, and Chelsea Finn. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation. InConference on Robot Learning (CoRL), 2024

2024

[19] [19]

Theory of the motion of the heavenly bodies moving about the sun in conic sections.Gauss’s Theoria Motus, 76(1):5–23, 1857

Carl Friedrich Gauss and Charles Henry Davis. Theory of the motion of the heavenly bodies moving about the sun in conic sections.Gauss’s Theoria Motus, 76(1):5–23, 1857

[20] [20]

Re3sim: Generating high-fidelity simulation data via 3d-photorealistic real-to-sim for robotic manipulation

Xiaoshen Han, Junqiu Yu, Minghuan Liu, Yilun Chen, Xiaoyang Lyu, Yang Tian, Bolun Wang, Weinan Zhang, Weinan Zhang, and Jiangmiao Pang. Re3sim: Generating high-fidelity simulation data via 3d-photorealistic real-to-sim for robotic manipulation. InIEEE International Conference on Robotics and Automation (ICRA), 2026

2026

[21] [21]

Gvgen: Text-to-3d generation with volumetric representation

Xianglong He, Junyi Chen, Sida Peng, Di Huang, Yangguang Li, Xiaoshui Huang, Chun Yuan, Wanli Ouyang, and Tong He. Gvgen: Text-to-3d generation with volumetric representation

[22] [22]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, Pieter Abbeel, and UC Berkeley. Denoising diffusion probabilistic models

[23] [23]

In: Burbano, A., Zorin, D., Jarosz, W

Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2d gaussian splatting for geometrically accurate radiance fields. InSIGGRAPH 2024 Conference Papers. Association for Computing Machinery, 2024. doi: 10.1145/3641519. 3657428

work page doi:10.1145/3641519 2024

[24] [24]

3d gaussian inpainting with depth-guided cross-view consistency

Sheng-Yu Huang, Zi-Ting Chou, and Yu-Chiang Frank Wang. 3d gaussian inpainting with depth-guided cross-view consistency. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26704–26713, 2025

2025

[25] [25]

Neural wavelet-domain diffusion for 3d shape generation

Ka-Hei Hui, Ruihui Li, Jingyu Hu, and Chi-Wing Fu. Neural wavelet-domain diffusion for 3d shape generation. Sep 2022

2022

[26] [26]

Insta360 x5.https://www.insta360.com/, 2025

Insta360. Insta360 x5.https://www.insta360.com/, 2025. Accessed: 2026-05-09

2025

[27] [27]

Physical Intelligence, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Kevin Black, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Jared DiCarlo, Danny Driess, Michael Equi, Adnan Esmail, Yunhao Fang, Chelsea Finn, Catherine Glossop, Thomas Godden, Ivan Goryachev, Lachy Groom, Hunter Hancock, Karol Hausman, Gashon Hussein, Brian Ichter, Szym...

Pith/arXiv arXiv 2025

[28] [28]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsc...

Pith/arXiv arXiv 2025

[29] [29]

Polaris: Scalable real-to-sim evaluations for generalist robot policies, 2025

Arhan Jain, Mingtong Zhang, Kanav Arora, William Chen, Marcel Torne, Muhammad Zubair Irshad, Sergey Zakharov, Yue Wang, Sergey Levine, Chelsea Finn, Wei-Chiu Ma, Dhruv Shah, Abhishek Gupta, and Karl Pertsch. Polaris: Scalable real-to-sim evaluations for generalist robot policies, 2025. URLhttps://arxiv.org/abs/2512.16881

arXiv 2025

[30] [30]

Postshot.https://www.jawset.com/, 2025

Jawset Visual Computing. Postshot.https://www.jawset.com/, 2025. Accessed: 2026-05-09

2025

[31] [31]

Fastlgs: Speeding up language embedded gaussians with feature grid mapping

Yuzhou Ji, He Zhu, Junshu Tang, Wuyi Liu, Zhizhong Zhang, Xin Tan, and Yuan Xie. Fastlgs: Speeding up language embedded gaussians with feature grid mapping. InProceedings of the AAAI Conference on Artificial Intelligence, 2025

2025

[32] [32]

Planing: A loosely coupled triangle-gaussian framework for streaming 3d reconstruction, 2026

Changjian Jiang, Kerui Ren, Xudong Li, Kaiwen Song, Linning Xu, Tao Lu, Junting Dong, Yu Zhang, Bo Dai, and Mulin Yu. Planing: A loosely coupled triangle-gaussian framework for streaming 3d reconstruction, 2026. URL https://arxiv. org/abs/2601.22046

arXiv 2026

[33] [33]

Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning

Zhenyu Jiang, Yuqi Xie, Kevin Lin, Zhenjia Xu, Weikang Wan, Ajay Mandlekar, Linxi Fan, and Yuke Zhu. Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning. In2025 IEEE International Conference on Robotics and Automation (ICRA), 2025

2025

[34] [34]

Poisson surface reconstruction

M Kazhdan. Poisson surface reconstruction. 2006

2006

[35] [35]

Screened poisson surface reconstruction.Acm Transactions on Graphics, 32(3):1–13, 2013

Michael Kazhdan and Hugues Hoppe. Screened poisson surface reconstruction.Acm Transactions on Graphics, 32(3):1–13, 2013. 10

2013

[36] [36]

3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4), July 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4), July 2023. URL https://repo-sam.inria.fr/fungraph/ 3d-gaussian-splatting/

2023

[37] [37]

Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything.arXiv:2304.02643, 2023

Pith/arXiv arXiv 2023

[38] [38]

Feature refinement to improve high resolution image inpainting, 2022

Prakhar Kulshreshtha, Brian Pugh, and Salma Jiddi. Feature refinement to improve high resolution image inpainting, 2022. URLhttps://arxiv.org/abs/2206.13644

arXiv 2022

[39] [39]

Desaint, Paris, 1788

Joseph-Louis Lagrange.Mécanique Analytique. Desaint, Paris, 1788

[40] [40]

Lehome: A simulation environment for deformable object manipulation in household scenarios

Zeyi Li, Jade Yang, Jingkai Xu, Shangbin Xie, Yuran Wang, Zhenhao Shen, Tianxing Chen, Yan Shen, Wenjun Li, Yukun Zheng, Chaorui Zhang, Ming Chen, Chen Xie, and Ruihai Wu. Lehome: A simulation environment for deformable object manipulation in household scenarios. InIROS 2025 - 5th Workshop on RObotic MAnipulation of Deformable Objects: holistic approaches...

2025

[41] [41]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection.arXiv preprint arXiv:2303.05499, 2023

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection.arXiv preprint arXiv:2303.05499, 2023

Pith/arXiv arXiv 2023

[42] [42]

In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021

Shitong Luo and Wei Hu. Diffusion probabilistic models for 3d point cloud generation. In2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2021. doi: 10.1109/cvpr46437.2021.00286. URL http://dx.doi. org/10.1109/cvpr46437.2021.00286

work page doi:10.1109/cvpr46437.2021.00286 2021

[43] [43]

Gaga: Group any gaussians via 3d-aware memory bank

Weijie Lyu, Xueting Li, Abhijit Kundu, Yi-Hsuan Tsai, and Ming-Hsuan Yang. Gaga: Group any gaussians via 3d-aware memory bank. Mar 2025

2025

[44] [44]

Mimicgen: A data generation system for scalable robot learning using human demonstrations

Ajay Mandlekar, Soroush Nasiriany, Bowen Wen, Iretiayo Akinola, Yashraj Narang, Linxi Fan, Yuke Zhu, and Dieter Fox. Mimicgen: A data generation system for scalable robot learning using human demonstrations. In7th Annual Conference on Robot Learning, 2023

2023

[45] [45]

Srinivasan, Matthew Tancik, Jonathan T

Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. InECCV, 2020

2020

[46] [46]

Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning

M. Mittal, P. Roth, J. Tigue, A. Richard, O. Zhang, P. Du, A. Serrano-Muñoz, X. Yao, R. Zurbrügg, N. Rudin, L. Wawrzyniak, M. Rakhsha, A. Denzler, E. Heiden, A. Borovicka, O. Ahmed, I. Akinola, A. Anwar, M. T. Carlson, J. Y . Feng, A. Garg, R. Gasoto, L. Gulich, Y . Guo, M. Gussert, A. Hansen, M. Kulkarni, C. Li, W. Liu, V . Makoviychuk, G. Malczyk, H. Ma...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.04831 2025

[47] [47]

3d gaussian ray tracing: Fast tracing of particle scenes.ACM Transactions on Graphics and SIGGRAPH Asia, 2024

Nicolas Moenne-Loccoz, Ashkan Mirzaei, Or Perel, Riccardo de Lutio, Janick Martinez Esturo, Gavriel State, Sanja Fidler, Nicholas Sharp, and Zan Gojcic. 3d gaussian ray tracing: Fast tracing of particle scenes.ACM Transactions on Graphics and SIGGRAPH Asia, 2024

2024

[48] [48]

Diffrf: Rendering- guided 3d radiance field diffusion.Cornell University - arXiv,Cornell University - arXiv, Dec 2022

Norman Muller, Yawar Siddiqui, Lorenzo Porzi, SamuelRota Bulò, Peter Kontschieder, and Matthias Nießner. Diffrf: Rendering- guided 3d radiance field diffusion.Cornell University - arXiv,Cornell University - arXiv, Dec 2022

2022

[49] [49]

Battaglia

Charlie Nash, Yaroslav Ganin, S.M.Ali Eslami, and PeterW. Battaglia. Polygen: An autoregressive generative model of 3d meshes.Cornell University - arXiv,Cornell University - arXiv, Feb 2020

2020

[50] [50]

Robocasa: Large-scale simulation of everyday tasks for generalist robots

Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots. InRobotics: Science and Systems (RSS), 2024

2024

[51] [51]

Robocasa365: A large-scale simulation framework for training and benchmarking generalist robots

Soroush Nasiriany, Sepehr Nasiriany, Abhiram Maddukuri, and Yuke Zhu. Robocasa365: A large-scale simulation framework for training and benchmarking generalist robots. InInternational Conference on Learning Representations (ICLR), 2026

2026

[52] [52]

Point-e: A system for generating 3d point clouds from complex prompts

Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. Point-e: A system for generating 3d point clouds from complex prompts. Dec 2022. 11

2022

[53] [53]

Isaac Sim.https://github.com/isaac-sim/IsaacSim, 2024

NVIDIA Corporation. Isaac Sim.https://github.com/isaac-sim/IsaacSim, 2024. Version 5.1.0

2024

[54] [54]

Deepsdf: Learning continuous signed distance functions for shape representation

Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. Deepsdf: Learning continuous signed distance functions for shape representation. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019

2019

[55] [55]

Fast: Efficient action tokenization for vision-language-action models, 2025

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models, 2025. URL https://arxiv.org/abs/ 2501.09747

Pith/arXiv arXiv 2025

[56] [56]

Polycam.https://poly.cam, 2024

Polycam Inc. Polycam.https://poly.cam, 2024. 3D scanning application

2024

[57] [57]

Langsplat: 3d language gaussian splatting

Minghan Qin, Wanhua Li, Jiawei Zhou, Haoqian Wang, and Hanspeter Pfister. Langsplat: 3d language gaussian splatting. Dec 2023

2023

[58] [58]

Splatsim: Zero-shot sim2real transfer of rgb manipulation policies using gaussian splatting, 2024

Mohammad Nomaan Qureshi, Sparsh Garg, Francisco Yandun, David Held, George Kantor, and Abhishesh Silwal. Splatsim: Zero-shot sim2real transfer of rgb manipulation policies using gaussian splatting, 2024. URL https://arxiv.org/abs/ 2409.10161

arXiv 2024

[59] [59]

Language models are unsupervised multitask learners

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019. URLhttps://api.semanticscholar.org/CorpusID:160025533

2019

[60] [61]

URLhttps://arxiv.org/abs/2408.00714

Pith/arXiv arXiv

[61] [62]

Grounded sam: Assembling open-world models for diverse visual tasks, 2024

Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, and Lei Zhang. Grounded sam: Assembling open-world models for diverse visual tasks, 2024

2024

[62] [63]

Schönberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger

Thomas Schöps, Johannes L. Schönberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A multi-view stereo benchmark with high-resolution images and multi-camera videos. InConference on Computer Vision and Pattern Recognition (CVPR), 2017

2017

[63] [64]

Scenemaker: Open-set 3d scene generation with decoupled de-occlusion and pose estimation model.arXiv preprint arXiv:2512.10957, 2025

Yukai Shi, Weiyu Li, Zihao Wang, Hongyang Li, Xingyu Chen, Ping Tan, and Lei Zhang. Scenemaker: Open-set 3d scene generation with decoupled de-occlusion and pose estimation model.arXiv preprint arXiv:2512.10957, 2025

arXiv 2025

[64] [65]

3d neural field generation using triplane diffusion

J.Ryan Shue, EricRyan Chan, Ryan Po, Zachary Ankner, Jiajun Wu, and Gordon Wetzstein. 3d neural field generation using triplane diffusion. Nov 2022

2022

[65] [66]

Weiss, Niru Maheswaranathan, and Surya Ganguli

Jascha Sohl-Dickstein, EricA. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequi- librium thermodynamics.arXiv: Learning,arXiv: Learning, Mar 2015

2015

[66] [67]

Resolution-robust large mask inpainting with fourier convolutions

Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lempitsky. Resolution-robust large mask inpainting with fourier convolutions. arXiv preprint arXiv:2109.07161, 2021

arXiv 2021

[67] [68]

V olumediffusion: Flexible text-to-3d generation with efficient volumetric encoder

Zhicong Tang, Shuyang Gu, Chunyu Wang, Ting Zhang, Jianmin Bao, Dong Chen, and Baining Guo. V olumediffusion: Flexible text-to-3d generation with efficient volumetric encoder. Apr 2024

2024

[68] [69]

ALOHA 2 Team, Jorge Aldaco, Travis Armstrong, Robert Baruch, Jeff Bingham, Sanky Chan, Kenneth Draper, Debidatta Dwibedi, Chelsea Finn, Pete Florence, Spencer Goodrich, Wayne Gramlich, Torr Hage, Alexander Herzog, Jonathan Hoech, Thinh Nguyen, Ian Storz, Baruch Tabanpour, Leila Takayama, Jonathan Tompson, Ayzaan Wahid, Ted Wahrburg, Sichun Xu, Sergey Yaro...

[69] [70]

URLhttps://arxiv.org/abs/2405.02292

arXiv

[70] [71]

Sam 3d: 3dfy anything in images

SAM 3D Team, Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin J Liang, Alexander Sax, Hao Tang, Weiyao Wang, Michelle Guo, Thibaut Hardin, Xiang Li, Aohan Lin, Jiawei Liu, Ziqi Ma, Anushka Sagar, Bowen Song, Xiaodong Wang, Jianing Yang, Bowen Zhang, Piotr Dollár, Georgia Gkioxari, Matt Feiszli, and Jitendra Malik. Sam 3d: 3dfy anything in images. 2025. URL h...

Pith/arXiv arXiv 2025

[71] [72]

Online segment any 3d thing as instance tracking

Hanshi Wang, Zijian Cai, Jin Gao, Yiwei Zhang, Weiming Hu, Ke Wang, and Zhipeng Zhang. Online segment any 3d thing as instance tracking. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, . 12

[72] [73]

Rodin: A generative model for sculpting 3d digital avatars using diffusion

Tengfei Wang, Bo Zhang, Ting Zhang, Shuyang Gu, Jianmin Bao, Tadas Baltrusaitis, Jingjing Shen, Dong Chen, Fang Wen, Qifeng Chen, Baining Guo, and Microsoft Research. Rodin: A generative model for sculpting 3d digital avatars using diffusion

[73] [74]

Embodiedgen: Towards a generative 3d world engine for embodied intelligence, 2025

Xinjie Wang, Liu Liu, Yu Cao, Ruiqi Wu, Wenkang Qin, Dehui Wang, Wei Sui, and Zhizhong Su. Embodiedgen: Towards a generative 3d world engine for embodied intelligence, 2025. URLhttps://arxiv.org/abs/2506.10600

arXiv 2025

[74] [75]

Gscream: Learning 3d geometry and feature consistent gaussian splatting for object removal

Yuxin Wang, Qianyi Wu, Guofeng Zhang, and Dan Xu. Gscream: Learning 3d geometry and feature consistent gaussian splatting for object removal. InECCV, 2024

2024

[75] [76]

4d gaussian splatting for real-time dynamic scene rendering

Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene rendering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20310–20320, June 2024

2024

[76] [77]

3dgut: Enabling distorted cameras and secondary rays in gaussian splatting.Conference on Computer Vision and Pattern Recognition (CVPR), 2025

Qi Wu, Janick Martinez Esturo, Ashkan Mirzaei, Nicolas Moenne-Loccoz, and Zan Gojcic. 3dgut: Enabling distorted cameras and secondary rays in gaussian splatting.Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025

[77] [78]

Drawer: Digital reconstruction and articulation with environment realism, 2025

Hongchi Xia, Entong Su, Marius Memmel, Arhan Jain, Raymond Yu, Numfor Mbiziwo-Tiapo, Ali Farhadi, Abhishek Gupta, Shenlong Wang, and Wei-Chiu Ma. Drawer: Digital reconstruction and articulation with environment realism, 2025. URL https://arxiv.org/abs/2504.15278

arXiv 2025

[78] [79]

Structured 3d latents for scalable and versatile 3d generation.arXiv preprint arXiv:2412.01506, 2024

Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d generation.arXiv preprint arXiv:2412.01506, 2024

Pith/arXiv arXiv 2024

[79] [80]

Embodiedsam: Online segment any 3d thing in real time.arXiv preprint arXiv:2408.11811, 2024

Xiuwei Xu, Huangxing Chen, Linqing Zhao, Ziwei Wang, Jie Zhou, and Jiwen Lu. Embodiedsam: Online segment any 3d thing in real time.arXiv preprint arXiv:2408.11811, 2024

arXiv 2024

[80] [81]

Omnipart: Part-aware 3d generation with semantic decoupling and structural cohesion.arXiv preprint arXiv:2507.06165, 2025

Yunhan Yang, Yufan Zhou, Yuan-Chen Guo, Zi-Xin Zou, Yukun Huang, Ying-Tian Liu, Hao Xu, Ding Liang, Yan-Pei Cao, and Xihui Liu. Omnipart: Part-aware 3d generation with semantic decoupling and structural cohesion.arXiv preprint arXiv:2507.06165, 2025

arXiv 2025