pith. machine review for the scientific record. sign in

arxiv: 2604.25459 · v1 · submitted 2026-04-28 · 💻 cs.RO

Recognition: unknown

GS-Playground: A High-Throughput Photorealistic Simulator for Vision-Informed Robot Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:07 UTC · model grok-4.3

classification 💻 cs.RO
keywords robot simulation3D Gaussian Splattingvision-based reinforcement learningReal2Simembodied AIphotorealistic renderingparallel physicssim-to-real transfer
0
0 comments X

The pith

GS-Playground reaches 10,000 frames per second of photorealistic rendering inside a parallel physics simulator for vision-based robot learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GS-Playground as a simulator that pairs a custom parallel physics engine with batch 3D Gaussian Splatting rendering to produce high-speed, high-fidelity visual observations for embodied AI tasks. It also supplies an automated Real2Sim pipeline that converts real scenes into simulation assets that remain both visually detailed and physically consistent without manual modeling. This combination targets the main barrier that has kept vision-centric methods from scaling like proprioception-only locomotion research: the heavy cost of photorealistic rendering at large batch sizes. If the integration holds, training loops for navigation, locomotion, and contact-rich manipulation can run at previously inaccessible scales while still supporting policy transfer to real hardware.

Core claim

GS-Playground is a multi-modal framework whose parallel physics engine integrates directly with a batch 3D Gaussian Splatting rendering pipeline to deliver synchronized, photorealistic images at 10^4 FPS at 640x480 resolution. An automated Real2Sim workflow reconstructs complex environments that are simultaneously photorealistic, physically consistent, and memory-efficient. Experiments across locomotion, navigation, and manipulation tasks show that policies trained in the system close the perceptual and physical gaps that have limited prior simulators.

What carries the argument

Batch 3D Gaussian Splatting rendering pipeline integrated with a custom parallel physics engine, which produces synchronized visual and physical states at high throughput while preserving fidelity.

Load-bearing premise

The batch rendering pipeline and physics engine stay synchronized at full speed without measurable loss of visual or physical accuracy, and the automated Real2Sim scenes are consistent enough for policies to transfer successfully to real robots.

What would settle it

Measure end-to-end training throughput and sim-to-real success rate for a contact-rich manipulation policy; if FPS falls below 2000 or transfer success is no better than prior simulators under identical compute, the central performance claim does not hold.

Figures

Figures reproduced from arXiv: 2604.25459 by Anqi Wang, Bin Xie, Bokui Chen, Chenyu Cao, Dixuan Jiang, Guangyu Wang, Guyue Zhou, Haizhou Ge, Hanqing Cui, Hanyang Shao, Heng Zhang, Hongyun Tian, Jiacheng Wang, Jiayuan Zhang, Jinzhi Zhang, Junzhe Wu, Lei Han, Lei Hao, Lu Shi, Mingrui Yu, Ruqi Huang, Shenyu Chen, Tiancai Wang, Tianle Liu, Wei Sui, Xiwa Deng, Xun Yang, Xuran Yao, Yiyi Yan, Yizhou Jiang, Yuchi Zhang, Yue Li, Yufei Jia, Yusen Qin, Yuxiang Chen, Zhanxiang Cao, Zhenbiao Huang, Zheng Li, Zhixing Chen, Zhuoyuan Yu, Zifan Wang, Ziheng Zhang.

Figure 1
Figure 1. Figure 1: GS-Playground Overview. It integrates photorealistic 3D Gaussian Splatting with high-performance parallel physics, achieving over 104 FPS at 640 × 480 resolution on a single GPU. We provide comprehensive sensor suites (Contact, Vision, LiDAR) and support a wide range of robotic embodiments and learning tasks, including locomotion, navigation, and manipulation. Abstract—Embodied AI research is undergoing a … view at source ↗
Figure 2
Figure 2. Figure 2: GS-Playground System Architecture. Left: an automated Image-to-Physics pipeline that constructs simulation-ready assets from RGB inputs via object segmentation, background inpainting, and 3DGS/mesh reconstruction. Middle: a physics and rendering simulation core with CPU/GPU physics backends, integrated sensor and LiDAR simulation, and batch-optimized 3DGS rendering with pruning and rigid-link kinematics. R… view at source ↗
Figure 3
Figure 3. Figure 3: Physics stability under complex multi-body interactions. (a) The dense store shelf scenario; (b) Stability error across time steps, computed over all objects in the scene with identical initial placements. The error is defined as p ∆p 2 + ∆θ 2, where ∆p is mean positional drift (m) and ∆θ is mean orientation drift (rad). 1 5 10 50 Scene Complexity (N) 10 1 10 2 10 3 10 4 FPS (Log Scale) (a) Complexity (Bat… view at source ↗
Figure 4
Figure 4. Figure 4: Performance Comparison on Complexity Scaling. Left: As the complexity (N, the number of humanoid robots) in a single environment increases, the FPS advantage of our framework becomes increasingly pronounced. Right: At a complexity of N = 10, our framework maintains FPS advantage with large batch sizes, where Genesis fails to reach convergence. numerical instability through Jacobian-related errors. When com… view at source ↗
Figure 5
Figure 5. Figure 5: Rendering throughput comparison between GS-Playground and Isaac Sim’s ray-tracing renderer across varying resolutions. While Isaac Sim relies on manual asset modeling and encounters Out-Of-Memory (OOM) exceptions at higher resolutions, GS-Playground leverages automated asset generation from real-world captures, achieving high-fidelity sim-ready assets with superior rendering throughput. Evaluations are con… view at source ↗
Figure 7
Figure 7. Figure 7: Wall-clock training efficiency for Unitree Go1 locomo￾tion. (a) Flat terrain; (b) Rough terrain (stairs). “deci” denotes the decimation, which refers to the number of physical sub-steps per control step. Lower decimation typically increases throughput but may compromise physical fidelity. visual consistency with real camera images, preserving critical geometric features and surface details. Moreover, the s… view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of rendering. The simulated renderings are nearly indistinguishable from real photographs indicating photoreal￾istic fidelity. Our framework supports a broad range of tasks and evaluations, including diverse objects and scene configurations. framework consistently outperforms the baseline, maintaining significantly higher FPS across all tested batch sizes. This performance advantage is partic… view at source ↗
Figure 8
Figure 8. Figure 8: Real-world deployment of policies trained in GS￾Playground. We demonstrate robust Sim2Real transfer across diverse embodiments and modalities: (a) Quadrupedal Locomotion: Veloc￾ity tracking on Unitree Go2; (b) Humanoid Locomotion: 23-DoF balancing and walking on Unitree G1; (c) Visual Manipulation: End-to-end RGB-based grasping; (d) Visual Navigation: Real-time cone following on Unitree Go2 using raw RGB o… view at source ↗
Figure 9
Figure 9. Figure 9: Shaking Test Scene: A Franka Panda robot grasps various objects (a cube, a ball, and a bottle) while being subjected to random shaking motions. This setup is used to evaluate the grasping robustness of different simulation methods under dynamic perturbations. 3D Gaussian Splatting (3DGS) representations, along with object-level meshes, 6D object poses, and calibrated camera intrinsics and extrinsics view at source ↗
Figure 10
Figure 10. Figure 10: Visual results of the asset generation pipeline on (top) Bridge-GS and (bottom) InteriorGS datasets. The rows display: (1) Original view at source ↗
Figure 11
Figure 11. Figure 11: We compare the success rates of different policies in simulation and the real world on various tasks. view at source ↗
Figure 12
Figure 12. Figure 12: Learning curves for the G1 Joystick task. view at source ↗
Figure 13
Figure 13. Figure 13: Learning curves for the Go2 Joystick task. view at source ↗
Figure 14
Figure 14. Figure 14: Perception setups for locomotion tasks. Left: The Unitree Go2 robot utilizing a Height Scan sensor to perceive terrain geometry on rough terrain. Right: The Unitree G1 humanoid equipped with a LiDAR sensor for environmental awareness. APPENDIX D. MANIPULATION D.1 Environment Setup The environment used is AIRBOT Play PickCube, as shown in view at source ↗
Figure 15
Figure 15. Figure 15: The AIRBOT Play PickCube manipulation environment setup. The robot needs to grasp the green cube and lift it to the target view at source ↗
Figure 16
Figure 16. Figure 16: Render comparison. From left to right: Real World, Ours, Isaac Lab, ManiSkill3, and Mujoco. view at source ↗
Figure 17
Figure 17. Figure 17: Learning curves for the PickCube manipulation task. view at source ↗
read the original abstract

Embodied AI research is undergoing a shift toward vision-centric perceptual paradigms. While massively parallel simulators have catalyzed breakthroughs in proprioception-based locomotion, their potential remains largely untapped for vision-informed tasks due to the prohibitive computational overhead of large-scale photorealistic rendering. Furthermore, the creation of simulation-ready 3D assets heavily relies on labor-intensive manual modeling, while the significant sim-to-real physical gap hinders the transfer of contact-rich manipulation policies. To address these bottlenecks, we propose GS-Playground, a multi-modal simulation framework designed to accelerate end-to-end perceptual learning. We develop a novel high-performance parallel physics engine, specifically designed to integrate with a batch 3D Gaussian Splatting (3DGS) rendering pipeline to ensure high-fidelity synchronization. Our system achieves a breakthrough throughput of 10^4 FPS at 640x480 resolution, significantly lowering the barrier for large-scale visual RL. Additionally, we introduce an automated Real2Sim workflow that reconstructs photorealistic, physically consistent, and memory-efficient environments, streamlining the generation of complex simulation-ready scenes. Extensive experiments on locomotion, navigation, and manipulation demonstrate that GS-Playground effectively bridges the perceptual and physical gaps across diverse embodied tasks. Project homepage: https://gsplayground.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces GS-Playground, a multi-modal simulation framework for vision-informed robot learning. It combines a custom parallel physics engine with a batch 3D Gaussian Splatting (3DGS) rendering pipeline to achieve claimed high-throughput photorealistic simulation, reports a throughput of 10^4 FPS at 640x480 resolution, presents an automated Real2Sim workflow for reconstructing photorealistic and physically consistent environments, and evaluates the system on locomotion, navigation, and manipulation tasks to demonstrate bridging of perceptual and physical gaps.

Significance. If the throughput, synchronization fidelity, and policy-transfer results hold under rigorous validation, the work could meaningfully advance large-scale visual RL by reducing rendering overhead and manual asset creation, enabling more realistic sim-to-real transfer in contact-rich settings. The automated Real2Sim pipeline, if shown to produce memory-efficient and consistent scenes, represents a practical contribution to simulation asset generation.

major comments (2)
  1. [Abstract] Abstract: The headline claim of 10^4 FPS throughput at 640x480 resolution is presented as arising from the novel batch 3DGS + parallel physics integration, yet the abstract supplies no quantitative results, baselines, error bars, or experimental details on achieved FPS, rendering latency, or physics update rates. This absence prevents evaluation of the central performance claim.
  2. [Method (batch 3DGS rendering pipeline integration with parallel physics engine)] Method (batch 3DGS rendering pipeline integration with parallel physics engine): The claim that the integration delivers high-fidelity synchronization without loss of speed or visual/physical accuracy is load-bearing for the throughput result. No quantitative bounds on maximum frame-to-physics lag, desync error metrics, ablation studies isolating synchronization overhead, or comparisons of rendered depth/normal consistency against non-batched baselines in contact-rich regimes are referenced, leaving open the possibility that modest desynchronization under locomotion or manipulation loads would undermine the reported performance while preserving the required fidelity.
minor comments (2)
  1. [Abstract] Abstract: The statement that 'extensive experiments... demonstrate that GS-Playground effectively bridges the perceptual and physical gaps' is not accompanied by any summary statistics, success rates, or comparison to prior simulators.
  2. [Abstract] The project homepage is referenced but no information on code release, reproducibility artifacts, or dataset availability is provided in the manuscript.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications from the manuscript and indicate where revisions will be made to strengthen the presentation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline claim of 10^4 FPS throughput at 640x480 resolution is presented as arising from the novel batch 3DGS + parallel physics integration, yet the abstract supplies no quantitative results, baselines, error bars, or experimental details on achieved FPS, rendering latency, or physics update rates. This absence prevents evaluation of the central performance claim.

    Authors: The abstract is a high-level summary, while the full quantitative evaluation—including the 10^4 FPS measurement at 640x480, comparisons against baseline simulators, error bars from multiple runs, rendering latency, and physics update rates—is provided in the Experiments section with supporting tables and figures. To improve immediate evaluability of the central claim, we will revise the abstract to include concise references to these metrics and experimental conditions. revision: partial

  2. Referee: [Method (batch 3DGS rendering pipeline integration with parallel physics engine)] Method (batch 3DGS rendering pipeline integration with parallel physics engine): The claim that the integration delivers high-fidelity synchronization without loss of speed or visual/physical accuracy is load-bearing for the throughput result. No quantitative bounds on maximum frame-to-physics lag, desync error metrics, ablation studies isolating synchronization overhead, or comparisons of rendered depth/normal consistency against non-batched baselines in contact-rich regimes are referenced, leaving open the possibility that modest desynchronization under locomotion or manipulation loads would undermine the reported performance while preserving the required fidelity.

    Authors: The parallel physics engine and batch 3DGS pipeline are explicitly co-designed for lockstep updates, with the reported 10^4 FPS achieved on contact-rich locomotion, navigation, and manipulation tasks serving as indirect evidence that synchronization overhead remains negligible. However, we acknowledge that explicit quantitative validation would strengthen the claim. In the revised manuscript we will add: (i) measured bounds on frame-to-physics lag, (ii) desync error metrics (position/velocity discrepancies), (iii) ablation studies isolating synchronization overhead, and (iv) depth/normal consistency comparisons against non-batched baselines under manipulation loads. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on implementation and empirical validation

full rationale

The paper is a systems contribution describing an integrated simulator (batch 3DGS rendering + parallel physics engine) and an automated Real2Sim workflow. No mathematical derivations, equations, fitted parameters presented as predictions, or first-principles results appear in the abstract or method descriptions. Throughput figures and synchronization claims are presented as measured outcomes of the implementation rather than derived quantities that reduce to their own inputs. No self-citations are invoked as load-bearing uniqueness theorems, and no ansatz or renaming of known results is used to justify core claims. The central assertions are externally falsifiable via benchmarks on locomotion, navigation, and manipulation tasks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on domain assumptions about rendering-physics synchronization and reconstruction fidelity rather than new mathematical derivations or invented entities.

axioms (2)
  • domain assumption Batch 3D Gaussian Splatting can be integrated with a parallel physics engine to maintain high-fidelity visual and physical synchronization at scale.
    Invoked directly in the description of the high-performance rendering pipeline and throughput claim.
  • domain assumption An automated Real2Sim workflow can produce photorealistic yet physically consistent and memory-efficient simulation environments suitable for policy transfer.
    Central to the second main contribution and the claim that perceptual and physical gaps are bridged.

pith-pipeline@v0.9.0 · 5688 in / 1406 out tokens · 59918 ms · 2026-05-07T16:07:52.945626+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 41 canonical work pages · 11 internal anchors

  1. [1]

    Real-is-sim: Bridging the sim-to-real gap with a dynamic digital twin,

    Jad Abou-Chakra, Lingfeng Sun, Krishan Rana, Brandon May, Karl Schmeckpeper, Maria Vittoria Minniti, and Laura Herlant. Real-is-sim: Bridging the sim-to-real gap with a dynamic digital twin for real-world robot policy evaluation.arXiv preprint arXiv:2504.03597, 2025

  2. [2]

    Solving Rubik's Cube with a Robot Hand

    Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Bob McGrew, Arthur Petron, Alex Paino, Matthias Plappert, Glenn Powell, Raphael Ribas, et al. Solving rubik’s cube with a robot hand.arXiv preprint arXiv:1910.07113, 2019

  3. [3]

    Dream to manipulate: Compositional world models empowering robot imitation learning with imagination, 2025

    Leonardo Barcellona, Andrii Zadaianchuk, Davide Al- legro, Samuele Papa, Stefano Ghidoni, and Efstratios Gavves. Dream to manipulate: Compositional world models empowering robot imitation learning with imag- ination.arXiv preprint arXiv:2412.14957, 2024

  4. [4]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot con- trol. corr, abs/2410.24164, 2024. doi: 10.48550.arXiv preprint ARXIV .2410.24164

  5. [5]

    NavDP: Learning sim-to-real navigation diffusion policy with privileged information guidance

    Wenzhe Cai, Jiaqi Peng, Yuqiang Yang, Yujian Zhang, Meng Wei, Hanqing Wang, Yilun Chen, Tai Wang, and Jiangmiao Pang. NavDP: Learning sim-to-real navigation diffusion policy with privileged information guidance. arXiv preprint arXiv:2501.04610, 2025

  6. [6]

    Learning motion skills with adaptive assistive curriculum force in humanoid robots.arXiv preprint arXiv:2506.23125, 2025

    Zhanxiang Cao, Yang Zhang, Buqing Nie, Huangxuan Lin, Haoyang Li, and Yue Gao. Learning motion skills with adaptive assistive curriculum force in humanoid robots.arXiv preprint arXiv:2506.23125, 2025

  7. [7]

    Visual dexterity: In- hand reorientation of novel and complex object shapes

    Tao Chen, Megha Tippur, Siyang Wu, Vikash Kumar, Ed- ward Adelson, and Pulkit Agrawal. Visual dexterity: In- hand reorientation of novel and complex object shapes. Science Robotics, 8(84):eadc9244, 2023

  8. [8]

    SAM 3D: 3Dfy Anything in Images

    Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin J Liang, Alexander Sax, Hao Tang, Weiyao Wang, Michelle Guo, Thibaut Hardin, Xiang Li, et al. Sam 3d: 3dfy anything in images.arXiv preprint arXiv:2511.16624, 2025

  9. [9]

    Navila: Legged robot vision-language- action model for navigation

    An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Xueyan Zou, Jan Kautz, Erdem Biyik, Hongxu Yin, Sifei Liu, and Xiaolong Wang. Navila: Legged robot vision-language- action model for navigation. InRSS, 2025

  10. [10]

    Learning quadrupedal locomotion on deformable terrain.Science Robotics, 8(74):eade2256, 2023

    Suyoung Choi, Gwanghyeon Ji, Jeongsoo Park, Hyeongjun Kim, Juhyeok Mun, Jeong Hyun Lee, and Jemin Hwangbo. Learning quadrupedal locomotion on deformable terrain.Science Robotics, 8(74):eade2256, 2023

  11. [11]

    Gaussgym: An open-source real-to-sim frame- work for learning locomotion from pixels.arXiv preprint arXiv:2510.15352, 2025

    Alejandro Escontrela, Justin Kerr, Arthur Allshire, Jonas Frey, Rocky Duan, Carmelo Sferrazza, and Pieter Abbeel. Gaussgym: An open-source real-to-sim frame- work for learning locomotion from pixels.arXiv preprint arXiv:2510.15352, 2025

  12. [12]

    Mini-splatting: Repre- senting scenes with a constrained number of gaussians

    Guangchi Fang and Bing Wang. Mini-splatting: Repre- senting scenes with a constrained number of gaussians. InEuropean Conference on Computer Vision, pages 165–

  13. [13]

    Re3 Sim: Generating High-Fidelity Simulation Data via 3D-Photorealistic Real-to-Sim for Robotic Manipulation

    Xiaoshen Han, Minghuan Liu, Yilun Chen, Junqiu Yu, Xiaoyang Lyu, Yang Tian, Bolun Wang, Weinan Zhang, and Jiangmiao Pang. Re 3 sim: Generating high-fidelity simulation data via 3d-photorealistic real-to-sim for robotic manipulation.arXiv preprint arXiv:2502.08645, 2025

  14. [14]

    Speedy-splat: Fast 3d gaussian splatting with sparse pixels and sparse primi- tives

    Alex Hanson, Allen Tu, Geng Lin, Vasu Singla, Matthias Zwicker, and Tom Goldstein. Speedy-splat: Fast 3d gaussian splatting with sparse pixels and sparse primi- tives. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21537–21546, 2025

  15. [15]

    Pup 3d-gs: Principled uncertainty pruning for 3d gaussian splatting

    Alex Hanson, Allen Tu, Vasu Singla, Mayuka Jayaward- hana, Matthias Zwicker, and Tom Goldstein. Pup 3d-gs: Principled uncertainty pruning for 3d gaussian splatting. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5949–5958, 2025

  16. [16]

    Available: https://arxiv.org/abs/2502.01143

    Tairan He, Jiawei Gao, Wenli Xiao, Yuanhang Zhang, Zi Wang, Jiashun Wang, Zhengyi Luo, Guanqi He, Nikhil Sobanbab, Chaoyi Pan, et al. Asap: Aligning simula- tion and real-world physics for learning agile humanoid whole-body skills.arXiv preprint arXiv:2502.01143, 2025

  17. [17]

    Viral: Visual sim-to-real at scale for humanoid loco-manipulation.arXiv preprint arXiv:2511.15200, 2025

    Tairan He, Zi Wang, Haoru Xue, Qingwei Ben, Zhengyi Luo, Wenli Xiao, Ye Yuan, Xingye Da, Fernando Casta˜neda, Shankar Sastry, et al. Viral: Visual sim-to-real at scale for humanoid loco-manipulation.arXiv preprint arXiv:2511.15200, 2025

  18. [18]

    Learning agile and dynamic motor skills for legged robots.Science Robotics, 4(26):eaau5872, 2019

    Jemin Hwangbo, Joonho Lee, Alexey Dosovitskiy, Dario Bellicoso, Vassilios Tsounis, Vladlen Koltun, and Marco Hutter. Learning agile and dynamic motor skills for legged robots.Science Robotics, 4(26):eaau5872, 2019

  19. [19]

    Discoverse: Efficient robot simulation in complex high-fidelity environments,

    Yufei Jia, Guangyu Wang, Yuhang Dong, Junzhe Wu, Yupei Zeng, Haonan Lin, Zifan Wang, Haizhou Ge, Weibin Gu, Kairui Ding, et al. Discoverse: Efficient robot simulation in complex high-fidelity environments.arXiv preprint arXiv:2507.21981, 2025

  20. [20]

    GSWorld: Closed-Loop Photo-Realistic Simulation Suite for Robotic Manipulation

    Guangqi Jiang, Haoran Chang, Ri-Zhao Qiu, Yutong Liang, Mazeyu Ji, Jiyue Zhu, Zhao Dong, Xueyan Zou, and Xiaolong Wang. Gsworld: Closed-loop photo- realistic simulation suite for robotic manipulation.arXiv preprint arXiv:2510.20813, 2025

  21. [21]

    Phystwin: Physics- informed reconstruction and simulation of deformable objects from videos.arXiv preprint arXiv:2503.17973, 2025

    Hanxiao Jiang, Hao-Yu Hsu, Kaifeng Zhang, Hsin-Ni Yu, Shenlong Wang, and Yunzhu Li. Phystwin: Physics- informed reconstruction and simulation of deformable objects from videos.arXiv preprint arXiv:2503.17973, 2025

  22. [22]

    Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.ACM Transactions on Graphics (TOG), 44(6):1–16, 2025

    Lihan Jiang, Yucheng Mao, Linning Xu, Tao Lu, Kerui Ren, Yichen Jin, Xudong Xu, Mulin Yu, Jiangmiao Pang, Feng Zhao, et al. Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.ACM Transactions on Graphics (TOG), 44(6):1–16, 2025

  23. [23]

    Robust in-hand reorientation with hi- erarchical rl-based motion primitives and model-based regrasping.IEEE Robotics and Automation Practice, 1: 12–17, 2025

    Yongpeng Jiang, Mingrui Yu, Chen Chen, Yongyi Jia, and Xiang Li. Robust in-hand reorientation with hi- erarchical rl-based motion primitives and model-based regrasping.IEEE Robotics and Automation Practice, 1: 12–17, 2025

  24. [24]

    3d gaussian splatting for real-time radiance field rendering.ACM Trans

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139– 1, 2023

  25. [25]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  26. [26]

    Segment Anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick. Segment anything. arXiv:2304.02643, 2023

  27. [27]

    Rma: Rapid motor adaptation for legged robots

    Ashish Kumar, Zipeng Fu, Deepak Pathak, and Jitendra Malik. Rma: Rapid motor adaptation for legged robots. arXiv preprint arXiv:2107.04034, 2021

  28. [28]

    SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

    Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Zhao- hui Yang, Kaiyan Zhang, Xuekai Zhu, Yuchen Zhang, Tianxing Chen, Ganqu Cui, et al. Simplevla-rl: Scaling vla training via reinforcement learning.arXiv preprint arXiv:2509.09674, 2025

  29. [29]

    Robogsim: A real2sim2real robotic gaussian splatting simulator,

    Xinhai Li, Jialin Li, Ziheng Zhang, Rui Zhang, Fan Jia, Tiancai Wang, Haoqiang Fan, Kuo-Kun Tseng, and Ruip- ing Wang. Robogsim: A real2sim2real robotic gaussian splatting simulator.arXiv preprint arXiv:2411.11839, 2024

  30. [30]

    CLONE: Closed-loop whole-body humanoid teleoperation for long-horizon tasks

    Yixuan Li, Yutang Lin, Jieming Cui, Tengyu Liu, Wei Liang, Yixin Zhu, and Siyuan Huang. CLONE: Closed-loop whole-body humanoid teleoperation for long-horizon tasks. In9th Annual Conference on Robot Learning (CoRL), 2025

  31. [31]

    Diffsplat: Repurposing image diffusion models for scalable gaussian splat generation.arXiv preprint arXiv:2501.16764, 2025

    Chenguo Lin, Panwang Pan, Bangbang Yang, Zeming Li, and Yadong Mu. Diffsplat: Repurposing image diffusion models for scalable gaussian splat generation.arXiv preprint arXiv:2501.16764, 2025

  32. [32]

    Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023

  33. [33]

    Robo-gs: A physics consistent spatial-temporal model for robotic arm with hybrid rep- resentation

    Haozhe Lou, Yurong Liu, Yike Pan, Yiran Geng, Jianteng Chen, Wenlong Ma, Chenglong Li, Lin Wang, Hengzhen Feng, Lu Shi, et al. Robo-gs: A physics consistent spatial-temporal model for robotic arm with hybrid rep- resentation. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 15379–15386. IEEE, 2025

  34. [34]

    Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning

    Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning.arXiv preprint arXiv:2108.10470, 2021

  35. [35]

    Walk these ways: Tuning robot control for generalization with mul- tiplicity of behavior

    Gabriel B Margolis and Pulkit Agrawal. Walk these ways: Tuning robot control for generalization with mul- tiplicity of behavior. InConference on Robot Learning, pages 22–31. PMLR, 2023

  36. [36]

    Rapid locomotion via reinforcement learning.The International Journal of Robotics Research, 43(4):572–587, 2024

    Gabriel B Margolis, Ge Yang, Kartik Paigwar, Tao Chen, and Pulkit Agrawal. Rapid locomotion via reinforcement learning.The International Journal of Robotics Research, 43(4):572–587, 2024

  37. [37]

    Jan Matas, Stephen James, and Andrew J. Davison. Sim-to-real reinforcement learning for deformable object manipulation. InConference on Robot Learning (CoRL), pages 734–743. PMLR, 2018

  38. [38]

    arXiv preprint arXiv:2512.10685 (2025)

    Lars Mescheder, Wei Dong, Shiwei Li, Xuyang Bai, Marcel Santos, Peiyun Hu, Bruno Lecouat, Mingmin Zhen, Ama ˜AG ¸ l Delaunoy, Tian Fang, et al. Sharp monocular view synthesis in less than a second.arXiv preprint arXiv:2512.10685, 2025

  39. [39]

    Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning

    Mayank Mittal, Pascal Roth, James Tigue, Antoine Richard, Octi Zhang, Peter Du, Antonio Serrano-Mu ˜noz, Xinjie Yao, Ren ´e Zurbr ¨ugg, Nikita Rudin, et al. Isaac lab: A gpu-accelerated simulation framework for multi- modal robot learning.arXiv preprint arXiv:2511.04831, 2025

  40. [40]

    Splat- sim: Zero-shot sim2real transfer of rgb manipulation poli- cies using gaussian splatting

    M Nomaan Qureshi, Sparsh Garg, Francisco Yandun, David Held, George Kantor, and Abhisesh Silwal. Splat- sim: Zero-shot sim2real transfer of rgb manipulation poli- cies using gaussian splatting. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 6502–6509. IEEE, 2025

  41. [42]

    URL https://arxiv.org/abs/2408.00714

  42. [43]

    An ex- tensible, data-oriented architecture for high-performance, many-world simulation.ACM Transactions on Graphics (TOG), 42(4):1–13, 2023

    Brennan Shacklett, Luc Guy Rosenzweig, Zhiqiang Xie, Bidipta Sarkar, Andrew Szot, Erik Wijmans, Vladlen Koltun, Dhruv Batra, and Kayvon Fatahalian. An ex- tensible, data-oriented architecture for high-performance, many-world simulation.ACM Transactions on Graphics (TOG), 42(4):1–13, 2023

  43. [44]

    Blind bipedal stair traversal via sim-to-real reinforcement learning.arXiv preprint arXiv:2105.08328, 2021

    Jonah Siekmann, Kevin Green, John Warila, Alan Fern, and Jonathan Hurst. Blind bipedal stair traversal via sim-to-real reinforcement learning.arXiv preprint arXiv:2105.08328, 2021

  44. [45]

    Resolution-robust large mask inpainting with fourier convolutions

    Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silve- strov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lempitsky. Resolution-robust large mask inpainting with fourier convolutions.arXiv preprint arXiv:2109.07161, 2021

  45. [46]

    Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai.Robotics: Science and Systems, 2025

    Stone Tao, Fanbo Xiang, Arth Shukla, Yuzhe Qin, Xander Hinrichsen, Xiaodi Yuan, Chen Bao, Xinsong Lin, Yulin Liu, Tse kai Chan, Yuan Gao, Xuanlin Li, Tongzhou Mu, Nan Xiao, Arnav Gurha, Viswesh Na- gaswamy Rajesh, Yong Woo Choi, Yen-Ru Chen, Zhiao Huang, Roberto Calandra, Rui Chen, Shan Luo, and Hao Su. Maniskill3: Gpu parallelized robotics simulation and...

  46. [47]

    Mujoco: A physics engine for model-based control

    Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE, 2012

  47. [48]

    Bridgedata v2: A dataset for robot learning at scale

    Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen- Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, and Sergey Levine. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning (CoRL), 2023

  48. [49]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Confer- ence, pages 5294–5306, 2025

  49. [50]

    Arm-constrained curriculum learning for loco- manipulation of a wheel-legged robot

    Zifan Wang, Yufei Jia, Lu Shi, Haoyu Wang, Haizhou Zhao, Xueyang Li, Jinni Zhou, Jun Ma, and Guyue Zhou. Arm-constrained curriculum learning for loco- manipulation of a wheel-legged robot. In2024 IEEE/RSJ International Conference on Intelligent Robots and Sys- tems (IROS), pages 10770–10776. IEEE, 2024

  50. [51]

    Omni-perception: Omnidirectional collision avoidance for legged locomotion in dynamic environments.arXiv preprint arXiv:2505.19214, 2025

    Zifan Wang, Teli Ma, Yufei Jia, Xun Yang, Jiaming Zhou, Wenlong Ouyang, Qiang Zhang, and Junwei Liang. Omni-perception: Omnidirectional collision avoidance for legged locomotion in dynamic environments.arXiv preprint arXiv:2505.19214, 2025

  51. [52]

    Opening the sim- to-real door for humanoid pixel-to-action policy transfer

    Haoru Xue, Tairan He, Zi Wang, Qingwei Ben, Wenli Xiao, Zhengyi Luo, Xingye Da, Fernando Casta ˜neda, Guanya Shi, Shankar Sastry, et al. Opening the sim- to-real door for humanoid pixel-to-action policy transfer. arXiv preprint arXiv:2512.01061, 2025

  52. [53]

    Novel demonstration generation with gaussian splatting enables robust one-shot manipulation,

    Sizhe Yang, Wenye Yu, Jia Zeng, Jun Lv, Kerui Ren, Cewu Lu, Dahua Lin, and Jiangmiao Pang. Novel demonstration generation with gaussian splatting en- ables robust one-shot manipulation.arXiv preprint arXiv:2504.13175, 2025

  53. [54]

    DexterityGen: Foundation controller for unprecedented dexterity

    Zhao-Heng Yin, Changhao Wang, Luis Pineda, Fran- cois Hogan, Krishna Bodduluri, Akash Sharma, Patrick Lancaster, Ishita Prasad, Mrinal Kalakrishnan, Jitendra Malik, et al. DexterityGen: Foundation controller for unprecedented dexterity. InProceedings of Robotics: Science and Systems (RSS), 2025

  54. [55]

    Real2render2real: Scaling robot data without dynamics simulation or robot hardware,

    Justin Yu, Letian Fu, Huang Huang, Karim El- Refai, Rares Andrei Ambrus, Richard Cheng, Muham- mad Zubair Irshad, and Ken Goldberg. Real2render2real: Scaling robot data without dynamics simulation or robot hardware.arXiv preprint arXiv:2505.09601, 2025

  55. [56]

    Mujoco playground

    Kevin Zakka, Baruch Tabanpour, Qiayuan Liao, Mustafa Haiderbhai, Samuel Holt, Jing Yuan Luo, Arthur All- shire, Erik Frey, Koushil Sreenath, Lueder A Kahrs, et al. Mujoco playground.arXiv preprint arXiv:2502.08844, 2025

  56. [57]

    A vision-language-action-critic model for robotic real-world reinforcement learning.arXiv preprint arXiv:2509.15937, 2025

    Shaopeng Zhai, Qi Zhang, Tianyi Zhang, Fuxian Huang, Haoran Zhang, Ming Zhou, Shengzhe Zhang, Litao Liu, Sixu Lin, and Jiangmiao Pang. A vision-language- action-critic model for robotic real-world reinforcement learning.arXiv preprint arXiv:2509.15937, 2025

  57. [58]

    Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks,

    Jiazhao Zhang, Kunyu Wang, Shaoan Wang, Minghan Li, Haoran Liu, Songlin Wei, Zhongyuan Wang, Zhizheng Zhang, and He Wang. Uni-navid: A video-based vision- language-action model for unifying embodied navigation tasks.arXiv preprint arXiv:2412.06224, 2024

  58. [59]

    Embodied navigation foundation model.arXiv preprint arXiv:2509.12129, 2025

    Jiazhao Zhang, Anqi Li, Yunpeng Qi, Minghan Li, Jia- hang Liu, Shaoan Wang, Haoran Liu, Gengze Zhou, Yuze Wu, Xingxing Li, et al. Embodied navigation foundation model.arXiv preprint arXiv:2509.12129, 2025

  59. [60]

    Real-to-sim robot policy evaluation with gaussian splatting simulation of soft-body interactions.arXiv preprint arXiv:2511.04665, 2025

    Kaifeng Zhang, Shuo Sha, Hanxiao Jiang, Matthew Loper, Hyunjong Song, Guangyan Cai, Zhuo Xu, Xi- aochen Hu, Changxi Zheng, and Yunzhu Li. Real- to-sim robot policy evaluation with gaussian splatting simulation of soft-body interactions.arXiv preprint arXiv:2511.04665, 2025

  60. [61]

    Keep on going: Learning robust humanoid motion skills via selective adversarial training.arXiv preprint arXiv:2507.08303, 2025

    Yang Zhang, Zhanxiang Cao, Buqing Nie, Haoyang Li, Zhong Jiangwei, Qiao Sun, Xiaoyi Hu, Xiaokang Yang, and Yue Gao. Keep on going: Learning robust humanoid motion skills via selective adversarial training.arXiv preprint arXiv:2507.08303, 2025

  61. [62]

    Genesis: A generative and universal physics engine for robotics and beyond.arXiv preprint arXiv:2401.01454, 2024

    Xian Zhou, Yiling Qiao, Zhenjia Xu, TH Wang, Z Chen, J Zheng, Z Xiong, Y Wang, M Zhang, P Ma, et al. Genesis: A generative and universal physics engine for robotics and beyond.arXiv preprint arXiv:2401.01454, 2024

  62. [63]

    Vision-language-action model with open- world embodied reasoning from pretrained knowledge

    Zhongyi Zhou, Yichen Zhu, Junjie Wen, Chaomin Shen, and Yi Xu. Vision-language-action model with open- world embodied reasoning from pretrained knowledge. arXiv preprint arXiv:2505.21906, 2025

  63. [64]

    robosuite: A Modular Simulation Framework and Benchmark for Robot Learning

    Yuke Zhu, Josiah Wong, Ajay Mandlekar, Roberto Mart´ın-Mart´ın, Abhishek Joshi, Soroush Nasiriany, and Yifeng Zhu. robosuite: A modular simulation frame- work and benchmark for robot learning.arXiv preprint arXiv:2009.12293, 2020

  64. [65]

    Robot parkour learning,

    Ziwen Zhuang, Zipeng Fu, Jianren Wang, Christo- pher Atkeson, Soeren Schwertfeger, Chelsea Finn, and Hang Zhao. Robot parkour learning.arXiv preprint arXiv:2309.05665, 2023. Appendix TABLE OFCONTENTS Appendix A. Physics 13 A.1 Shaking Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Appendix B. 3D...