UniLab: A Heterogeneous Architecture for Robot RL Beyond GPU-Dominant Paradigms

Anqi Wang; Bokui Chen; Chenxin Dong; Chenyu Cao; Dixuan Jiang; Dongjie Zhu; Fan Jia; Guyue Zhou; Haizhou Ge; Hanqing Cui

A CPU-simulation and GPU-learning split for robot RL achieves 3-10 times higher end-to-end training speed than GPU-only designs.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-06-29 07:06 UTC pith:MHKJMKHB

load-bearing objection UniLab splits sim to CPU and learning to GPU with a unified runtime and claims 3-10x gains, but the abstract supplies none of the throughput or overhead numbers needed to evaluate that. the 2 major comments →

arxiv 2605.30313 v3 pith:MHKJMKHB submitted 2026-05-28 cs.RO

UniLab: A Heterogeneous Architecture for Robot RL Beyond GPU-Dominant Paradigms

Yufei Jia , Zhanxiang Cao , Mingrui Yu , Heng Zhang , Shenyu Chen , Dixuan Jiang , Meng Li , Xiaofan Li

show 43 more authors

Yiyang Liu Junzhe Wu Zheng Li XiLin Fang Ting-Yu Tsui Shengcheng Fu Haoyang Li Anqi Wang Zifan Wang Dongjie Zhu Chenyu Cao Zhenbiao Huang Ziang Zheng Jie Lu Xin Ma Zhengyang Wei Xiang Zhao Tianyue Zhan Ye He Yuxiang Chen Yizhou Jiang Yue Li Haizhou Ge Yuhang Dong Fan Jia Ziheng Zhang Meng Zhang Xiwa Deng Zhixing Chen Hanyang Shao Chenxin Dong Yixuan Li Yizhi Chen Bokui Chen Kaifeng Zhang Hanqing Cui Yusen Qin Ruqi Huang Lei Han Tiancai Wang Xiang Li Yue Gao Guyue Zhou

This is my paper

classification cs.RO

keywords heterogeneous architecturerobot reinforcement learningCPU simulationGPU learningtraining efficiencycross-platform executionMuJoCoPPO

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper questions the default that efficient robot reinforcement learning needs physics simulation on the GPU. It builds a system that runs batched simulations on the CPU while performing policy updates on the GPU, connected by a single runtime layer that handles data transfer and synchronization. This separation produces faster overall training loops on standard robot control benchmarks and removes the requirement for NVIDIA-specific software. The architecture also runs on Apple macOS as well as AMD and Intel accelerator platforms.

Core claim

UniLab is a heterogeneous architecture that decouples CPU-parallel simulation from GPU policy updates through a unified runtime for data movement, buffering, and synchronization. It is implemented using MuJoCoUni and MotrixSim CPU-batched physics backends and supports PPO, FastSAC, FlashSAC, and APPO. On representative simulation-based robot control tasks the architecture improves end-to-end training efficiency by 3-10 times under identical hardware while reducing dependence on the NVIDIA CUDA stack and enabling execution on Apple macOS, AMD ROCm, and Intel XPU backends.

What carries the argument

Unified runtime for data movement, buffering, and synchronization between CPU simulation and GPU learning.

Load-bearing premise

CPU-batched physics backends supply enough simulation speed and fidelity that the end-to-end gains from the split outweigh any synchronization or accuracy costs.

What would settle it

A side-by-side run of the same tasks with a GPU physics engine that shows training time no longer than the UniLab CPU version or that produces policies with lower task performance.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

End-to-end training efficiency rises by 3-10 times on the same hardware configuration.
Training no longer requires the NVIDIA CUDA software stack.
The same training code runs on Apple macOS, AMD ROCm, and Intel XPU accelerator backends.
GPU-resident simulation is effective for robot RL but is not required for high training speed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Robot RL training could move to consumer laptops or edge devices that lack high-end GPUs.
Similar CPU-GPU splits might apply to other simulation-heavy domains such as game environments or physics-based animation.
Real-robot deployment pipelines could keep simulation on local CPU resources while using cloud GPUs only for policy updates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

UniLab splits sim to CPU and learning to GPU with a unified runtime and claims 3-10x gains, but the abstract supplies none of the throughput or overhead numbers needed to evaluate that.

read the letter

The paper's core move is to treat simulation throughput and policy learning as separate concerns that can be split across CPU and GPU, connected by a runtime that handles data movement and synchronization. It implements this with CPU-batched backends (MuJoCoUni, MotrixSim) feeding PPO, FastSAC and similar algorithms, and it adds support for macOS, ROCm, and Intel XPU. That architecture and the cross-platform reach are the concrete additions.

The implementation description is straightforward and the motivation is clear: GPU-resident simulation became the default because it was fast, but the paper asks whether it is required. The platform flexibility is a practical plus for groups that do not want to stay locked to NVIDIA tooling.

The problem is the performance claim. The abstract states 3-10x end-to-end improvement on representative tasks under the same hardware, yet it gives no simulation steps per second, no transfer latency figures, no baseline descriptions, and no ablation on synchronization cost. The stress-test note is accurate on this point; without those measurements it is impossible to know whether the gains come from the decoupling or from unstated differences in task scale or implementation. If the full paper contains the missing numbers and controls, the claim can be checked. If not, the central result stays unverified.

This is for people who build or maintain robot RL training systems and want documented options beyond the current GPU-sim default. A reader focused on system architecture would find the runtime design worth examining. The thinking is coherent on its own terms, so the paper is worth sending to referees even if the experiments need tightening.

Referee Report

2 major / 0 minor

Summary. The paper presents UniLab, a heterogeneous CPU-simulation / GPU-learning architecture for robot RL that decouples CPU-parallel simulation (via MuJoCoUni and MotrixSim backends) from GPU policy updates through a unified runtime for data movement and synchronization. It supports PPO, FastSAC, FlashSAC, and APPO, and claims 3--10× end-to-end training efficiency gains on representative simulation-based robot control tasks under the same hardware, while reducing NVIDIA CUDA dependence and enabling cross-platform execution on macOS, AMD ROCm, and Intel XPU.

Significance. If the empirical claims are substantiated with detailed measurements, the work would show that GPU-resident physics is not required for efficient robot RL training, thereby expanding practical system design choices beyond the current GPU-dominant paradigm and supporting more diverse hardware backends.

major comments (2)

[Abstract] Abstract: the performance claims of 3--10× gains are stated without any experimental details, baselines, error bars, task specifications, or measurement methodology, making it impossible to assess whether the data supports the claim.
[Implementation description] Implementation description: the CPU-batched physics backends are asserted to deliver the claimed throughput and fidelity, but no numbers are supplied for batched simulation steps/sec on the evaluated tasks, measured CPU-to-GPU transfer latency per rollout batch, or any ablation isolating synchronization overhead; these quantities are load-bearing for the end-to-end 3--10× claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and implementation details. We address each major comment below and will revise the manuscript accordingly to improve clarity and substantiation of the empirical claims.

read point-by-point responses

Referee: [Abstract] Abstract: the performance claims of 3--10× gains are stated without any experimental details, baselines, error bars, task specifications, or measurement methodology, making it impossible to assess whether the data supports the claim.

Authors: We agree that the abstract presents the 3--10× claims at a high level without sufficient context. In the revised manuscript, we will update the abstract to include brief experimental details: the tasks are standard MuJoCo-based robot control benchmarks, baselines include standard GPU-centric implementations, measurements use end-to-end wall-clock training time to convergence on identical hardware, and results include error bars from multiple seeds. The full methodology, tables, and figures remain in Section 4 (Experiments). This change will allow readers to assess the claims directly from the abstract. revision: yes
Referee: [Implementation description] Implementation description: the CPU-batched physics backends are asserted to deliver the claimed throughput and fidelity, but no numbers are supplied for batched simulation steps/sec on the evaluated tasks, measured CPU-to-GPU transfer latency per rollout batch, or any ablation isolating synchronization overhead; these quantities are load-bearing for the end-to-end 3--10× claim.

Authors: The referee correctly identifies that the current manuscript describes the MuJoCoUni and MotrixSim backends at an architectural level but does not report the requested quantitative metrics. These measurements (batched steps/sec, per-batch transfer latency, and synchronization overhead ablation) are indeed essential to support the end-to-end efficiency claims. We will add a dedicated subsection with these numbers, derived from our evaluations on the representative tasks, along with the ablation study, in the revised version. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system architecture with observed speedups, no derivations or fitted predictions

full rationale

The paper describes a heterogeneous CPU/GPU architecture (UniLab) implemented with specific backends (MuJoCoUni, MotrixSim) and reports measured end-to-end training speedups of 3-10× on robot control tasks. No equations, parameter fits, uniqueness theorems, or predictions are presented that could reduce to the inputs by construction. The central claims are framed as empirical results from running the new system, not as mathematical reductions or self-referential fits. Self-citations, if present, are not load-bearing for any derivation. This matches the default case of a non-circular empirical systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no information on free parameters, axioms, or invented entities; the system builds on existing backends without introducing new postulated components.

pith-pipeline@v0.9.1-grok · 5986 in / 1091 out tokens · 24677 ms · 2026-06-29T07:06:58.014396+00:00 · methodology

0 comments

read the original abstract

Simulation-based RL for contemporary robot control is increasingly organized around GPU-resident simulation: physics, rollout collection, and learning are placed on a single GPU-centric execution path. This paradigm has greatly improved training speed, but it has also encouraged a default assumption that efficient training requires physics to reside on the GPU. We revisit this assumption. Our view is that, in simulation-dominated robot control, the essential question is not which processor runs physics, but whether simulation throughput, policy learning, and runtime synchronization form an efficient end-to-end loop. We present UniLab, a heterogeneous CPU-simulation / GPU-learning architecture that decouples CPU-parallel simulation from GPU policy updates through a unified runtime for data movement, buffering, and synchronization. UniLab is implemented as a complete and extensible training system using MuJoCoUni and MotrixSim CPU-batched physics backends, supporting PPO, FastSAC, FlashSAC, and APPO. On representative simulation-based robot control tasks, UniLab improves end-to-end training efficiency by 3--10$\times$ under the same hardware configuration, while reducing dependence on the NVIDIA CUDA-based software stack and supporting cross-platform execution on the Apple macOS platform and the AMD ROCm and Intel XPU accelerator backends. These results show that GPU simulation is an effective path to efficient training, but not a necessary one, broadening the practical system choices available for robot RL training. Project page: https://unilabsim.github.io.

Figures

Figures reproduced from arXiv: 2605.30313 by Anqi Wang, Bokui Chen, Chenxin Dong, Chenyu Cao, Dixuan Jiang, Dongjie Zhu, Fan Jia, Guyue Zhou, Haizhou Ge, Hanqing Cui, Hanyang Shao, Haoyang Li, Heng Zhang, Jie Lu, Junzhe Wu, Kaifeng Zhang, Lei Han, Meng Li, Meng Zhang, Mingrui Yu, Ruqi Huang, Shengcheng Fu, Shenyu Chen, Tiancai Wang, Tianyue Zhan, Ting-Yu Tsui, Xiang Li, Xiang Zhao, Xiaofan Li, XiLin Fang, Xin Ma, Xiwa Deng, Ye He, Yixuan Li, Yiyang Liu, Yizhi Chen, Yizhou Jiang, Yue Gao, Yue Li, Yufei Jia, Yuhang Dong, Yusen Qin, Yuxiang Chen, Zhanxiang Cao, Zhenbiao Huang, Zheng Li, Zhengyang Wei, Zhixing Chen, Ziang Zheng, Zifan Wang, Ziheng Zhang.

**Figure 2.** Figure 2: UniLab system architecture. The figure shows the data, scheduling, and parameter-synchronization [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Collection–update timing and overlap. Collection–update timing and overlap. UniLab supports both synchronized and loosely coupled collection–update timing. Standard PPO uses a synchronized rollout/update cycle. Our APPO implementation follows the asynchronous on-policy formulation described by Luo et al. [33]: the collector writes fixed-horizon rollouts, behavior-policy log probabilities, and bootstrap i… view at source ↗

**Figure 4.** Figure 4: CPU simulation throughput across representative robot control scenes. The figure establishes the [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: End-to-end training efficiency on representative robot control tasks. Representative speedups: 3.3 [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Training-cycle placement ablation. Holosoma is the FastSAC codebase used here, and MjWarp is its [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: To-real experiment overview across six real-robot tasks. [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Cross-platform training overview on representative devices. The figure shows training curves and [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Baseline SAC-A and optimized SAC learner-cycle timelines on A100. Durations are means per [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: System-attribution summary for the optimized SAC trace. Panel A reports batching efficiency, [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: C-to-baseline ablation for the SAC replay path. Wall-clock E2E bars are three-seed means with [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗

**Figure 12.** Figure 12: SAC buffer and communication overhead. Values in Panel A are means per retained learner [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗

**Figure 13.** Figure 13: PPO training curves across the sixteen benchmark tasks. Each panel reports episode reward against [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: APPO training curves for the six tasks with a registered APPO configuration: Go1 / Go2 Joystick [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

**Figure 15.** Figure 15: FastSAC and FlashSAC training curves for the five tasks with a replay configuration. G1 Walk Flat [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 14 canonical work pages · 6 internal anchors

[1]

Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning

V . Makoviychuk, L. Wawrzyniak, Y . Guo, M. Lu, K. Storey, M. Macklin, D. Hoeller, N. Rudin, A. Allshire, A. Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning.arXiv preprint arXiv:2108.10470, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[2]

Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning

M. Mittal, P. Roth, J. Tigue, A. Richard, O. Zhang, P. Du, A. Serrano-Mu ˜noz, X. Yao, R. Zurbr ¨ugg, N. Rudin, et al. Isaac lab: A gpu-accelerated simulation framework for multi-modal robot learning.arXiv preprint arXiv:2511.04831, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

MuJoCo Playground

K. Zakka, B. Tabanpour, Q. Liao, M. Haiderbhai, S. Holt, J. Y . Luo, A. Allshire, E. Frey, K. Sreenath, L. A. Kahrs, et al. Mujoco playground.arXiv preprint arXiv:2502.08844, 2025

work page Pith review arXiv 2025
[4]

Zakka, Q

K. Zakka, Q. Liao, B. Yi, L. Le Lay, K. Sreenath, and P. Abbeel. mjlab: A Lightweight Framework for GPU-Accelerated Robot Learning. 2026. URLhttps://arxiv.org/abs/ 2601.22074

work page arXiv 2026
[5]

S. Tao, F. Xiang, A. Shukla, Y . Qin, X. Hinrichsen, X. Yuan, C. Bao, X. Lin, Y . Liu, T. kai Chan, Y . Gao, X. Li, T. Mu, N. Xiao, A. Gurha, V . N. Rajesh, Y . W. Choi, Y .-R. Chen, Z. Huang, R. Calandra, R. Chen, S. Luo, and H. Su. Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai.Robotics: Science and Systems, 2025

2025
[6]

G. Authors. Genesis: A generative and universal physics engine for robotics and beyond, December 2024. URLhttps://github.com/Genesis-Embodied-AI/Genesis

2024
[7]

J. Weng, M. Lin, S. Huang, B. Liu, D. Makoviichuk, V . Makoviychuk, Z. Liu, Y . Song, T. Luo, Y . Jiang, et al. Envpool: A highly parallel reinforcement learning environment execution engine.Advances in Neural Information Processing Systems, 35:22409–22421, 2022

2022
[8]

Z. Wu, E. Liang, M. Luo, S. Mika, J. E. Gonzalez, and I. Stoica. RLlib flow: Distributed reinforcement learning is a dataflow problem. InConference on Neural Information Processing Systems (NeurIPS), 2021. URLhttps://proceedings.neurips.cc/paper/2021/file/ 2bce32ed409f5ebcee2a7b417ad9beed-Paper.pdf. 9

2021
[9]

J. Weng, H. Chen, D. Yan, K. You, A. Duburcq, M. Zhang, Y . Su, H. Su, and J. Zhu. Tianshou: A highly modularized deep reinforcement learning library.Journal of Machine Learning Research, 23(267):1–6, 2022. URLhttp://jmlr.org/papers/v23/21-1127.html

2022
[10]

J. Suarez. PufferLib 2.0: Reinforcement learning at 1m steps/s.Reinforcement Learning Journal, 6:1378–1388, 2025

2025
[11]

Solving Rubik's Cube with a Robot Hand

I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew, A. Petron, A. Paino, M. Plappert, G. Powell, R. Ribas, et al. Solving rubik’s cube with a robot hand.arXiv preprint arXiv:1910.07113, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910
[12]

Y . Kim, H. Oh, J. Lee, J. Choi, G. Ji, M. Jung, D. Youm, and J. Hwangbo. Not only rewards but also constraints: Applications on legged robot locomotion.IEEE Transactions on Robotics, 40:2984–3003, 2024

2024
[13]

O. Pearce. Exploring utilization options of heterogeneous architectures for multi-physics simulations.Parallel Computing, 87:35–45, 2019

2019
[14]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[15]

Haarnoja, A

T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. Pmlr, 2018

2018
[16]

MuJoCoUni:Persistent Batched Runtime Primitives for MuJoCo

Y . Jia and J. Wu. Mujocouni: Persistent batched runtime primitives for mujoco.arXiv preprint arXiv:2605.24922, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[17]

Motrixsim: A physics simulation engine for robotics and embodied ai, 2026

Motphys Team. Motrixsim: A physics simulation engine for robotics and embodied ai, 2026. URLhttps://motrixsim.readthedocs.io/. Python binary package

2026
[18]

C. D. Freeman, E. Frey, A. Raichuk, S. Girgin, I. Mordatch, and O. Bachem. Brax–a differentiable physics engine for large scale rigid body simulation.arXiv preprint arXiv:2106.13281, 2021

work page Pith review arXiv 2021
[19]

Liang, V

J. Liang, V . Makoviychuk, A. Handa, N. Chentanez, M. Macklin, and D. Fox. Gpu-accelerated robotic simulation for distributed reinforcement learning. InConference on Robot Learning, pages 270–282. PMLR, 2018

2018
[20]

Todorov, T

E. Todorov, T. Erez, and Y . Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE, 2012

2012
[21]

Hwangbo, J

J. Hwangbo, J. Lee, A. Dosovitskiy, D. Bellicoso, V . Tsounis, V . Koltun, and M. Hutter. Learning agile and dynamic motor skills for legged robots.Science Robotics, 4(26):eaau5872, 2019

2019
[22]

G. B. Margolis and P. Agrawal. Walk these ways: Tuning robot control for generalization with multiplicity of behavior. InConference on Robot Learning, pages 22–31. PMLR, 2023

2023
[23]

G. B. Margolis, G. Yang, K. Paigwar, T. Chen, and P. Agrawal. Rapid locomotion via reinforcement learning.The International Journal of Robotics Research, 43(4):572–587, 2024

2024
[24]

Z. Wang, Y . Jia, L. Shi, H. Wang, H. Zhao, X. Li, J. Zhou, J. Ma, and G. Zhou. Arm-constrained curriculum learning for loco-manipulation of a wheel-legged robot. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10770–10776. IEEE, 2024. 10

2024
[25]

T. He, J. Gao, W. Xiao, Y . Zhang, Z. Wang, J. Wang, Z. Luo, G. He, N. Sobanbab, C. Pan, et al. Asap: Aligning simulation and real-world physics for learning agile humanoid whole-body skills.arXiv preprint arXiv:2502.01143, 2025

work page Pith review arXiv 2025
[26]

Z. Cao, L. Yan, Y . Zhang, S. Chen, J. Ma, T. Zhan, S. Fu, Y . Jia, C. Lu, and Y . Gao. Hiwet: Hierarchical world-frame end-effector tracking for long-horizon humanoid loco-manipulation. arXiv preprint arXiv:2602.06341, 2026

work page arXiv 2026
[27]

Bharthulwar, S

S. Bharthulwar, S. Tao, and H. Su. Staggered environment resets improve massively parallel on-policy reinforcement learning.Advances in Neural Information Processing Systems, 38: 133342–133375, 2026

2026
[28]

A. A. Shahid, Y . Narang, V . Petrone, E. Ferrentino, A. Handa, D. Fox, M. Pavone, and L. Roveda. Scaling population-based reinforcement learning with gpu accelerated simulation. arXiv preprint arXiv, 2404, 2024

2024
[29]

Fujimoto, H

S. Fujimoto, H. Hoof, and D. Meger. Addressing function approximation error in actor-critic methods. InInternational conference on machine learning, pages 1587–1596. PMLR, 2018

2018
[30]

Y . Seo, C. Sferrazza, H. Geng, M. Nauman, Z.-H. Yin, and P. Abbeel. Fasttd3: Simple, fast, and capable reinforcement learning for humanoid control.arXiv preprint arXiv:2505.22642, 2025

work page Pith review arXiv 2025
[31]

Y . Seo, C. Sferrazza, J. Chen, G. Shi, R. Duan, and P. Abbeel. Learning sim-to-real humanoid locomotion in 15 minutes.arXiv preprint arXiv:2512.01996, 2025

work page arXiv 2025
[32]

D. Kim, Y . Lee, M. Park, K. Kim, I. Nahendra, T. Seno, S. Min, D. Palenicek, F. V ogt, D. Kragic, et al. Flashsac: Fast and stable off-policy reinforcement learning for high-dimensional robot control.arXiv preprint arXiv:2604.04539, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[33]

M. Luo, J. Yao, R. Liaw, E. Liang, and I. Stoica. Impact: Importance weighted asynchronous architectures with clipped target networks.arXiv preprint arXiv:1912.00167, 2019

work page Pith review arXiv 1912
[34]

Mujoco warp (mjwarp), 2026

Google DeepMind. Mujoco warp (mjwarp), 2026. URLhttps://mujoco.readthedocs. io/en/3.3.7/mjwarp/. Software documentation. 11 Appendix Table of Contents A Off-Policy Replay Path Case Study 12 A.1 Baseline GPU-Cache SAC Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 A.2 Sample-Before-Transfer Replay Pipeline . . . . . . . . . . . . . . . . ....

2026

[1] [1]

Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning

V . Makoviychuk, L. Wawrzyniak, Y . Guo, M. Lu, K. Storey, M. Macklin, D. Hoeller, N. Rudin, A. Allshire, A. Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning.arXiv preprint arXiv:2108.10470, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[2] [2]

Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning

M. Mittal, P. Roth, J. Tigue, A. Richard, O. Zhang, P. Du, A. Serrano-Mu ˜noz, X. Yao, R. Zurbr ¨ugg, N. Rudin, et al. Isaac lab: A gpu-accelerated simulation framework for multi-modal robot learning.arXiv preprint arXiv:2511.04831, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

MuJoCo Playground

K. Zakka, B. Tabanpour, Q. Liao, M. Haiderbhai, S. Holt, J. Y . Luo, A. Allshire, E. Frey, K. Sreenath, L. A. Kahrs, et al. Mujoco playground.arXiv preprint arXiv:2502.08844, 2025

work page Pith review arXiv 2025

[4] [4]

Zakka, Q

K. Zakka, Q. Liao, B. Yi, L. Le Lay, K. Sreenath, and P. Abbeel. mjlab: A Lightweight Framework for GPU-Accelerated Robot Learning. 2026. URLhttps://arxiv.org/abs/ 2601.22074

work page arXiv 2026

[5] [5]

S. Tao, F. Xiang, A. Shukla, Y . Qin, X. Hinrichsen, X. Yuan, C. Bao, X. Lin, Y . Liu, T. kai Chan, Y . Gao, X. Li, T. Mu, N. Xiao, A. Gurha, V . N. Rajesh, Y . W. Choi, Y .-R. Chen, Z. Huang, R. Calandra, R. Chen, S. Luo, and H. Su. Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai.Robotics: Science and Systems, 2025

2025

[6] [6]

G. Authors. Genesis: A generative and universal physics engine for robotics and beyond, December 2024. URLhttps://github.com/Genesis-Embodied-AI/Genesis

2024

[7] [7]

J. Weng, M. Lin, S. Huang, B. Liu, D. Makoviichuk, V . Makoviychuk, Z. Liu, Y . Song, T. Luo, Y . Jiang, et al. Envpool: A highly parallel reinforcement learning environment execution engine.Advances in Neural Information Processing Systems, 35:22409–22421, 2022

2022

[8] [8]

Z. Wu, E. Liang, M. Luo, S. Mika, J. E. Gonzalez, and I. Stoica. RLlib flow: Distributed reinforcement learning is a dataflow problem. InConference on Neural Information Processing Systems (NeurIPS), 2021. URLhttps://proceedings.neurips.cc/paper/2021/file/ 2bce32ed409f5ebcee2a7b417ad9beed-Paper.pdf. 9

2021

[9] [9]

J. Weng, H. Chen, D. Yan, K. You, A. Duburcq, M. Zhang, Y . Su, H. Su, and J. Zhu. Tianshou: A highly modularized deep reinforcement learning library.Journal of Machine Learning Research, 23(267):1–6, 2022. URLhttp://jmlr.org/papers/v23/21-1127.html

2022

[10] [10]

J. Suarez. PufferLib 2.0: Reinforcement learning at 1m steps/s.Reinforcement Learning Journal, 6:1378–1388, 2025

2025

[11] [11]

Solving Rubik's Cube with a Robot Hand

I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew, A. Petron, A. Paino, M. Plappert, G. Powell, R. Ribas, et al. Solving rubik’s cube with a robot hand.arXiv preprint arXiv:1910.07113, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910

[12] [12]

Y . Kim, H. Oh, J. Lee, J. Choi, G. Ji, M. Jung, D. Youm, and J. Hwangbo. Not only rewards but also constraints: Applications on legged robot locomotion.IEEE Transactions on Robotics, 40:2984–3003, 2024

2024

[13] [13]

O. Pearce. Exploring utilization options of heterogeneous architectures for multi-physics simulations.Parallel Computing, 87:35–45, 2019

2019

[14] [14]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[15] [15]

Haarnoja, A

T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. Pmlr, 2018

2018

[16] [16]

MuJoCoUni:Persistent Batched Runtime Primitives for MuJoCo

Y . Jia and J. Wu. Mujocouni: Persistent batched runtime primitives for mujoco.arXiv preprint arXiv:2605.24922, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[17] [17]

Motrixsim: A physics simulation engine for robotics and embodied ai, 2026

Motphys Team. Motrixsim: A physics simulation engine for robotics and embodied ai, 2026. URLhttps://motrixsim.readthedocs.io/. Python binary package

2026

[18] [18]

C. D. Freeman, E. Frey, A. Raichuk, S. Girgin, I. Mordatch, and O. Bachem. Brax–a differentiable physics engine for large scale rigid body simulation.arXiv preprint arXiv:2106.13281, 2021

work page Pith review arXiv 2021

[19] [19]

Liang, V

J. Liang, V . Makoviychuk, A. Handa, N. Chentanez, M. Macklin, and D. Fox. Gpu-accelerated robotic simulation for distributed reinforcement learning. InConference on Robot Learning, pages 270–282. PMLR, 2018

2018

[20] [20]

Todorov, T

E. Todorov, T. Erez, and Y . Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE, 2012

2012

[21] [21]

Hwangbo, J

J. Hwangbo, J. Lee, A. Dosovitskiy, D. Bellicoso, V . Tsounis, V . Koltun, and M. Hutter. Learning agile and dynamic motor skills for legged robots.Science Robotics, 4(26):eaau5872, 2019

2019

[22] [22]

G. B. Margolis and P. Agrawal. Walk these ways: Tuning robot control for generalization with multiplicity of behavior. InConference on Robot Learning, pages 22–31. PMLR, 2023

2023

[23] [23]

G. B. Margolis, G. Yang, K. Paigwar, T. Chen, and P. Agrawal. Rapid locomotion via reinforcement learning.The International Journal of Robotics Research, 43(4):572–587, 2024

2024

[24] [24]

Z. Wang, Y . Jia, L. Shi, H. Wang, H. Zhao, X. Li, J. Zhou, J. Ma, and G. Zhou. Arm-constrained curriculum learning for loco-manipulation of a wheel-legged robot. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10770–10776. IEEE, 2024. 10

2024

[25] [25]

T. He, J. Gao, W. Xiao, Y . Zhang, Z. Wang, J. Wang, Z. Luo, G. He, N. Sobanbab, C. Pan, et al. Asap: Aligning simulation and real-world physics for learning agile humanoid whole-body skills.arXiv preprint arXiv:2502.01143, 2025

work page Pith review arXiv 2025

[26] [26]

Z. Cao, L. Yan, Y . Zhang, S. Chen, J. Ma, T. Zhan, S. Fu, Y . Jia, C. Lu, and Y . Gao. Hiwet: Hierarchical world-frame end-effector tracking for long-horizon humanoid loco-manipulation. arXiv preprint arXiv:2602.06341, 2026

work page arXiv 2026

[27] [27]

Bharthulwar, S

S. Bharthulwar, S. Tao, and H. Su. Staggered environment resets improve massively parallel on-policy reinforcement learning.Advances in Neural Information Processing Systems, 38: 133342–133375, 2026

2026

[28] [28]

A. A. Shahid, Y . Narang, V . Petrone, E. Ferrentino, A. Handa, D. Fox, M. Pavone, and L. Roveda. Scaling population-based reinforcement learning with gpu accelerated simulation. arXiv preprint arXiv, 2404, 2024

2024

[29] [29]

Fujimoto, H

S. Fujimoto, H. Hoof, and D. Meger. Addressing function approximation error in actor-critic methods. InInternational conference on machine learning, pages 1587–1596. PMLR, 2018

2018

[30] [30]

Y . Seo, C. Sferrazza, H. Geng, M. Nauman, Z.-H. Yin, and P. Abbeel. Fasttd3: Simple, fast, and capable reinforcement learning for humanoid control.arXiv preprint arXiv:2505.22642, 2025

work page Pith review arXiv 2025

[31] [31]

Y . Seo, C. Sferrazza, J. Chen, G. Shi, R. Duan, and P. Abbeel. Learning sim-to-real humanoid locomotion in 15 minutes.arXiv preprint arXiv:2512.01996, 2025

work page arXiv 2025

[32] [32]

D. Kim, Y . Lee, M. Park, K. Kim, I. Nahendra, T. Seno, S. Min, D. Palenicek, F. V ogt, D. Kragic, et al. Flashsac: Fast and stable off-policy reinforcement learning for high-dimensional robot control.arXiv preprint arXiv:2604.04539, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[33] [33]

M. Luo, J. Yao, R. Liaw, E. Liang, and I. Stoica. Impact: Importance weighted asynchronous architectures with clipped target networks.arXiv preprint arXiv:1912.00167, 2019

work page Pith review arXiv 1912

[34] [34]

Mujoco warp (mjwarp), 2026

Google DeepMind. Mujoco warp (mjwarp), 2026. URLhttps://mujoco.readthedocs. io/en/3.3.7/mjwarp/. Software documentation. 11 Appendix Table of Contents A Off-Policy Replay Path Case Study 12 A.1 Baseline GPU-Cache SAC Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 A.2 Sample-Before-Transfer Replay Pipeline . . . . . . . . . . . . . . . . ....

2026