Recognition: 2 theorem links
· Lean TheoremFlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control
Pith reviewed 2026-05-10 19:59 UTC · model grok-4.3
The pith
Reducing gradient updates while scaling models and data, plus bounding norms, lets off-policy RL match or beat PPO stability in high-dimensional robot control.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FlashSAC modifies Soft Actor-Critic so that it performs far fewer gradient steps, uses larger networks, processes more environment steps per update, and applies hard bounds on weight, feature, and gradient norms; this prevents critic error accumulation while preserving the ability to learn from off-policy data distributions, yielding higher returns and faster convergence than PPO and prior off-policy baselines on over sixty tasks.
What carries the argument
The combination of reduced gradient-update frequency with increased model capacity and data throughput, stabilized by explicit bounds on weight, feature, and gradient norms that limit error propagation in the critic.
If this is right
- FlashSAC reaches higher final returns than PPO and strong off-policy baselines on the majority of tested tasks, with the biggest improvements on high-dimensional control problems such as dexterous manipulation.
- Training time for sim-to-real humanoid locomotion drops from hours to minutes.
- The method maintains stability across ten different simulators and more than sixty tasks without requiring task-specific hyper-parameter retuning.
- Off-policy data reuse becomes practical at scale because the reduced update count is offset by larger models and higher throughput.
Where Pith is reading between the lines
- The same scaling-plus-bounding recipe could be tested on other off-policy algorithms to see whether the stability gains are specific to SAC or more general.
- If the norm bounds prove robust, they might allow even larger models and lower update frequencies, further accelerating training in domains where data collection is expensive.
- The approach suggests that RL scaling laws can be made reliable once error accumulation is controlled, opening a route to apply supervised-learning style scaling directly to robot policies.
Load-bearing premise
That bounding weight, feature, and gradient norms is sufficient to stop critic errors from accumulating when the number of gradient updates is sharply reduced.
What would settle it
A high-dimensional dexterous manipulation task where FlashSAC either fails to reach higher final performance than PPO or shows clear signs of critic divergence despite the norm bounds and scaling changes.
read the original abstract
Reinforcement learning (RL) is a core approach for robot control when expert demonstrations are unavailable. On-policy methods such as Proximal Policy Optimization (PPO) are widely used for their stability, but their reliance on narrowly distributed on-policy data limits accurate policy evaluation in high-dimensional state and action spaces. Off-policy methods can overcome this limitation by learning from a broader state-action distribution, yet suffer from slow convergence and instability, as fitting a value function over diverse data requires many gradient updates, causing critic errors to accumulate through bootstrapping. We present FlashSAC, a fast and stable off-policy RL algorithm built on Soft Actor-Critic. Motivated by scaling laws observed in supervised learning, FlashSAC sharply reduces gradient updates while compensating with larger models and higher data throughput. To maintain stability at increased scale, FlashSAC explicitly bounds weight, feature, and gradient norms, curbing critic error accumulation. Across over 60 tasks in 10 simulators, FlashSAC consistently outperforms PPO and strong off-policy baselines in both final performance and training efficiency, with the largest gains on high-dimensional tasks such as dexterous manipulation. In sim-to-real humanoid locomotion, FlashSAC reduces training time from hours to minutes, demonstrating the promise of off-policy RL for sim-to-real transfer.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces FlashSAC, a variant of Soft Actor-Critic for off-policy RL in high-dimensional robot control. It reduces the frequency of gradient updates per environment step while scaling model capacity and data throughput, and adds explicit bounds on weight, feature, and gradient norms to curb critic error accumulation from bootstrapping. Empirical results across >60 tasks in 10 simulators show consistent outperformance versus PPO and strong off-policy baselines in final performance and sample efficiency, with largest gains on dexterous manipulation; a sim-to-real humanoid example reports training time reduced from hours to minutes.
Significance. If the reported gains can be isolated to the proposed stability mechanism rather than capacity or data-volume differences, the work would be significant for practical deployment of off-policy methods in robotics, where on-policy algorithms like PPO remain dominant due to perceived instability. The scaling-plus-norm-bounds approach offers a concrete, implementable recipe that could generalize beyond the tested simulators.
major comments (1)
- [Experiments] Experiments section (and associated tables/figures): the central claim that norm bounds plus reduced updates preserve off-policy advantages requires explicit confirmation that PPO and baseline SAC implementations used identical model sizes, network widths, and environment-step collection rates as FlashSAC. If baselines were run at standard (smaller) scales, performance gaps on high-dimensional dexterous tasks could be explained by capacity and data volume rather than the stability mechanism; this control is load-bearing for attributing gains to the proposed technique.
minor comments (2)
- [Abstract] Abstract and §3: the description of 'sharply reduces gradient updates' would benefit from a precise statement of the update-to-environment-step ratio used in FlashSAC versus baselines.
- [Ablation studies] The paper should include a dedicated ablation isolating the contribution of each norm bound (weight, feature, gradient) to stability, reported with the same metrics as the main results.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for identifying a key point needed to strengthen attribution of our results. We address the major comment below and commit to revisions that provide the requested controls and clarifications.
read point-by-point responses
-
Referee: [Experiments] Experiments section (and associated tables/figures): the central claim that norm bounds plus reduced updates preserve off-policy advantages requires explicit confirmation that PPO and baseline SAC implementations used identical model sizes, network widths, and environment-step collection rates as FlashSAC. If baselines were run at standard (smaller) scales, performance gaps on high-dimensional dexterous tasks could be explained by capacity and data volume rather than the stability mechanism; this control is load-bearing for attributing gains to the proposed technique.
Authors: We agree that explicit side-by-side confirmation of scales is necessary for rigorous attribution. FlashSAC is intentionally designed around scaling (larger models, higher data throughput, fewer updates per step) plus norm bounds to stabilize off-policy learning at that scale; standard PPO and SAC baselines in our experiments follow their canonical implementations (e.g., from Stable Baselines3 and original papers), which use smaller widths (typically 256-512 hidden units, 2-3 layers) and lower throughput. The current manuscript and appendix already list per-method hyperparameters, but we will revise the Experiments section to add a consolidated table explicitly comparing model sizes, widths, layers, and environment steps per update across all methods. To isolate the stability mechanisms from raw capacity, we will also add an ablation comparing capacity-matched SAC (identical model size and throughput to FlashSAC) with and without norm bounds. These additions will be included in the revised manuscript and will directly address the load-bearing concern about whether gains stem from the proposed technique rather than scale alone. revision: yes
Circularity Check
No circularity in empirical algorithm presentation
full rationale
The paper introduces FlashSAC as a practical modification to Soft Actor-Critic, motivated by external scaling laws from supervised learning and stabilized via explicit norm bounds on weights, features, and gradients. All performance claims rest on direct empirical comparisons across 60+ tasks rather than any closed-form derivation, fitted parameter renamed as prediction, or self-referential equation. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the abstract or method description. The central contribution is an engineering recipe validated by external benchmarks, making the result self-contained and independent of its own inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Scaling laws from supervised learning apply directly to off-policy RL value function fitting when gradient updates are reduced.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
sharply reduces gradient updates while compensating with larger models and higher data throughput... explicitly bounds weight, feature, and gradient norms
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Inverted Residual Backbone... Pre-activation Batch Normalization... Weight Normalization
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Loss of plasticity in continual deep reinforcement learning
Zaheer Abbas, Rosie Zhao, Joseph Modayil, Adam White, and Marlos C Machado. Loss of plasticity in continual deep reinforcement learning. InConference on lifelong learning agents, pages 620–636. PMLR, 2023
2023
-
[2]
Learning dexterous in-hand manipulation.The International Journal of Robotics Research, 39(1):3–20, 2020
OpenAI: Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous in-hand manipulation.The International Journal of Robotics Research, 39(1):3–20, 2020
2020
-
[3]
A brief survey of deep reinforcement learning,
Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, and Anil Anthony Bharath. A brief survey of deep reinforcement learning.arXiv preprint arXiv:1708.05866, 2017
-
[4]
Genesis: A generative and universal physics engine for robotics and beyond, December 2024
Genesis Authors. Genesis: A generative and universal physics engine for robotics and beyond, December 2024. https://github.com/Genesis-Embodied-AI/Genesis
2024
-
[5]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.arXiv preprint arXiv:1607.06450, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[6]
Efficient online reinforcement learning with offline data
Philip J Ball, Laura Smith, Ilya Kostrikov, and Sergey Levine. Efficient online reinforcement learning with offline data. InInternational Conference on Machine Learning, pages 1577–1594. PMLR, 2023
2023
-
[7]
A distributional perspective on reinforcement learning
Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In International conference on machine learning, pages 449–458. PMLR, 2017
2017
-
[8]
CrossQ: Batch normalization in deep reinforcement learning for greater sample efficiency and simplicity
Aditya Bhatt, Daniel Palenicek, Boris Belousov, Max Argus, Artemij Amiranashvili, Thomas Brox, and Jan Peters. CrossQ: Batch normalization in deep reinforcement learning for greater sample efficiency and simplicity. International Conference on Learning Representations (ICLR), 2024
2024
-
[9]
Vittorio Caggiano, Huawei Wang, Guillaume Durandau, Massimo Sartori, and Vikash Kumar. Myosuite–a contact-rich simulation suite for musculoskeletal motor control.arXiv preprint arXiv:2205.13600, 2022
-
[10]
Towards human-level bimanual dexterous manipulation with reinforcement learning
Yuanpei Chen, Yaodong Yang, Tianhao Wu, Shengjie Wang, Xidong Feng, Jiechuan Jiang, Zongqing Lu, Stephen Marcus McAleer, Hao Dong, and Song-Chun Zhu. Towards human-level bimanual dexterous manipulation with reinforcement learning. InThirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.https://openreview.net/...
2022
-
[11]
Temporally-extended{\epsilon}-greedy exploration.arXiv preprint arXiv:2006.01782, 2020
Will Dabney, Georg Ostrovski, and André Barreto. Temporally-extended{\epsilon}-greedy exploration.arXiv preprint arXiv:2006.01782, 2020
- [12]
-
[13]
Pink noise is all you need: Colored noise exploration in deep reinforcement learning
Onno Eberhard, Jakob Hollenstein, Cristina Pinneri, and Georg Martius. Pink noise is all you need: Colored noise exploration in deep reinforcement learning. InThe Eleventh International Conference on Learning Representations, 2023
2023
-
[14]
Revisiting fundamentals of experience replay
William Fedus, Prajit Ramachandran, Rishabh Agarwal, Yoshua Bengio, Hugo Larochelle, Mark Rowland, and Will Dabney. Revisiting fundamentals of experience replay. InInternational conference on machine learning, pages 3061–3071. PMLR, 2020
2020
-
[15]
Addressing function approximation error in actor-critic methods
Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-critic methods. InInternational conference on machine learning, pages 1587–1596. PMLR, 2018
2018
-
[16]
Smith, Shixiang Shane Gu, Doina Precup, and David Meger
Scott Fujimoto, Wei-Di Chang, Edward J Smith, Shixiang Shane Gu, Doina Precup, and David Meger. For sale: State-action representation learning for deep reinforcement learning.arXiv preprint arXiv:2306.02451, 2023
-
[17]
Towards general-purpose model-free reinforcement learning.arXiv preprint arXiv:2501.16142, 2025
Scott Fujimoto, Pierluca D’Oro, Amy Zhang, Yuandong Tian, and Michael Rabbat. Towards general-purpose model-free reinforcement learning.arXiv preprint arXiv:2501.16142, 2025
-
[18]
Simplifying deep temporal difference learning.arXiv preprint arXiv:2407.04811, 2024
Matteo Gallici, Mattie Fellows, Benjamin Ellis, Bartomeu Pou, Ivan Masmitja, Jakob Nicolaus Foerster, and Mario Martin. Simplifying deep temporal difference learning.arXiv preprint arXiv:2407.04811, 2024
-
[19]
Zhaoyuan Gu, Junheng Li, Wenlan Shen, Wenhao Yu, Zhaoming Xie, Stephen McCrory, Xianyi Cheng, Abdulaziz Shamsah, Robert Griffin, C Karen Liu, et al. Humanoid locomotion and manipulation: Current progress and challenges in control, planning, and learning.arXiv preprint arXiv:2501.02116, 2025
-
[20]
Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. PMLR, 2018. 13
2018
-
[21]
Mastering Diverse Domains through World Models
Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023
work page internal anchor Pith review arXiv 2023
-
[22]
Td-mpc2: Scalable, robust world models for continuous control, 2024
Nicklas Hansen, Hao Su, and Xiaolong Wang. Td-mpc2: Scalable, robust world models for continuous control, 2024
2024
-
[23]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016
2016
-
[24]
Anymal parkour: Learning agile navigation for quadrupedal robots.Science Robotics, 9(88):eadi7566, 2024
David Hoeller, Nikita Rudin, Dhionis Sako, and Marco Hutter. Anymal parkour: Learning agile navigation for quadrupedal robots.Science Robotics, 9(88):eadi7566, 2024
2024
-
[25]
Jakob Hollenstein, Sayantan Auddy, Matteo Saveriano, Erwan Renaudo, and Justus Piater. Action noise in off-policy deep reinforcement learning: Impact on exploration and performance.arXiv preprint arXiv:2206.03787, 2022
-
[26]
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
Andrew G Howard. Mobilenets: Efficient convolutional neural networks for mobile vision applications.arXiv preprint arXiv:1704.04861, 2017
work page internal anchor Pith review arXiv 2017
-
[27]
Learning agile and dynamic motor skills for legged robots.Science Robotics, 4(26):eaau5872, 2019
Jemin Hwangbo, Joonho Lee, Alexey Dosovitskiy, Dario Bellicoso, Vassilios Tsounis, Vladlen Koltun, and Marco Hutter. Learning agile and dynamic motor skills for legged robots.Science Robotics, 4(26):eaau5872, 2019
2019
-
[28]
Batch normalization: Accelerating deep network training by reducing internal covariate shift
Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. InInternational conference on machine learning, pages 448–456. pmlr, 2015
2015
-
[29]
When to trust your model: Model-based policy optimization.Advances in neural information processing systems, 32, 2019
Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization.Advances in neural information processing systems, 32, 2019
2019
-
[30]
Gwanghyeon Ji, Juhyeok Mun, Hyeongjun Kim, and Jemin Hwangbo. Concurrent training of a control policy and a state estimator for dynamic and robust legged locomotion.IEEE Robotics and Automation Letters, 7(2):4630–4637, April 2022. ISSN 2377-3774. doi: 10.1109/lra.2022.3151396.http://dx.doi.org/10.1109/LRA.2022.3151396
work page doi:10.1109/lra.2022.3151396.http://dx.doi.org/10.1109/lra.2022.3151396 2022
-
[31]
Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning
Zhenyu Jiang, Yuqi Xie, Kevin Lin, Zhenjia Xu, Weikang Wan, Ajay Mandlekar, Linxi Jim Fan, and Yuke Zhu. Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 16923–16930. IEEE, 2025
2025
-
[32]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[33]
Openvla: An open-source vision-language-action model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language-action model. InCoRL, 2025
2025
-
[34]
Reinforcement learning in robotics: A survey.The International Journal of Robotics Research, 32(11):1238–1274, 2013
Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey.The International Journal of Robotics Research, 32(11):1238–1274, 2013
2013
-
[35]
Rma: Rapid motor adaptation for legged robots,
Ashish Kumar, Zipeng Fu, Deepak Pathak, and Jitendra Malik. Rma: Rapid motor adaptation for legged robots. arXiv preprint arXiv:2107.04034, 2021
-
[36]
Reinforcement learning with augmented data.Advances in neural information processing systems, 33:19884–19895, 2020
Misha Laskin, Kimin Lee, Adam Stooke, Lerrel Pinto, Pieter Abbeel, and Aravind Srinivas. Reinforcement learning with augmented data.Advances in neural information processing systems, 33:19884–19895, 2020
2020
-
[37]
Plastic: Improving input and label plasticity for sample efficient reinforcement learning.Advances in Neural Information Processing Systems, 36, 2024
Hojoon Lee, Hanseul Cho, Hyunseung Kim, Daehoon Gwak, Joonkee Kim, Jaegul Choo, Se-Young Yun, and Chulhee Yun. Plastic: Improving input and label plasticity for sample efficient reinforcement learning.Advances in Neural Information Processing Systems, 36, 2024
2024
-
[38]
Slow and steady wins the race: Maintaining plasticity with hare and tortoise networks,
Hojoon Lee, Hyeonseo Cho, Hyunseung Kim, Donghu Kim, Dugki Min, Jaegul Choo, and Clare Lyle. Slow and steady wins the race: Maintaining plasticity with hare and tortoise networks.arXiv preprint arXiv:2406.02596, 2024
-
[39]
Hojoon Lee, Dongyoon Hwang, Donghu Kim, Hyunseung Kim, Jun Jet Tai, Kaushik Subramanian, Peter R Wurman, Jaegul Choo, Peter Stone, and Takuma Seno. Simba: Simplicity bias for scaling up parameters in deep reinforcement learning.arXiv preprint arXiv:2410.09754, 2024
-
[40]
Hojoon Lee, Youngdo Lee, Takuma Seno, Donghu Kim, Peter Stone, and Jaegul Choo. Hyperspherical normaliza- tion for scalable deep reinforcement learning.arXiv preprint arXiv:2502.15280, 2025. 14
-
[41]
Learning quadrupedal locomotion over challenging terrain,
Joonho Lee, Jemin Hwangbo, Lorenz Wellhausen, Vladlen Koltun, and Marco Hutter. Learning quadrupedal locomotion over challenging terrain.Science Robotics, 5(47), October 2020. ISSN 2470-9476. doi: 10.1126/ scirobotics.abc5986.http://dx.doi.org/10.1126/scirobotics.abc5986
-
[42]
Qiyang Li, Aviral Kumar, Ilya Kostrikov, and Sergey Levine. Efficient deep reinforcement learning requires regulating overfitting.arXiv preprint arXiv:2304.10466, 2023
-
[43]
Beyondmimic: From mo- tion tracking to versatile humanoid control via guided diffusion,
Qiayuan Liao, Takara E Truong, Xiaoyu Huang, Yuman Gao, Guy Tevet, Koushil Sreenath, and C Karen Liu. Beyondmimic: From motion tracking to versatile humanoid control via guided diffusion.arXiv preprint arXiv:2508.08241, 2025
-
[44]
Continuous control with deep reinforcement learning
Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning.arXiv preprint arXiv:1509.02971, 2015
work page internal anchor Pith review arXiv 2015
-
[45]
Softgym: Benchmarking deep reinforcement learning for deformable object manipulation
Xingyu Lin, Yufei Wang, Jake Olkin, and David Held. Softgym: Benchmarking deep reinforcement learning for deformable object manipulation. InConference on Robot Learning, pages 432–448. PMLR, 2021
2021
-
[46]
Ilya Loshchilov, Cheng-Ping Hsieh, Simeng Sun, and Boris Ginsburg. ngpt: Normalized transformer with representation learning on the hypersphere.arXiv preprint arXiv:2410.01131, 2024
-
[47]
Understanding plasticity in neural networks.Proc
Clare Lyle, Zeyu Zheng, Evgenii Nikishin, Bernardo Avila Pires, Razvan Pascanu, and Will Dabney. Understanding plasticity in neural networks.Proc. the International Conference on Machine Learning (ICML), 2023
2023
-
[48]
Normalization and effective learning rates in reinforcement learning.Advances in Neural Information Processing Systems, 37:106440–106473, 2024
Clare Lyle, Zeyu Zheng, Khimya Khetarpal, James Martens, Hado P van Hasselt, Razvan Pascanu, and Will Dabney. Normalization and effective learning rates in reinforcement learning.Advances in Neural Information Processing Systems, 37:106440–106473, 2024
2024
-
[49]
Moerland, Joost Broekens, Aske Plaat, and Catholijn M
Thomas M. Moerland, Joost Broekens, Aske Plaat, and Catholijn M. Jonker. Model-based reinforcement learning: A survey.Foundations and Trends in Machine Learning, 16(1):1–118, 2023
2023
-
[50]
Weighted importance sampling for off-policy learning with linear function approximation.Advances in neural information processing systems, 27, 2014
A Rupam Mahmood, Hado P Van Hasselt, and Richard S Sutton. Weighted importance sampling for off-policy learning with linear function approximation.Advances in neural information processing systems, 27, 2014
2014
-
[51]
Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning
Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning.arXiv preprint arXiv:2108.10470, 2021
work page internal anchor Pith review arXiv 2021
-
[52]
Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Gins- burg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training.arXiv preprint arXiv:1710.03740, 2017
work page internal anchor Pith review arXiv 2017
-
[53]
Symmetry considerations for learning task symmetric robot policies
Mayank Mittal, Nikita Rudin, Victor Klemm, Arthur Allshire, and Marco Hutter. Symmetry considerations for learning task symmetric robot policies. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 7433–7439. IEEE, 2024
2024
-
[54]
Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning
Mayank Mittal, Pascal Roth, James Tigue, Antoine Richard, Octi Zhang, Peter Du, Antonio Serrano-Muñoz, Xinjie Yao, René Zurbrügg, Nikita Rudin, et al. Isaac lab: A gpu-accelerated simulation framework for multi-modal robot learning.arXiv preprint arXiv:2511.04831, 2025
work page internal anchor Pith review arXiv 2025
-
[55]
Human-level control through deep reinforcement learning.nature, 518(7540):529–533, 2015
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning.nature, 518(7540):529–533, 2015
2015
-
[56]
Mock and University of Wyoming
J.W. Mock and University of Wyoming. Department of Electrical Engineering.A Comparison of PPO, TD3, and SAC Reinforcement Algorithms for Quadruped Walking Gait Generation and Transfer Learning to a Physical Robot. University of Wyoming, 2023. ISBN 9798379561789.https://books.google.co.kr/books?id=waUG0AEACAAJ
2023
-
[57]
I Nahrendra, Byeongho Yu, and Hyun Myung. Dreamwaq: Learning robust quadrupedal locomotion with implicit terrain imagination via deep reinforcement learning.arXiv preprint arXiv:2301.10602, 2023
-
[58]
Reward centering.arXiv preprint arXiv:2405.09999, 2024
Abhishek Naik, Yi Wan, Manan Tomar, and Richard S Sutton. Reward centering.arXiv preprint arXiv:2405.09999, 2024
-
[59]
Robocasa: Large-scale simulation of everyday tasks for generalist robots
Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots. InRobotics: Science and Systems, 2024
2024
-
[60]
Michal Nauman, Mateusz Ostaszewski, Krzysztof Jankowski, Piotr Miłoś, and Marek Cygan. Bigger, regularized, optimistic: scaling for compute and sample-efficient continuous control.arXiv preprint arXiv:2405.16158, 2024. 15
-
[61]
Nauman, M., Ostaszewski, M., Jankowski, K., Miło ´s, P., and Cygan, M
Michal Nauman, Marek Cygan, Carmelo Sferrazza, Aviral Kumar, and Pieter Abbeel. Bigger, regularized, categorical: High-capacity value functions are efficient multi-task learners.arXiv preprint arXiv:2505.23150, 2025
-
[62]
Johan Obando-Ceron, Walter Mayor, Samuel Lavoie, Scott Fujimoto, Aaron Courville, and Pablo Samuel Castro. Simplicial embeddings improve sample efficiency in actor-critic agents.arXiv preprint arXiv:2510.13704, 2025
-
[63]
Scaling off-policy reinforcement learning with batch and weight normalization.Advances in Neural Information Processing Systems (NeurIPS), 2025
Daniel Palenicek, Florian Vogt, Joe Watson, and Jan Peters. Scaling off-policy reinforcement learning with batch and weight normalization.Advances in Neural Information Processing Systems (NeurIPS), 2025
2025
-
[64]
XQC: Well-conditioned optimization accelerates deep reinforcement learning.International Conference on Learning Representations (ICLR), 2026
Daniel Palenicek, Florian Vogt, Joe Watson, Ingmar Posner, and Jan Peters. XQC: Well-conditioned optimization accelerates deep reinforcement learning.International Conference on Learning Representations (ICLR), 2026
2026
-
[65]
Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019
2019
-
[66]
Asymmetric Actor Critic for Image-Based Robot Learning
Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. Asymmetric actor critic for image-based robot learning.arXiv preprint arXiv:1710.06542, 2017
work page Pith review arXiv 2017
-
[67]
Xavier Puig, Eric Undersander, Andrew Szot, Mikael Dallaire Cote, Tsung-Yen Yang, Ruslan Partsey, Ruta Desai, Alexander William Clegg, Michal Hlavac, So Yeon Min, et al. Habitat 3.0: A co-habitat for humans, avatars and robots.arXiv preprint arXiv:2310.13724, 2023
-
[69]
Learning to walk in minutes using massively parallel deep reinforcement learning
Nikita Rudin, David Hoeller, Philipp Reist, and Marco Hutter. Learning to walk in minutes using massively parallel deep reinforcement learning. InConference on robot learning, pages 91–100. PMLR, 2022
2022
-
[70]
How does batch normalization help optimization?Advances in neural information processing systems, 31, 2018
Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How does batch normalization help optimization?Advances in neural information processing systems, 31, 2018
2018
-
[71]
Tom Schaul, Georg Ostrovski, Iurii Kemaev, and Diana Borsa. Return-based scaling: Yet another normalisation trick for deep rl.arXiv preprint arXiv:2105.05347, 2021
-
[72]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[73]
arXiv preprint arXiv:2509.10771 , year=
Clemens Schwarke, Mayank Mittal, Nikita Rudin, David Hoeller, and Marco Hutter. Rsl-rl: A learning library for robotics research.arXiv preprint arXiv:2509.10771, 2025
-
[74]
Learning sim-to-real humanoid locomotion in 15 minutes.arXiv preprint arXiv:2512.01996, 2025
Younggyo Seo, Carmelo Sferrazza, Juyue Chen, Guanya Shi, Rocky Duan, and Pieter Abbeel. Learning sim-to-real humanoid locomotion in 15 minutes, 2025.https://arxiv.org/abs/2512.01996
-
[75]
Younggyo Seo, Carmelo Sferrazza, Haoran Geng, Michal Nauman, Zhao-Heng Yin, and Pieter Abbeel. Fasttd3: Simple, fast, and capable reinforcement learning for humanoid control.arXiv preprint arXiv:2505.22642, 2025
-
[76]
Hu- manoidbench: Simulated humanoid benchmark for whole-body locomotion and manipulation
Carmelo Sferrazza, Dun-Ming Huang, Xingyu Lin, Youngwoon Lee, and Pieter Abbeel. Humanoidbench: Simulated humanoid benchmark for whole-body locomotion and manipulation.arXiv preprint arXiv:2403.10506, 2024
-
[77]
Importance sampling for reinforcement learning with multiple objectives.PhD thesis, Massachusetts Institute of Technology, 2001
Christian Robert Shelton. Importance sampling for reinforcement learning with multiple objectives.PhD thesis, Massachusetts Institute of Technology, 2001
2001
-
[78]
Sim2real manipulation on unknown objects with tactile-based reinforcement learning
Entong Su, Chengzhe Jia, Yuzhe Qin, Wenxuan Zhou, Annabella Macaluso, Binghao Huang, and Xiaolong Wang. Sim2real manipulation on unknown objects with tactile-based reinforcement learning. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 9234–9241. IEEE, 2024
2024
-
[79]
Integrated architectures for learning, planning, and reacting based on approximating dynamic programming
Richard S Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. InMachine learning proceedings 1990, pages 216–224. Elsevier, 1990
1990
-
[80]
MIT press Cambridge, 1998
Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998
1998
-
[81]
Stone Tao, Fanbo Xiang, Arth Shukla, Yuzhe Qin, Xander Hinrichsen, Xiaodi Yuan, Chen Bao, Xinsong Lin, Yulin Liu, Tse-kai Chan, et al. Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai.arXiv preprint arXiv:2410.00425, 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.