pith. machine review for the scientific record. sign in

arxiv: 2604.04539 · v1 · submitted 2026-04-06 · 💻 cs.LG · cs.RO

Recognition: 2 theorem links

· Lean Theorem

FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:59 UTC · model grok-4.3

classification 💻 cs.LG cs.RO
keywords reinforcement learningoff-policy learningrobot controlSoft Actor-Critichigh-dimensional controlstabilitysim-to-real transferscaling laws
0
0 comments X

The pith

Reducing gradient updates while scaling models and data, plus bounding norms, lets off-policy RL match or beat PPO stability in high-dimensional robot control.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that off-policy reinforcement learning has long been limited in high-dimensional robot tasks because fitting value functions over diverse data requires many updates, allowing critic errors to build up through bootstrapping. FlashSAC addresses this by following scaling patterns from supervised learning: it sharply cuts the number of gradient updates, compensates with bigger models and faster data collection, and adds explicit bounds on weight, feature, and gradient norms to keep errors in check. A reader would care because this combination promises the sample-efficiency advantages of off-policy methods without the usual instability, leading to better final performance and much shorter training times across many simulators and real-robot transfer scenarios.

Core claim

FlashSAC modifies Soft Actor-Critic so that it performs far fewer gradient steps, uses larger networks, processes more environment steps per update, and applies hard bounds on weight, feature, and gradient norms; this prevents critic error accumulation while preserving the ability to learn from off-policy data distributions, yielding higher returns and faster convergence than PPO and prior off-policy baselines on over sixty tasks.

What carries the argument

The combination of reduced gradient-update frequency with increased model capacity and data throughput, stabilized by explicit bounds on weight, feature, and gradient norms that limit error propagation in the critic.

If this is right

  • FlashSAC reaches higher final returns than PPO and strong off-policy baselines on the majority of tested tasks, with the biggest improvements on high-dimensional control problems such as dexterous manipulation.
  • Training time for sim-to-real humanoid locomotion drops from hours to minutes.
  • The method maintains stability across ten different simulators and more than sixty tasks without requiring task-specific hyper-parameter retuning.
  • Off-policy data reuse becomes practical at scale because the reduced update count is offset by larger models and higher throughput.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same scaling-plus-bounding recipe could be tested on other off-policy algorithms to see whether the stability gains are specific to SAC or more general.
  • If the norm bounds prove robust, they might allow even larger models and lower update frequencies, further accelerating training in domains where data collection is expensive.
  • The approach suggests that RL scaling laws can be made reliable once error accumulation is controlled, opening a route to apply supervised-learning style scaling directly to robot policies.

Load-bearing premise

That bounding weight, feature, and gradient norms is sufficient to stop critic errors from accumulating when the number of gradient updates is sharply reduced.

What would settle it

A high-dimensional dexterous manipulation task where FlashSAC either fails to reach higher final performance than PPO or shows clear signs of critic divergence despite the norm bounds and scaling changes.

read the original abstract

Reinforcement learning (RL) is a core approach for robot control when expert demonstrations are unavailable. On-policy methods such as Proximal Policy Optimization (PPO) are widely used for their stability, but their reliance on narrowly distributed on-policy data limits accurate policy evaluation in high-dimensional state and action spaces. Off-policy methods can overcome this limitation by learning from a broader state-action distribution, yet suffer from slow convergence and instability, as fitting a value function over diverse data requires many gradient updates, causing critic errors to accumulate through bootstrapping. We present FlashSAC, a fast and stable off-policy RL algorithm built on Soft Actor-Critic. Motivated by scaling laws observed in supervised learning, FlashSAC sharply reduces gradient updates while compensating with larger models and higher data throughput. To maintain stability at increased scale, FlashSAC explicitly bounds weight, feature, and gradient norms, curbing critic error accumulation. Across over 60 tasks in 10 simulators, FlashSAC consistently outperforms PPO and strong off-policy baselines in both final performance and training efficiency, with the largest gains on high-dimensional tasks such as dexterous manipulation. In sim-to-real humanoid locomotion, FlashSAC reduces training time from hours to minutes, demonstrating the promise of off-policy RL for sim-to-real transfer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces FlashSAC, a variant of Soft Actor-Critic for off-policy RL in high-dimensional robot control. It reduces the frequency of gradient updates per environment step while scaling model capacity and data throughput, and adds explicit bounds on weight, feature, and gradient norms to curb critic error accumulation from bootstrapping. Empirical results across >60 tasks in 10 simulators show consistent outperformance versus PPO and strong off-policy baselines in final performance and sample efficiency, with largest gains on dexterous manipulation; a sim-to-real humanoid example reports training time reduced from hours to minutes.

Significance. If the reported gains can be isolated to the proposed stability mechanism rather than capacity or data-volume differences, the work would be significant for practical deployment of off-policy methods in robotics, where on-policy algorithms like PPO remain dominant due to perceived instability. The scaling-plus-norm-bounds approach offers a concrete, implementable recipe that could generalize beyond the tested simulators.

major comments (1)
  1. [Experiments] Experiments section (and associated tables/figures): the central claim that norm bounds plus reduced updates preserve off-policy advantages requires explicit confirmation that PPO and baseline SAC implementations used identical model sizes, network widths, and environment-step collection rates as FlashSAC. If baselines were run at standard (smaller) scales, performance gaps on high-dimensional dexterous tasks could be explained by capacity and data volume rather than the stability mechanism; this control is load-bearing for attributing gains to the proposed technique.
minor comments (2)
  1. [Abstract] Abstract and §3: the description of 'sharply reduces gradient updates' would benefit from a precise statement of the update-to-environment-step ratio used in FlashSAC versus baselines.
  2. [Ablation studies] The paper should include a dedicated ablation isolating the contribution of each norm bound (weight, feature, gradient) to stability, reported with the same metrics as the main results.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for identifying a key point needed to strengthen attribution of our results. We address the major comment below and commit to revisions that provide the requested controls and clarifications.

read point-by-point responses
  1. Referee: [Experiments] Experiments section (and associated tables/figures): the central claim that norm bounds plus reduced updates preserve off-policy advantages requires explicit confirmation that PPO and baseline SAC implementations used identical model sizes, network widths, and environment-step collection rates as FlashSAC. If baselines were run at standard (smaller) scales, performance gaps on high-dimensional dexterous tasks could be explained by capacity and data volume rather than the stability mechanism; this control is load-bearing for attributing gains to the proposed technique.

    Authors: We agree that explicit side-by-side confirmation of scales is necessary for rigorous attribution. FlashSAC is intentionally designed around scaling (larger models, higher data throughput, fewer updates per step) plus norm bounds to stabilize off-policy learning at that scale; standard PPO and SAC baselines in our experiments follow their canonical implementations (e.g., from Stable Baselines3 and original papers), which use smaller widths (typically 256-512 hidden units, 2-3 layers) and lower throughput. The current manuscript and appendix already list per-method hyperparameters, but we will revise the Experiments section to add a consolidated table explicitly comparing model sizes, widths, layers, and environment steps per update across all methods. To isolate the stability mechanisms from raw capacity, we will also add an ablation comparing capacity-matched SAC (identical model size and throughput to FlashSAC) with and without norm bounds. These additions will be included in the revised manuscript and will directly address the load-bearing concern about whether gains stem from the proposed technique rather than scale alone. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical algorithm presentation

full rationale

The paper introduces FlashSAC as a practical modification to Soft Actor-Critic, motivated by external scaling laws from supervised learning and stabilized via explicit norm bounds on weights, features, and gradients. All performance claims rest on direct empirical comparisons across 60+ tasks rather than any closed-form derivation, fitted parameter renamed as prediction, or self-referential equation. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the abstract or method description. The central contribution is an engineering recipe validated by external benchmarks, making the result self-contained and independent of its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only; the approach assumes scaling laws observed in supervised learning transfer to RL critic training and that norm bounding suffices to control bootstrapping error accumulation without further justification or external validation.

axioms (1)
  • domain assumption Scaling laws from supervised learning apply directly to off-policy RL value function fitting when gradient updates are reduced.
    Motivation stated in abstract for reducing updates while increasing model size and data throughput.

pith-pipeline@v0.9.0 · 5574 in / 1296 out tokens · 51175 ms · 2026-05-10T19:59:18.599888+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

94 extracted references · 50 canonical work pages · 12 internal anchors

  1. [1]

    Loss of plasticity in continual deep reinforcement learning

    Zaheer Abbas, Rosie Zhao, Joseph Modayil, Adam White, and Marlos C Machado. Loss of plasticity in continual deep reinforcement learning. InConference on lifelong learning agents, pages 620–636. PMLR, 2023

  2. [2]

    Learning dexterous in-hand manipulation.The International Journal of Robotics Research, 39(1):3–20, 2020

    OpenAI: Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous in-hand manipulation.The International Journal of Robotics Research, 39(1):3–20, 2020

  3. [3]

    A brief survey of deep reinforcement learning,

    Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, and Anil Anthony Bharath. A brief survey of deep reinforcement learning.arXiv preprint arXiv:1708.05866, 2017

  4. [4]

    Genesis: A generative and universal physics engine for robotics and beyond, December 2024

    Genesis Authors. Genesis: A generative and universal physics engine for robotics and beyond, December 2024. https://github.com/Genesis-Embodied-AI/Genesis

  5. [5]

    Layer Normalization

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.arXiv preprint arXiv:1607.06450, 2016

  6. [6]

    Efficient online reinforcement learning with offline data

    Philip J Ball, Laura Smith, Ilya Kostrikov, and Sergey Levine. Efficient online reinforcement learning with offline data. InInternational Conference on Machine Learning, pages 1577–1594. PMLR, 2023

  7. [7]

    A distributional perspective on reinforcement learning

    Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In International conference on machine learning, pages 449–458. PMLR, 2017

  8. [8]

    CrossQ: Batch normalization in deep reinforcement learning for greater sample efficiency and simplicity

    Aditya Bhatt, Daniel Palenicek, Boris Belousov, Max Argus, Artemij Amiranashvili, Thomas Brox, and Jan Peters. CrossQ: Batch normalization in deep reinforcement learning for greater sample efficiency and simplicity. International Conference on Learning Representations (ICLR), 2024

  9. [9]

    Myosuite–a contact-rich simulation suite for musculoskeletal motor control.arXiv preprint arXiv:2205.13600, 2022

    Vittorio Caggiano, Huawei Wang, Guillaume Durandau, Massimo Sartori, and Vikash Kumar. Myosuite–a contact-rich simulation suite for musculoskeletal motor control.arXiv preprint arXiv:2205.13600, 2022

  10. [10]

    Towards human-level bimanual dexterous manipulation with reinforcement learning

    Yuanpei Chen, Yaodong Yang, Tianhao Wu, Shengjie Wang, Xidong Feng, Jiechuan Jiang, Zongqing Lu, Stephen Marcus McAleer, Hao Dong, and Song-Chun Zhu. Towards human-level bimanual dexterous manipulation with reinforcement learning. InThirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.https://openreview.net/...

  11. [11]

    Temporally-extended{\epsilon}-greedy exploration.arXiv preprint arXiv:2006.01782, 2020

    Will Dabney, Georg Ostrovski, and André Barreto. Temporally-extended{\epsilon}-greedy exploration.arXiv preprint arXiv:2006.01782, 2020

  12. [12]

    Dohare, J

    Shibhansh Dohare, J Fernando Hernandez-Garcia, Parash Rahman, Richard S Sutton, and A Rupam Mahmood. Maintaining plasticity in deep continual learning.arXiv preprint arXiv:2306.13812, 2023

  13. [13]

    Pink noise is all you need: Colored noise exploration in deep reinforcement learning

    Onno Eberhard, Jakob Hollenstein, Cristina Pinneri, and Georg Martius. Pink noise is all you need: Colored noise exploration in deep reinforcement learning. InThe Eleventh International Conference on Learning Representations, 2023

  14. [14]

    Revisiting fundamentals of experience replay

    William Fedus, Prajit Ramachandran, Rishabh Agarwal, Yoshua Bengio, Hugo Larochelle, Mark Rowland, and Will Dabney. Revisiting fundamentals of experience replay. InInternational conference on machine learning, pages 3061–3071. PMLR, 2020

  15. [15]

    Addressing function approximation error in actor-critic methods

    Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-critic methods. InInternational conference on machine learning, pages 1587–1596. PMLR, 2018

  16. [16]

    Smith, Shixiang Shane Gu, Doina Precup, and David Meger

    Scott Fujimoto, Wei-Di Chang, Edward J Smith, Shixiang Shane Gu, Doina Precup, and David Meger. For sale: State-action representation learning for deep reinforcement learning.arXiv preprint arXiv:2306.02451, 2023

  17. [17]

    Towards general-purpose model-free reinforcement learning.arXiv preprint arXiv:2501.16142, 2025

    Scott Fujimoto, Pierluca D’Oro, Amy Zhang, Yuandong Tian, and Michael Rabbat. Towards general-purpose model-free reinforcement learning.arXiv preprint arXiv:2501.16142, 2025

  18. [18]

    Simplifying deep temporal difference learning.arXiv preprint arXiv:2407.04811, 2024

    Matteo Gallici, Mattie Fellows, Benjamin Ellis, Bartomeu Pou, Ivan Masmitja, Jakob Nicolaus Foerster, and Mario Martin. Simplifying deep temporal difference learning.arXiv preprint arXiv:2407.04811, 2024

  19. [19]

    Humanoid locomotion and manipulation: Current progress and challenges in control, planning, and learning,

    Zhaoyuan Gu, Junheng Li, Wenlan Shen, Wenhao Yu, Zhaoming Xie, Stephen McCrory, Xianyi Cheng, Abdulaziz Shamsah, Robert Griffin, C Karen Liu, et al. Humanoid locomotion and manipulation: Current progress and challenges in control, planning, and learning.arXiv preprint arXiv:2501.02116, 2025

  20. [20]

    Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. PMLR, 2018. 13

  21. [21]

    Mastering Diverse Domains through World Models

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

  22. [22]

    Td-mpc2: Scalable, robust world models for continuous control, 2024

    Nicklas Hansen, Hao Su, and Xiaolong Wang. Td-mpc2: Scalable, robust world models for continuous control, 2024

  23. [23]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  24. [24]

    Anymal parkour: Learning agile navigation for quadrupedal robots.Science Robotics, 9(88):eadi7566, 2024

    David Hoeller, Nikita Rudin, Dhionis Sako, and Marco Hutter. Anymal parkour: Learning agile navigation for quadrupedal robots.Science Robotics, 9(88):eadi7566, 2024

  25. [25]

    Action noise in off-policy deep reinforcement learning: Impact on exploration and performance.arXiv preprint arXiv:2206.03787, 2022

    Jakob Hollenstein, Sayantan Auddy, Matteo Saveriano, Erwan Renaudo, and Justus Piater. Action noise in off-policy deep reinforcement learning: Impact on exploration and performance.arXiv preprint arXiv:2206.03787, 2022

  26. [26]

    MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

    Andrew G Howard. Mobilenets: Efficient convolutional neural networks for mobile vision applications.arXiv preprint arXiv:1704.04861, 2017

  27. [27]

    Learning agile and dynamic motor skills for legged robots.Science Robotics, 4(26):eaau5872, 2019

    Jemin Hwangbo, Joonho Lee, Alexey Dosovitskiy, Dario Bellicoso, Vassilios Tsounis, Vladlen Koltun, and Marco Hutter. Learning agile and dynamic motor skills for legged robots.Science Robotics, 4(26):eaau5872, 2019

  28. [28]

    Batch normalization: Accelerating deep network training by reducing internal covariate shift

    Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. InInternational conference on machine learning, pages 448–456. pmlr, 2015

  29. [29]

    When to trust your model: Model-based policy optimization.Advances in neural information processing systems, 32, 2019

    Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization.Advances in neural information processing systems, 32, 2019

  30. [30]

    Concurrent training of a control policy and a state estimator for dynamic and robust legged locomotion.IEEE Robotics and Automation Letters, 7(2):4630–4637, April 2022

    Gwanghyeon Ji, Juhyeok Mun, Hyeongjun Kim, and Jemin Hwangbo. Concurrent training of a control policy and a state estimator for dynamic and robust legged locomotion.IEEE Robotics and Automation Letters, 7(2):4630–4637, April 2022. ISSN 2377-3774. doi: 10.1109/lra.2022.3151396.http://dx.doi.org/10.1109/LRA.2022.3151396

  31. [31]

    Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning

    Zhenyu Jiang, Yuqi Xie, Kevin Lin, Zhenjia Xu, Weikang Wan, Ajay Mandlekar, Linxi Jim Fan, and Yuke Zhu. Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 16923–16930. IEEE, 2025

  32. [32]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

  33. [33]

    Openvla: An open-source vision-language-action model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language-action model. InCoRL, 2025

  34. [34]

    Reinforcement learning in robotics: A survey.The International Journal of Robotics Research, 32(11):1238–1274, 2013

    Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey.The International Journal of Robotics Research, 32(11):1238–1274, 2013

  35. [35]

    Rma: Rapid motor adaptation for legged robots,

    Ashish Kumar, Zipeng Fu, Deepak Pathak, and Jitendra Malik. Rma: Rapid motor adaptation for legged robots. arXiv preprint arXiv:2107.04034, 2021

  36. [36]

    Reinforcement learning with augmented data.Advances in neural information processing systems, 33:19884–19895, 2020

    Misha Laskin, Kimin Lee, Adam Stooke, Lerrel Pinto, Pieter Abbeel, and Aravind Srinivas. Reinforcement learning with augmented data.Advances in neural information processing systems, 33:19884–19895, 2020

  37. [37]

    Plastic: Improving input and label plasticity for sample efficient reinforcement learning.Advances in Neural Information Processing Systems, 36, 2024

    Hojoon Lee, Hanseul Cho, Hyunseung Kim, Daehoon Gwak, Joonkee Kim, Jaegul Choo, Se-Young Yun, and Chulhee Yun. Plastic: Improving input and label plasticity for sample efficient reinforcement learning.Advances in Neural Information Processing Systems, 36, 2024

  38. [38]

    Slow and steady wins the race: Maintaining plasticity with hare and tortoise networks,

    Hojoon Lee, Hyeonseo Cho, Hyunseung Kim, Donghu Kim, Dugki Min, Jaegul Choo, and Clare Lyle. Slow and steady wins the race: Maintaining plasticity with hare and tortoise networks.arXiv preprint arXiv:2406.02596, 2024

  39. [39]

    Simba: Simplicity bias for scaling up parameters in deep reinforcement learning.arXiv preprint arXiv:2410.09754, 2024

    Hojoon Lee, Dongyoon Hwang, Donghu Kim, Hyunseung Kim, Jun Jet Tai, Kaushik Subramanian, Peter R Wurman, Jaegul Choo, Peter Stone, and Takuma Seno. Simba: Simplicity bias for scaling up parameters in deep reinforcement learning.arXiv preprint arXiv:2410.09754, 2024

  40. [40]

    Hyperspherical normalization for scalable deep reinforcement learning.arXiv preprint arXiv:2502.15280, 2025

    Hojoon Lee, Youngdo Lee, Takuma Seno, Donghu Kim, Peter Stone, and Jaegul Choo. Hyperspherical normaliza- tion for scalable deep reinforcement learning.arXiv preprint arXiv:2502.15280, 2025. 14

  41. [41]

    Learning quadrupedal locomotion over challenging terrain,

    Joonho Lee, Jemin Hwangbo, Lorenz Wellhausen, Vladlen Koltun, and Marco Hutter. Learning quadrupedal locomotion over challenging terrain.Science Robotics, 5(47), October 2020. ISSN 2470-9476. doi: 10.1126/ scirobotics.abc5986.http://dx.doi.org/10.1126/scirobotics.abc5986

  42. [42]

    Efficient deep reinforcement learning requires regulating overfitting.arXiv preprint arXiv:2304.10466, 2023

    Qiyang Li, Aviral Kumar, Ilya Kostrikov, and Sergey Levine. Efficient deep reinforcement learning requires regulating overfitting.arXiv preprint arXiv:2304.10466, 2023

  43. [43]

    Beyondmimic: From mo- tion tracking to versatile humanoid control via guided diffusion,

    Qiayuan Liao, Takara E Truong, Xiaoyu Huang, Yuman Gao, Guy Tevet, Koushil Sreenath, and C Karen Liu. Beyondmimic: From motion tracking to versatile humanoid control via guided diffusion.arXiv preprint arXiv:2508.08241, 2025

  44. [44]

    Continuous control with deep reinforcement learning

    Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning.arXiv preprint arXiv:1509.02971, 2015

  45. [45]

    Softgym: Benchmarking deep reinforcement learning for deformable object manipulation

    Xingyu Lin, Yufei Wang, Jake Olkin, and David Held. Softgym: Benchmarking deep reinforcement learning for deformable object manipulation. InConference on Robot Learning, pages 432–448. PMLR, 2021

  46. [46]

    Łukasz Maziarka, Tomasz Danel, Sławomir Mucha, Krzysztof Rataj, Jacek Tabor, and Stanisław Jastrzęb- ski

    Ilya Loshchilov, Cheng-Ping Hsieh, Simeng Sun, and Boris Ginsburg. ngpt: Normalized transformer with representation learning on the hypersphere.arXiv preprint arXiv:2410.01131, 2024

  47. [47]

    Understanding plasticity in neural networks.Proc

    Clare Lyle, Zeyu Zheng, Evgenii Nikishin, Bernardo Avila Pires, Razvan Pascanu, and Will Dabney. Understanding plasticity in neural networks.Proc. the International Conference on Machine Learning (ICML), 2023

  48. [48]

    Normalization and effective learning rates in reinforcement learning.Advances in Neural Information Processing Systems, 37:106440–106473, 2024

    Clare Lyle, Zeyu Zheng, Khimya Khetarpal, James Martens, Hado P van Hasselt, Razvan Pascanu, and Will Dabney. Normalization and effective learning rates in reinforcement learning.Advances in Neural Information Processing Systems, 37:106440–106473, 2024

  49. [49]

    Moerland, Joost Broekens, Aske Plaat, and Catholijn M

    Thomas M. Moerland, Joost Broekens, Aske Plaat, and Catholijn M. Jonker. Model-based reinforcement learning: A survey.Foundations and Trends in Machine Learning, 16(1):1–118, 2023

  50. [50]

    Weighted importance sampling for off-policy learning with linear function approximation.Advances in neural information processing systems, 27, 2014

    A Rupam Mahmood, Hado P Van Hasselt, and Richard S Sutton. Weighted importance sampling for off-policy learning with linear function approximation.Advances in neural information processing systems, 27, 2014

  51. [51]

    Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning

    Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning.arXiv preprint arXiv:2108.10470, 2021

  52. [52]

    Mixed Precision Training

    Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Gins- burg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training.arXiv preprint arXiv:1710.03740, 2017

  53. [53]

    Symmetry considerations for learning task symmetric robot policies

    Mayank Mittal, Nikita Rudin, Victor Klemm, Arthur Allshire, and Marco Hutter. Symmetry considerations for learning task symmetric robot policies. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 7433–7439. IEEE, 2024

  54. [54]

    Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning

    Mayank Mittal, Pascal Roth, James Tigue, Antoine Richard, Octi Zhang, Peter Du, Antonio Serrano-Muñoz, Xinjie Yao, René Zurbrügg, Nikita Rudin, et al. Isaac lab: A gpu-accelerated simulation framework for multi-modal robot learning.arXiv preprint arXiv:2511.04831, 2025

  55. [55]

    Human-level control through deep reinforcement learning.nature, 518(7540):529–533, 2015

    Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning.nature, 518(7540):529–533, 2015

  56. [56]

    Mock and University of Wyoming

    J.W. Mock and University of Wyoming. Department of Electrical Engineering.A Comparison of PPO, TD3, and SAC Reinforcement Algorithms for Quadruped Walking Gait Generation and Transfer Learning to a Physical Robot. University of Wyoming, 2023. ISBN 9798379561789.https://books.google.co.kr/books?id=waUG0AEACAAJ

  57. [57]

    Dreamwaq: Learning robust quadrupedal locomotion with implicit terrain imagination via deep reinforcement learning,

    I Nahrendra, Byeongho Yu, and Hyun Myung. Dreamwaq: Learning robust quadrupedal locomotion with implicit terrain imagination via deep reinforcement learning.arXiv preprint arXiv:2301.10602, 2023

  58. [58]

    Reward centering.arXiv preprint arXiv:2405.09999, 2024

    Abhishek Naik, Yi Wan, Manan Tomar, and Richard S Sutton. Reward centering.arXiv preprint arXiv:2405.09999, 2024

  59. [59]

    Robocasa: Large-scale simulation of everyday tasks for generalist robots

    Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots. InRobotics: Science and Systems, 2024

  60. [60]

    Bigger, regularized, optimistic: scaling for compute and sample-efficient continuous control.arXiv preprint arXiv:2405.16158, 2024

    Michal Nauman, Mateusz Ostaszewski, Krzysztof Jankowski, Piotr Miłoś, and Marek Cygan. Bigger, regularized, optimistic: scaling for compute and sample-efficient continuous control.arXiv preprint arXiv:2405.16158, 2024. 15

  61. [61]

    Nauman, M., Ostaszewski, M., Jankowski, K., Miło ´s, P., and Cygan, M

    Michal Nauman, Marek Cygan, Carmelo Sferrazza, Aviral Kumar, and Pieter Abbeel. Bigger, regularized, categorical: High-capacity value functions are efficient multi-task learners.arXiv preprint arXiv:2505.23150, 2025

  62. [62]

    Simplicial embeddings improve sample efficiency in actor-critic agents.arXiv preprint arXiv:2510.13704, 2025

    Johan Obando-Ceron, Walter Mayor, Samuel Lavoie, Scott Fujimoto, Aaron Courville, and Pablo Samuel Castro. Simplicial embeddings improve sample efficiency in actor-critic agents.arXiv preprint arXiv:2510.13704, 2025

  63. [63]

    Scaling off-policy reinforcement learning with batch and weight normalization.Advances in Neural Information Processing Systems (NeurIPS), 2025

    Daniel Palenicek, Florian Vogt, Joe Watson, and Jan Peters. Scaling off-policy reinforcement learning with batch and weight normalization.Advances in Neural Information Processing Systems (NeurIPS), 2025

  64. [64]

    XQC: Well-conditioned optimization accelerates deep reinforcement learning.International Conference on Learning Representations (ICLR), 2026

    Daniel Palenicek, Florian Vogt, Joe Watson, Ingmar Posner, and Jan Peters. XQC: Well-conditioned optimization accelerates deep reinforcement learning.International Conference on Learning Representations (ICLR), 2026

  65. [65]

    Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

  66. [66]

    Asymmetric Actor Critic for Image-Based Robot Learning

    Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. Asymmetric actor critic for image-based robot learning.arXiv preprint arXiv:1710.06542, 2017

  67. [67]

    Turner, Oleksandr Maksymets, Zsolt Kira, Mrinal Kalakrishnan, Jitendra Malik, Devendra Singh Chaplot, Unnat Jain, Dhruv Batra, Akshara Rai, and Roozbeh Mottaghi

    Xavier Puig, Eric Undersander, Andrew Szot, Mikael Dallaire Cote, Tsung-Yen Yang, Ruslan Partsey, Ruta Desai, Alexander William Clegg, Michal Hlavac, So Yeon Min, et al. Habitat 3.0: A co-habitat for humans, avatars and robots.arXiv preprint arXiv:2310.13724, 2023

  68. [69]

    Learning to walk in minutes using massively parallel deep reinforcement learning

    Nikita Rudin, David Hoeller, Philipp Reist, and Marco Hutter. Learning to walk in minutes using massively parallel deep reinforcement learning. InConference on robot learning, pages 91–100. PMLR, 2022

  69. [70]

    How does batch normalization help optimization?Advances in neural information processing systems, 31, 2018

    Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How does batch normalization help optimization?Advances in neural information processing systems, 31, 2018

  70. [71]

    CoRR , volume =

    Tom Schaul, Georg Ostrovski, Iurii Kemaev, and Diana Borsa. Return-based scaling: Yet another normalisation trick for deep rl.arXiv preprint arXiv:2105.05347, 2021

  71. [72]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  72. [73]

    arXiv preprint arXiv:2509.10771 , year=

    Clemens Schwarke, Mayank Mittal, Nikita Rudin, David Hoeller, and Marco Hutter. Rsl-rl: A learning library for robotics research.arXiv preprint arXiv:2509.10771, 2025

  73. [74]

    Learning sim-to-real humanoid locomotion in 15 minutes.arXiv preprint arXiv:2512.01996, 2025

    Younggyo Seo, Carmelo Sferrazza, Juyue Chen, Guanya Shi, Rocky Duan, and Pieter Abbeel. Learning sim-to-real humanoid locomotion in 15 minutes, 2025.https://arxiv.org/abs/2512.01996

  74. [75]

    Fasttd3: Simple, fast, and capable reinforcement learning for humanoid control.arXiv preprint arXiv:2505.22642, 2025

    Younggyo Seo, Carmelo Sferrazza, Haoran Geng, Michal Nauman, Zhao-Heng Yin, and Pieter Abbeel. Fasttd3: Simple, fast, and capable reinforcement learning for humanoid control.arXiv preprint arXiv:2505.22642, 2025

  75. [76]

    Hu- manoidbench: Simulated humanoid benchmark for whole-body locomotion and manipulation

    Carmelo Sferrazza, Dun-Ming Huang, Xingyu Lin, Youngwoon Lee, and Pieter Abbeel. Humanoidbench: Simulated humanoid benchmark for whole-body locomotion and manipulation.arXiv preprint arXiv:2403.10506, 2024

  76. [77]

    Importance sampling for reinforcement learning with multiple objectives.PhD thesis, Massachusetts Institute of Technology, 2001

    Christian Robert Shelton. Importance sampling for reinforcement learning with multiple objectives.PhD thesis, Massachusetts Institute of Technology, 2001

  77. [78]

    Sim2real manipulation on unknown objects with tactile-based reinforcement learning

    Entong Su, Chengzhe Jia, Yuzhe Qin, Wenxuan Zhou, Annabella Macaluso, Binghao Huang, and Xiaolong Wang. Sim2real manipulation on unknown objects with tactile-based reinforcement learning. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 9234–9241. IEEE, 2024

  78. [79]

    Integrated architectures for learning, planning, and reacting based on approximating dynamic programming

    Richard S Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. InMachine learning proceedings 1990, pages 216–224. Elsevier, 1990

  79. [80]

    MIT press Cambridge, 1998

    Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

  80. [81]

    Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai.arXiv preprint arXiv:2410.00425, 2024

    Stone Tao, Fanbo Xiang, Arth Shukla, Yuzhe Qin, Xander Hinrichsen, Xiaodi Yuan, Chen Bao, Xinsong Lin, Yulin Liu, Tse-kai Chan, et al. Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai.arXiv preprint arXiv:2410.00425, 2024

Showing first 80 references.