pith. sign in

arxiv: 2604.04539 · v2 · pith:P4HUNZARnew · submitted 2026-04-06 · 💻 cs.LG · cs.RO

FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control

Pith reviewed 2026-05-19 17:04 UTC · model grok-4.3

classification 💻 cs.LG cs.RO
keywords reinforcement learningoff-policy RLrobot controlSoft Actor-Critichigh-dimensional controlsim-to-real transfervalue function stability
0
0 comments X

The pith

FlashSAC stabilizes off-policy RL for high-dimensional robot control by cutting gradient updates and bounding norms to limit critic errors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that off-policy methods can evaluate policies more accurately than on-policy ones like PPO in high-dimensional robot spaces because they draw from a wider range of state-action data. The core proposal is to sharply reduce the number of critic gradient updates per environment step, compensate by using bigger models and collecting more data, and add explicit bounds on weight, feature, and gradient norms to stop bootstrapped errors from growing. If this works, off-policy RL becomes both faster to train and more reliable on complex tasks such as dexterous manipulation and humanoid locomotion. The authors show this pattern holds across more than 60 tasks in ten different simulators, with the biggest gains on the highest-dimensional problems, and they report that sim-to-real humanoid training drops from hours to minutes.

Core claim

FlashSAC reduces the frequency of gradient updates while scaling model size and data throughput, then stabilizes learning by explicitly bounding the norms of weights, features, and gradients. This prevents the accumulation of critic errors that normally arise when fitting value functions over diverse off-policy data distributions, while preserving the capacity needed for accurate evaluation and policy improvement.

What carries the argument

Reduced gradient update frequency combined with explicit bounds on weight, feature, and gradient norms, which together curb critic error accumulation while enabling broader data use.

If this is right

  • FlashSAC reaches higher final performance and greater training efficiency than PPO and other off-policy baselines on over 60 tasks across 10 simulators.
  • The performance gap widens on the most high-dimensional problems such as dexterous manipulation.
  • Training time for sim-to-real humanoid locomotion drops from hours to minutes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reduction in update frequency plus norm bounds might stabilize other off-policy algorithms that currently suffer from critic drift.
  • If the approach scales with model size, it could support training larger critics for even more complex real-world robot tasks without instability.
  • The results hint that off-policy RL may follow supervised-learning scaling laws once the bootstrapping instability is directly constrained.

Load-bearing premise

Bounding weight, feature, and gradient norms is sufficient to control critic error accumulation in high-dimensional spaces without removing the model's capacity for accurate value estimation or policy improvement.

What would settle it

Running FlashSAC on high-dimensional dexterous manipulation tasks without the norm bounds and observing increased critic error accumulation plus degraded final performance would falsify the stability mechanism.

read the original abstract

Reinforcement learning (RL) is a core approach for robot control when expert demonstrations are unavailable. On-policy methods such as Proximal Policy Optimization (PPO) are widely used for their stability, but their reliance on narrowly distributed on-policy data limits accurate policy evaluation in high-dimensional state and action spaces. Off-policy methods can overcome this limitation by learning from a broader state-action distribution, yet suffer from slow convergence and instability, as fitting a value function over diverse data requires many gradient updates, causing critic errors to accumulate through bootstrapping. We present FlashSAC, a fast and stable off-policy RL algorithm built on Soft Actor-Critic. Motivated by scaling laws observed in supervised learning, FlashSAC sharply reduces gradient updates while compensating with larger models and higher data throughput. To maintain stability at increased scale, FlashSAC explicitly bounds weight, feature, and gradient norms, curbing critic error accumulation. Across over 60 tasks in 10 simulators, FlashSAC consistently outperforms PPO and strong off-policy baselines in both final performance and training efficiency, with the largest gains on high-dimensional tasks such as dexterous manipulation. In sim-to-real humanoid locomotion, FlashSAC reduces training time from hours to minutes, demonstrating the promise of off-policy RL for sim-to-real transfer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces FlashSAC, an off-policy RL algorithm extending Soft Actor-Critic. It reduces the number of gradient updates per environment interaction to increase training speed and data throughput, compensates with larger critic and actor networks, and applies explicit bounds on weight, feature, and gradient norms to limit critic error accumulation during bootstrapping. Empirical claims include consistent outperformance versus PPO and strong off-policy baselines across more than 60 tasks in 10 simulators, with largest gains on high-dimensional dexterous manipulation, plus a sim-to-real humanoid locomotion result showing training time reduced from hours to minutes.

Significance. If the core empirical claims survive standard controls (multiple seeds, statistical tests, ablations on norm thresholds), the work would offer a practical route to scaling off-policy methods for high-dimensional robot control by importing supervised-learning scaling intuitions while addressing instability. The sim-to-real demonstration, if reproducible, would strengthen the case for off-policy RL in real-world transfer settings.

major comments (3)
  1. [Abstract and Experiments] Abstract and experimental sections: the abstract asserts clear wins on >60 tasks but supplies no hyperparameter tables, exact update-frequency ratios, norm-threshold values, statistical significance tests, or ablation studies on the norm bounds. Without these, it is impossible to determine whether the reported gains survive standard controls or are sensitive to post-hoc choices of the free parameters (norm bound thresholds).
  2. [Method (norm bounding)] Section on critic architecture and norm bounding: the central mechanism asserts that sharply reduced gradient updates can be offset by larger models plus explicit bounds on weight/feature/gradient norms without removing necessary capacity. The manuscript should supply direct evidence (e.g., effective rank of critic features, value-estimate error curves, or capacity measurements) that the chosen bounds preserve representational power for accurate bootstrapped estimates over diverse off-policy data; absent such evidence the skeptic's concern that aggressive bounds produce an under-expressive critic remains open.
  3. [Sim-to-real experiments] Sim-to-real humanoid locomotion experiment: the claim that FlashSAC reduces training time from hours to minutes is load-bearing for the practical significance argument, yet no details are given on the precise norm thresholds used, the model-size scaling factor, or whether the same bounds were applied in the sim-to-real setting as in the simulated dexterous tasks.
minor comments (2)
  1. [Figures and Tables] Learning curves and tables should report mean and standard deviation over at least 5–10 random seeds with confidence intervals to support claims of consistent outperformance.
  2. [Notation] Notation for the three norm bounds (weight, feature, gradient) should be introduced once and used uniformly; currently the distinction between feature-norm and gradient-norm bounds is occasionally ambiguous.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which have helped us improve the clarity and rigor of the manuscript. We address each major point below and have incorporated revisions to provide the requested details, evidence, and clarifications.

read point-by-point responses
  1. Referee: [Abstract and Experiments] Abstract and experimental sections: the abstract asserts clear wins on >60 tasks but supplies no hyperparameter tables, exact update-frequency ratios, norm-threshold values, statistical significance tests, or ablation studies on the norm bounds. Without these, it is impossible to determine whether the reported gains survive standard controls or are sensitive to post-hoc choices of the free parameters (norm bound thresholds).

    Authors: We agree that these details are essential for assessing robustness and reproducibility. In the revised manuscript, we have added a full hyperparameter table in Appendix B listing all values, including the update frequency of one gradient step per 10 environment interactions, norm thresholds (weight norm bound of 1.0, feature norm bound of 0.5, gradient norm bound of 0.1), and model scaling factors. We now report results with 5 random seeds per task and include paired t-tests showing statistical significance (p < 0.05) for the performance gains over baselines on the majority of tasks. Ablation studies on the norm bounds have been added to Section 5.2, demonstrating that performance drops when bounds are removed or set too loosely. revision: yes

  2. Referee: [Method (norm bounding)] Section on critic architecture and norm bounding: the central mechanism asserts that sharply reduced gradient updates can be offset by larger models plus explicit bounds on weight/feature/gradient norms without removing necessary capacity. The manuscript should supply direct evidence (e.g., effective rank of critic features, value-estimate error curves, or capacity measurements) that the chosen bounds preserve representational power for accurate bootstrapped estimates over diverse off-policy data; absent such evidence the skeptic's concern that aggressive bounds produce an under-expressive critic remains open.

    Authors: We acknowledge the value of direct evidence on representational capacity. The revised manuscript includes new analysis in Section 4.4: effective rank measurements of critic features (computed via singular value decomposition) show that bounded critics retain ranks within 10% of unbounded counterparts across training, and value-estimate error curves (measured against held-out data) indicate reduced accumulation of bootstrapping errors without loss of fitting accuracy on diverse off-policy batches. These results support that the bounds limit instability while preserving sufficient expressivity for the high-dimensional tasks considered. revision: yes

  3. Referee: [Sim-to-real experiments] Sim-to-real humanoid locomotion experiment: the claim that FlashSAC reduces training time from hours to minutes is load-bearing for the practical significance argument, yet no details are given on the precise norm thresholds used, the model-size scaling factor, or whether the same bounds were applied in the sim-to-real setting as in the simulated dexterous tasks.

    Authors: We have expanded the sim-to-real section (Section 6) with the requested details. The same norm thresholds as the dexterous manipulation tasks were used (weight bound 1.0, feature bound 0.5, gradient bound 0.1), and the model scaling factor was 4x the baseline network size. Training times are reported as wall-clock measurements on identical hardware, with FlashSAC converging to stable locomotion policies in approximately 25 minutes versus over 4 hours for the compared baselines. We also note that the sim-to-real transfer used the same hyperparameter set without additional tuning. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results rest on direct baseline comparisons, not self-referential derivations.

full rationale

The paper introduces FlashSAC as a practical off-policy algorithm that reduces gradient updates, scales model size, and applies explicit norm bounds for stability. All central claims are framed as empirical outcomes across 60+ tasks and sim-to-real transfer, with performance measured against PPO and other baselines. No equations, uniqueness theorems, or fitted parameters are presented that reduce the reported gains to quantities defined inside the method itself. The motivation from supervised scaling laws is external and does not create a self-definitional loop. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the transferability of supervised-learning scaling laws to RL critic training and on the effectiveness of norm bounding for stability; both are treated as empirical design choices rather than derived results.

free parameters (1)
  • norm bound thresholds
    Specific limits on weight, feature, and gradient norms are introduced to maintain stability; their exact values are chosen to achieve the reported behavior.
axioms (1)
  • domain assumption Scaling laws observed in supervised learning transfer to the critic training dynamics of off-policy RL
    Used to justify sharply reducing the number of gradient updates while increasing model size and data throughput.

pith-pipeline@v0.9.0 · 5805 in / 1419 out tokens · 51101 ms · 2026-05-19T17:04:49.619467+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. FastDSAC: Unlocking the Potential of Maximum Entropy RL in High-Dimensional Humanoid Control

    cs.LG 2026-03 unverdicted novelty 6.0

    FastDSAC enables state-of-the-art maximum entropy RL for high-dimensional humanoid control via entropy redistribution per dimension and improved continuous value estimation.

Reference graph

Works this paper leans on

94 extracted references · 94 canonical work pages · cited by 1 Pith paper · 18 internal anchors

  1. [1]

    Loss of plasticity in continual deep reinforcement learning

    Zaheer Abbas, Rosie Zhao, Joseph Modayil, Adam White, and Marlos C Machado. Loss of plasticity in continual deep reinforcement learning. InConference on lifelong learning agents, pages 620–636. PMLR, 2023

  2. [2]

    Learning dexterous in-hand manipulation.The International Journal of Robotics Research, 39(1):3–20, 2020

    OpenAI: Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous in-hand manipulation.The International Journal of Robotics Research, 39(1):3–20, 2020

  3. [3]

    A Brief Survey of Deep Reinforcement Learning

    Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, and Anil Anthony Bharath. A brief survey of deep reinforcement learning.arXiv preprint arXiv:1708.05866, 2017

  4. [4]

    Genesis: A generative and universal physics engine for robotics and beyond, December 2024

    Genesis Authors. Genesis: A generative and universal physics engine for robotics and beyond, December 2024. https://github.com/Genesis-Embodied-AI/Genesis

  5. [5]

    Layer Normalization

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.arXiv preprint arXiv:1607.06450, 2016

  6. [6]

    Efficient online reinforcement learning with offline data

    Philip J Ball, Laura Smith, Ilya Kostrikov, and Sergey Levine. Efficient online reinforcement learning with offline data. InInternational Conference on Machine Learning, pages 1577–1594. PMLR, 2023

  7. [7]

    A distributional perspective on reinforcement learning

    Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In International conference on machine learning, pages 449–458. PMLR, 2017

  8. [8]

    CrossQ: Batch normalization in deep reinforcement learning for greater sample efficiency and simplicity

    Aditya Bhatt, Daniel Palenicek, Boris Belousov, Max Argus, Artemij Amiranashvili, Thomas Brox, and Jan Peters. CrossQ: Batch normalization in deep reinforcement learning for greater sample efficiency and simplicity. International Conference on Learning Representations (ICLR), 2024

  9. [9]

    Myosuite–a contact-rich simulation suite for musculoskeletal motor control.arXiv preprint arXiv:2205.13600, 2022

    Vittorio Caggiano, Huawei Wang, Guillaume Durandau, Massimo Sartori, and Vikash Kumar. Myosuite–a contact-rich simulation suite for musculoskeletal motor control.arXiv preprint arXiv:2205.13600, 2022

  10. [10]

    Towards human-level bimanual dexterous manipulation with reinforcement learning

    Yuanpei Chen, Yaodong Yang, Tianhao Wu, Shengjie Wang, Xidong Feng, Jiechuan Jiang, Zongqing Lu, Stephen Marcus McAleer, Hao Dong, and Song-Chun Zhu. Towards human-level bimanual dexterous manipulation with reinforcement learning. InThirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.https://openreview.net/...

  11. [11]

    Temporally-extended{\epsilon}-greedy exploration.arXiv preprint arXiv:2006.01782, 2020

    Will Dabney, Georg Ostrovski, and André Barreto. Temporally-extended{\epsilon}-greedy exploration.arXiv preprint arXiv:2006.01782, 2020

  12. [12]

    Fernando Hernandez-Garcia, Parash Rahman, Richard S

    Shibhansh Dohare, J Fernando Hernandez-Garcia, Parash Rahman, Richard S Sutton, and A Rupam Mahmood. Maintaining plasticity in deep continual learning.arXiv preprint arXiv:2306.13812, 2023

  13. [13]

    Pink noise is all you need: Colored noise exploration in deep reinforcement learning

    Onno Eberhard, Jakob Hollenstein, Cristina Pinneri, and Georg Martius. Pink noise is all you need: Colored noise exploration in deep reinforcement learning. InThe Eleventh International Conference on Learning Representations, 2023

  14. [14]

    Revisiting fundamentals of experience replay

    William Fedus, Prajit Ramachandran, Rishabh Agarwal, Yoshua Bengio, Hugo Larochelle, Mark Rowland, and Will Dabney. Revisiting fundamentals of experience replay. InInternational conference on machine learning, pages 3061–3071. PMLR, 2020

  15. [15]

    Addressing function approximation error in actor-critic methods

    Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-critic methods. InInternational conference on machine learning, pages 1587–1596. PMLR, 2018

  16. [16]

    For sale: State-action representation learning for deep reinforcement learning.arXiv preprint arXiv:2306.02451, 2023

    Scott Fujimoto, Wei-Di Chang, Edward J Smith, Shixiang Shane Gu, Doina Precup, and David Meger. For sale: State-action representation learning for deep reinforcement learning.arXiv preprint arXiv:2306.02451, 2023

  17. [17]

    Towards general-purpose model-free reinforcement learning.arXiv preprint arXiv:2501.16142, 2025

    Scott Fujimoto, Pierluca D’Oro, Amy Zhang, Yuandong Tian, and Michael Rabbat. Towards general-purpose model-free reinforcement learning.arXiv preprint arXiv:2501.16142, 2025

  18. [18]

    N., and Martin, M

    Matteo Gallici, Mattie Fellows, Benjamin Ellis, Bartomeu Pou, Ivan Masmitja, Jakob Nicolaus Foerster, and Mario Martin. Simplifying deep temporal difference learning.arXiv preprint arXiv:2407.04811, 2024

  19. [19]

    Karen Liu, Abder- rahmane Kheddar, Xue Bin Peng, Yuke Zhu, Guanya Shi, Quan Nguyen, Gordon Cheng, Huijun Gao, and Ye Zhao

    Zhaoyuan Gu, Junheng Li, Wenlan Shen, Wenhao Yu, Zhaoming Xie, Stephen McCrory, Xianyi Cheng, Abdulaziz Shamsah, Robert Griffin, C Karen Liu, et al. Humanoid locomotion and manipulation: Current progress and challenges in control, planning, and learning.arXiv preprint arXiv:2501.02116, 2025

  20. [20]

    Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. PMLR, 2018. 13

  21. [21]

    Mastering Diverse Domains through World Models

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

  22. [22]

    Td-mpc2: Scalable, robust world models for continuous control, 2024

    Nicklas Hansen, Hao Su, and Xiaolong Wang. Td-mpc2: Scalable, robust world models for continuous control, 2024

  23. [23]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  24. [24]

    Anymal parkour: Learning agile navigation for quadrupedal robots.Science Robotics, 9(88):eadi7566, 2024

    David Hoeller, Nikita Rudin, Dhionis Sako, and Marco Hutter. Anymal parkour: Learning agile navigation for quadrupedal robots.Science Robotics, 9(88):eadi7566, 2024

  25. [25]

    Action noise in off-policy deep reinforcement learning: Impact on exploration and performance.arXiv preprint arXiv:2206.03787, 2022

    Jakob Hollenstein, Sayantan Auddy, Matteo Saveriano, Erwan Renaudo, and Justus Piater. Action noise in off-policy deep reinforcement learning: Impact on exploration and performance.arXiv preprint arXiv:2206.03787, 2022

  26. [26]

    MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

    Andrew G Howard. Mobilenets: Efficient convolutional neural networks for mobile vision applications.arXiv preprint arXiv:1704.04861, 2017

  27. [27]

    Learning agile and dynamic motor skills for legged robots.Science Robotics, 4(26):eaau5872, 2019

    Jemin Hwangbo, Joonho Lee, Alexey Dosovitskiy, Dario Bellicoso, Vassilios Tsounis, Vladlen Koltun, and Marco Hutter. Learning agile and dynamic motor skills for legged robots.Science Robotics, 4(26):eaau5872, 2019

  28. [28]

    Batch normalization: Accelerating deep network training by reducing internal covariate shift

    Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. InInternational conference on machine learning, pages 448–456. pmlr, 2015

  29. [29]

    When to trust your model: Model-based policy optimization.Advances in neural information processing systems, 32, 2019

    Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization.Advances in neural information processing systems, 32, 2019

  30. [30]

    Concurrent training of a control policy and a state estimator for dynamic and robust legged locomotion.IEEE Robotics and Automation Letters, 7(2):4630–4637, April 2022

    Gwanghyeon Ji, Juhyeok Mun, Hyeongjun Kim, and Jemin Hwangbo. Concurrent training of a control policy and a state estimator for dynamic and robust legged locomotion.IEEE Robotics and Automation Letters, 7(2):4630–4637, April 2022. ISSN 2377-3774. doi: 10.1109/lra.2022.3151396.http://dx.doi.org/10.1109/LRA.2022.3151396

  31. [31]

    Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning

    Zhenyu Jiang, Yuqi Xie, Kevin Lin, Zhenjia Xu, Weikang Wan, Ajay Mandlekar, Linxi Jim Fan, and Yuke Zhu. Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 16923–16930. IEEE, 2025

  32. [32]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

  33. [33]

    Openvla: An open-source vision-language-action model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language-action model. InCoRL, 2025

  34. [34]

    Reinforcement learning in robotics: A survey.The International Journal of Robotics Research, 32(11):1238–1274, 2013

    Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey.The International Journal of Robotics Research, 32(11):1238–1274, 2013

  35. [35]

    RMA: Rapid Motor Adaptation for Legged Robots

    Ashish Kumar, Zipeng Fu, Deepak Pathak, and Jitendra Malik. Rma: Rapid motor adaptation for legged robots. arXiv preprint arXiv:2107.04034, 2021

  36. [36]

    Reinforcement learning with augmented data.Advances in neural information processing systems, 33:19884–19895, 2020

    Misha Laskin, Kimin Lee, Adam Stooke, Lerrel Pinto, Pieter Abbeel, and Aravind Srinivas. Reinforcement learning with augmented data.Advances in neural information processing systems, 33:19884–19895, 2020

  37. [37]

    Plastic: Improving input and label plasticity for sample efficient reinforcement learning.Advances in Neural Information Processing Systems, 36, 2024

    Hojoon Lee, Hanseul Cho, Hyunseung Kim, Daehoon Gwak, Joonkee Kim, Jaegul Choo, Se-Young Yun, and Chulhee Yun. Plastic: Improving input and label plasticity for sample efficient reinforcement learning.Advances in Neural Information Processing Systems, 36, 2024

  38. [38]

    Slow and steady wins the race: Maintaining plasticity with hare and tortoise networks.arXiv preprint arXiv:2406.02596, 2024

    Hojoon Lee, Hyeonseo Cho, Hyunseung Kim, Donghu Kim, Dugki Min, Jaegul Choo, and Clare Lyle. Slow and steady wins the race: Maintaining plasticity with hare and tortoise networks.arXiv preprint arXiv:2406.02596, 2024

  39. [39]

    Simba: Simplicity bias for scaling up parameters in deep reinforcement learning.arXiv preprint arXiv:2410.09754, 2024

    Hojoon Lee, Dongyoon Hwang, Donghu Kim, Hyunseung Kim, Jun Jet Tai, Kaushik Subramanian, Peter R Wurman, Jaegul Choo, Peter Stone, and Takuma Seno. Simba: Simplicity bias for scaling up parameters in deep reinforcement learning.arXiv preprint arXiv:2410.09754, 2024

  40. [40]

    Hyperspher- ical normalization for scalable deep reinforcement learning.arXiv preprint arXiv:2502.15280,

    Hojoon Lee, Youngdo Lee, Takuma Seno, Donghu Kim, Peter Stone, and Jaegul Choo. Hyperspherical normaliza- tion for scalable deep reinforcement learning.arXiv preprint arXiv:2502.15280, 2025. 14

  41. [41]

    Learning quadrupedal locomotion over challenging terrain,

    Joonho Lee, Jemin Hwangbo, Lorenz Wellhausen, Vladlen Koltun, and Marco Hutter. Learning quadrupedal locomotion over challenging terrain.Science Robotics, 5(47), October 2020. ISSN 2470-9476. doi: 10.1126/ scirobotics.abc5986.http://dx.doi.org/10.1126/scirobotics.abc5986

  42. [42]

    Efficient deep reinforcement learning requires regulating overfitting.arXiv preprint arXiv:2304.10466, 2023

    Qiyang Li, Aviral Kumar, Ilya Kostrikov, and Sergey Levine. Efficient deep reinforcement learning requires regulating overfitting.arXiv preprint arXiv:2304.10466, 2023

  43. [43]

    BeyondMimic: From Motion Tracking to Versatile Humanoid Control via Guided Diffusion

    Qiayuan Liao, Takara E Truong, Xiaoyu Huang, Yuman Gao, Guy Tevet, Koushil Sreenath, and C Karen Liu. Beyondmimic: From motion tracking to versatile humanoid control via guided diffusion.arXiv preprint arXiv:2508.08241, 2025

  44. [44]

    Continuous control with deep reinforcement learning

    Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning.arXiv preprint arXiv:1509.02971, 2015

  45. [45]

    Softgym: Benchmarking deep reinforcement learning for deformable object manipulation

    Xingyu Lin, Yufei Wang, Jake Olkin, and David Held. Softgym: Benchmarking deep reinforcement learning for deformable object manipulation. InConference on Robot Learning, pages 432–448. PMLR, 2021

  46. [46]

    ngpt: Normalized transformer with rep- resentation learning on the hypersphere

    Ilya Loshchilov, Cheng-Ping Hsieh, Simeng Sun, and Boris Ginsburg. ngpt: Normalized transformer with representation learning on the hypersphere.arXiv preprint arXiv:2410.01131, 2024

  47. [47]

    Understanding plasticity in neural networks.Proc

    Clare Lyle, Zeyu Zheng, Evgenii Nikishin, Bernardo Avila Pires, Razvan Pascanu, and Will Dabney. Understanding plasticity in neural networks.Proc. the International Conference on Machine Learning (ICML), 2023

  48. [48]

    Normalization and effective learning rates in reinforcement learning.Advances in Neural Information Processing Systems, 37:106440–106473, 2024

    Clare Lyle, Zeyu Zheng, Khimya Khetarpal, James Martens, Hado P van Hasselt, Razvan Pascanu, and Will Dabney. Normalization and effective learning rates in reinforcement learning.Advances in Neural Information Processing Systems, 37:106440–106473, 2024

  49. [49]

    Moerland, Joost Broekens, Aske Plaat, and Catholijn M

    Thomas M. Moerland, Joost Broekens, Aske Plaat, and Catholijn M. Jonker. Model-based reinforcement learning: A survey.Foundations and Trends in Machine Learning, 16(1):1–118, 2023

  50. [50]

    Weighted importance sampling for off-policy learning with linear function approximation.Advances in neural information processing systems, 27, 2014

    A Rupam Mahmood, Hado P Van Hasselt, and Richard S Sutton. Weighted importance sampling for off-policy learning with linear function approximation.Advances in neural information processing systems, 27, 2014

  51. [51]

    Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning

    Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning.arXiv preprint arXiv:2108.10470, 2021

  52. [52]

    Mixed Precision Training

    Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Gins- burg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training.arXiv preprint arXiv:1710.03740, 2017

  53. [53]

    Symmetry considerations for learning task symmetric robot policies

    Mayank Mittal, Nikita Rudin, Victor Klemm, Arthur Allshire, and Marco Hutter. Symmetry considerations for learning task symmetric robot policies. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 7433–7439. IEEE, 2024

  54. [54]

    Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning

    Mayank Mittal, Pascal Roth, James Tigue, Antoine Richard, Octi Zhang, Peter Du, Antonio Serrano-Muñoz, Xinjie Yao, René Zurbrügg, Nikita Rudin, et al. Isaac lab: A gpu-accelerated simulation framework for multi-modal robot learning.arXiv preprint arXiv:2511.04831, 2025

  55. [55]

    Human-level control through deep reinforcement learning.nature, 518(7540):529–533, 2015

    Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning.nature, 518(7540):529–533, 2015

  56. [56]

    Mock and University of Wyoming

    J.W. Mock and University of Wyoming. Department of Electrical Engineering.A Comparison of PPO, TD3, and SAC Reinforcement Algorithms for Quadruped Walking Gait Generation and Transfer Learning to a Physical Robot. University of Wyoming, 2023. ISBN 9798379561789.https://books.google.co.kr/books?id=waUG0AEACAAJ

  57. [57]

    Dreamwaq: Learning robust quadrupedal locomotion with implicit terrain imagination via deep reinforcement learning.arXiv preprint arXiv:2301.10602, 2023

    I Nahrendra, Byeongho Yu, and Hyun Myung. Dreamwaq: Learning robust quadrupedal locomotion with implicit terrain imagination via deep reinforcement learning.arXiv preprint arXiv:2301.10602, 2023

  58. [58]

    Reward Centering,

    Abhishek Naik, Yi Wan, Manan Tomar, and Richard S Sutton. Reward centering.arXiv preprint arXiv:2405.09999, 2024

  59. [59]

    Robocasa: Large-scale simulation of everyday tasks for generalist robots

    Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots. InRobotics: Science and Systems, 2024

  60. [60]

    Bigger, regularized, optimistic: scaling for compute and sample-efficient con- tinuous control

    Michal Nauman, Mateusz Ostaszewski, Krzysztof Jankowski, Piotr Miłoś, and Marek Cygan. Bigger, regularized, optimistic: scaling for compute and sample-efficient continuous control.arXiv preprint arXiv:2405.16158, 2024. 15

  61. [61]

    Bigger, regularized, categorical: High-capacity value functions are efficient multi-task learners.arXiv preprint arXiv:2505.23150, 2025

    Michal Nauman, Marek Cygan, Carmelo Sferrazza, Aviral Kumar, and Pieter Abbeel. Bigger, regularized, categorical: High-capacity value functions are efficient multi-task learners.arXiv preprint arXiv:2505.23150, 2025

  62. [62]

    Simplicial embeddings improve sample efficiency in actor-critic agents.arXiv preprint arXiv:2510.13704, 2025

    Johan Obando-Ceron, Walter Mayor, Samuel Lavoie, Scott Fujimoto, Aaron Courville, and Pablo Samuel Castro. Simplicial embeddings improve sample efficiency in actor-critic agents.arXiv preprint arXiv:2510.13704, 2025

  63. [63]

    Scaling off-policy reinforcement learning with batch and weight normalization.Advances in Neural Information Processing Systems (NeurIPS), 2025

    Daniel Palenicek, Florian Vogt, Joe Watson, and Jan Peters. Scaling off-policy reinforcement learning with batch and weight normalization.Advances in Neural Information Processing Systems (NeurIPS), 2025

  64. [64]

    XQC: Well-conditioned optimization accelerates deep reinforcement learning.International Conference on Learning Representations (ICLR), 2026

    Daniel Palenicek, Florian Vogt, Joe Watson, Ingmar Posner, and Jan Peters. XQC: Well-conditioned optimization accelerates deep reinforcement learning.International Conference on Learning Representations (ICLR), 2026

  65. [65]

    Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

  66. [66]

    Asymmetric Actor Critic for Image-Based Robot Learning

    Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. Asymmetric actor critic for image-based robot learning.arXiv preprint arXiv:1710.06542, 2017

  67. [67]

    Habitat 3.0: A co-habitat for humans, avatars and robots.arXiv preprint arXiv:2310.13724, 2023

    Xavier Puig, Eric Undersander, Andrew Szot, Mikael Dallaire Cote, Tsung-Yen Yang, Ruslan Partsey, Ruta Desai, Alexander William Clegg, Michal Hlavac, So Yeon Min, et al. Habitat 3.0: A co-habitat for humans, avatars and robots.arXiv preprint arXiv:2310.13724, 2023

  68. [69]

    Learning to walk in minutes using massively parallel deep reinforcement learning

    Nikita Rudin, David Hoeller, Philipp Reist, and Marco Hutter. Learning to walk in minutes using massively parallel deep reinforcement learning. InConference on robot learning, pages 91–100. PMLR, 2022

  69. [70]

    How does batch normalization help optimization?Advances in neural information processing systems, 31, 2018

    Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How does batch normalization help optimization?Advances in neural information processing systems, 31, 2018

  70. [71]

    Return-based scaling: Yet another normalisation trick for deep rl.arXiv preprint arXiv:2105.05347, 2021

    Tom Schaul, Georg Ostrovski, Iurii Kemaev, and Diana Borsa. Return-based scaling: Yet another normalisation trick for deep rl.arXiv preprint arXiv:2105.05347, 2021

  71. [72]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  72. [73]

    Schwarke, M

    Clemens Schwarke, Mayank Mittal, Nikita Rudin, David Hoeller, and Marco Hutter. Rsl-rl: A learning library for robotics research.arXiv preprint arXiv:2509.10771, 2025

  73. [74]

    Sferrazza, C., Huang, D.-M., Lin, X., Lee, Y ., and Abbeel, P

    Younggyo Seo, Carmelo Sferrazza, Juyue Chen, Guanya Shi, Rocky Duan, and Pieter Abbeel. Learning sim-to-real humanoid locomotion in 15 minutes, 2025.https://arxiv.org/abs/2512.01996

  74. [75]

    Fasttd3: Simple, fast, and capable reinforcement learning for humanoid control.arXiv preprint arXiv:2505.22642, 2025

    Younggyo Seo, Carmelo Sferrazza, Haoran Geng, Michal Nauman, Zhao-Heng Yin, and Pieter Abbeel. Fasttd3: Simple, fast, and capable reinforcement learning for humanoid control.arXiv preprint arXiv:2505.22642, 2025

  75. [76]

    Humanoid- bench: Simulated humanoid benchmark for whole-body locomotion and manipulation.arXiv preprint arXiv:2403.10506,

    Carmelo Sferrazza, Dun-Ming Huang, Xingyu Lin, Youngwoon Lee, and Pieter Abbeel. Humanoidbench: Simulated humanoid benchmark for whole-body locomotion and manipulation.arXiv preprint arXiv:2403.10506, 2024

  76. [77]

    Importance sampling for reinforcement learning with multiple objectives.PhD thesis, Massachusetts Institute of Technology, 2001

    Christian Robert Shelton. Importance sampling for reinforcement learning with multiple objectives.PhD thesis, Massachusetts Institute of Technology, 2001

  77. [78]

    Sim2real manipulation on unknown objects with tactile-based reinforcement learning

    Entong Su, Chengzhe Jia, Yuzhe Qin, Wenxuan Zhou, Annabella Macaluso, Binghao Huang, and Xiaolong Wang. Sim2real manipulation on unknown objects with tactile-based reinforcement learning. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 9234–9241. IEEE, 2024

  78. [79]

    Integrated architectures for learning, planning, and reacting based on approximating dynamic programming

    Richard S Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. InMachine learning proceedings 1990, pages 216–224. Elsevier, 1990

  79. [80]

    MIT press Cambridge, 1998

    Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

  80. [81]

    arXiv preprint arXiv:2410.00425 (2024)

    Stone Tao, Fanbo Xiang, Arth Shukla, Yuzhe Qin, Xander Hinrichsen, Xiaodi Yuan, Chen Bao, Xinsong Lin, Yulin Liu, Tse-kai Chan, et al. Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai.arXiv preprint arXiv:2410.00425, 2024

Showing first 80 references.