arxiv: 2604.04539 · v1 · submitted 2026-04-06 · 💻 cs.LG · cs.RO

Recognition: 2 theorem links

· Lean Theorem

FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control

Donghu Kim , Youngdo Lee , Minho Park , Kinam Kim , I Made Aswin Nahendra , Takuma Seno , Sehee Min , Daniel Palenicek

show 5 more authors

Florian Vogt Danica Kragic Jan Peters Jaegul Choo Hojoon Lee

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:59 UTC · model grok-4.3

classification 💻 cs.LG cs.RO

keywords reinforcement learningoff-policy learningrobot controlSoft Actor-Critichigh-dimensional controlstabilitysim-to-real transferscaling laws

0 comments

The pith

Reducing gradient updates while scaling models and data, plus bounding norms, lets off-policy RL match or beat PPO stability in high-dimensional robot control.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that off-policy reinforcement learning has long been limited in high-dimensional robot tasks because fitting value functions over diverse data requires many updates, allowing critic errors to build up through bootstrapping. FlashSAC addresses this by following scaling patterns from supervised learning: it sharply cuts the number of gradient updates, compensates with bigger models and faster data collection, and adds explicit bounds on weight, feature, and gradient norms to keep errors in check. A reader would care because this combination promises the sample-efficiency advantages of off-policy methods without the usual instability, leading to better final performance and much shorter training times across many simulators and real-robot transfer scenarios.

Core claim

FlashSAC modifies Soft Actor-Critic so that it performs far fewer gradient steps, uses larger networks, processes more environment steps per update, and applies hard bounds on weight, feature, and gradient norms; this prevents critic error accumulation while preserving the ability to learn from off-policy data distributions, yielding higher returns and faster convergence than PPO and prior off-policy baselines on over sixty tasks.

What carries the argument

The combination of reduced gradient-update frequency with increased model capacity and data throughput, stabilized by explicit bounds on weight, feature, and gradient norms that limit error propagation in the critic.

If this is right

FlashSAC reaches higher final returns than PPO and strong off-policy baselines on the majority of tested tasks, with the biggest improvements on high-dimensional control problems such as dexterous manipulation.
Training time for sim-to-real humanoid locomotion drops from hours to minutes.
The method maintains stability across ten different simulators and more than sixty tasks without requiring task-specific hyper-parameter retuning.
Off-policy data reuse becomes practical at scale because the reduced update count is offset by larger models and higher throughput.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same scaling-plus-bounding recipe could be tested on other off-policy algorithms to see whether the stability gains are specific to SAC or more general.
If the norm bounds prove robust, they might allow even larger models and lower update frequencies, further accelerating training in domains where data collection is expensive.
The approach suggests that RL scaling laws can be made reliable once error accumulation is controlled, opening a route to apply supervised-learning style scaling directly to robot policies.

Load-bearing premise

That bounding weight, feature, and gradient norms is sufficient to stop critic errors from accumulating when the number of gradient updates is sharply reduced.

What would settle it

A high-dimensional dexterous manipulation task where FlashSAC either fails to reach higher final performance than PPO or shows clear signs of critic divergence despite the norm bounds and scaling changes.

read the original abstract

Reinforcement learning (RL) is a core approach for robot control when expert demonstrations are unavailable. On-policy methods such as Proximal Policy Optimization (PPO) are widely used for their stability, but their reliance on narrowly distributed on-policy data limits accurate policy evaluation in high-dimensional state and action spaces. Off-policy methods can overcome this limitation by learning from a broader state-action distribution, yet suffer from slow convergence and instability, as fitting a value function over diverse data requires many gradient updates, causing critic errors to accumulate through bootstrapping. We present FlashSAC, a fast and stable off-policy RL algorithm built on Soft Actor-Critic. Motivated by scaling laws observed in supervised learning, FlashSAC sharply reduces gradient updates while compensating with larger models and higher data throughput. To maintain stability at increased scale, FlashSAC explicitly bounds weight, feature, and gradient norms, curbing critic error accumulation. Across over 60 tasks in 10 simulators, FlashSAC consistently outperforms PPO and strong off-policy baselines in both final performance and training efficiency, with the largest gains on high-dimensional tasks such as dexterous manipulation. In sim-to-real humanoid locomotion, FlashSAC reduces training time from hours to minutes, demonstrating the promise of off-policy RL for sim-to-real transfer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FlashSAC scales up SAC with fewer updates, bigger models, more data, and norm bounds to get stable off-policy performance in high-dim robotics, but the gains may partly trace to extra capacity rather than the stability fixes alone.

read the letter

FlashSAC modifies Soft Actor-Critic by sharply cutting gradient updates while scaling model size and data throughput, then adds explicit bounds on weight, feature, and gradient norms to limit critic error buildup. The motivation comes from supervised scaling laws, and the empirical side shows consistent wins over PPO and other off-policy baselines on more than 60 tasks across 10 simulators, with the clearest advantages in dexterous manipulation and a big drop in sim-to-real training time for humanoids from hours to minutes. That breadth of testing is the strongest part; it directly addresses the data-efficiency problems that make on-policy methods slow in high-dimensional robot control. The combination of reduced updates plus scale plus bounds is a clean, practical idea that has not appeared in exactly this form before. The main soft spot is the one flagged in the stress test. If the PPO and baseline SAC runs used smaller models and lower data rates than FlashSAC, then the reported gaps could come from raw capacity and sample volume instead of the norm bounds preventing error accumulation. The abstract emphasizes the compensation step, but without explicit confirmation that baselines received matched scaling, it is hard to isolate how much the stability mechanism actually contributes, especially on the high-dimensional tasks where capacity matters most. Abations on each piece and basic statistical reporting would help here. This work is aimed at robotics researchers who want faster off-policy training without sacrificing stability. It has enough empirical scope and a clear practical payoff to deserve serious referee time, even if the controls need tightening in revision.

Referee Report

1 major / 2 minor

Summary. The paper introduces FlashSAC, a variant of Soft Actor-Critic for off-policy RL in high-dimensional robot control. It reduces the frequency of gradient updates per environment step while scaling model capacity and data throughput, and adds explicit bounds on weight, feature, and gradient norms to curb critic error accumulation from bootstrapping. Empirical results across >60 tasks in 10 simulators show consistent outperformance versus PPO and strong off-policy baselines in final performance and sample efficiency, with largest gains on dexterous manipulation; a sim-to-real humanoid example reports training time reduced from hours to minutes.

Significance. If the reported gains can be isolated to the proposed stability mechanism rather than capacity or data-volume differences, the work would be significant for practical deployment of off-policy methods in robotics, where on-policy algorithms like PPO remain dominant due to perceived instability. The scaling-plus-norm-bounds approach offers a concrete, implementable recipe that could generalize beyond the tested simulators.

major comments (1)

[Experiments] Experiments section (and associated tables/figures): the central claim that norm bounds plus reduced updates preserve off-policy advantages requires explicit confirmation that PPO and baseline SAC implementations used identical model sizes, network widths, and environment-step collection rates as FlashSAC. If baselines were run at standard (smaller) scales, performance gaps on high-dimensional dexterous tasks could be explained by capacity and data volume rather than the stability mechanism; this control is load-bearing for attributing gains to the proposed technique.

minor comments (2)

[Abstract] Abstract and §3: the description of 'sharply reduces gradient updates' would benefit from a precise statement of the update-to-environment-step ratio used in FlashSAC versus baselines.
[Ablation studies] The paper should include a dedicated ablation isolating the contribution of each norm bound (weight, feature, gradient) to stability, reported with the same metrics as the main results.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for identifying a key point needed to strengthen attribution of our results. We address the major comment below and commit to revisions that provide the requested controls and clarifications.

read point-by-point responses

Referee: [Experiments] Experiments section (and associated tables/figures): the central claim that norm bounds plus reduced updates preserve off-policy advantages requires explicit confirmation that PPO and baseline SAC implementations used identical model sizes, network widths, and environment-step collection rates as FlashSAC. If baselines were run at standard (smaller) scales, performance gaps on high-dimensional dexterous tasks could be explained by capacity and data volume rather than the stability mechanism; this control is load-bearing for attributing gains to the proposed technique.

Authors: We agree that explicit side-by-side confirmation of scales is necessary for rigorous attribution. FlashSAC is intentionally designed around scaling (larger models, higher data throughput, fewer updates per step) plus norm bounds to stabilize off-policy learning at that scale; standard PPO and SAC baselines in our experiments follow their canonical implementations (e.g., from Stable Baselines3 and original papers), which use smaller widths (typically 256-512 hidden units, 2-3 layers) and lower throughput. The current manuscript and appendix already list per-method hyperparameters, but we will revise the Experiments section to add a consolidated table explicitly comparing model sizes, widths, layers, and environment steps per update across all methods. To isolate the stability mechanisms from raw capacity, we will also add an ablation comparing capacity-matched SAC (identical model size and throughput to FlashSAC) with and without norm bounds. These additions will be included in the revised manuscript and will directly address the load-bearing concern about whether gains stem from the proposed technique rather than scale alone. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical algorithm presentation

full rationale

The paper introduces FlashSAC as a practical modification to Soft Actor-Critic, motivated by external scaling laws from supervised learning and stabilized via explicit norm bounds on weights, features, and gradients. All performance claims rest on direct empirical comparisons across 60+ tasks rather than any closed-form derivation, fitted parameter renamed as prediction, or self-referential equation. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the abstract or method description. The central contribution is an engineering recipe validated by external benchmarks, making the result self-contained and independent of its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only; the approach assumes scaling laws observed in supervised learning transfer to RL critic training and that norm bounding suffices to control bootstrapping error accumulation without further justification or external validation.

axioms (1)

domain assumption Scaling laws from supervised learning apply directly to off-policy RL value function fitting when gradient updates are reduced.
Motivation stated in abstract for reducing updates while increasing model size and data throughput.

pith-pipeline@v0.9.0 · 5574 in / 1296 out tokens · 51175 ms · 2026-05-10T19:59:18.599888+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

sharply reduces gradient updates while compensating with larger models and higher data throughput... explicitly bounds weight, feature, and gradient norms
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Inverted Residual Backbone... Pre-activation Batch Normalization... Weight Normalization

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

94 extracted references · 50 canonical work pages · 12 internal anchors

[1]

Loss of plasticity in continual deep reinforcement learning

Zaheer Abbas, Rosie Zhao, Joseph Modayil, Adam White, and Marlos C Machado. Loss of plasticity in continual deep reinforcement learning. InConference on lifelong learning agents, pages 620–636. PMLR, 2023

2023
[2]

Learning dexterous in-hand manipulation.The International Journal of Robotics Research, 39(1):3–20, 2020

OpenAI: Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous in-hand manipulation.The International Journal of Robotics Research, 39(1):3–20, 2020

2020
[3]

A brief survey of deep reinforcement learning,

Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, and Anil Anthony Bharath. A brief survey of deep reinforcement learning.arXiv preprint arXiv:1708.05866, 2017

work page arXiv 2017
[4]

Genesis: A generative and universal physics engine for robotics and beyond, December 2024

Genesis Authors. Genesis: A generative and universal physics engine for robotics and beyond, December 2024. https://github.com/Genesis-Embodied-AI/Genesis

2024
[5]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.arXiv preprint arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[6]

Efficient online reinforcement learning with offline data

Philip J Ball, Laura Smith, Ilya Kostrikov, and Sergey Levine. Efficient online reinforcement learning with offline data. InInternational Conference on Machine Learning, pages 1577–1594. PMLR, 2023

2023
[7]

A distributional perspective on reinforcement learning

Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In International conference on machine learning, pages 449–458. PMLR, 2017

2017
[8]

CrossQ: Batch normalization in deep reinforcement learning for greater sample efficiency and simplicity

Aditya Bhatt, Daniel Palenicek, Boris Belousov, Max Argus, Artemij Amiranashvili, Thomas Brox, and Jan Peters. CrossQ: Batch normalization in deep reinforcement learning for greater sample efficiency and simplicity. International Conference on Learning Representations (ICLR), 2024

2024
[9]

Myosuite–a contact-rich simulation suite for musculoskeletal motor control.arXiv preprint arXiv:2205.13600, 2022

Vittorio Caggiano, Huawei Wang, Guillaume Durandau, Massimo Sartori, and Vikash Kumar. Myosuite–a contact-rich simulation suite for musculoskeletal motor control.arXiv preprint arXiv:2205.13600, 2022

work page arXiv 2022
[10]

Towards human-level bimanual dexterous manipulation with reinforcement learning

Yuanpei Chen, Yaodong Yang, Tianhao Wu, Shengjie Wang, Xidong Feng, Jiechuan Jiang, Zongqing Lu, Stephen Marcus McAleer, Hao Dong, and Song-Chun Zhu. Towards human-level bimanual dexterous manipulation with reinforcement learning. InThirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.https://openreview.net/...

2022
[11]

Temporally-extended{\epsilon}-greedy exploration.arXiv preprint arXiv:2006.01782, 2020

Will Dabney, Georg Ostrovski, and André Barreto. Temporally-extended{\epsilon}-greedy exploration.arXiv preprint arXiv:2006.01782, 2020

work page arXiv 2006
[12]

Dohare, J

Shibhansh Dohare, J Fernando Hernandez-Garcia, Parash Rahman, Richard S Sutton, and A Rupam Mahmood. Maintaining plasticity in deep continual learning.arXiv preprint arXiv:2306.13812, 2023

work page arXiv 2023
[13]

Pink noise is all you need: Colored noise exploration in deep reinforcement learning

Onno Eberhard, Jakob Hollenstein, Cristina Pinneri, and Georg Martius. Pink noise is all you need: Colored noise exploration in deep reinforcement learning. InThe Eleventh International Conference on Learning Representations, 2023

2023
[14]

Revisiting fundamentals of experience replay

William Fedus, Prajit Ramachandran, Rishabh Agarwal, Yoshua Bengio, Hugo Larochelle, Mark Rowland, and Will Dabney. Revisiting fundamentals of experience replay. InInternational conference on machine learning, pages 3061–3071. PMLR, 2020

2020
[15]

Addressing function approximation error in actor-critic methods

Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-critic methods. InInternational conference on machine learning, pages 1587–1596. PMLR, 2018

2018
[16]

Smith, Shixiang Shane Gu, Doina Precup, and David Meger

Scott Fujimoto, Wei-Di Chang, Edward J Smith, Shixiang Shane Gu, Doina Precup, and David Meger. For sale: State-action representation learning for deep reinforcement learning.arXiv preprint arXiv:2306.02451, 2023

work page arXiv 2023
[17]

Towards general-purpose model-free reinforcement learning.arXiv preprint arXiv:2501.16142, 2025

Scott Fujimoto, Pierluca D’Oro, Amy Zhang, Yuandong Tian, and Michael Rabbat. Towards general-purpose model-free reinforcement learning.arXiv preprint arXiv:2501.16142, 2025

work page arXiv 2025
[18]

Simplifying deep temporal difference learning.arXiv preprint arXiv:2407.04811, 2024

Matteo Gallici, Mattie Fellows, Benjamin Ellis, Bartomeu Pou, Ivan Masmitja, Jakob Nicolaus Foerster, and Mario Martin. Simplifying deep temporal difference learning.arXiv preprint arXiv:2407.04811, 2024

work page arXiv 2024
[19]

Humanoid locomotion and manipulation: Current progress and challenges in control, planning, and learning,

Zhaoyuan Gu, Junheng Li, Wenlan Shen, Wenhao Yu, Zhaoming Xie, Stephen McCrory, Xianyi Cheng, Abdulaziz Shamsah, Robert Griffin, C Karen Liu, et al. Humanoid locomotion and manipulation: Current progress and challenges in control, planning, and learning.arXiv preprint arXiv:2501.02116, 2025

work page arXiv 2025
[20]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. PMLR, 2018. 13

2018
[21]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

work page internal anchor Pith review arXiv 2023
[22]

Td-mpc2: Scalable, robust world models for continuous control, 2024

Nicklas Hansen, Hao Su, and Xiaolong Wang. Td-mpc2: Scalable, robust world models for continuous control, 2024

2024
[23]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

2016
[24]

Anymal parkour: Learning agile navigation for quadrupedal robots.Science Robotics, 9(88):eadi7566, 2024

David Hoeller, Nikita Rudin, Dhionis Sako, and Marco Hutter. Anymal parkour: Learning agile navigation for quadrupedal robots.Science Robotics, 9(88):eadi7566, 2024

2024
[25]

Action noise in off-policy deep reinforcement learning: Impact on exploration and performance.arXiv preprint arXiv:2206.03787, 2022

Jakob Hollenstein, Sayantan Auddy, Matteo Saveriano, Erwan Renaudo, and Justus Piater. Action noise in off-policy deep reinforcement learning: Impact on exploration and performance.arXiv preprint arXiv:2206.03787, 2022

work page arXiv 2022
[26]

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Andrew G Howard. Mobilenets: Efficient convolutional neural networks for mobile vision applications.arXiv preprint arXiv:1704.04861, 2017

work page internal anchor Pith review arXiv 2017
[27]

Learning agile and dynamic motor skills for legged robots.Science Robotics, 4(26):eaau5872, 2019

Jemin Hwangbo, Joonho Lee, Alexey Dosovitskiy, Dario Bellicoso, Vassilios Tsounis, Vladlen Koltun, and Marco Hutter. Learning agile and dynamic motor skills for legged robots.Science Robotics, 4(26):eaau5872, 2019

2019
[28]

Batch normalization: Accelerating deep network training by reducing internal covariate shift

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. InInternational conference on machine learning, pages 448–456. pmlr, 2015

2015
[29]

When to trust your model: Model-based policy optimization.Advances in neural information processing systems, 32, 2019

Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization.Advances in neural information processing systems, 32, 2019

2019
[30]

Concurrent training of a control policy and a state estimator for dynamic and robust legged locomotion.IEEE Robotics and Automation Letters, 7(2):4630–4637, April 2022

Gwanghyeon Ji, Juhyeok Mun, Hyeongjun Kim, and Jemin Hwangbo. Concurrent training of a control policy and a state estimator for dynamic and robust legged locomotion.IEEE Robotics and Automation Letters, 7(2):4630–4637, April 2022. ISSN 2377-3774. doi: 10.1109/lra.2022.3151396.http://dx.doi.org/10.1109/LRA.2022.3151396

work page doi:10.1109/lra.2022.3151396.http://dx.doi.org/10.1109/lra.2022.3151396 2022
[31]

Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning

Zhenyu Jiang, Yuqi Xie, Kevin Lin, Zhenjia Xu, Weikang Wan, Ajay Mandlekar, Linxi Jim Fan, and Yuke Zhu. Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 16923–16930. IEEE, 2025

2025
[32]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[33]

Openvla: An open-source vision-language-action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language-action model. InCoRL, 2025

2025
[34]

Reinforcement learning in robotics: A survey.The International Journal of Robotics Research, 32(11):1238–1274, 2013

Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey.The International Journal of Robotics Research, 32(11):1238–1274, 2013

2013
[35]

Rma: Rapid motor adaptation for legged robots,

Ashish Kumar, Zipeng Fu, Deepak Pathak, and Jitendra Malik. Rma: Rapid motor adaptation for legged robots. arXiv preprint arXiv:2107.04034, 2021

work page arXiv 2021
[36]

Reinforcement learning with augmented data.Advances in neural information processing systems, 33:19884–19895, 2020

Misha Laskin, Kimin Lee, Adam Stooke, Lerrel Pinto, Pieter Abbeel, and Aravind Srinivas. Reinforcement learning with augmented data.Advances in neural information processing systems, 33:19884–19895, 2020

2020
[37]

Plastic: Improving input and label plasticity for sample efficient reinforcement learning.Advances in Neural Information Processing Systems, 36, 2024

Hojoon Lee, Hanseul Cho, Hyunseung Kim, Daehoon Gwak, Joonkee Kim, Jaegul Choo, Se-Young Yun, and Chulhee Yun. Plastic: Improving input and label plasticity for sample efficient reinforcement learning.Advances in Neural Information Processing Systems, 36, 2024

2024
[38]

Slow and steady wins the race: Maintaining plasticity with hare and tortoise networks,

Hojoon Lee, Hyeonseo Cho, Hyunseung Kim, Donghu Kim, Dugki Min, Jaegul Choo, and Clare Lyle. Slow and steady wins the race: Maintaining plasticity with hare and tortoise networks.arXiv preprint arXiv:2406.02596, 2024

work page arXiv 2024
[39]

Simba: Simplicity bias for scaling up parameters in deep reinforcement learning.arXiv preprint arXiv:2410.09754, 2024

Hojoon Lee, Dongyoon Hwang, Donghu Kim, Hyunseung Kim, Jun Jet Tai, Kaushik Subramanian, Peter R Wurman, Jaegul Choo, Peter Stone, and Takuma Seno. Simba: Simplicity bias for scaling up parameters in deep reinforcement learning.arXiv preprint arXiv:2410.09754, 2024

work page arXiv 2024
[40]

Hyperspherical normalization for scalable deep reinforcement learning.arXiv preprint arXiv:2502.15280, 2025

Hojoon Lee, Youngdo Lee, Takuma Seno, Donghu Kim, Peter Stone, and Jaegul Choo. Hyperspherical normaliza- tion for scalable deep reinforcement learning.arXiv preprint arXiv:2502.15280, 2025. 14

work page arXiv 2025
[41]

Learning quadrupedal locomotion over challenging terrain,

Joonho Lee, Jemin Hwangbo, Lorenz Wellhausen, Vladlen Koltun, and Marco Hutter. Learning quadrupedal locomotion over challenging terrain.Science Robotics, 5(47), October 2020. ISSN 2470-9476. doi: 10.1126/ scirobotics.abc5986.http://dx.doi.org/10.1126/scirobotics.abc5986

work page doi:10.1126/scirobotics.abc5986 2020
[42]

Efficient deep reinforcement learning requires regulating overfitting.arXiv preprint arXiv:2304.10466, 2023

Qiyang Li, Aviral Kumar, Ilya Kostrikov, and Sergey Levine. Efficient deep reinforcement learning requires regulating overfitting.arXiv preprint arXiv:2304.10466, 2023

work page arXiv 2023
[43]

Beyondmimic: From mo- tion tracking to versatile humanoid control via guided diffusion,

Qiayuan Liao, Takara E Truong, Xiaoyu Huang, Yuman Gao, Guy Tevet, Koushil Sreenath, and C Karen Liu. Beyondmimic: From motion tracking to versatile humanoid control via guided diffusion.arXiv preprint arXiv:2508.08241, 2025

work page arXiv 2025
[44]

Continuous control with deep reinforcement learning

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning.arXiv preprint arXiv:1509.02971, 2015

work page internal anchor Pith review arXiv 2015
[45]

Softgym: Benchmarking deep reinforcement learning for deformable object manipulation

Xingyu Lin, Yufei Wang, Jake Olkin, and David Held. Softgym: Benchmarking deep reinforcement learning for deformable object manipulation. InConference on Robot Learning, pages 432–448. PMLR, 2021

2021
[46]

Łukasz Maziarka, Tomasz Danel, Sławomir Mucha, Krzysztof Rataj, Jacek Tabor, and Stanisław Jastrzęb- ski

Ilya Loshchilov, Cheng-Ping Hsieh, Simeng Sun, and Boris Ginsburg. ngpt: Normalized transformer with representation learning on the hypersphere.arXiv preprint arXiv:2410.01131, 2024

work page arXiv 2024
[47]

Understanding plasticity in neural networks.Proc

Clare Lyle, Zeyu Zheng, Evgenii Nikishin, Bernardo Avila Pires, Razvan Pascanu, and Will Dabney. Understanding plasticity in neural networks.Proc. the International Conference on Machine Learning (ICML), 2023

2023
[48]

Normalization and effective learning rates in reinforcement learning.Advances in Neural Information Processing Systems, 37:106440–106473, 2024

Clare Lyle, Zeyu Zheng, Khimya Khetarpal, James Martens, Hado P van Hasselt, Razvan Pascanu, and Will Dabney. Normalization and effective learning rates in reinforcement learning.Advances in Neural Information Processing Systems, 37:106440–106473, 2024

2024
[49]

Moerland, Joost Broekens, Aske Plaat, and Catholijn M

Thomas M. Moerland, Joost Broekens, Aske Plaat, and Catholijn M. Jonker. Model-based reinforcement learning: A survey.Foundations and Trends in Machine Learning, 16(1):1–118, 2023

2023
[50]

Weighted importance sampling for off-policy learning with linear function approximation.Advances in neural information processing systems, 27, 2014

A Rupam Mahmood, Hado P Van Hasselt, and Richard S Sutton. Weighted importance sampling for off-policy learning with linear function approximation.Advances in neural information processing systems, 27, 2014

2014
[51]

Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning

Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning.arXiv preprint arXiv:2108.10470, 2021

work page internal anchor Pith review arXiv 2021
[52]

Mixed Precision Training

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Gins- burg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training.arXiv preprint arXiv:1710.03740, 2017

work page internal anchor Pith review arXiv 2017
[53]

Symmetry considerations for learning task symmetric robot policies

Mayank Mittal, Nikita Rudin, Victor Klemm, Arthur Allshire, and Marco Hutter. Symmetry considerations for learning task symmetric robot policies. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 7433–7439. IEEE, 2024

2024
[54]

Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning

Mayank Mittal, Pascal Roth, James Tigue, Antoine Richard, Octi Zhang, Peter Du, Antonio Serrano-Muñoz, Xinjie Yao, René Zurbrügg, Nikita Rudin, et al. Isaac lab: A gpu-accelerated simulation framework for multi-modal robot learning.arXiv preprint arXiv:2511.04831, 2025

work page internal anchor Pith review arXiv 2025
[55]

Human-level control through deep reinforcement learning.nature, 518(7540):529–533, 2015

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning.nature, 518(7540):529–533, 2015

2015
[56]

Mock and University of Wyoming

J.W. Mock and University of Wyoming. Department of Electrical Engineering.A Comparison of PPO, TD3, and SAC Reinforcement Algorithms for Quadruped Walking Gait Generation and Transfer Learning to a Physical Robot. University of Wyoming, 2023. ISBN 9798379561789.https://books.google.co.kr/books?id=waUG0AEACAAJ

2023
[57]

Dreamwaq: Learning robust quadrupedal locomotion with implicit terrain imagination via deep reinforcement learning,

I Nahrendra, Byeongho Yu, and Hyun Myung. Dreamwaq: Learning robust quadrupedal locomotion with implicit terrain imagination via deep reinforcement learning.arXiv preprint arXiv:2301.10602, 2023

work page arXiv 2023
[58]

Reward centering.arXiv preprint arXiv:2405.09999, 2024

Abhishek Naik, Yi Wan, Manan Tomar, and Richard S Sutton. Reward centering.arXiv preprint arXiv:2405.09999, 2024

work page arXiv 2024
[59]

Robocasa: Large-scale simulation of everyday tasks for generalist robots

Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots. InRobotics: Science and Systems, 2024

2024
[60]

Bigger, regularized, optimistic: scaling for compute and sample-efficient continuous control.arXiv preprint arXiv:2405.16158, 2024

Michal Nauman, Mateusz Ostaszewski, Krzysztof Jankowski, Piotr Miłoś, and Marek Cygan. Bigger, regularized, optimistic: scaling for compute and sample-efficient continuous control.arXiv preprint arXiv:2405.16158, 2024. 15

work page arXiv 2024
[61]

Nauman, M., Ostaszewski, M., Jankowski, K., Miło ´s, P., and Cygan, M

Michal Nauman, Marek Cygan, Carmelo Sferrazza, Aviral Kumar, and Pieter Abbeel. Bigger, regularized, categorical: High-capacity value functions are efficient multi-task learners.arXiv preprint arXiv:2505.23150, 2025

work page arXiv 2025
[62]

Simplicial embeddings improve sample efficiency in actor-critic agents.arXiv preprint arXiv:2510.13704, 2025

Johan Obando-Ceron, Walter Mayor, Samuel Lavoie, Scott Fujimoto, Aaron Courville, and Pablo Samuel Castro. Simplicial embeddings improve sample efficiency in actor-critic agents.arXiv preprint arXiv:2510.13704, 2025

work page arXiv 2025
[63]

Scaling off-policy reinforcement learning with batch and weight normalization.Advances in Neural Information Processing Systems (NeurIPS), 2025

Daniel Palenicek, Florian Vogt, Joe Watson, and Jan Peters. Scaling off-policy reinforcement learning with batch and weight normalization.Advances in Neural Information Processing Systems (NeurIPS), 2025

2025
[64]

XQC: Well-conditioned optimization accelerates deep reinforcement learning.International Conference on Learning Representations (ICLR), 2026

Daniel Palenicek, Florian Vogt, Joe Watson, Ingmar Posner, and Jan Peters. XQC: Well-conditioned optimization accelerates deep reinforcement learning.International Conference on Learning Representations (ICLR), 2026

2026
[65]

Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

2019
[66]

Asymmetric Actor Critic for Image-Based Robot Learning

Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. Asymmetric actor critic for image-based robot learning.arXiv preprint arXiv:1710.06542, 2017

work page Pith review arXiv 2017
[67]

Turner, Oleksandr Maksymets, Zsolt Kira, Mrinal Kalakrishnan, Jitendra Malik, Devendra Singh Chaplot, Unnat Jain, Dhruv Batra, Akshara Rai, and Roozbeh Mottaghi

Xavier Puig, Eric Undersander, Andrew Szot, Mikael Dallaire Cote, Tsung-Yen Yang, Ruslan Partsey, Ruta Desai, Alexander William Clegg, Michal Hlavac, So Yeon Min, et al. Habitat 3.0: A co-habitat for humans, avatars and robots.arXiv preprint arXiv:2310.13724, 2023

work page arXiv 2023
[69]

Learning to walk in minutes using massively parallel deep reinforcement learning

Nikita Rudin, David Hoeller, Philipp Reist, and Marco Hutter. Learning to walk in minutes using massively parallel deep reinforcement learning. InConference on robot learning, pages 91–100. PMLR, 2022

2022
[70]

How does batch normalization help optimization?Advances in neural information processing systems, 31, 2018

Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How does batch normalization help optimization?Advances in neural information processing systems, 31, 2018

2018
[71]

CoRR , volume =

Tom Schaul, Georg Ostrovski, Iurii Kemaev, and Diana Borsa. Return-based scaling: Yet another normalisation trick for deep rl.arXiv preprint arXiv:2105.05347, 2021

work page arXiv 2021
[72]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[73]

arXiv preprint arXiv:2509.10771 , year=

Clemens Schwarke, Mayank Mittal, Nikita Rudin, David Hoeller, and Marco Hutter. Rsl-rl: A learning library for robotics research.arXiv preprint arXiv:2509.10771, 2025

work page arXiv 2025
[74]

Learning sim-to-real humanoid locomotion in 15 minutes.arXiv preprint arXiv:2512.01996, 2025

Younggyo Seo, Carmelo Sferrazza, Juyue Chen, Guanya Shi, Rocky Duan, and Pieter Abbeel. Learning sim-to-real humanoid locomotion in 15 minutes, 2025.https://arxiv.org/abs/2512.01996

work page arXiv 2025
[75]

Fasttd3: Simple, fast, and capable reinforcement learning for humanoid control.arXiv preprint arXiv:2505.22642, 2025

Younggyo Seo, Carmelo Sferrazza, Haoran Geng, Michal Nauman, Zhao-Heng Yin, and Pieter Abbeel. Fasttd3: Simple, fast, and capable reinforcement learning for humanoid control.arXiv preprint arXiv:2505.22642, 2025

work page arXiv 2025
[76]

Hu- manoidbench: Simulated humanoid benchmark for whole-body locomotion and manipulation

Carmelo Sferrazza, Dun-Ming Huang, Xingyu Lin, Youngwoon Lee, and Pieter Abbeel. Humanoidbench: Simulated humanoid benchmark for whole-body locomotion and manipulation.arXiv preprint arXiv:2403.10506, 2024

work page arXiv 2024
[77]

Importance sampling for reinforcement learning with multiple objectives.PhD thesis, Massachusetts Institute of Technology, 2001

Christian Robert Shelton. Importance sampling for reinforcement learning with multiple objectives.PhD thesis, Massachusetts Institute of Technology, 2001

2001
[78]

Sim2real manipulation on unknown objects with tactile-based reinforcement learning

Entong Su, Chengzhe Jia, Yuzhe Qin, Wenxuan Zhou, Annabella Macaluso, Binghao Huang, and Xiaolong Wang. Sim2real manipulation on unknown objects with tactile-based reinforcement learning. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 9234–9241. IEEE, 2024

2024
[79]

Integrated architectures for learning, planning, and reacting based on approximating dynamic programming

Richard S Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. InMachine learning proceedings 1990, pages 216–224. Elsevier, 1990

1990
[80]

MIT press Cambridge, 1998

Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

1998
[81]

Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai.arXiv preprint arXiv:2410.00425, 2024

Stone Tao, Fanbo Xiang, Arth Shukla, Yuzhe Qin, Xander Hinrichsen, Xiaodi Yuan, Chen Bao, Xinsong Lin, Yulin Liu, Tse-kai Chan, et al. Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai.arXiv preprint arXiv:2410.00425, 2024

work page arXiv 2024

Showing first 80 references.