FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control

Danica Kragic; Daniel Palenicek; Donghu Kim; Florian Vogt; Hojoon Lee; I Made Aswin Nahendra; Jaegul Choo; Jan Peters; Kinam Kim; Minho Park

arxiv: 2604.04539 · v2 · pith:P4HUNZARnew · submitted 2026-04-06 · 💻 cs.LG · cs.RO

FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control

Donghu Kim , Youngdo Lee , Minho Park , Kinam Kim , I Made Aswin Nahendra , Takuma Seno , Sehee Min , Daniel Palenicek

show 5 more authors

Florian Vogt Danica Kragic Jan Peters Jaegul Choo Hojoon Lee

This is my paper

Pith reviewed 2026-05-19 17:04 UTC · model grok-4.3

classification 💻 cs.LG cs.RO

keywords reinforcement learningoff-policy RLrobot controlSoft Actor-Critichigh-dimensional controlsim-to-real transfervalue function stability

0 comments

The pith

FlashSAC stabilizes off-policy RL for high-dimensional robot control by cutting gradient updates and bounding norms to limit critic errors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that off-policy methods can evaluate policies more accurately than on-policy ones like PPO in high-dimensional robot spaces because they draw from a wider range of state-action data. The core proposal is to sharply reduce the number of critic gradient updates per environment step, compensate by using bigger models and collecting more data, and add explicit bounds on weight, feature, and gradient norms to stop bootstrapped errors from growing. If this works, off-policy RL becomes both faster to train and more reliable on complex tasks such as dexterous manipulation and humanoid locomotion. The authors show this pattern holds across more than 60 tasks in ten different simulators, with the biggest gains on the highest-dimensional problems, and they report that sim-to-real humanoid training drops from hours to minutes.

Core claim

FlashSAC reduces the frequency of gradient updates while scaling model size and data throughput, then stabilizes learning by explicitly bounding the norms of weights, features, and gradients. This prevents the accumulation of critic errors that normally arise when fitting value functions over diverse off-policy data distributions, while preserving the capacity needed for accurate evaluation and policy improvement.

What carries the argument

Reduced gradient update frequency combined with explicit bounds on weight, feature, and gradient norms, which together curb critic error accumulation while enabling broader data use.

If this is right

FlashSAC reaches higher final performance and greater training efficiency than PPO and other off-policy baselines on over 60 tasks across 10 simulators.
The performance gap widens on the most high-dimensional problems such as dexterous manipulation.
Training time for sim-to-real humanoid locomotion drops from hours to minutes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reduction in update frequency plus norm bounds might stabilize other off-policy algorithms that currently suffer from critic drift.
If the approach scales with model size, it could support training larger critics for even more complex real-world robot tasks without instability.
The results hint that off-policy RL may follow supervised-learning scaling laws once the bootstrapping instability is directly constrained.

Load-bearing premise

Bounding weight, feature, and gradient norms is sufficient to control critic error accumulation in high-dimensional spaces without removing the model's capacity for accurate value estimation or policy improvement.

What would settle it

Running FlashSAC on high-dimensional dexterous manipulation tasks without the norm bounds and observing increased critic error accumulation plus degraded final performance would falsify the stability mechanism.

read the original abstract

Reinforcement learning (RL) is a core approach for robot control when expert demonstrations are unavailable. On-policy methods such as Proximal Policy Optimization (PPO) are widely used for their stability, but their reliance on narrowly distributed on-policy data limits accurate policy evaluation in high-dimensional state and action spaces. Off-policy methods can overcome this limitation by learning from a broader state-action distribution, yet suffer from slow convergence and instability, as fitting a value function over diverse data requires many gradient updates, causing critic errors to accumulate through bootstrapping. We present FlashSAC, a fast and stable off-policy RL algorithm built on Soft Actor-Critic. Motivated by scaling laws observed in supervised learning, FlashSAC sharply reduces gradient updates while compensating with larger models and higher data throughput. To maintain stability at increased scale, FlashSAC explicitly bounds weight, feature, and gradient norms, curbing critic error accumulation. Across over 60 tasks in 10 simulators, FlashSAC consistently outperforms PPO and strong off-policy baselines in both final performance and training efficiency, with the largest gains on high-dimensional tasks such as dexterous manipulation. In sim-to-real humanoid locomotion, FlashSAC reduces training time from hours to minutes, demonstrating the promise of off-policy RL for sim-to-real transfer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces FlashSAC, an off-policy RL algorithm extending Soft Actor-Critic. It reduces the number of gradient updates per environment interaction to increase training speed and data throughput, compensates with larger critic and actor networks, and applies explicit bounds on weight, feature, and gradient norms to limit critic error accumulation during bootstrapping. Empirical claims include consistent outperformance versus PPO and strong off-policy baselines across more than 60 tasks in 10 simulators, with largest gains on high-dimensional dexterous manipulation, plus a sim-to-real humanoid locomotion result showing training time reduced from hours to minutes.

Significance. If the core empirical claims survive standard controls (multiple seeds, statistical tests, ablations on norm thresholds), the work would offer a practical route to scaling off-policy methods for high-dimensional robot control by importing supervised-learning scaling intuitions while addressing instability. The sim-to-real demonstration, if reproducible, would strengthen the case for off-policy RL in real-world transfer settings.

major comments (3)

[Abstract and Experiments] Abstract and experimental sections: the abstract asserts clear wins on >60 tasks but supplies no hyperparameter tables, exact update-frequency ratios, norm-threshold values, statistical significance tests, or ablation studies on the norm bounds. Without these, it is impossible to determine whether the reported gains survive standard controls or are sensitive to post-hoc choices of the free parameters (norm bound thresholds).
[Method (norm bounding)] Section on critic architecture and norm bounding: the central mechanism asserts that sharply reduced gradient updates can be offset by larger models plus explicit bounds on weight/feature/gradient norms without removing necessary capacity. The manuscript should supply direct evidence (e.g., effective rank of critic features, value-estimate error curves, or capacity measurements) that the chosen bounds preserve representational power for accurate bootstrapped estimates over diverse off-policy data; absent such evidence the skeptic's concern that aggressive bounds produce an under-expressive critic remains open.
[Sim-to-real experiments] Sim-to-real humanoid locomotion experiment: the claim that FlashSAC reduces training time from hours to minutes is load-bearing for the practical significance argument, yet no details are given on the precise norm thresholds used, the model-size scaling factor, or whether the same bounds were applied in the sim-to-real setting as in the simulated dexterous tasks.

minor comments (2)

[Figures and Tables] Learning curves and tables should report mean and standard deviation over at least 5–10 random seeds with confidence intervals to support claims of consistent outperformance.
[Notation] Notation for the three norm bounds (weight, feature, gradient) should be introduced once and used uniformly; currently the distinction between feature-norm and gradient-norm bounds is occasionally ambiguous.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which have helped us improve the clarity and rigor of the manuscript. We address each major point below and have incorporated revisions to provide the requested details, evidence, and clarifications.

read point-by-point responses

Referee: [Abstract and Experiments] Abstract and experimental sections: the abstract asserts clear wins on >60 tasks but supplies no hyperparameter tables, exact update-frequency ratios, norm-threshold values, statistical significance tests, or ablation studies on the norm bounds. Without these, it is impossible to determine whether the reported gains survive standard controls or are sensitive to post-hoc choices of the free parameters (norm bound thresholds).

Authors: We agree that these details are essential for assessing robustness and reproducibility. In the revised manuscript, we have added a full hyperparameter table in Appendix B listing all values, including the update frequency of one gradient step per 10 environment interactions, norm thresholds (weight norm bound of 1.0, feature norm bound of 0.5, gradient norm bound of 0.1), and model scaling factors. We now report results with 5 random seeds per task and include paired t-tests showing statistical significance (p < 0.05) for the performance gains over baselines on the majority of tasks. Ablation studies on the norm bounds have been added to Section 5.2, demonstrating that performance drops when bounds are removed or set too loosely. revision: yes
Referee: [Method (norm bounding)] Section on critic architecture and norm bounding: the central mechanism asserts that sharply reduced gradient updates can be offset by larger models plus explicit bounds on weight/feature/gradient norms without removing necessary capacity. The manuscript should supply direct evidence (e.g., effective rank of critic features, value-estimate error curves, or capacity measurements) that the chosen bounds preserve representational power for accurate bootstrapped estimates over diverse off-policy data; absent such evidence the skeptic's concern that aggressive bounds produce an under-expressive critic remains open.

Authors: We acknowledge the value of direct evidence on representational capacity. The revised manuscript includes new analysis in Section 4.4: effective rank measurements of critic features (computed via singular value decomposition) show that bounded critics retain ranks within 10% of unbounded counterparts across training, and value-estimate error curves (measured against held-out data) indicate reduced accumulation of bootstrapping errors without loss of fitting accuracy on diverse off-policy batches. These results support that the bounds limit instability while preserving sufficient expressivity for the high-dimensional tasks considered. revision: yes
Referee: [Sim-to-real experiments] Sim-to-real humanoid locomotion experiment: the claim that FlashSAC reduces training time from hours to minutes is load-bearing for the practical significance argument, yet no details are given on the precise norm thresholds used, the model-size scaling factor, or whether the same bounds were applied in the sim-to-real setting as in the simulated dexterous tasks.

Authors: We have expanded the sim-to-real section (Section 6) with the requested details. The same norm thresholds as the dexterous manipulation tasks were used (weight bound 1.0, feature bound 0.5, gradient bound 0.1), and the model scaling factor was 4x the baseline network size. Training times are reported as wall-clock measurements on identical hardware, with FlashSAC converging to stable locomotion policies in approximately 25 minutes versus over 4 hours for the compared baselines. We also note that the sim-to-real transfer used the same hyperparameter set without additional tuning. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results rest on direct baseline comparisons, not self-referential derivations.

full rationale

The paper introduces FlashSAC as a practical off-policy algorithm that reduces gradient updates, scales model size, and applies explicit norm bounds for stability. All central claims are framed as empirical outcomes across 60+ tasks and sim-to-real transfer, with performance measured against PPO and other baselines. No equations, uniqueness theorems, or fitted parameters are presented that reduce the reported gains to quantities defined inside the method itself. The motivation from supervised scaling laws is external and does not create a self-definitional loop. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the transferability of supervised-learning scaling laws to RL critic training and on the effectiveness of norm bounding for stability; both are treated as empirical design choices rather than derived results.

free parameters (1)

norm bound thresholds
Specific limits on weight, feature, and gradient norms are introduced to maintain stability; their exact values are chosen to achieve the reported behavior.

axioms (1)

domain assumption Scaling laws observed in supervised learning transfer to the critic training dynamics of off-policy RL
Used to justify sharply reducing the number of gradient updates while increasing model size and data throughput.

pith-pipeline@v0.9.0 · 5805 in / 1419 out tokens · 51101 ms · 2026-05-19T17:04:49.619467+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

FlashSAC explicitly bounds weight, feature, and gradient norms, curbing critic error accumulation... inverted residual backbone... Pre-activation Batch Normalization... Weight Normalization... project each weight vector onto the unit-norm sphere
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Motivated by scaling laws observed in supervised learning... sharply reduces gradient updates while compensating with larger models and higher data throughput

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

FastDSAC: Unlocking the Potential of Maximum Entropy RL in High-Dimensional Humanoid Control
cs.LG 2026-03 unverdicted novelty 6.0

FastDSAC enables state-of-the-art maximum entropy RL for high-dimensional humanoid control via entropy redistribution per dimension and improved continuous value estimation.

Reference graph

Works this paper leans on

94 extracted references · 94 canonical work pages · cited by 1 Pith paper · 18 internal anchors

[1]

Loss of plasticity in continual deep reinforcement learning

Zaheer Abbas, Rosie Zhao, Joseph Modayil, Adam White, and Marlos C Machado. Loss of plasticity in continual deep reinforcement learning. InConference on lifelong learning agents, pages 620–636. PMLR, 2023

work page 2023
[2]

Learning dexterous in-hand manipulation.The International Journal of Robotics Research, 39(1):3–20, 2020

OpenAI: Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous in-hand manipulation.The International Journal of Robotics Research, 39(1):3–20, 2020

work page 2020
[3]

A Brief Survey of Deep Reinforcement Learning

Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, and Anil Anthony Bharath. A brief survey of deep reinforcement learning.arXiv preprint arXiv:1708.05866, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[4]

Genesis: A generative and universal physics engine for robotics and beyond, December 2024

Genesis Authors. Genesis: A generative and universal physics engine for robotics and beyond, December 2024. https://github.com/Genesis-Embodied-AI/Genesis

work page 2024
[5]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.arXiv preprint arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[6]

Efficient online reinforcement learning with offline data

Philip J Ball, Laura Smith, Ilya Kostrikov, and Sergey Levine. Efficient online reinforcement learning with offline data. InInternational Conference on Machine Learning, pages 1577–1594. PMLR, 2023

work page 2023
[7]

A distributional perspective on reinforcement learning

Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In International conference on machine learning, pages 449–458. PMLR, 2017

work page 2017
[8]

CrossQ: Batch normalization in deep reinforcement learning for greater sample efficiency and simplicity

Aditya Bhatt, Daniel Palenicek, Boris Belousov, Max Argus, Artemij Amiranashvili, Thomas Brox, and Jan Peters. CrossQ: Batch normalization in deep reinforcement learning for greater sample efficiency and simplicity. International Conference on Learning Representations (ICLR), 2024

work page 2024
[9]

Myosuite–a contact-rich simulation suite for musculoskeletal motor control.arXiv preprint arXiv:2205.13600, 2022

Vittorio Caggiano, Huawei Wang, Guillaume Durandau, Massimo Sartori, and Vikash Kumar. Myosuite–a contact-rich simulation suite for musculoskeletal motor control.arXiv preprint arXiv:2205.13600, 2022

work page arXiv 2022
[10]

Towards human-level bimanual dexterous manipulation with reinforcement learning

Yuanpei Chen, Yaodong Yang, Tianhao Wu, Shengjie Wang, Xidong Feng, Jiechuan Jiang, Zongqing Lu, Stephen Marcus McAleer, Hao Dong, and Song-Chun Zhu. Towards human-level bimanual dexterous manipulation with reinforcement learning. InThirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.https://openreview.net/...

work page 2022
[11]

Temporally-extended{\epsilon}-greedy exploration.arXiv preprint arXiv:2006.01782, 2020

Will Dabney, Georg Ostrovski, and André Barreto. Temporally-extended{\epsilon}-greedy exploration.arXiv preprint arXiv:2006.01782, 2020

work page arXiv 2006
[12]

Fernando Hernandez-Garcia, Parash Rahman, Richard S

Shibhansh Dohare, J Fernando Hernandez-Garcia, Parash Rahman, Richard S Sutton, and A Rupam Mahmood. Maintaining plasticity in deep continual learning.arXiv preprint arXiv:2306.13812, 2023

work page arXiv 2023
[13]

Pink noise is all you need: Colored noise exploration in deep reinforcement learning

Onno Eberhard, Jakob Hollenstein, Cristina Pinneri, and Georg Martius. Pink noise is all you need: Colored noise exploration in deep reinforcement learning. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023
[14]

Revisiting fundamentals of experience replay

William Fedus, Prajit Ramachandran, Rishabh Agarwal, Yoshua Bengio, Hugo Larochelle, Mark Rowland, and Will Dabney. Revisiting fundamentals of experience replay. InInternational conference on machine learning, pages 3061–3071. PMLR, 2020

work page 2020
[15]

Addressing function approximation error in actor-critic methods

Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-critic methods. InInternational conference on machine learning, pages 1587–1596. PMLR, 2018

work page 2018
[16]

For sale: State-action representation learning for deep reinforcement learning.arXiv preprint arXiv:2306.02451, 2023

Scott Fujimoto, Wei-Di Chang, Edward J Smith, Shixiang Shane Gu, Doina Precup, and David Meger. For sale: State-action representation learning for deep reinforcement learning.arXiv preprint arXiv:2306.02451, 2023

work page arXiv 2023
[17]

Towards general-purpose model-free reinforcement learning.arXiv preprint arXiv:2501.16142, 2025

Scott Fujimoto, Pierluca D’Oro, Amy Zhang, Yuandong Tian, and Michael Rabbat. Towards general-purpose model-free reinforcement learning.arXiv preprint arXiv:2501.16142, 2025

work page arXiv 2025
[18]

N., and Martin, M

Matteo Gallici, Mattie Fellows, Benjamin Ellis, Bartomeu Pou, Ivan Masmitja, Jakob Nicolaus Foerster, and Mario Martin. Simplifying deep temporal difference learning.arXiv preprint arXiv:2407.04811, 2024

work page arXiv 2024
[19]

Karen Liu, Abder- rahmane Kheddar, Xue Bin Peng, Yuke Zhu, Guanya Shi, Quan Nguyen, Gordon Cheng, Huijun Gao, and Ye Zhao

Zhaoyuan Gu, Junheng Li, Wenlan Shen, Wenhao Yu, Zhaoming Xie, Stephen McCrory, Xianyi Cheng, Abdulaziz Shamsah, Robert Griffin, C Karen Liu, et al. Humanoid locomotion and manipulation: Current progress and challenges in control, planning, and learning.arXiv preprint arXiv:2501.02116, 2025

work page arXiv 2025
[20]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. PMLR, 2018. 13

work page 2018
[21]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Td-mpc2: Scalable, robust world models for continuous control, 2024

Nicklas Hansen, Hao Su, and Xiaolong Wang. Td-mpc2: Scalable, robust world models for continuous control, 2024

work page 2024
[23]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016
[24]

Anymal parkour: Learning agile navigation for quadrupedal robots.Science Robotics, 9(88):eadi7566, 2024

David Hoeller, Nikita Rudin, Dhionis Sako, and Marco Hutter. Anymal parkour: Learning agile navigation for quadrupedal robots.Science Robotics, 9(88):eadi7566, 2024

work page 2024
[25]

Action noise in off-policy deep reinforcement learning: Impact on exploration and performance.arXiv preprint arXiv:2206.03787, 2022

Jakob Hollenstein, Sayantan Auddy, Matteo Saveriano, Erwan Renaudo, and Justus Piater. Action noise in off-policy deep reinforcement learning: Impact on exploration and performance.arXiv preprint arXiv:2206.03787, 2022

work page arXiv 2022
[26]

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Andrew G Howard. Mobilenets: Efficient convolutional neural networks for mobile vision applications.arXiv preprint arXiv:1704.04861, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[27]

Learning agile and dynamic motor skills for legged robots.Science Robotics, 4(26):eaau5872, 2019

Jemin Hwangbo, Joonho Lee, Alexey Dosovitskiy, Dario Bellicoso, Vassilios Tsounis, Vladlen Koltun, and Marco Hutter. Learning agile and dynamic motor skills for legged robots.Science Robotics, 4(26):eaau5872, 2019

work page 2019
[28]

Batch normalization: Accelerating deep network training by reducing internal covariate shift

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. InInternational conference on machine learning, pages 448–456. pmlr, 2015

work page 2015
[29]

When to trust your model: Model-based policy optimization.Advances in neural information processing systems, 32, 2019

Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization.Advances in neural information processing systems, 32, 2019

work page 2019
[30]

Concurrent training of a control policy and a state estimator for dynamic and robust legged locomotion.IEEE Robotics and Automation Letters, 7(2):4630–4637, April 2022

Gwanghyeon Ji, Juhyeok Mun, Hyeongjun Kim, and Jemin Hwangbo. Concurrent training of a control policy and a state estimator for dynamic and robust legged locomotion.IEEE Robotics and Automation Letters, 7(2):4630–4637, April 2022. ISSN 2377-3774. doi: 10.1109/lra.2022.3151396.http://dx.doi.org/10.1109/LRA.2022.3151396

work page doi:10.1109/lra.2022.3151396.http://dx.doi.org/10.1109/lra.2022.3151396 2022
[31]

Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning

Zhenyu Jiang, Yuqi Xie, Kevin Lin, Zhenjia Xu, Weikang Wan, Ajay Mandlekar, Linxi Jim Fan, and Yuke Zhu. Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 16923–16930. IEEE, 2025

work page 2025
[32]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[33]

Openvla: An open-source vision-language-action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language-action model. InCoRL, 2025

work page 2025
[34]

Reinforcement learning in robotics: A survey.The International Journal of Robotics Research, 32(11):1238–1274, 2013

Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey.The International Journal of Robotics Research, 32(11):1238–1274, 2013

work page 2013
[35]

RMA: Rapid Motor Adaptation for Legged Robots

Ashish Kumar, Zipeng Fu, Deepak Pathak, and Jitendra Malik. Rma: Rapid motor adaptation for legged robots. arXiv preprint arXiv:2107.04034, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[36]

Reinforcement learning with augmented data.Advances in neural information processing systems, 33:19884–19895, 2020

Misha Laskin, Kimin Lee, Adam Stooke, Lerrel Pinto, Pieter Abbeel, and Aravind Srinivas. Reinforcement learning with augmented data.Advances in neural information processing systems, 33:19884–19895, 2020

work page 2020
[37]

Plastic: Improving input and label plasticity for sample efficient reinforcement learning.Advances in Neural Information Processing Systems, 36, 2024

Hojoon Lee, Hanseul Cho, Hyunseung Kim, Daehoon Gwak, Joonkee Kim, Jaegul Choo, Se-Young Yun, and Chulhee Yun. Plastic: Improving input and label plasticity for sample efficient reinforcement learning.Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[38]

Slow and steady wins the race: Maintaining plasticity with hare and tortoise networks.arXiv preprint arXiv:2406.02596, 2024

Hojoon Lee, Hyeonseo Cho, Hyunseung Kim, Donghu Kim, Dugki Min, Jaegul Choo, and Clare Lyle. Slow and steady wins the race: Maintaining plasticity with hare and tortoise networks.arXiv preprint arXiv:2406.02596, 2024

work page arXiv 2024
[39]

Simba: Simplicity bias for scaling up parameters in deep reinforcement learning.arXiv preprint arXiv:2410.09754, 2024

Hojoon Lee, Dongyoon Hwang, Donghu Kim, Hyunseung Kim, Jun Jet Tai, Kaushik Subramanian, Peter R Wurman, Jaegul Choo, Peter Stone, and Takuma Seno. Simba: Simplicity bias for scaling up parameters in deep reinforcement learning.arXiv preprint arXiv:2410.09754, 2024

work page arXiv 2024
[40]

Hyperspher- ical normalization for scalable deep reinforcement learning.arXiv preprint arXiv:2502.15280,

Hojoon Lee, Youngdo Lee, Takuma Seno, Donghu Kim, Peter Stone, and Jaegul Choo. Hyperspherical normaliza- tion for scalable deep reinforcement learning.arXiv preprint arXiv:2502.15280, 2025. 14

work page arXiv 2025
[41]

Learning quadrupedal locomotion over challenging terrain,

Joonho Lee, Jemin Hwangbo, Lorenz Wellhausen, Vladlen Koltun, and Marco Hutter. Learning quadrupedal locomotion over challenging terrain.Science Robotics, 5(47), October 2020. ISSN 2470-9476. doi: 10.1126/ scirobotics.abc5986.http://dx.doi.org/10.1126/scirobotics.abc5986

work page doi:10.1126/scirobotics.abc5986 2020
[42]

Efficient deep reinforcement learning requires regulating overfitting.arXiv preprint arXiv:2304.10466, 2023

Qiyang Li, Aviral Kumar, Ilya Kostrikov, and Sergey Levine. Efficient deep reinforcement learning requires regulating overfitting.arXiv preprint arXiv:2304.10466, 2023

work page arXiv 2023
[43]

BeyondMimic: From Motion Tracking to Versatile Humanoid Control via Guided Diffusion

Qiayuan Liao, Takara E Truong, Xiaoyu Huang, Yuman Gao, Guy Tevet, Koushil Sreenath, and C Karen Liu. Beyondmimic: From motion tracking to versatile humanoid control via guided diffusion.arXiv preprint arXiv:2508.08241, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Continuous control with deep reinforcement learning

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning.arXiv preprint arXiv:1509.02971, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[45]

Softgym: Benchmarking deep reinforcement learning for deformable object manipulation

Xingyu Lin, Yufei Wang, Jake Olkin, and David Held. Softgym: Benchmarking deep reinforcement learning for deformable object manipulation. InConference on Robot Learning, pages 432–448. PMLR, 2021

work page 2021
[46]

ngpt: Normalized transformer with rep- resentation learning on the hypersphere

Ilya Loshchilov, Cheng-Ping Hsieh, Simeng Sun, and Boris Ginsburg. ngpt: Normalized transformer with representation learning on the hypersphere.arXiv preprint arXiv:2410.01131, 2024

work page arXiv 2024
[47]

Understanding plasticity in neural networks.Proc

Clare Lyle, Zeyu Zheng, Evgenii Nikishin, Bernardo Avila Pires, Razvan Pascanu, and Will Dabney. Understanding plasticity in neural networks.Proc. the International Conference on Machine Learning (ICML), 2023

work page 2023
[48]

Normalization and effective learning rates in reinforcement learning.Advances in Neural Information Processing Systems, 37:106440–106473, 2024

Clare Lyle, Zeyu Zheng, Khimya Khetarpal, James Martens, Hado P van Hasselt, Razvan Pascanu, and Will Dabney. Normalization and effective learning rates in reinforcement learning.Advances in Neural Information Processing Systems, 37:106440–106473, 2024

work page 2024
[49]

Moerland, Joost Broekens, Aske Plaat, and Catholijn M

Thomas M. Moerland, Joost Broekens, Aske Plaat, and Catholijn M. Jonker. Model-based reinforcement learning: A survey.Foundations and Trends in Machine Learning, 16(1):1–118, 2023

work page 2023
[50]

Weighted importance sampling for off-policy learning with linear function approximation.Advances in neural information processing systems, 27, 2014

A Rupam Mahmood, Hado P Van Hasselt, and Richard S Sutton. Weighted importance sampling for off-policy learning with linear function approximation.Advances in neural information processing systems, 27, 2014

work page 2014
[51]

Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning

Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning.arXiv preprint arXiv:2108.10470, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[52]

Mixed Precision Training

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Gins- burg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training.arXiv preprint arXiv:1710.03740, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[53]

Symmetry considerations for learning task symmetric robot policies

Mayank Mittal, Nikita Rudin, Victor Klemm, Arthur Allshire, and Marco Hutter. Symmetry considerations for learning task symmetric robot policies. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 7433–7439. IEEE, 2024

work page 2024
[54]

Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning

Mayank Mittal, Pascal Roth, James Tigue, Antoine Richard, Octi Zhang, Peter Du, Antonio Serrano-Muñoz, Xinjie Yao, René Zurbrügg, Nikita Rudin, et al. Isaac lab: A gpu-accelerated simulation framework for multi-modal robot learning.arXiv preprint arXiv:2511.04831, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

Human-level control through deep reinforcement learning.nature, 518(7540):529–533, 2015

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning.nature, 518(7540):529–533, 2015

work page 2015
[56]

Mock and University of Wyoming

J.W. Mock and University of Wyoming. Department of Electrical Engineering.A Comparison of PPO, TD3, and SAC Reinforcement Algorithms for Quadruped Walking Gait Generation and Transfer Learning to a Physical Robot. University of Wyoming, 2023. ISBN 9798379561789.https://books.google.co.kr/books?id=waUG0AEACAAJ

work page 2023
[57]

Dreamwaq: Learning robust quadrupedal locomotion with implicit terrain imagination via deep reinforcement learning.arXiv preprint arXiv:2301.10602, 2023

I Nahrendra, Byeongho Yu, and Hyun Myung. Dreamwaq: Learning robust quadrupedal locomotion with implicit terrain imagination via deep reinforcement learning.arXiv preprint arXiv:2301.10602, 2023

work page arXiv 2023
[58]

Reward Centering,

Abhishek Naik, Yi Wan, Manan Tomar, and Richard S Sutton. Reward centering.arXiv preprint arXiv:2405.09999, 2024

work page arXiv 2024
[59]

Robocasa: Large-scale simulation of everyday tasks for generalist robots

Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots. InRobotics: Science and Systems, 2024

work page 2024
[60]

Bigger, regularized, optimistic: scaling for compute and sample-efficient con- tinuous control

Michal Nauman, Mateusz Ostaszewski, Krzysztof Jankowski, Piotr Miłoś, and Marek Cygan. Bigger, regularized, optimistic: scaling for compute and sample-efficient continuous control.arXiv preprint arXiv:2405.16158, 2024. 15

work page arXiv 2024
[61]

Bigger, regularized, categorical: High-capacity value functions are efficient multi-task learners.arXiv preprint arXiv:2505.23150, 2025

Michal Nauman, Marek Cygan, Carmelo Sferrazza, Aviral Kumar, and Pieter Abbeel. Bigger, regularized, categorical: High-capacity value functions are efficient multi-task learners.arXiv preprint arXiv:2505.23150, 2025

work page arXiv 2025
[62]

Simplicial embeddings improve sample efficiency in actor-critic agents.arXiv preprint arXiv:2510.13704, 2025

Johan Obando-Ceron, Walter Mayor, Samuel Lavoie, Scott Fujimoto, Aaron Courville, and Pablo Samuel Castro. Simplicial embeddings improve sample efficiency in actor-critic agents.arXiv preprint arXiv:2510.13704, 2025

work page arXiv 2025
[63]

Scaling off-policy reinforcement learning with batch and weight normalization.Advances in Neural Information Processing Systems (NeurIPS), 2025

Daniel Palenicek, Florian Vogt, Joe Watson, and Jan Peters. Scaling off-policy reinforcement learning with batch and weight normalization.Advances in Neural Information Processing Systems (NeurIPS), 2025

work page 2025
[64]

XQC: Well-conditioned optimization accelerates deep reinforcement learning.International Conference on Learning Representations (ICLR), 2026

Daniel Palenicek, Florian Vogt, Joe Watson, Ingmar Posner, and Jan Peters. XQC: Well-conditioned optimization accelerates deep reinforcement learning.International Conference on Learning Representations (ICLR), 2026

work page 2026
[65]

Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

work page 2019
[66]

Asymmetric Actor Critic for Image-Based Robot Learning

Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. Asymmetric actor critic for image-based robot learning.arXiv preprint arXiv:1710.06542, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[67]

Habitat 3.0: A co-habitat for humans, avatars and robots.arXiv preprint arXiv:2310.13724, 2023

Xavier Puig, Eric Undersander, Andrew Szot, Mikael Dallaire Cote, Tsung-Yen Yang, Ruslan Partsey, Ruta Desai, Alexander William Clegg, Michal Hlavac, So Yeon Min, et al. Habitat 3.0: A co-habitat for humans, avatars and robots.arXiv preprint arXiv:2310.13724, 2023

work page arXiv 2023
[69]

Learning to walk in minutes using massively parallel deep reinforcement learning

Nikita Rudin, David Hoeller, Philipp Reist, and Marco Hutter. Learning to walk in minutes using massively parallel deep reinforcement learning. InConference on robot learning, pages 91–100. PMLR, 2022

work page 2022
[70]

How does batch normalization help optimization?Advances in neural information processing systems, 31, 2018

Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How does batch normalization help optimization?Advances in neural information processing systems, 31, 2018

work page 2018
[71]

Return-based scaling: Yet another normalisation trick for deep rl.arXiv preprint arXiv:2105.05347, 2021

Tom Schaul, Georg Ostrovski, Iurii Kemaev, and Diana Borsa. Return-based scaling: Yet another normalisation trick for deep rl.arXiv preprint arXiv:2105.05347, 2021

work page arXiv 2021
[72]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[73]

Schwarke, M

Clemens Schwarke, Mayank Mittal, Nikita Rudin, David Hoeller, and Marco Hutter. Rsl-rl: A learning library for robotics research.arXiv preprint arXiv:2509.10771, 2025

work page arXiv 2025
[74]

Sferrazza, C., Huang, D.-M., Lin, X., Lee, Y ., and Abbeel, P

Younggyo Seo, Carmelo Sferrazza, Juyue Chen, Guanya Shi, Rocky Duan, and Pieter Abbeel. Learning sim-to-real humanoid locomotion in 15 minutes, 2025.https://arxiv.org/abs/2512.01996

work page arXiv 2025
[75]

Fasttd3: Simple, fast, and capable reinforcement learning for humanoid control.arXiv preprint arXiv:2505.22642, 2025

Younggyo Seo, Carmelo Sferrazza, Haoran Geng, Michal Nauman, Zhao-Heng Yin, and Pieter Abbeel. Fasttd3: Simple, fast, and capable reinforcement learning for humanoid control.arXiv preprint arXiv:2505.22642, 2025

work page arXiv 2025
[76]

Humanoid- bench: Simulated humanoid benchmark for whole-body locomotion and manipulation.arXiv preprint arXiv:2403.10506,

Carmelo Sferrazza, Dun-Ming Huang, Xingyu Lin, Youngwoon Lee, and Pieter Abbeel. Humanoidbench: Simulated humanoid benchmark for whole-body locomotion and manipulation.arXiv preprint arXiv:2403.10506, 2024

work page arXiv 2024
[77]

Importance sampling for reinforcement learning with multiple objectives.PhD thesis, Massachusetts Institute of Technology, 2001

Christian Robert Shelton. Importance sampling for reinforcement learning with multiple objectives.PhD thesis, Massachusetts Institute of Technology, 2001

work page 2001
[78]

Sim2real manipulation on unknown objects with tactile-based reinforcement learning

Entong Su, Chengzhe Jia, Yuzhe Qin, Wenxuan Zhou, Annabella Macaluso, Binghao Huang, and Xiaolong Wang. Sim2real manipulation on unknown objects with tactile-based reinforcement learning. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 9234–9241. IEEE, 2024

work page 2024
[79]

Integrated architectures for learning, planning, and reacting based on approximating dynamic programming

Richard S Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. InMachine learning proceedings 1990, pages 216–224. Elsevier, 1990

work page 1990
[80]

MIT press Cambridge, 1998

Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

work page 1998
[81]

arXiv preprint arXiv:2410.00425 (2024)

Stone Tao, Fanbo Xiang, Arth Shukla, Yuzhe Qin, Xander Hinrichsen, Xiaodi Yuan, Chen Bao, Xinsong Lin, Yulin Liu, Tse-kai Chan, et al. Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai.arXiv preprint arXiv:2410.00425, 2024

work page arXiv 2024

Showing first 80 references.

[1] [1]

Loss of plasticity in continual deep reinforcement learning

Zaheer Abbas, Rosie Zhao, Joseph Modayil, Adam White, and Marlos C Machado. Loss of plasticity in continual deep reinforcement learning. InConference on lifelong learning agents, pages 620–636. PMLR, 2023

work page 2023

[2] [2]

Learning dexterous in-hand manipulation.The International Journal of Robotics Research, 39(1):3–20, 2020

OpenAI: Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous in-hand manipulation.The International Journal of Robotics Research, 39(1):3–20, 2020

work page 2020

[3] [3]

A Brief Survey of Deep Reinforcement Learning

Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, and Anil Anthony Bharath. A brief survey of deep reinforcement learning.arXiv preprint arXiv:1708.05866, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[4] [4]

Genesis: A generative and universal physics engine for robotics and beyond, December 2024

Genesis Authors. Genesis: A generative and universal physics engine for robotics and beyond, December 2024. https://github.com/Genesis-Embodied-AI/Genesis

work page 2024

[5] [5]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.arXiv preprint arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[6] [6]

Efficient online reinforcement learning with offline data

Philip J Ball, Laura Smith, Ilya Kostrikov, and Sergey Levine. Efficient online reinforcement learning with offline data. InInternational Conference on Machine Learning, pages 1577–1594. PMLR, 2023

work page 2023

[7] [7]

A distributional perspective on reinforcement learning

Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In International conference on machine learning, pages 449–458. PMLR, 2017

work page 2017

[8] [8]

CrossQ: Batch normalization in deep reinforcement learning for greater sample efficiency and simplicity

Aditya Bhatt, Daniel Palenicek, Boris Belousov, Max Argus, Artemij Amiranashvili, Thomas Brox, and Jan Peters. CrossQ: Batch normalization in deep reinforcement learning for greater sample efficiency and simplicity. International Conference on Learning Representations (ICLR), 2024

work page 2024

[9] [9]

Myosuite–a contact-rich simulation suite for musculoskeletal motor control.arXiv preprint arXiv:2205.13600, 2022

Vittorio Caggiano, Huawei Wang, Guillaume Durandau, Massimo Sartori, and Vikash Kumar. Myosuite–a contact-rich simulation suite for musculoskeletal motor control.arXiv preprint arXiv:2205.13600, 2022

work page arXiv 2022

[10] [10]

Towards human-level bimanual dexterous manipulation with reinforcement learning

Yuanpei Chen, Yaodong Yang, Tianhao Wu, Shengjie Wang, Xidong Feng, Jiechuan Jiang, Zongqing Lu, Stephen Marcus McAleer, Hao Dong, and Song-Chun Zhu. Towards human-level bimanual dexterous manipulation with reinforcement learning. InThirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.https://openreview.net/...

work page 2022

[11] [11]

Temporally-extended{\epsilon}-greedy exploration.arXiv preprint arXiv:2006.01782, 2020

Will Dabney, Georg Ostrovski, and André Barreto. Temporally-extended{\epsilon}-greedy exploration.arXiv preprint arXiv:2006.01782, 2020

work page arXiv 2006

[12] [12]

Fernando Hernandez-Garcia, Parash Rahman, Richard S

Shibhansh Dohare, J Fernando Hernandez-Garcia, Parash Rahman, Richard S Sutton, and A Rupam Mahmood. Maintaining plasticity in deep continual learning.arXiv preprint arXiv:2306.13812, 2023

work page arXiv 2023

[13] [13]

Pink noise is all you need: Colored noise exploration in deep reinforcement learning

Onno Eberhard, Jakob Hollenstein, Cristina Pinneri, and Georg Martius. Pink noise is all you need: Colored noise exploration in deep reinforcement learning. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023

[14] [14]

Revisiting fundamentals of experience replay

William Fedus, Prajit Ramachandran, Rishabh Agarwal, Yoshua Bengio, Hugo Larochelle, Mark Rowland, and Will Dabney. Revisiting fundamentals of experience replay. InInternational conference on machine learning, pages 3061–3071. PMLR, 2020

work page 2020

[15] [15]

Addressing function approximation error in actor-critic methods

Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-critic methods. InInternational conference on machine learning, pages 1587–1596. PMLR, 2018

work page 2018

[16] [16]

For sale: State-action representation learning for deep reinforcement learning.arXiv preprint arXiv:2306.02451, 2023

Scott Fujimoto, Wei-Di Chang, Edward J Smith, Shixiang Shane Gu, Doina Precup, and David Meger. For sale: State-action representation learning for deep reinforcement learning.arXiv preprint arXiv:2306.02451, 2023

work page arXiv 2023

[17] [17]

Towards general-purpose model-free reinforcement learning.arXiv preprint arXiv:2501.16142, 2025

Scott Fujimoto, Pierluca D’Oro, Amy Zhang, Yuandong Tian, and Michael Rabbat. Towards general-purpose model-free reinforcement learning.arXiv preprint arXiv:2501.16142, 2025

work page arXiv 2025

[18] [18]

N., and Martin, M

Matteo Gallici, Mattie Fellows, Benjamin Ellis, Bartomeu Pou, Ivan Masmitja, Jakob Nicolaus Foerster, and Mario Martin. Simplifying deep temporal difference learning.arXiv preprint arXiv:2407.04811, 2024

work page arXiv 2024

[19] [19]

Karen Liu, Abder- rahmane Kheddar, Xue Bin Peng, Yuke Zhu, Guanya Shi, Quan Nguyen, Gordon Cheng, Huijun Gao, and Ye Zhao

Zhaoyuan Gu, Junheng Li, Wenlan Shen, Wenhao Yu, Zhaoming Xie, Stephen McCrory, Xianyi Cheng, Abdulaziz Shamsah, Robert Griffin, C Karen Liu, et al. Humanoid locomotion and manipulation: Current progress and challenges in control, planning, and learning.arXiv preprint arXiv:2501.02116, 2025

work page arXiv 2025

[20] [20]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. PMLR, 2018. 13

work page 2018

[21] [21]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [22]

Td-mpc2: Scalable, robust world models for continuous control, 2024

Nicklas Hansen, Hao Su, and Xiaolong Wang. Td-mpc2: Scalable, robust world models for continuous control, 2024

work page 2024

[23] [23]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016

[24] [24]

Anymal parkour: Learning agile navigation for quadrupedal robots.Science Robotics, 9(88):eadi7566, 2024

David Hoeller, Nikita Rudin, Dhionis Sako, and Marco Hutter. Anymal parkour: Learning agile navigation for quadrupedal robots.Science Robotics, 9(88):eadi7566, 2024

work page 2024

[25] [25]

Action noise in off-policy deep reinforcement learning: Impact on exploration and performance.arXiv preprint arXiv:2206.03787, 2022

Jakob Hollenstein, Sayantan Auddy, Matteo Saveriano, Erwan Renaudo, and Justus Piater. Action noise in off-policy deep reinforcement learning: Impact on exploration and performance.arXiv preprint arXiv:2206.03787, 2022

work page arXiv 2022

[26] [26]

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Andrew G Howard. Mobilenets: Efficient convolutional neural networks for mobile vision applications.arXiv preprint arXiv:1704.04861, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[27] [27]

Learning agile and dynamic motor skills for legged robots.Science Robotics, 4(26):eaau5872, 2019

Jemin Hwangbo, Joonho Lee, Alexey Dosovitskiy, Dario Bellicoso, Vassilios Tsounis, Vladlen Koltun, and Marco Hutter. Learning agile and dynamic motor skills for legged robots.Science Robotics, 4(26):eaau5872, 2019

work page 2019

[28] [28]

Batch normalization: Accelerating deep network training by reducing internal covariate shift

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. InInternational conference on machine learning, pages 448–456. pmlr, 2015

work page 2015

[29] [29]

When to trust your model: Model-based policy optimization.Advances in neural information processing systems, 32, 2019

Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization.Advances in neural information processing systems, 32, 2019

work page 2019

[30] [30]

Concurrent training of a control policy and a state estimator for dynamic and robust legged locomotion.IEEE Robotics and Automation Letters, 7(2):4630–4637, April 2022

Gwanghyeon Ji, Juhyeok Mun, Hyeongjun Kim, and Jemin Hwangbo. Concurrent training of a control policy and a state estimator for dynamic and robust legged locomotion.IEEE Robotics and Automation Letters, 7(2):4630–4637, April 2022. ISSN 2377-3774. doi: 10.1109/lra.2022.3151396.http://dx.doi.org/10.1109/LRA.2022.3151396

work page doi:10.1109/lra.2022.3151396.http://dx.doi.org/10.1109/lra.2022.3151396 2022

[31] [31]

Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning

Zhenyu Jiang, Yuqi Xie, Kevin Lin, Zhenjia Xu, Weikang Wan, Ajay Mandlekar, Linxi Jim Fan, and Yuke Zhu. Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 16923–16930. IEEE, 2025

work page 2025

[32] [32]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[33] [33]

Openvla: An open-source vision-language-action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language-action model. InCoRL, 2025

work page 2025

[34] [34]

Reinforcement learning in robotics: A survey.The International Journal of Robotics Research, 32(11):1238–1274, 2013

Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey.The International Journal of Robotics Research, 32(11):1238–1274, 2013

work page 2013

[35] [35]

RMA: Rapid Motor Adaptation for Legged Robots

Ashish Kumar, Zipeng Fu, Deepak Pathak, and Jitendra Malik. Rma: Rapid motor adaptation for legged robots. arXiv preprint arXiv:2107.04034, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[36] [36]

Reinforcement learning with augmented data.Advances in neural information processing systems, 33:19884–19895, 2020

Misha Laskin, Kimin Lee, Adam Stooke, Lerrel Pinto, Pieter Abbeel, and Aravind Srinivas. Reinforcement learning with augmented data.Advances in neural information processing systems, 33:19884–19895, 2020

work page 2020

[37] [37]

Plastic: Improving input and label plasticity for sample efficient reinforcement learning.Advances in Neural Information Processing Systems, 36, 2024

Hojoon Lee, Hanseul Cho, Hyunseung Kim, Daehoon Gwak, Joonkee Kim, Jaegul Choo, Se-Young Yun, and Chulhee Yun. Plastic: Improving input and label plasticity for sample efficient reinforcement learning.Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[38] [38]

Slow and steady wins the race: Maintaining plasticity with hare and tortoise networks.arXiv preprint arXiv:2406.02596, 2024

Hojoon Lee, Hyeonseo Cho, Hyunseung Kim, Donghu Kim, Dugki Min, Jaegul Choo, and Clare Lyle. Slow and steady wins the race: Maintaining plasticity with hare and tortoise networks.arXiv preprint arXiv:2406.02596, 2024

work page arXiv 2024

[39] [39]

Simba: Simplicity bias for scaling up parameters in deep reinforcement learning.arXiv preprint arXiv:2410.09754, 2024

Hojoon Lee, Dongyoon Hwang, Donghu Kim, Hyunseung Kim, Jun Jet Tai, Kaushik Subramanian, Peter R Wurman, Jaegul Choo, Peter Stone, and Takuma Seno. Simba: Simplicity bias for scaling up parameters in deep reinforcement learning.arXiv preprint arXiv:2410.09754, 2024

work page arXiv 2024

[40] [40]

Hyperspher- ical normalization for scalable deep reinforcement learning.arXiv preprint arXiv:2502.15280,

Hojoon Lee, Youngdo Lee, Takuma Seno, Donghu Kim, Peter Stone, and Jaegul Choo. Hyperspherical normaliza- tion for scalable deep reinforcement learning.arXiv preprint arXiv:2502.15280, 2025. 14

work page arXiv 2025

[41] [41]

Learning quadrupedal locomotion over challenging terrain,

Joonho Lee, Jemin Hwangbo, Lorenz Wellhausen, Vladlen Koltun, and Marco Hutter. Learning quadrupedal locomotion over challenging terrain.Science Robotics, 5(47), October 2020. ISSN 2470-9476. doi: 10.1126/ scirobotics.abc5986.http://dx.doi.org/10.1126/scirobotics.abc5986

work page doi:10.1126/scirobotics.abc5986 2020

[42] [42]

Efficient deep reinforcement learning requires regulating overfitting.arXiv preprint arXiv:2304.10466, 2023

Qiyang Li, Aviral Kumar, Ilya Kostrikov, and Sergey Levine. Efficient deep reinforcement learning requires regulating overfitting.arXiv preprint arXiv:2304.10466, 2023

work page arXiv 2023

[43] [43]

BeyondMimic: From Motion Tracking to Versatile Humanoid Control via Guided Diffusion

Qiayuan Liao, Takara E Truong, Xiaoyu Huang, Yuman Gao, Guy Tevet, Koushil Sreenath, and C Karen Liu. Beyondmimic: From motion tracking to versatile humanoid control via guided diffusion.arXiv preprint arXiv:2508.08241, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [44]

Continuous control with deep reinforcement learning

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning.arXiv preprint arXiv:1509.02971, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[45] [45]

Softgym: Benchmarking deep reinforcement learning for deformable object manipulation

Xingyu Lin, Yufei Wang, Jake Olkin, and David Held. Softgym: Benchmarking deep reinforcement learning for deformable object manipulation. InConference on Robot Learning, pages 432–448. PMLR, 2021

work page 2021

[46] [46]

ngpt: Normalized transformer with rep- resentation learning on the hypersphere

Ilya Loshchilov, Cheng-Ping Hsieh, Simeng Sun, and Boris Ginsburg. ngpt: Normalized transformer with representation learning on the hypersphere.arXiv preprint arXiv:2410.01131, 2024

work page arXiv 2024

[47] [47]

Understanding plasticity in neural networks.Proc

Clare Lyle, Zeyu Zheng, Evgenii Nikishin, Bernardo Avila Pires, Razvan Pascanu, and Will Dabney. Understanding plasticity in neural networks.Proc. the International Conference on Machine Learning (ICML), 2023

work page 2023

[48] [48]

Normalization and effective learning rates in reinforcement learning.Advances in Neural Information Processing Systems, 37:106440–106473, 2024

Clare Lyle, Zeyu Zheng, Khimya Khetarpal, James Martens, Hado P van Hasselt, Razvan Pascanu, and Will Dabney. Normalization and effective learning rates in reinforcement learning.Advances in Neural Information Processing Systems, 37:106440–106473, 2024

work page 2024

[49] [49]

Moerland, Joost Broekens, Aske Plaat, and Catholijn M

Thomas M. Moerland, Joost Broekens, Aske Plaat, and Catholijn M. Jonker. Model-based reinforcement learning: A survey.Foundations and Trends in Machine Learning, 16(1):1–118, 2023

work page 2023

[50] [50]

Weighted importance sampling for off-policy learning with linear function approximation.Advances in neural information processing systems, 27, 2014

A Rupam Mahmood, Hado P Van Hasselt, and Richard S Sutton. Weighted importance sampling for off-policy learning with linear function approximation.Advances in neural information processing systems, 27, 2014

work page 2014

[51] [51]

Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning

Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning.arXiv preprint arXiv:2108.10470, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[52] [52]

Mixed Precision Training

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Gins- burg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training.arXiv preprint arXiv:1710.03740, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[53] [53]

Symmetry considerations for learning task symmetric robot policies

Mayank Mittal, Nikita Rudin, Victor Klemm, Arthur Allshire, and Marco Hutter. Symmetry considerations for learning task symmetric robot policies. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 7433–7439. IEEE, 2024

work page 2024

[54] [54]

Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning

Mayank Mittal, Pascal Roth, James Tigue, Antoine Richard, Octi Zhang, Peter Du, Antonio Serrano-Muñoz, Xinjie Yao, René Zurbrügg, Nikita Rudin, et al. Isaac lab: A gpu-accelerated simulation framework for multi-modal robot learning.arXiv preprint arXiv:2511.04831, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[55] [55]

Human-level control through deep reinforcement learning.nature, 518(7540):529–533, 2015

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning.nature, 518(7540):529–533, 2015

work page 2015

[56] [56]

Mock and University of Wyoming

J.W. Mock and University of Wyoming. Department of Electrical Engineering.A Comparison of PPO, TD3, and SAC Reinforcement Algorithms for Quadruped Walking Gait Generation and Transfer Learning to a Physical Robot. University of Wyoming, 2023. ISBN 9798379561789.https://books.google.co.kr/books?id=waUG0AEACAAJ

work page 2023

[57] [57]

Dreamwaq: Learning robust quadrupedal locomotion with implicit terrain imagination via deep reinforcement learning.arXiv preprint arXiv:2301.10602, 2023

I Nahrendra, Byeongho Yu, and Hyun Myung. Dreamwaq: Learning robust quadrupedal locomotion with implicit terrain imagination via deep reinforcement learning.arXiv preprint arXiv:2301.10602, 2023

work page arXiv 2023

[58] [58]

Reward Centering,

Abhishek Naik, Yi Wan, Manan Tomar, and Richard S Sutton. Reward centering.arXiv preprint arXiv:2405.09999, 2024

work page arXiv 2024

[59] [59]

Robocasa: Large-scale simulation of everyday tasks for generalist robots

Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots. InRobotics: Science and Systems, 2024

work page 2024

[60] [60]

Bigger, regularized, optimistic: scaling for compute and sample-efficient con- tinuous control

Michal Nauman, Mateusz Ostaszewski, Krzysztof Jankowski, Piotr Miłoś, and Marek Cygan. Bigger, regularized, optimistic: scaling for compute and sample-efficient continuous control.arXiv preprint arXiv:2405.16158, 2024. 15

work page arXiv 2024

[61] [61]

Bigger, regularized, categorical: High-capacity value functions are efficient multi-task learners.arXiv preprint arXiv:2505.23150, 2025

Michal Nauman, Marek Cygan, Carmelo Sferrazza, Aviral Kumar, and Pieter Abbeel. Bigger, regularized, categorical: High-capacity value functions are efficient multi-task learners.arXiv preprint arXiv:2505.23150, 2025

work page arXiv 2025

[62] [62]

Simplicial embeddings improve sample efficiency in actor-critic agents.arXiv preprint arXiv:2510.13704, 2025

Johan Obando-Ceron, Walter Mayor, Samuel Lavoie, Scott Fujimoto, Aaron Courville, and Pablo Samuel Castro. Simplicial embeddings improve sample efficiency in actor-critic agents.arXiv preprint arXiv:2510.13704, 2025

work page arXiv 2025

[63] [63]

Scaling off-policy reinforcement learning with batch and weight normalization.Advances in Neural Information Processing Systems (NeurIPS), 2025

Daniel Palenicek, Florian Vogt, Joe Watson, and Jan Peters. Scaling off-policy reinforcement learning with batch and weight normalization.Advances in Neural Information Processing Systems (NeurIPS), 2025

work page 2025

[64] [64]

XQC: Well-conditioned optimization accelerates deep reinforcement learning.International Conference on Learning Representations (ICLR), 2026

Daniel Palenicek, Florian Vogt, Joe Watson, Ingmar Posner, and Jan Peters. XQC: Well-conditioned optimization accelerates deep reinforcement learning.International Conference on Learning Representations (ICLR), 2026

work page 2026

[65] [65]

Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

work page 2019

[66] [66]

Asymmetric Actor Critic for Image-Based Robot Learning

Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. Asymmetric actor critic for image-based robot learning.arXiv preprint arXiv:1710.06542, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[67] [67]

Habitat 3.0: A co-habitat for humans, avatars and robots.arXiv preprint arXiv:2310.13724, 2023

Xavier Puig, Eric Undersander, Andrew Szot, Mikael Dallaire Cote, Tsung-Yen Yang, Ruslan Partsey, Ruta Desai, Alexander William Clegg, Michal Hlavac, So Yeon Min, et al. Habitat 3.0: A co-habitat for humans, avatars and robots.arXiv preprint arXiv:2310.13724, 2023

work page arXiv 2023

[68] [69]

Learning to walk in minutes using massively parallel deep reinforcement learning

Nikita Rudin, David Hoeller, Philipp Reist, and Marco Hutter. Learning to walk in minutes using massively parallel deep reinforcement learning. InConference on robot learning, pages 91–100. PMLR, 2022

work page 2022

[69] [70]

How does batch normalization help optimization?Advances in neural information processing systems, 31, 2018

Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How does batch normalization help optimization?Advances in neural information processing systems, 31, 2018

work page 2018

[70] [71]

Return-based scaling: Yet another normalisation trick for deep rl.arXiv preprint arXiv:2105.05347, 2021

Tom Schaul, Georg Ostrovski, Iurii Kemaev, and Diana Borsa. Return-based scaling: Yet another normalisation trick for deep rl.arXiv preprint arXiv:2105.05347, 2021

work page arXiv 2021

[71] [72]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[72] [73]

Schwarke, M

Clemens Schwarke, Mayank Mittal, Nikita Rudin, David Hoeller, and Marco Hutter. Rsl-rl: A learning library for robotics research.arXiv preprint arXiv:2509.10771, 2025

work page arXiv 2025

[73] [74]

Sferrazza, C., Huang, D.-M., Lin, X., Lee, Y ., and Abbeel, P

Younggyo Seo, Carmelo Sferrazza, Juyue Chen, Guanya Shi, Rocky Duan, and Pieter Abbeel. Learning sim-to-real humanoid locomotion in 15 minutes, 2025.https://arxiv.org/abs/2512.01996

work page arXiv 2025

[74] [75]

Fasttd3: Simple, fast, and capable reinforcement learning for humanoid control.arXiv preprint arXiv:2505.22642, 2025

Younggyo Seo, Carmelo Sferrazza, Haoran Geng, Michal Nauman, Zhao-Heng Yin, and Pieter Abbeel. Fasttd3: Simple, fast, and capable reinforcement learning for humanoid control.arXiv preprint arXiv:2505.22642, 2025

work page arXiv 2025

[75] [76]

Humanoid- bench: Simulated humanoid benchmark for whole-body locomotion and manipulation.arXiv preprint arXiv:2403.10506,

Carmelo Sferrazza, Dun-Ming Huang, Xingyu Lin, Youngwoon Lee, and Pieter Abbeel. Humanoidbench: Simulated humanoid benchmark for whole-body locomotion and manipulation.arXiv preprint arXiv:2403.10506, 2024

work page arXiv 2024

[76] [77]

Importance sampling for reinforcement learning with multiple objectives.PhD thesis, Massachusetts Institute of Technology, 2001

Christian Robert Shelton. Importance sampling for reinforcement learning with multiple objectives.PhD thesis, Massachusetts Institute of Technology, 2001

work page 2001

[77] [78]

Sim2real manipulation on unknown objects with tactile-based reinforcement learning

Entong Su, Chengzhe Jia, Yuzhe Qin, Wenxuan Zhou, Annabella Macaluso, Binghao Huang, and Xiaolong Wang. Sim2real manipulation on unknown objects with tactile-based reinforcement learning. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 9234–9241. IEEE, 2024

work page 2024

[78] [79]

Integrated architectures for learning, planning, and reacting based on approximating dynamic programming

Richard S Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. InMachine learning proceedings 1990, pages 216–224. Elsevier, 1990

work page 1990

[79] [80]

MIT press Cambridge, 1998

Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

work page 1998

[80] [81]

arXiv preprint arXiv:2410.00425 (2024)

Stone Tao, Fanbo Xiang, Arth Shukla, Yuzhe Qin, Xander Hinrichsen, Xiaodi Yuan, Chen Bao, Xinsong Lin, Yulin Liu, Tse-kai Chan, et al. Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai.arXiv preprint arXiv:2410.00425, 2024

work page arXiv 2024