pith. sign in

arxiv: 2605.24975 · v1 · pith:Y33JFOVLnew · submitted 2026-05-24 · 💻 cs.RO · cs.AI· cs.LG

Bridging the Gap: Enabling Soft Actor Critic for High Performance Legged Locomotion

Pith reviewed 2026-06-30 00:51 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG
keywords Soft Actor-CriticProximal Policy Optimizationlegged locomotionreinforcement learningmassively parallel trainingsim-to-real transferrobot control
0
0 comments X

The pith

Targeted fixes to policy initialization, critic targets, and return estimation let SAC match PPO in parallel legged robot training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that Soft Actor-Critic can reach the same performance level as Proximal Policy Optimization when training legged robots in large-scale parallel simulations. PPO has been the standard choice because it scales reliably, yet its on-policy design prevents reuse of past data and limits continuous adaptation on real hardware. SAC reuses experience and supports online fine-tuning, but it had consistently fallen short in these massively parallel settings. The authors trace the shortfall to three specific issues and show that correcting policy initialization, making critic targets aware of timeouts, and using multi-step return estimation removes the gap entirely.

Core claim

Modifications covering policy initialization, timeout-aware critic targets, and multi-step return estimation enable SAC to train stably at scale in massively parallel environments. When tested across multiple legged robot platforms and diverse locomotion tasks, the adjusted SAC closes the performance gap with PPO entirely.

What carries the argument

The three targeted modifications—policy initialization, timeout-aware critic targets, and multi-step return estimation—that together stabilize off-policy SAC training in massively parallel simulation for legged locomotion.

If this is right

  • The same algorithm can now handle both large-scale simulation pre-training and continuous online adaptation on real robots.
  • Sample reuse becomes available for legged locomotion without sacrificing final task performance.
  • Training stability at scale is achieved for off-policy methods in environments previously dominated by on-policy algorithms.
  • The approach applies across different robot morphologies and locomotion objectives without platform-specific retuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Sim-to-real pipelines could run the identical learner from simulation through hardware deployment, removing the need to switch algorithms mid-process.
  • The performance difference between on-policy and off-policy methods in parallel settings appears driven by implementation details rather than a fundamental limitation.
  • Similar adjustments to initialization and return estimation may improve other off-policy algorithms when scaled to thousands of parallel environments.

Load-bearing premise

The three identified issues are the primary reasons SAC previously underperformed PPO, and the fixes resolve them without introducing new errors that would prevent equivalent results.

What would settle it

Training the modified SAC in the same IsaacLab parallel setup on the reported robot platforms and tasks and checking whether reward curves and final locomotion performance match those of PPO without extra per-task tuning.

Figures

Figures reproduced from arXiv: 2605.24975 by Chenhao Li, Gianluca Sabatini, Marco Hutter.

Figure 1
Figure 1. Figure 1: Overview of our RSL-RL-SAC framework for training legged locomotion policies in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Effect of initial standard deviation σ0 on the squashed Gaussian exploration distribution. Large σ0 leads to near-uniform exploration over the action range; small σ0 concentrates exploration around the default joint configuration. the environment processes policy outputs as a p t = s ⊙ at + b e , where b e is the default joint config￾uration and s is the task-specific action scale. For each joint j, we com… view at source ↗
Figure 3
Figure 3. Figure 3: Example of IsaacLab simulation environments used for evaluation: ANYmal D and Uni [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Anymal D velocity tracking on rough terrain. Left: reward vs. training iterations. Right: [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Unitree A1 velocity tracking on rough terrain. Left: reward vs. training iterations. Right: [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Unitree H1 velocity tracking on rough terrain. Left: reward vs. training iterations. Right: [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Unitree G1 velocity tracking on rough terrain. Left: reward vs. training iterations. Right: [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
read the original abstract

Proximal Policy Optimization (PPO) has become the de facto standard for training legged robots, thanks to its robustness and scalability in massively parallel simulation environments like IsaacLab. However, its on-policy nature makes it inherently sample-inefficient, preventing its use for continuous adaptation and fine-tuning on real hardware. Soft Actor-Critic (SAC), by contrast, is an off-policy algorithm that can reuse past experience, making it a natural candidate for sim-to-real transfer workflows where the same algorithm can be used both in simulation and for online learning on the real robot. Despite these advantages, SAC has consistently failed to match PPO's empirical performance in massively parallel training settings. This work identifies the root causes of this gap and introduces targeted modifications, covering policy initialization, timeout-aware critic targets, and multi-step return estimation, that enable SAC to train stably at scale. Evaluated across multiple legged robot platforms and diverse locomotion tasks, our approach closes the performance gap with PPO entirely.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript identifies three root causes for SAC underperforming PPO in massively parallel IsaacLab settings for legged locomotion—policy initialization, timeout-aware critic targets, and multi-step return estimation—and introduces targeted modifications to SAC. It claims these changes enable stable training at scale and close the performance gap with PPO entirely, as demonstrated across multiple legged robot platforms and diverse locomotion tasks.

Significance. If the empirical results and ablations hold, the work would be significant for robotics by making off-policy methods like SAC viable for high-performance parallel simulation training of legged robots. This could enable sample-efficient workflows, continuous adaptation, and unified sim-to-real pipelines that are incompatible with on-policy PPO.

major comments (2)
  1. [Abstract] Abstract: The central claim that the modifications 'close the performance gap with PPO entirely' is unsupported by any quantitative results, ablation studies, error bars, or statistical comparisons. This is load-bearing for the headline assertion that the three identified factors are the primary causes and that the fixes resolve them without compensating errors or platform-specific tuning.
  2. [Methods/Results] Methods/Results (implied by abstract framing): No details are supplied on how the root causes were isolated (e.g., controlled experiments isolating initialization vs. timeout handling vs. multi-step returns), nor on whether ablations confirm each component is necessary versus other implementation factors such as reward scaling or parallel environment handling.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger empirical support. We agree that the claims require additional quantitative backing, ablation studies, and details on root-cause isolation. We will revise the manuscript to incorporate these elements while preserving the core contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the modifications 'close the performance gap with PPO entirely' is unsupported by any quantitative results, ablation studies, error bars, or statistical comparisons. This is load-bearing for the headline assertion that the three identified factors are the primary causes and that the fixes resolve them without compensating errors or platform-specific tuning.

    Authors: We acknowledge that the abstract's strong claim must be supported by explicit quantitative evidence. The manuscript reports performance matching across multiple platforms and tasks, but we agree that error bars, statistical tests, and direct quantitative comparisons are needed to substantiate 'entirely' closing the gap without platform-specific effects. In revision we will add multi-seed results with standard deviations, statistical significance metrics, and expanded tables/figures quantifying the performance equivalence. revision: yes

  2. Referee: [Methods/Results] Methods/Results (implied by abstract framing): No details are supplied on how the root causes were isolated (e.g., controlled experiments isolating initialization vs. timeout handling vs. multi-step returns), nor on whether ablations confirm each component is necessary versus other implementation factors such as reward scaling or parallel environment handling.

    Authors: The root causes were identified through systematic debugging in the massively parallel setting, but the manuscript does not present the controlled isolation experiments or full ablation suite. We will add a new subsection with targeted ablations that disable each modification individually while holding other factors fixed, plus controls for reward scaling and environment parallelism. This will confirm necessity and rule out compensating errors. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical identification and fixes with external benchmarks

full rationale

The paper is framed as an empirical study that identifies root causes for SAC underperformance in massively parallel settings and introduces targeted modifications (policy initialization, timeout-aware critic targets, multi-step returns). No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. Claims rest on experimental evaluation across multiple legged robot platforms and tasks, which are external benchmarks. This is self-contained against those benchmarks with no reduction of any result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review conducted on abstract only; full paper text and experimental details unavailable. No free parameters, axioms, or invented entities are explicitly stated in the abstract. The central claim rests on the unstated assumption that the simulation environments and tasks used are representative of real legged locomotion challenges.

axioms (1)
  • domain assumption Massively parallel simulation environments like IsaacLab are representative of the conditions under which legged locomotion policies must be trained and transferred.
    The abstract frames all results around these environments without discussing their limitations or validation against real hardware.

pith-pipeline@v0.9.1-grok · 5707 in / 1348 out tokens · 38765 ms · 2026-06-30T00:51:10.927625+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 4 internal anchors

  1. [1]

    Marc G Bellemare, Will Dabney, and R ´emi Munos

    URLhttps://arxiv.org/abs/2602.20220. Marc G Bellemare, Will Dabney, and R ´emi Munos. A distributional perspective on reinforcement learning. InInternational Conference on Machine Learning,

  2. [2]

    Exploration by Random Network Distillation

    URLhttps://arxiv.org/abs/1810.12894. Xuxin Cheng, Kexin Shi, Ananye Agarwal, and Deepak Pathak. Extreme parkour with legged robots. In2024 IEEE International Conference on Robotics and Automation (ICRA),

  3. [3]

    Simplifying deep temporal difference learning.arXiv preprint arXiv:2407.04811,

    Matteo Gallici, Mattie Fellows, Benjamin Ellis, Bartomeu Pou, Ivan Masmitja, Jakob Nicolaus Foerster, and Mario Martin. Simplifying deep temporal difference learning.arXiv preprint arXiv:2407.04811,

  4. [4]

    Soft Actor-Critic Algorithms and Applications

    Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, et al. Soft actor-critic algorithms and appli- cations.arXiv preprint arXiv:1812.05905,

  5. [5]

    org/abs/2504.16680

    URLhttps://arxiv. org/abs/2504.16680. Zechu Li, Tao Chen, Zhang-Wei Hong, Anurag Ajay, and Pulkit Agrawal. Parallelq-learning: Scaling off-policy reinforcement learning under massively parallel simulation. InInternational Conference on Machine Learning, pp. 19440–19459. PMLR,

  6. [6]

    Mayank Mittal, Nikita Rudin, Victor Klemm, Arthur Allshire, and Marco Hutter

    doi: 10.1109/LRA.2023.3270034. Mayank Mittal, Nikita Rudin, Victor Klemm, Arthur Allshire, and Marco Hutter. Symmetry con- siderations for learning task symmetric robot policies. In2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 7433–7439. IEEE,

  7. [7]

    Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning

    URLhttps://arxiv.org/abs/2511.04831. Antonin Raffin. Getting sac to work on a massive parallel simulator: An rl journey with off-policy algorithms.araffin.github.io, Feb

  8. [8]

    Proximal Policy Optimization Algorithms

    URLhttps://proceedings.mlr.press/v164/rudin22a.html. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

  9. [9]

    Schwarke, M

    Clemens Schwarke, Mayank Mittal, Nikita Rudin, David Hoeller, and Marco Hutter. Rsl-rl: A learning library for robotics research.arXiv preprint arXiv:2509.10771,

  10. [10]

    Learning sim-to-real humanoid locomotion in 15 minutes, 2025a

    Younggyo Seo, Carmelo Sferrazza, Juyue Chen, Guanya Shi, Rocky Duan, and Pieter Abbeel. Learning sim-to-real humanoid locomotion in 15 minutes, 2025a. URLhttps://arxiv. org/abs/2512.01996. Younggyo Seo, Carmelo Sferrazza, Haoran Geng, Michal Nauman, Zhao-Heng Yin, and Pieter Abbeel. Fasttd3: Simple, fast, and capable reinforcement learning for humanoid co...

  11. [11]

    com/p/speeding-up-sac-with-massively-parallel

    URLhttps://arthshukla.substack. com/p/speeding-up-sac-with-massively-parallel. Ritvik Singh, Arthur Allshire, Ankur Handa, Nathan Ratliff, and Karl Van Wyk. Dextrah-rgb: Visuomotor policies to grasp anything with dexterous hands.arXiv preprint arXiv:2412.01791,

  12. [12]

    Laura Smith, Ilya Kostrikov, and Sergey Levine

    URLhttps: //arxiv.org/abs/2110.05457. Laura Smith, Ilya Kostrikov, and Sergey Levine. A walk in the park: Learning to walk in 20 minutes with model-free reinforcement learning. InRobotics: Science and Systems,

  13. [13]

    MuJoCo Playground,

    Kevin Zakka, Baruch Tabanpour, Qiayuan Liao, Mustafa Haiderbhai, Samuel Holt, Jing Yuan Luo, Arthur Allshire, Erik Frey, Koushil Sreenath, Lueder A Kahrs, et al. Mujoco playground.arXiv preprint arXiv:2502.08844,

  14. [14]

    Humanoid parkour learning,

    Ziwen Zhuang, Shenzhe Yao, and Hang Zhao. Humanoid parkour learning.arXiv preprint arXiv:2406.10759,