pith. sign in

arxiv: 2605.19503 · v2 · pith:TS6PKARKnew · submitted 2026-05-19 · 💻 cs.RO · cs.AI· cs.LG

ARC-RL: A Reinforcement Learning Playground Inspired by ARC Raiders

Pith reviewed 2026-05-21 07:32 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG
keywords reinforcement learninglegged locomotionMuJoCogame-inspired robotsmorphological diversitycentral pattern generatorsoffline-to-onlinestylistic constraints
0
0 comments X

The pith

ARC-RL introduces four MuJoCo environments with game-inspired morphologies and a unified reward to compare RL algorithms under stylistic constraints.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ARC-RL, a suite of four MuJoCo continuous-control environments whose morphologies are drawn from the ARC Raiders game bestiary rather than commercial robot hardware. The Queen, Bastion, Tick, and Leaper share one observation template, action convention, simulation cadence, and a single closed-form multi-component reward function whose only morphology-specific elements are a small set of weights. That reward combines velocity tracking, a survival bonus, phase-locked gait terms, action regularizers, safety penalties, and a posture anchor, with no motion-capture data used at any stage. Hand-crafted Central Pattern Generator demonstrators are supplied as fixed expert references and as sources of prior data. The authors then run a controlled empirical comparison of standard online algorithms against prior-data-augmented variants to characterize how each paradigm handles the playground's morphological diversity and animation-style constraints.

Core claim

ARC-RL supplies four legged morphologies—the 18-DoF tall hexapod Queen, the 12-DoF armoured hexapod Bastion, the 18-DoF compact hexapod Tick, and the 12-DoF quadruped Leaper—together with a unified reward that fuses velocity-tracking, survival, gait-compliance, regularization, safety, and posture terms. The only per-morphology variation is a handful of scalar weights. Central Pattern Generator models provide both fixed expert trajectories and prior data for offline-to-online training. A controlled study then compares purely online methods (SAC, SPEQ, SOPE-EO) with prior-augmented counterparts (SACfD, SPEQ-O2O, SOPE) to show how each class copes with the morphological variety and the game-anm

What carries the argument

The single closed-form multi-component reward function shared across all morphologies, varying only by a small set of weights, together with the unified observation template, action convention, and simulation cadence.

If this is right

  • Online and prior-augmented algorithms can be compared directly on identical environments that differ only in body plan and a few reward weights.
  • The morphological spread allows direct measurement of how well each training paradigm scales to changes in degrees of freedom and mass distribution.
  • Stylistic constraints can be studied in isolation from real-robot hardware constraints or motion-capture requirements.
  • Reproducible baselines become available for any method that seeks to produce locomotion satisfying game-animation aesthetics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Controllers developed here could transfer to game-NPC animation pipelines where visual style matters more than energetic efficiency.
  • The same unified-reward template could be applied to additional game or animation morphologies to test whether the approach generalizes beyond the four presented bodies.
  • If the CPG priors prove effective, similar hand-crafted oscillators might reduce sample complexity in other high-DoF continuous-control tasks.

Load-bearing premise

The game-inspired morphologies together with the single closed-form reward function are sufficient to represent stylistic constraints absent from sim-to-real robotics without motion-capture data or additional task-specific terms.

What would settle it

Training the listed algorithms on the four environments and checking whether the phase-locked gait-compliance scores remain consistent with the supplied CPG demonstrators across morphologies when no per-morphology reward terms are added.

Figures

Figures reproduced from arXiv: 2605.19503 by Andrew D. Bagdanov, Carlo Romeo.

Figure 1
Figure 1. Figure 1: The four ARC-RL morphologies. Isometric renders of the robots that make up the playground: (a) Leaper, a 12-DoF quadruped with three-link legs; (b) Bastion, a 12-DoF armoured hexapod with two-link legs; (c) Queen, an 18-DoF tall hexapod with three-link legs; and (d) Tick, a compact 18-DoF hexapod sharing Queen’s kinematics at a smaller scale. 2018), Isaac Gym (Makoviychuk et al., 2021), and MuJoCo Playgrou… view at source ↗
Figure 2
Figure 2. Figure 2: Online RL on the four ARC-RL robots. Evaluation returns as a function of environment steps for SAC, SPEQ, and SOPE-EO, with the CPG controller plotted as a constant expert reference. Solid lines denote the mean across 5 random seeds, shaded regions the standard deviation. SACfD SPEQ O2O SOPE EXPERT Leaper 0.0 0.2 0.4 0.6 0.8 1.0 Env steps 1e6 0 500 1000 1500 2000 2500 3000 3500 Eval Reward Bastion 0.0 0.2 … view at source ↗
Figure 3
Figure 3. Figure 3: Online RL with prior data on the four ARC-RL robots. Evaluation returns as a function of environment steps for SACfD, SPEQ-O2O, and SOPE, each consuming the CPG-generated prior buffer, with the CPG controller plotted as a constant expert reference. Solid lines denote the mean across 5 random seeds, shaded regions the standard deviation. a more comprehensive picture than the previous comparison of learning … view at source ↗
Figure 4
Figure 4. Figure 4: Visual comparison of Leaper policies across algorithms. Representative frames cap￾tured at equivalent points of the gait cycle during evaluation. The top row shows the exclusively on￾line algorithms (SAC, SPEQ, SOPE-EO); the bottom row shows their counterparts augmented with prior data (SACfD, SPEQ-O2O, SOPE). The round black “eye” on the front of the chassis indicates the intended forward-facing direction… view at source ↗
read the original abstract

Reinforcement learning for legged locomotion has matured into a stack of multi-component reward functions and physics-engine benchmarks whose morphologies are uniformly derived from real commercial hardware. Game NPCs, however, are bound by stylistic constraints absent from sim-to-real robotics and routinely take the form of creatures with no real-robot counterpart. We introduce ARC-RL, a suite of four MuJoCo continuous-control environments featuring robotic morphologies inspired by the bestiary of ARC Raiders: the 18-DoF tall hexapod Queen, the 12-DoF armoured hexapod Bastion, the 18-DoF compact hexapod Tick, and the 12-DoF quadruped Leaper. All four robots share a unified observation template, action convention, simulation cadence, and a single closed-form multi-component reward function whose only per-morphology variation lives in a small set of weights and parameters. The reward fuses a velocity-tracking tent, a healthy survive bonus, a phase-locked gait-compliance bonus/cost pair, action regularisers, three safety penalties, and a posture anchor; no motion-capture data enters the reward at any point. We additionally provide hand-crafted Central Pattern Generator demonstrators per morphology, which serve both as fixed expert references and as sources of prior data for offline-to-online training. On this playground, we conduct a controlled empirical study comparing standard online algorithms (SAC, SPEQ, SOPE-EO) and methods augmented with prior data (SACfD, SPEQ-O2O, SOPE), and characterise how each paradigm copes with the playground's morphological diversity and animation-style stylistic constraints. Source code is available at https://github.com/CarloRomeo427/ARC_RL.git.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces ARC-RL, a reinforcement learning playground consisting of four MuJoCo continuous-control environments with morphologies inspired by ARC Raiders game creatures: the 18-DoF Queen hexapod, 12-DoF Bastion armoured hexapod, 18-DoF Tick compact hexapod, and 12-DoF Leaper quadruped. All share a unified observation template, action convention, simulation cadence, and a single closed-form multi-component reward function (velocity-tracking tent, survive bonus, phase-locked gait-compliance, action regularisers, safety penalties, posture anchor) whose only per-morphology variation is a small set of weights. Hand-crafted CPG demonstrators are supplied per morphology as expert references and prior-data sources. The work conducts a controlled empirical study comparing standard online algorithms (SAC, SPEQ, SOPE-EO) against prior-data-augmented variants (SACfD, SPEQ-O2O, SOPE) to characterise performance under morphological diversity and animation-style stylistic constraints. Code is released publicly.

Significance. If the empirical characterisation holds, ARC-RL supplies a reproducible, open-source benchmark for RL on non-standard legged morphologies that lack commercial hardware analogues. The unified reward without motion-capture data and the inclusion of both online and offline-to-online paradigms allow systematic study of how algorithms handle body-plan variation. Public code release is a clear strength that supports follow-on work in game-inspired robotics controllers.

major comments (1)
  1. [Abstract] Abstract: The claim that the four morphologies plus the unified reward impose 'animation-style stylistic constraints absent from sim-to-real robotics' is load-bearing for the central contribution, yet the reward components (velocity tent, phase-locked gait compliance, regularisers, safety penalties, posture anchor) are standard MuJoCo locomotion terms. No explicit features for game-specific aesthetics (exaggerated limb arcs, creature timing, non-biomechanical postures) are described, and the CPGs are hand-crafted rather than derived from animation data. This leaves open whether the study measures stylistic adaptation or ordinary gait stability; a clarifying ablation or qualitative analysis is needed to support the characterisation.
minor comments (2)
  1. [Abstract] The abstract references a 'controlled empirical study' but does not preview the primary metrics (e.g., success rate, cumulative reward, gait metrics) or report error bars; ensure these appear explicitly in the results section or tables.
  2. Provide a short table or appendix entry listing the exact per-morphology reward weights and CPG parameters so readers can reproduce the stylistic variation without inspecting the released code.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback. The major comment raises a valid point about the load-bearing claim regarding animation-style constraints. We address it directly below and outline the changes we will make.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that the four morphologies plus the unified reward impose 'animation-style stylistic constraints absent from sim-to-real robotics' is load-bearing for the central contribution, yet the reward components (velocity tent, phase-locked gait compliance, regularisers, safety penalties, posture anchor) are standard MuJoCo locomotion terms. No explicit features for game-specific aesthetics (exaggerated limb arcs, creature timing, non-biomechanical postures) are described, and the CPGs are hand-crafted rather than derived from animation data. This leaves open whether the study measures stylistic adaptation or ordinary gait stability; a clarifying ablation or qualitative analysis is needed to support the characterisation.

    Authors: We agree that the claim requires stronger support and thank the referee for identifying this gap. The animation-style constraints are intended to arise from two sources that differ from typical sim-to-real setups: (1) the four morphologies are explicitly non-biomechanical game creatures (tall hexapod, armoured hexapod, compact hexapod, and quadruped) with no commercial hardware analogues, and (2) the single closed-form reward applies a phase-locked gait-compliance term uniformly across all morphologies to encourage periodic, stylised locomotion rather than purely energy-efficient or stable gaits. The hand-crafted CPG demonstrators are provided precisely as references for such stylised periodic behaviour. Nevertheless, the manuscript does not currently contain an explicit ablation isolating the gait-compliance term or a qualitative gait analysis, so the distinction from ordinary stability remains implicit. We will therefore add a short qualitative analysis subsection (with example gait visualisations) and an ablation on the phase-locked gait-compliance weight to demonstrate its contribution to stylistic constraints. We will also revise the abstract and introduction to make the morphological and reward-based sources of the constraints explicit rather than relying on the current phrasing. revision: yes

Circularity Check

0 steps flagged

No circularity: new environments and reward defined independently of results

full rationale

The paper introduces four new MuJoCo environments with custom morphologies inspired by game assets, a single closed-form multi-component reward function (velocity tent, survive bonus, phase-locked gait compliance, regularisers, safety penalties, posture anchor) whose per-morphology variation is only in weights, and hand-crafted CPG demonstrators. These are presented as definitions rather than derived quantities. The empirical study then compares standard RL algorithms on these newly specified setups. No load-bearing step reduces a claimed result to a fitted parameter from the same paper, a self-citation chain, or an ansatz smuggled via prior work; the central claims rest on the explicit construction of the playground and the reported experimental outcomes, which remain falsifiable against the released code and environments.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The contribution rests on introducing new simulation environments whose physics are taken as given by the MuJoCo engine, a reward function whose weights are chosen per morphology, and standard RL training assumptions; no new physical entities or ungrounded constants are postulated.

free parameters (1)
  • per-morphology reward weights and parameters
    Small set of scalar weights that adapt the shared multi-component reward to each morphology; values are not derived from first principles or external data.
axioms (1)
  • domain assumption MuJoCo physics engine produces sufficiently accurate dynamics for the described robotic morphologies
    All training and evaluation occurs inside MuJoCo; no validation against real hardware is mentioned.
invented entities (1)
  • Queen, Bastion, Tick, and Leaper robotic morphologies no independent evidence
    purpose: Provide diverse legged robot designs inspired by game creatures to test RL under stylistic constraints
    New kinematic and morphological specifications introduced in this work; no independent evidence such as real-world measurements or predicted physical properties is supplied.

pith-pipeline@v0.9.0 · 5840 in / 1581 out tokens · 58777 ms · 2026-05-21T07:32:10.080396+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 5 internal anchors

  1. [1]

    Video pretraining (vpt): Learning to act by watching unlabeled online videos, 2022

    arXiv:2206.11795. Guillaume Bellegarda and Auke Ijspeert. CPG-RL: Learning central pattern generators for quadruped locomotion.IEEE Robotics and Automation Letters, 7(4):12547–12554,

  2. [2]

    Guillaume Bellegarda, Milad Shafiee, and Auke Ijspeert

    DOI: 10.1109/LRA.2022.3218167. Guillaume Bellegarda, Milad Shafiee, and Auke Ijspeert. Visual CPG-RL: Learning central pat- tern generators for visually-guided quadruped locomotion. InIEEE International Conference on Robotics and Automation (ICRA), pp. 1420–1427,

  3. [3]

    Dota 2 with Large Scale Deep Reinforcement Learning

    DOI: 10.1613/jair.3912. Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław D˛ ebiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement learning.arXiv preprint arXiv:1912.06680,

  4. [4]

    OpenAI Gym

    Boston Dynamics. Spot: The agile mobile robot, 2024.https://bostondynamics.com/ products/spot/. Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym.arXiv preprint arXiv:1606.01540,

  5. [5]

    Chase Kew, Wenhao Yu, Tingnan Zhang, Daniel Freeman, Kuang- Huei Lee, Lisa Lee, Stefano Saliceti, Vincent Zhuang, et al

    Ken Caluwaerts, Atil Iscen, J. Chase Kew, Wenhao Yu, Tingnan Zhang, Daniel Freeman, Kuang- Huei Lee, Lisa Lee, Stefano Saliceti, Vincent Zhuang, et al. Barkour: Benchmarking animal-level agility with quadruped robots.arXiv preprint arXiv:2305.14654,

  6. [6]

    10 Embark Studios

    arXiv:2309.14341. 10 Embark Studios. ARC Raiders. Video game. Released 30 October 2025,

  7. [7]

    Brax–a differentiable physics engine for large scale rigid body simulation,

    arXiv:2106.13281. Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InProceedings of the 35th International Conference on Machine Learning (ICML), volume 80 ofProceedings of Machine Learning Research, pp. 1861–1870,

  8. [8]

    arXiv preprint arXiv:2109.06780 , year=

    Danijar Hafner. Benchmarking the spectrum of agent capabilities.arXiv preprint arXiv:2109.06780,

  9. [9]

    doi: 10.1038/s41586-025-08744-2

    DOI: 10.1038/s41586-025-08744-2. Marco Hutter, Christian Gehring, Dominic Jud, Andreas Lauber, C. Dario Bellicoso, Vassilios Tsou- nis, Jemin Hwangbo, Karen Bodie, Péter Fankhauser, Michael Bloesch, Remo Diethelm, Samuel Bachmann, Amir Melzer, and Mark Hoepflinger. ANYmal – a highly mobile and dynamic quadrupedal robot. InIEEE/RSJ International Conference...

  10. [10]

    Jemin Hwangbo, Joonho Lee, Alexey Dosovitskiy, Dario Bellicoso, Vassilios Tsounis, Vladlen Koltun, and Marco Hutter

    DOI: 10.1109/IROS.2016.7758092. Jemin Hwangbo, Joonho Lee, Alexey Dosovitskiy, Dario Bellicoso, Vassilios Tsounis, Vladlen Koltun, and Marco Hutter. Learning agile and dynamic motor skills for legged robots.Science Robotics, 4(26):eaau5872,

  11. [11]

    Auke Jan Ijspeert

    DOI: 10.1126/scirobotics.aau5872. Auke Jan Ijspeert. Central pattern generators for locomotion control in animals and robots: A review. Neural Networks, 21(4):642–653,

  12. [12]

    Atil Iscen, Ken Caluwaerts, Jie Tan, Tingnan Zhang, Erwin Coumans, Vikas Sindhwani, and Vincent Vanhoucke

    DOI: 10.1016/j.neunet.2008.03.014. Atil Iscen, Ken Caluwaerts, Jie Tan, Tingnan Zhang, Erwin Coumans, Vikas Sindhwani, and Vincent Vanhoucke. Policies modulating trajectory generators. InConference on Robot Learning (CoRL), volume 87 ofProceedings of Machine Learning Research, pp. 916–926,

  13. [13]

    Juliani, V .-P

    Arthur Juliani, Vincent-Pierre Berges, Ervin Teng, Andrew Cohen, Jonathan Harper, Chris Elion, Chris Goy, Yuan Gao, Hunter Henry, Marwan Mattar, and Danny Lange. Unity: A general platform for intelligent agents.arXiv preprint arXiv:1809.02627,

  14. [14]

    Learning quadrupedal locomotion over challenging terrain,

    DOI: 10.1126/scirobotics.abc5986. Guanda Li, Auke Ijspeert, and Mitsuhiro Hayashibe. AI-CPG: Adaptive imitated central pattern gen- erators for bipedal locomotion learned through reinforced reflex neural networks.IEEE Robotics and Automation Letters, 9(6):5190–5197,

  15. [15]

    Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, and Gavriel State

    DOI: 10.1109/LRA.2024.3388842. Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, and Gavriel State. Isaac Gym: High performance GPU-based physics simulation for robot learning. InAdvances in Neural Informa- tion Processing Systems Datasets and Benchmarks Track,

  16. [16]

    Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning

    arXiv:2108.10470. Gabriel B. Margolis and Pulkit Agrawal. Walk these ways: Tuning robot control for generalization with multiplicity of behavior. InProceedings of the 6th Conference on Robot Learning (CoRL), volume 205 ofProceedings of Machine Learning Research,

  17. [17]

    Learning robust perceptive locomotion for quadrupedal robots in the wild,

    DOI: 10.1126/scirobotics.abk2822. 11 V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Belle- mare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, et al. Human- level control through deep reinforcement learning.Nature, 518(7540):529–533,

  18. [18]

    Human-level control through deep reinforcement learning

    DOI: 10.1038/nature14236. Siddharth Mysore, Bassel Mabsout, Renato Mancuso, and Kate Saenko. Regularizing action policies for smooth control with reinforcement learning. InIEEE International Conference on Robotics and Automation (ICRA), pp. 1810–1816,

  19. [19]

    Lee, Matthew Tan, Yuke Zhu, and Jeannette Bohg

    DOI: 10.1109/ICRA48506.2021.9561138. Xue Bin Peng, Glen Berseth, KangKang Yin, and Michiel van de Panne. DeepLoco: Dynamic locomotion skills using hierarchical deep reinforcement learning.ACM Transactions on Graphics (Proc. SIGGRAPH), 36(4),

  20. [20]

    Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning

    DOI: 10.1145/3072959.3073602. Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. DeepMimic: Example- guided deep reinforcement learning of physics-based character skills.ACM Transactions on Graphics (Proc. SIGGRAPH), 37(4),

  21. [21]

    Deepmimic: Example- guided deep reinforcement learning of physics-based character skills

    DOI: 10.1145/3197517.3201311. Xue Bin Peng, Erwin Coumans, Tingnan Zhang, Tsang-Wei Edward Lee, Jie Tan, and Sergey Levine. Learning agile robotic locomotion skills by imitating animals. InRobotics: Science and Systems (RSS),

  22. [22]

    Robotics: Science and Systems (2020) https://doi.org/10.15607/RSS.2020.XVI.064

    DOI: 10.15607/RSS.2020.XVI.064. Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa. AMP: Adversarial motion priors for stylized physics-based character control.ACM Transactions on Graphics (Proc. SIGGRAPH), 40(4),

  23. [23]

    Xue Bin Peng, Yunrong Guo, Lina Halper, Sergey Levine, and Sanja Fidler

    DOI: 10.1145/3450626.3459670. Xue Bin Peng, Yunrong Guo, Lina Halper, Sergey Levine, and Sanja Fidler. ASE: Large-scale reusable adversarial skill embeddings for physically simulated characters.ACM Transactions on Graphics (Proc. SIGGRAPH), 41(4),

  24. [24]

    Carlo Romeo, Girolamo Macaluso, Alessandro Sestini, and Andrew D

    DOI: 10.1145/3528223.3530110. Carlo Romeo, Girolamo Macaluso, Alessandro Sestini, and Andrew D. Bagdanov. SPEQ: Offline stabilization phases for efficient Q-learning in high update-to-data ratio reinforcement learning. Reinforcement Learning Journal (Proc. RLC 2025),

  25. [25]

    Carlo Romeo, Girolamo Macaluso, Alessandro Sestini, and Andrew D

    arXiv:2501.08669. Carlo Romeo, Girolamo Macaluso, Alessandro Sestini, and Andrew D. Bagdanov. SOPE: Stabiliz- ing off-policy evaluation for online RL with prior data.arXiv preprint arXiv:2605.05863,

  26. [26]

    Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

    DOI: 10.1038/s41586-020-03051-4. Alessandro Sestini, Joakim Bergdahl, Konrad Tollmar, Andrew D. Bagdanov, and Linus Gisslén. Towards informed design and validation assistance in computer games using imitation learning. arXiv preprint arXiv:2208.07811,

  27. [27]

    Yecheng Shao, Yongbin Jin, Xianwei Liu, Weiyan He, Hongtao Wang, and Wei Yang

    arXiv:2310.10486. Yecheng Shao, Yongbin Jin, Xianwei Liu, Weiyan He, Hongtao Wang, and Wei Yang. Learning free gait transition for quadruped robots via phase-guided controller.IEEE Robotics and Automation Letters, 7(2):1230–1237,

  28. [28]

    Jonah Siekmann, Yesh Godse, Alan Fern, and Jonathan Hurst

    DOI: 10.1109/LRA.2021.3136645. Jonah Siekmann, Yesh Godse, Alan Fern, and Jonathan Hurst. Sim-to-real learning of all common bipedal gaits via periodic reward composition. InIEEE International Conference on Robotics and Automation (ICRA), pp. 7309–7315, 2021a. DOI: 10.1109/ICRA48506.2021.9561814. 12 Jonah Siekmann, Kevin Green, John Warila, Alan Fern, and...

  29. [29]

    J., et al

    DOI: 10.1038/nature16961. David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of Go without human knowledge.Nature, 550(7676):354–359,

  30. [30]

    2017, Na ture, 550, 354, doi: 10.1038/nature24270 9

    DOI: 10.1038/nature24270. David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play.Science, 362(6419):1140– 1144,

  31. [31]

    A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play

    DOI: 10.1126/science.aar6404. SIMA Team, Maria Abi Raad, Arun Ahuja, Catarina Barros, Frederic Besse, Andrew Bolt, Adrian Bolton, Bethanie Brownfield, Gavin Buttimore, Max Cant, et al. Scaling instructable agents across many simulated worlds.arXiv preprint arXiv:2404.10179,

  32. [32]

    DeepMind Control Suite

    Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Bud- den, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, Timothy Lillicrap, and Martin Ried- miller. DeepMind control suite.arXiv preprint arXiv:1801.00690,

  33. [33]

    Mujoco: A physics en- gine for model-based control, in: 2012 IEEE/RSJ International Con- ference on Intelligent Robots and Systems, IEEE

    DOI: 10.1109/IROS.2012.6386109. Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U. Balis, Gianluca De Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Markus Krimmel, Arjun KG, et al. Gymnasium: A standard interface for reinforcement learning environments.arXiv preprint arXiv:2407.17032,

  34. [34]

    Unitree Go1, 2021.https://www.unitree.com/go1/

    Unitree Robotics. Unitree Go1, 2021.https://www.unitree.com/go1/. Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, Michaël Mathieu, Andrew Dudzik, Juny- oung Chung, David H. Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning.Nature, 575(7782):350–354,

  35. [35]

    DOI: 10.1038/s41586-019-1724-z. Peter R. Wurman, Samuel Barrett, Kenta Kawamoto, James MacGlashan, Kaushik Subramanian, Thomas J. Walsh, Roberto Capobianco, Alisa Devlic, Franziska Eckert, Florian Fuchs, et al. Outracing champion Gran Turismo drivers with deep reinforcement learning.Nature, 602(7896): 223–228,

  36. [36]

    Zhaoming Xie, Glen Berseth, Patrick Clary, Jonathan Hurst, and Michiel van de Panne

    DOI: 10.1038/s41586-021-04357-7. Zhaoming Xie, Glen Berseth, Patrick Clary, Jonathan Hurst, and Michiel van de Panne. Feedback control for Cassie with deep reinforcement learning. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS),

  37. [37]

    In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp

    DOI: 10.1109/IROS.2018.8593722. Jiaqi Yang, Songyi Lu, Miao Han, Yuze Li, Yongqi Ma, Zihao Lin, and Hangxin Li. Mapless nav- igation for UA Vs via reinforcement learning from demonstrations.Science China Technological Sciences, 66(5):1263–1270,

  38. [38]

    Kevin Zakka, Baruch Tabanpour, Qiayuan Liao, Mustafa Haiderbhai, Samuel Holt, Jing Yuan Luo, Arthur Allshire, Erik Frey, Koushil Sreenath, Lueder A

    DOI: 10.1007/s11431-022-2292-3. Kevin Zakka, Baruch Tabanpour, Qiayuan Liao, Mustafa Haiderbhai, Samuel Holt, Jing Yuan Luo, Arthur Allshire, Erik Frey, Koushil Sreenath, Lueder A. Kahrs, Carmelo Sferrazza, Yuval Tassa, and Pieter Abbeel. MuJoCo playground,

  39. [39]

    Mujoco playground,

    Robotics: Science and Systems (RSS) 2025, Outstanding Demo Paper Award. arXiv:2502.08844. 13 Xinyu Zhang, Zhiyuan Xiao, Qingrui Zhang, and Wei Pan. SYNLOCO: Synthesizing central pattern generator with reinforcement learning for quadruped locomotion. InIEEE Conference on Deci- sion and Control (CDC),

  40. [40]

    Authors corrected from earlier draft, which incorrectly attributed the paper to Bellegarda et al

    arXiv:2310.06606. Authors corrected from earlier draft, which incorrectly attributed the paper to Bellegarda et al. 14