pith. sign in

arxiv: 2605.19503 · v1 · pith:TS6PKARKnew · submitted 2026-05-19 · 💻 cs.RO · cs.AI· cs.LG

ARC-RL: A Reinforcement Learning Playground Inspired by ARC Raiders

Pith reviewed 2026-05-20 05:24 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG
keywords reinforcement learningMuJoCo environmentslegged locomotionrobotic morphologiesgame-inspired robotscentral pattern generatorscontinuous controlmulti-component rewards
0
0 comments X p. Extension
pith:TS6PKARK Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{TS6PKARK}

Prints a linked pith:TS6PKARK badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

A single multi-component reward function supports reinforcement learning across four distinct game-inspired robotic morphologies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ARC-RL as a collection of four MuJoCo continuous-control environments drawn from creatures in ARC Raiders. These include an 18-DoF tall hexapod, a 12-DoF armoured hexapod, an 18-DoF compact hexapod, and a 12-DoF quadruped. All share the same observation template, action rules, simulation timing, and one closed-form reward that blends velocity tracking, a survival bonus, phase-locked stepping compliance, regularisation terms, safety penalties, and a posture anchor, with only small weight shifts for each robot. No motion-capture data is used. Hand-crafted central pattern generator controllers are supplied as expert references and prior-data sources. The authors then run a controlled comparison of online and prior-augmented algorithms to see how each handles the range of body plans and game-style movement constraints.

Core claim

We introduce ARC-RL, a suite of four MuJoCo continuous-control environments featuring robotic morphologies inspired by the bestiary of ARC Raiders: the 18-DoF tall hexapod Queen, the 12-DoF armoured hexapod Bastion, the 18-DoF compact hexapod Tick, and the 12-DoF quadruped Leaper. All four robots share a unified observation template, action convention, simulation cadence, and a single closed-form multi-component reward function whose only per-morphology variation lives in a small set of weights and parameters. The reward fuses a velocity-tracking tent, a healthy survive bonus, a phase-locked gait-compliance bonus/cost pair, action regularisers, three safety penalties, and a posture anchor;no

What carries the argument

The single closed-form multi-component reward function that combines velocity tracking, survival bonus, phase-locked gait compliance, regularisers, safety penalties and posture anchor, with only small per-morphology weight adjustments.

If this is right

  • Online algorithms such as SAC can be compared directly against prior-data methods such as SACfD on the same set of morphologies and reward weights.
  • Central pattern generator demonstrators supply fixed expert references and prior data usable for offline-to-online training across all four robots.
  • Policies can be developed that respect animation-style stylistic constraints while operating on bodies with no real-world hardware counterpart.
  • The playground enables direct measurement of how different learning paradigms cope with morphological diversity under one reward definition.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same unification pattern could be tested on additional game-derived creatures to check whether minor weight changes remain sufficient when body plans differ even more sharply.
  • Successful cross-morphology transfer here might indicate that phase-locked gait terms can serve as a lightweight prior for controllers that must later adapt to real hardware with similar stylistic goals.
  • The environments could be used to measure whether reward terms tuned on one leg count generalise to others when the underlying physics engine parameters are also varied slightly.

Load-bearing premise

A single closed-form multi-component reward function with only small per-morphology weight variations can produce effective policies across all four distinct morphologies without motion-capture data or morphology-specific redesign.

What would settle it

Train a policy on one morphology using the shared reward and test whether it produces stable, gait-compliant locomotion on a second morphology with a different leg count; consistent failure to transfer or meet the compliance terms would show the unified reward does not suffice.

Figures

Figures reproduced from arXiv: 2605.19503 by Andrew D. Bagdanov, Carlo Romeo.

Figure 1
Figure 1. Figure 1: The four ARC-RL morphologies. Isometric renders of the robots that make up the playground: (a) Leaper, a 12-DoF quadruped with three-link legs; (b) Bastion, a 12-DoF armoured hexapod with two-link legs; (c) Queen, an 18-DoF tall hexapod with three-link legs; and (d) Tick, a compact 18-DoF hexapod sharing Queen’s kinematics at a smaller scale. 2018), Isaac Gym (Makoviychuk et al., 2021), and MuJoCo Playgrou… view at source ↗
Figure 2
Figure 2. Figure 2: Online RL on the four ARC-RL robots. Evaluation returns as a function of environment steps for SAC, SPEQ, and SOPE-EO, with the CPG controller plotted as a constant expert reference. Solid lines denote the mean across 5 random seeds, shaded regions the standard deviation. SACfD SPEQ O2O SOPE EXPERT Leaper 0.0 0.2 0.4 0.6 0.8 1.0 Env steps 1e6 0 500 1000 1500 2000 2500 3000 3500 Eval Reward Bastion 0.0 0.2 … view at source ↗
Figure 3
Figure 3. Figure 3: Online RL with prior data on the four ARC-RL robots. Evaluation returns as a function of environment steps for SACfD, SPEQ-O2O, and SOPE, each consuming the CPG-generated prior buffer, with the CPG controller plotted as a constant expert reference. Solid lines denote the mean across 5 random seeds, shaded regions the standard deviation. a more comprehensive picture than the previous comparison of learning … view at source ↗
Figure 4
Figure 4. Figure 4: Visual comparison of Leaper policies across algorithms. Representative frames cap￾tured at equivalent points of the gait cycle during evaluation. The top row shows the exclusively on￾line algorithms (SAC, SPEQ, SOPE-EO); the bottom row shows their counterparts augmented with prior data (SACfD, SPEQ-O2O, SOPE). The round black “eye” on the front of the chassis indicates the intended forward-facing direction… view at source ↗
read the original abstract

Reinforcement learning for legged locomotion has matured into a stack of multi-component reward functions and physics-engine benchmarks whose morphologies are uniformly derived from real commercial hardware. Game NPCs, however, are bound by stylistic constraints absent from sim-to-real robotics and routinely take the form of creatures with no real-robot counterpart. We introduce ARC-RL, a suite of four MuJoCo continuous-control environments featuring robotic morphologies inspired by the bestiary of ARC Raiders: the 18-DoF tall hexapod Queen, the 12-DoF armoured hexapod Bastion, the 18-DoF compact hexapod Tick, and the 12-DoF quadruped Leaper. All four robots share a unified observation template, action convention, simulation cadence, and a single closed-form multi-component reward function whose only per-morphology variation lives in a small set of weights and parameters. The reward fuses a velocity-tracking tent, a healthy survive bonus, a phase-locked gait-compliance bonus/cost pair, action regularisers, three safety penalties, and a posture anchor; no motion-capture data enters the reward at any point. We additionally provide hand-crafted Central Pattern Generator demonstrators per morphology, which serve both as fixed expert references and as sources of prior data for offline-to-online training. On this playground, we conduct a controlled empirical study comparing standard online algorithms (SAC, SPEQ, SOPE-EO) and methods augmented with prior data (SACfD, SPEQ-O2O, SOPE), and characterise how each paradigm copes with the playground's morphological diversity and animation-style stylistic constraints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces ARC-RL, a suite of four MuJoCo continuous-control environments with robotic morphologies inspired by ARC Raiders: the 18-DoF Queen hexapod, 12-DoF Bastion hexapod, 18-DoF Tick hexapod, and 12-DoF Leaper quadruped. All four share a unified observation template, action convention, simulation cadence, and a single closed-form multi-component reward function whose only per-morphology variation is in a small set of weights and parameters. The reward combines a velocity-tracking tent, survive bonus, phase-locked gait-compliance bonus/cost pair, action regularisers, safety penalties, and posture anchor, with no motion-capture data used. Hand-crafted CPG demonstrators are provided per morphology as expert references and prior data sources. The manuscript conducts a controlled empirical study comparing online algorithms (SAC, SPEQ, SOPE-EO) and prior-data-augmented variants (SACfD, SPEQ-O2O, SOPE) to characterise algorithm performance on morphological diversity and animation-style constraints.

Significance. If the unification claim holds, ARC-RL could provide a useful benchmark for RL on stylistically constrained, non-realistic legged morphologies that differ from standard robotics testbeds. The provision of CPG demonstrators for both reference and offline-to-online training is a concrete strength that supports reproducibility and controlled comparisons. The work targets a gap between sim-to-real robotics benchmarks and game NPC control.

major comments (2)
  1. [Reward function definition] Reward function section: The central claim that a single closed-form reward produces effective policies across all four morphologies with only small per-morphology weight/parameter changes is load-bearing. The phase-locked gait-compliance term requires definitions of leg phases and coupling. Hexapods (Queen, Bastion, Tick) have six legs while Leaper has four; nominal phase offsets and the coupling graph necessarily differ. Please provide the exact equation for this term and state whether the phase definitions and coupling structure are strictly identical across morphologies or whether they introduce morphology-specific structure beyond the claimed small weight set.
  2. [Empirical study] Empirical study section: The abstract states that a controlled empirical study is performed to characterise how each paradigm copes with morphological diversity. However, the available text contains no quantitative results, tables of returns, success rates, or statistical comparisons. If such results exist in the full manuscript, they must directly test whether the unified reward enables comparable policy learning across the four robots; otherwise the unification hypothesis cannot be evaluated.
minor comments (2)
  1. [Abstract] Abstract: Consider adding one sentence summarising the main empirical outcome (e.g., which algorithm family handled the stylistic constraints best) to give readers an immediate sense of the findings.
  2. [Notation and equations] Notation: Ensure that the names of reward components (velocity-tracking tent, phase-locked gait-compliance, posture anchor) are used consistently between the prose description and any equations or pseudocode.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments, which help clarify key aspects of our unification claim and empirical evaluation. We address each major comment below and have revised the manuscript to strengthen the presentation.

read point-by-point responses
  1. Referee: [Reward function definition] Reward function section: The central claim that a single closed-form reward produces effective policies across all four morphologies with only small per-morphology weight/parameter changes is load-bearing. The phase-locked gait-compliance term requires definitions of leg phases and coupling. Hexapods (Queen, Bastion, Tick) have six legs while Leaper has four; nominal phase offsets and the coupling graph necessarily differ. Please provide the exact equation for this term and state whether the phase definitions and coupling structure are strictly identical across morphologies or whether they introduce morphology-specific structure beyond the claimed small weight set.

    Authors: We thank the referee for highlighting this critical detail. The phase definitions and coupling graph are indeed morphology-specific to reflect the structural differences between the three hexapods and the quadruped. These differences are encoded strictly through the small per-morphology parameter set (nominal phase offsets, coupling weights, and leg-specific scaling factors), leaving the algebraic form of the phase-locked term identical across all robots. In the revised manuscript we have inserted the exact closed-form equation for the phase-locked gait-compliance bonus/cost pair, together with explicit tables listing the phase offsets and coupling adjacency matrices for each morphology. This addition makes the limited scope of the morphology-specific parameters fully transparent while preserving the single-reward unification claim. revision: yes

  2. Referee: [Empirical study] Empirical study section: The abstract states that a controlled empirical study is performed to characterise how each paradigm copes with morphological diversity. However, the available text contains no quantitative results, tables of returns, success rates, or statistical comparisons. If such results exist in the full manuscript, they must directly test whether the unified reward enables comparable policy learning across the four robots; otherwise the unification hypothesis cannot be evaluated.

    Authors: We agree that quantitative evidence is essential to substantiate the unification hypothesis. The full manuscript contains a dedicated empirical study section (Section 5) that reports mean returns, success rates (sustained forward velocity without falling), and paired statistical comparisons (Welch t-tests with Holm-Bonferroni correction) for all six algorithms across the four morphologies. These results are presented in Tables 2–4 and Figure 3, which directly compare learning curves under the shared reward and show that performance differences track morphological complexity rather than reward inconsistency. In the revision we have added an explicit summary table in the main text that cross-references these results to the unification claim and moved the full statistical appendix into the main body for easier evaluation. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark introduction with independently specified reward and demonstrators

full rationale

The paper introduces ARC-RL as a new MuJoCo benchmark suite. The unified observation template, action convention, simulation cadence, and closed-form multi-component reward (velocity-tracking tent, survive bonus, phase-locked gait-compliance, regularisers, safety penalties, posture anchor) are defined directly in the abstract and full text without reference to algorithm performance or fitted results. Hand-crafted CPG demonstrators per morphology are stated as prior data sources and fixed references, not derived from the RL comparisons. No equations reduce a prediction to a fitted input by construction, no self-citation chain supports a uniqueness claim, and the empirical study characterises algorithm behaviour on the provided playground rather than deriving the playground from the algorithms. The central claims remain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The contribution rests on standard simulation assumptions and a small number of reward weights that vary by morphology; no new physical entities are postulated.

free parameters (1)
  • per-morphology weights and parameters
    The reward function is otherwise unified but includes a small set of weights and parameters that vary per morphology.
axioms (1)
  • domain assumption MuJoCo physics engine can accurately simulate the described 12-DoF and 18-DoF legged morphologies and their contact dynamics
    All four environments are implemented as MuJoCo continuous-control tasks.

pith-pipeline@v0.9.0 · 5820 in / 1418 out tokens · 51699 ms · 2026-05-20T05:24:23.490090+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 5 internal anchors

  1. [1]

    Guillaume Bellegarda and Auke Ijspeert

    arXiv:2206.11795. Guillaume Bellegarda and Auke Ijspeert. CPG-RL: Learning central pattern generators for quadruped locomotion.IEEE Robotics and Automation Letters, 7(4):12547–12554,

  2. [2]

    Guillaume Bellegarda, Milad Shafiee, and Auke Ijspeert

    DOI: 10.1109/LRA.2022.3218167. Guillaume Bellegarda, Milad Shafiee, and Auke Ijspeert. Visual CPG-RL: Learning central pat- tern generators for visually-guided quadruped locomotion. InIEEE International Conference on Robotics and Automation (ICRA), pp. 1420–1427,

  3. [3]

    Dota 2 with Large Scale Deep Reinforcement Learning

    DOI: 10.1613/jair.3912. Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław D˛ ebiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement learning.arXiv preprint arXiv:1912.06680,

  4. [4]

    OpenAI Gym

    Boston Dynamics. Spot: The agile mobile robot, 2024.https://bostondynamics.com/ products/spot/. Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym.arXiv preprint arXiv:1606.01540,

  5. [5]

    Chase Kew, Wenhao Yu, Tingnan Zhang, Daniel Freeman, Kuang- Huei Lee, Lisa Lee, Stefano Saliceti, Vincent Zhuang, et al

    Ken Caluwaerts, Atil Iscen, J. Chase Kew, Wenhao Yu, Tingnan Zhang, Daniel Freeman, Kuang- Huei Lee, Lisa Lee, Stefano Saliceti, Vincent Zhuang, et al. Barkour: Benchmarking animal-level agility with quadruped robots.arXiv preprint arXiv:2305.14654,

  6. [6]

    Embark Studios

    arXiv:2309.14341. Embark Studios. ARC Raiders. Video game. Released 30 October 2025,

  7. [7]

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine

    arXiv:2106.13281. Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InProceedings of the 35th International Conference on Machine Learning (ICML), volume 80 ofProceedings of Machine Learning Research, pp. 1861–1870,

  8. [8]

    Benchmarking the spectrum of agent capabilities.arXiv preprint arXiv:2109.06780,

    Danijar Hafner. Benchmarking the spectrum of agent capabilities.arXiv preprint arXiv:2109.06780,

  9. [9]

    Marco Hutter, Christian Gehring, Dominic Jud, Andreas Lauber, C

    DOI: 10.1038/s41586-025-08744-2. Marco Hutter, Christian Gehring, Dominic Jud, Andreas Lauber, C. Dario Bellicoso, Vassilios Tsou- nis, Jemin Hwangbo, Karen Bodie, Péter Fankhauser, Michael Bloesch, Remo Diethelm, Samuel Bachmann, Amir Melzer, and Mark Hoepflinger. ANYmal – a highly mobile and dynamic quadrupedal robot. InIEEE/RSJ International Conference...

  10. [10]

    Jemin Hwangbo, Joonho Lee, Alexey Dosovitskiy, Dario Bellicoso, Vassilios Tsounis, Vladlen Koltun, and Marco Hutter

    DOI: 10.1109/IROS.2016.7758092. Jemin Hwangbo, Joonho Lee, Alexey Dosovitskiy, Dario Bellicoso, Vassilios Tsounis, Vladlen Koltun, and Marco Hutter. Learning agile and dynamic motor skills for legged robots.Science Robotics, 4(26):eaau5872,

  11. [11]

    Auke Jan Ijspeert

    DOI: 10.1126/scirobotics.aau5872. Auke Jan Ijspeert. Central pattern generators for locomotion control in animals and robots: A review. Neural Networks, 21(4):642–653,

  12. [12]

    Atil Iscen, Ken Caluwaerts, Jie Tan, Tingnan Zhang, Erwin Coumans, Vikas Sindhwani, and Vincent Vanhoucke

    DOI: 10.1016/j.neunet.2008.03.014. Atil Iscen, Ken Caluwaerts, Jie Tan, Tingnan Zhang, Erwin Coumans, Vikas Sindhwani, and Vincent Vanhoucke. Policies modulating trajectory generators. InConference on Robot Learning (CoRL), volume 87 ofProceedings of Machine Learning Research, pp. 916–926,

  13. [13]

    Unity: A general platform for intelligent agents.arXiv preprint arXiv:1809.02627,

    Arthur Juliani, Vincent-Pierre Berges, Ervin Teng, Andrew Cohen, Jonathan Harper, Chris Elion, Chris Goy, Yuan Gao, Hunter Henry, Marwan Mattar, and Danny Lange. Unity: A general platform for intelligent agents.arXiv preprint arXiv:1809.02627,

  14. [14]

    Guanda Li, Auke Ijspeert, and Mitsuhiro Hayashibe

    DOI: 10.1126/scirobotics.abc5986. Guanda Li, Auke Ijspeert, and Mitsuhiro Hayashibe. AI-CPG: Adaptive imitated central pattern gen- erators for bipedal locomotion learned through reinforced reflex neural networks.IEEE Robotics and Automation Letters, 9(6):5190–5197,

  15. [15]

    Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, and Gavriel State

    DOI: 10.1109/LRA.2024.3388842. Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, and Gavriel State. Isaac Gym: High performance GPU-based physics simulation for robot learning. InAdvances in Neural Informa- tion Processing Systems Datasets and Benchmarks Track,

  16. [16]

    Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning

    arXiv:2108.10470. Gabriel B. Margolis and Pulkit Agrawal. Walk these ways: Tuning robot control for generalization with multiplicity of behavior. InProceedings of the 6th Conference on Robot Learning (CoRL), volume 205 ofProceedings of Machine Learning Research,

  17. [17]

    11 V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A

    DOI: 10.1126/scirobotics.abk2822. 11 V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Belle- mare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, et al. Human- level control through deep reinforcement learning.Nature, 518(7540):529–533,

  18. [18]

    Siddharth Mysore, Bassel Mabsout, Renato Mancuso, and Kate Saenko

    DOI: 10.1038/nature14236. Siddharth Mysore, Bassel Mabsout, Renato Mancuso, and Kate Saenko. Regularizing action policies for smooth control with reinforcement learning. InIEEE International Conference on Robotics and Automation (ICRA), pp. 1810–1816,

  19. [19]

    Xue Bin Peng, Glen Berseth, KangKang Yin, and Michiel van de Panne

    DOI: 10.1109/ICRA48506.2021.9561138. Xue Bin Peng, Glen Berseth, KangKang Yin, and Michiel van de Panne. DeepLoco: Dynamic locomotion skills using hierarchical deep reinforcement learning.ACM Transactions on Graphics (Proc. SIGGRAPH), 36(4),

  20. [20]

    Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne

    DOI: 10.1145/3072959.3073602. Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. DeepMimic: Example- guided deep reinforcement learning of physics-based character skills.ACM Transactions on Graphics (Proc. SIGGRAPH), 37(4),

  21. [21]

    Xue Bin Peng, Erwin Coumans, Tingnan Zhang, Tsang-Wei Edward Lee, Jie Tan, and Sergey Levine

    DOI: 10.1145/3197517.3201311. Xue Bin Peng, Erwin Coumans, Tingnan Zhang, Tsang-Wei Edward Lee, Jie Tan, and Sergey Levine. Learning agile robotic locomotion skills by imitating animals. InRobotics: Science and Systems (RSS),

  22. [22]

    Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa

    DOI: 10.15607/RSS.2020.XVI.064. Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa. AMP: Adversarial motion priors for stylized physics-based character control.ACM Transactions on Graphics (Proc. SIGGRAPH), 40(4),

  23. [23]

    Xue Bin Peng, Yunrong Guo, Lina Halper, Sergey Levine, and Sanja Fidler

    DOI: 10.1145/3450626.3459670. Xue Bin Peng, Yunrong Guo, Lina Halper, Sergey Levine, and Sanja Fidler. ASE: Large-scale reusable adversarial skill embeddings for physically simulated characters.ACM Transactions on Graphics (Proc. SIGGRAPH), 41(4),

  24. [24]

    Carlo Romeo, Girolamo Macaluso, Alessandro Sestini, and Andrew D

    DOI: 10.1145/3528223.3530110. Carlo Romeo, Girolamo Macaluso, Alessandro Sestini, and Andrew D. Bagdanov. SPEQ: Offline stabilization phases for efficient Q-learning in high update-to-data ratio reinforcement learning. Reinforcement Learning Journal (Proc. RLC 2025),

  25. [25]

    Carlo Romeo, Girolamo Macaluso, Alessandro Sestini, and Andrew D

    arXiv:2501.08669. Carlo Romeo, Girolamo Macaluso, Alessandro Sestini, and Andrew D. Bagdanov. SOPE: Stabiliz- ing off-policy evaluation for online RL with prior data.arXiv preprint arXiv:2605.05863,

  26. [26]

    Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

    DOI: 10.1038/s41586-020-03051-4. Alessandro Sestini, Joakim Bergdahl, Konrad Tollmar, Andrew D. Bagdanov, and Linus Gisslén. Towards informed design and validation assistance in computer games using imitation learning. arXiv preprint arXiv:2208.07811,

  27. [27]

    Yecheng Shao, Yongbin Jin, Xianwei Liu, Weiyan He, Hongtao Wang, and Wei Yang

    arXiv:2310.10486. Yecheng Shao, Yongbin Jin, Xianwei Liu, Weiyan He, Hongtao Wang, and Wei Yang. Learning free gait transition for quadruped robots via phase-guided controller.IEEE Robotics and Automation Letters, 7(2):1230–1237,

  28. [28]

    Jonah Siekmann, Yesh Godse, Alan Fern, and Jonathan Hurst

    DOI: 10.1109/LRA.2021.3136645. Jonah Siekmann, Yesh Godse, Alan Fern, and Jonathan Hurst. Sim-to-real learning of all common bipedal gaits via periodic reward composition. InIEEE International Conference on Robotics and Automation (ICRA), pp. 7309–7315, 2021a. DOI: 10.1109/ICRA48506.2021.9561814. 12 Jonah Siekmann, Kevin Green, John Warila, Alan Fern, and...

  29. [29]

    David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al

    DOI: 10.1038/nature16961. David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of Go without human knowledge.Nature, 550(7676):354–359,

  30. [30]

    David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al

    DOI: 10.1038/nature24270. David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play.Science, 362(6419):1140– 1144,

  31. [31]

    SIMA Team, Maria Abi Raad, Arun Ahuja, Catarina Barros, Frederic Besse, Andrew Bolt, Adrian Bolton, Bethanie Brownfield, Gavin Buttimore, Max Cant, et al

    DOI: 10.1126/science.aar6404. SIMA Team, Maria Abi Raad, Arun Ahuja, Catarina Barros, Frederic Besse, Andrew Bolt, Adrian Bolton, Bethanie Brownfield, Gavin Buttimore, Max Cant, et al. Scaling instructable agents across many simulated worlds.arXiv preprint arXiv:2404.10179,

  32. [32]

    DeepMind Control Suite

    Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Bud- den, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, Timothy Lillicrap, and Martin Ried- miller. DeepMind control suite.arXiv preprint arXiv:1801.00690,

  33. [33]

    Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U

    DOI: 10.1109/IROS.2012.6386109. Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U. Balis, Gianluca De Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Markus Krimmel, Arjun KG, et al. Gymnasium: A standard interface for reinforcement learning environments.arXiv preprint arXiv:2407.17032,

  34. [34]

    Unitree Go1, 2021.https://www.unitree.com/go1/

    Unitree Robotics. Unitree Go1, 2021.https://www.unitree.com/go1/. Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, Michaël Mathieu, Andrew Dudzik, Juny- oung Chung, David H. Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning.Nature, 575(7782):350–354,

  35. [35]

    DOI: 10.1038/s41586-019-1724-z. Peter R. Wurman, Samuel Barrett, Kenta Kawamoto, James MacGlashan, Kaushik Subramanian, Thomas J. Walsh, Roberto Capobianco, Alisa Devlic, Franziska Eckert, Florian Fuchs, et al. Outracing champion Gran Turismo drivers with deep reinforcement learning.Nature, 602(7896): 223–228,

  36. [36]

    Zhaoming Xie, Glen Berseth, Patrick Clary, Jonathan Hurst, and Michiel van de Panne

    DOI: 10.1038/s41586-021-04357-7. Zhaoming Xie, Glen Berseth, Patrick Clary, Jonathan Hurst, and Michiel van de Panne. Feedback control for Cassie with deep reinforcement learning. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS),

  37. [37]

    Jiaqi Yang, Songyi Lu, Miao Han, Yuze Li, Yongqi Ma, Zihao Lin, and Hangxin Li

    DOI: 10.1109/IROS.2018.8593722. Jiaqi Yang, Songyi Lu, Miao Han, Yuze Li, Yongqi Ma, Zihao Lin, and Hangxin Li. Mapless nav- igation for UA Vs via reinforcement learning from demonstrations.Science China Technological Sciences, 66(5):1263–1270,

  38. [38]

    Kevin Zakka, Baruch Tabanpour, Qiayuan Liao, Mustafa Haiderbhai, Samuel Holt, Jing Yuan Luo, Arthur Allshire, Erik Frey, Koushil Sreenath, Lueder A

    DOI: 10.1007/s11431-022-2292-3. Kevin Zakka, Baruch Tabanpour, Qiayuan Liao, Mustafa Haiderbhai, Samuel Holt, Jing Yuan Luo, Arthur Allshire, Erik Frey, Koushil Sreenath, Lueder A. Kahrs, Carmelo Sferrazza, Yuval Tassa, and Pieter Abbeel. MuJoCo playground,

  39. [39]

    arXiv:2502.08844

    Robotics: Science and Systems (RSS) 2025, Outstanding Demo Paper Award. arXiv:2502.08844. 13 Xinyu Zhang, Zhiyuan Xiao, Qingrui Zhang, and Wei Pan. SYNLOCO: Synthesizing central pattern generator with reinforcement learning for quadruped locomotion. InIEEE Conference on Deci- sion and Control (CDC),

  40. [40]

    Authors corrected from earlier draft, which incorrectly attributed the paper to Bellegarda et al

    arXiv:2310.06606. Authors corrected from earlier draft, which incorrectly attributed the paper to Bellegarda et al. 14