pith. sign in

arxiv: 2607.00442 · v1 · pith:4EQAZYJInew · submitted 2026-07-01 · 💻 cs.RO · cs.AI

Learning Gait-Aware Quadruped Locomotion with Temporal Logic Specifications

Pith reviewed 2026-07-02 11:56 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords quadruped locomotionreinforcement learningsignal temporal logicreward shapinggait specificationPPOtemporal logic constraints
0
0 comments X

The pith

Signal Temporal Logic specifications shape rewards that improve quadruped velocity tracking and training stability over hand-crafted baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a method to specify distinct quadruped gaits through parameterized Signal Temporal Logic constraints that encode safety bounds, gait synchronization, command tracking, and actuation limits. These constraints are turned into dense, continuous rewards via smooth approximations of their robustness measures, which then guide policy training with Proximal Policy Optimization. Parametric templates are defined and calibrated for three speed regimes from reference rollouts, and the resulting rewards are tested on a simulated Barkour quadruped with parallel simulation and domain randomization. The central demonstration is that this approach produces policies with tighter velocity tracking and more stable training than standard hand-crafted reward functions. A sympathetic reader would care because it replaces opaque reward design with explicit, interpretable temporal constraints for controlling complex locomotion behaviors.

Core claim

We introduce a framework where distinct gaits are specified using parameterized constraints expressed in Signal Temporal Logic (STL). These include safety bounds, gait synchronization constraints, command tracking, and actuation bounds. From these specifications, we develop a reward shaping mechanism that provides learning agents a dense, continuous reward landscape that encodes desired behavior. We define parametric STL templates for three speed regimes (walking-trot, trot, bound), calibrate their parameters from reference rollouts, and compute rewards from using smooth approximations of STL robustness over the rollouts. The generated rewards can be used to provide shaped gradients compatib

What carries the argument

Parameterized STL templates for three speed regimes that generate dense rewards from smooth robustness approximations for PPO training.

If this is right

  • Policies can be trained with explicit control over gait type through adjustments to the STL parameters.
  • Reward signals derived from temporal logic provide denser gradients that stabilize PPO training.
  • Velocity tracking accuracy improves when rewards encode synchronization and command constraints from STL.
  • Domain randomization combines with the shaped rewards to produce policies robust to variations in simulation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same STL specifications used for reward shaping could later verify whether a trained policy satisfies the original gait requirements.
  • Calibrating templates directly from real-robot data rather than simulation rollouts might reduce the sim-to-real gap.
  • Similar parametric STL templates could be developed for other periodic robot behaviors such as manipulation or multi-agent coordination.

Load-bearing premise

The parametric STL templates for the three speed regimes, calibrated from reference rollouts, accurately capture desired gait behaviors and produce effective dense rewards.

What would settle it

Training experiments in which STL-shaped rewards produce higher velocity tracking error or less stable learning curves than hand-crafted rewards would falsify the performance advantage.

Figures

Figures reproduced from arXiv: 2607.00442 by Alfredo Reina Corona, Cagan Bakirci, Jyotirmoy V. Deshmukh, Keyan Azbijari, Merve Atasever.

Figure 1
Figure 1. Figure 1: Overall Pipeline. havior whose contact patterns remain appropriate for the commanded regime. State-of-the-art deep RL pipelines address the first objective, and partially the second through carefully engineered reward terms, curriculum learning, and domain randomization [14, 11, 13]. However, these reward func￾tions are often difficult to interpret, and they provide only indirect control over contact-seque… view at source ↗
Figure 2
Figure 2. Figure 2: Barkour vb robot in MJX. Formal specifications provide an alternative perspective: rather than expressing locomotion objectives through loosely coupled rewards, desired behaviors can be en￾coded as logical specifications. Signal Temporal Logic (STL) is a framework that has been used in robotics to ex￾press bounded-time constraints over real-valued signals. It admits quantitative or robustness semantics, i.… view at source ↗
read the original abstract

Reinforcement learning (RL) for quadruped locomotion commonly depends on fixed, hand-crafted, and Markovian reward functions that limit both interpretability of learned policies and lack explicit control over gait behaviors. We introduce a framework where distinct gaits are specified using parameterized constraints expressed in Signal Temporal Logic (STL). These include safety bounds, gait synchronization constraints, command tracking, and actuation bounds. From these specifications, we develop a reward shaping mechanism that provides learning agents a dense, continuous reward landscape that encodes desired behavior. We define parametric STL templates for three speed regimes (walking-trot, trot, bound), calibrate their parameters from reference rollouts, and compute rewards from using smooth approximations of STL robustness over the rollouts. The generated rewards can be used to provide shaped gradients compatible with Proximal Policy Optimization (PPO). We instantiate the approach on Google's Barkour quadruped robot in MuJoCo XLA (MJX). We use parallelization within the simulator to improve training speeds and use domain randomization to robustify learned policies. We show that compared to a baseline of hand-crafted rewards, the STL-shaped rewards yield tighter velocity tracking and more stable training. Videos can be found on our project website: https://stl-locomotion.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a framework for quadruped locomotion in RL that uses parameterized Signal Temporal Logic (STL) templates to encode gait behaviors (safety bounds, synchronization, command tracking) across three speed regimes. Parameters are calibrated from reference rollouts; smooth robustness approximations then produce dense rewards for PPO. The method is instantiated on the Barkour robot in MuJoCo with domain randomization and parallel simulation; the central empirical claim is that STL-shaped rewards produce tighter velocity tracking and more stable training than a hand-crafted reward baseline.

Significance. If the comparison is shown to be non-circular and the quantitative gains are reproducible, the work would demonstrate a practical route to injecting interpretable, temporally structured specifications into reward shaping for legged RL. This could improve policy transparency and gait control without sacrificing sample efficiency, a useful contribution to the intersection of formal methods and robotics.

major comments (2)
  1. [method section on parametric templates] Calibration procedure for STL templates (method section on parametric templates for walking-trot/trot/bound): the paper states that parameters are fitted from reference rollouts, yet provides no description of how those rollouts were generated (e.g., whether they came from policies already optimized for the target velocities or from the same simulator setup used in the experiments). Without an explicit statement that the reference data are independent of both the STL and hand-crafted training loops, the reported advantage in velocity tracking cannot be unambiguously attributed to the temporal-logic structure rather than to the calibration step itself.
  2. [results section] Experimental comparison (results section reporting velocity tracking and training stability): the abstract and claim assert tighter tracking and more stable PPO training, but the provided text supplies no numerical values, standard deviations, number of seeds, or statistical tests. A load-bearing claim of superiority therefore rests on unreported quantitative evidence; the manuscript must include these metrics (e.g., mean tracking error per regime, success rate, or learning curves) to allow verification.
minor comments (2)
  1. [abstract] The abstract mentions "Videos can be found on our project website" but the manuscript does not include a persistent link or DOI; a stable reference should be added.
  2. [reward shaping subsection] Notation for the smooth robustness approximation is introduced without an explicit equation number or reference to the approximation formula used (e.g., the specific sigmoid or log-sum-exp form); adding an equation label would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and will revise the manuscript accordingly to improve clarity and provide supporting evidence.

read point-by-point responses
  1. Referee: [method section on parametric templates] Calibration procedure for STL templates (method section on parametric templates for walking-trot/trot/bound): the paper states that parameters are fitted from reference rollouts, yet provides no description of how those rollouts were generated (e.g., whether they came from policies already optimized for the target velocities or from the same simulator setup used in the experiments). Without an explicit statement that the reference data are independent of both the STL and hand-crafted training loops, the reported advantage in velocity tracking cannot be unambiguously attributed to the temporal-logic structure rather than to the calibration step itself.

    Authors: We agree that the manuscript does not explicitly describe how the reference rollouts were generated. We will revise the method section on parametric templates to state that these rollouts were produced by a preliminary policy trained independently using only a basic velocity-tracking reward (without STL or the hand-crafted baseline) in the same MuJoCo Barkour environment with domain randomization. This addition will establish the independence of the calibration data from the reported training loops. revision: yes

  2. Referee: [results section] Experimental comparison (results section reporting velocity tracking and training stability): the abstract and claim assert tighter tracking and more stable PPO training, but the provided text supplies no numerical values, standard deviations, number of seeds, or statistical tests. A load-bearing claim of superiority therefore rests on unreported quantitative evidence; the manuscript must include these metrics (e.g., mean tracking error per regime, success rate, or learning curves) to allow verification.

    Authors: We acknowledge that the manuscript reports only qualitative improvements without the requested quantitative details. We will revise the results section to include mean velocity tracking errors per regime with standard deviations, the number of random seeds used, success rates, and learning curve comparisons, along with any applicable statistical tests. This will substantiate the claims of tighter tracking and more stable training. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper explicitly states it defines parametric STL templates for three speed regimes, calibrates their parameters from reference rollouts, and computes rewards via smooth robustness approximations for use with PPO. It then reports an empirical comparison showing STL-shaped rewards yield tighter velocity tracking and more stable training than a hand-crafted baseline. This calibration step is presented as part of the method rather than a hidden fit renamed as a prediction, and the performance claim rests on independent training outcomes rather than reducing by construction to the reference data. No self-citations, uniqueness theorems, or ansatzes smuggled via prior work appear in the provided text. The approach is self-contained against the stated external baseline.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Based on abstract only, the approach depends on calibrated parameters from reference rollouts as free parameters and the domain assumption that STL can express the listed gait constraints.

free parameters (1)
  • STL template parameters for gaits = calibrated from reference rollouts
    Parameters for safety bounds, gait synchronization constraints, command tracking, and actuation bounds are calibrated from reference rollouts for three speed regimes.
axioms (1)
  • domain assumption Signal Temporal Logic can express gait behaviors using parameterized constraints for safety, synchronization, tracking, and actuation
    The framework is built on this assumption to generate rewards from STL specifications.

pith-pipeline@v0.9.1-grok · 5771 in / 1222 out tokens · 41580 ms · 2026-07-02T11:56:37.789792+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

76 extracted references · 14 canonical work pages · 4 internal anchors

  1. [1]

    Raibert, K

    M. Raibert, K. Blankespoor, G. Nelson, and R. Playter. Bigdog, the rough-terrain quadruped robot.IFAC Proceedings Volumes, 41(2):10822–10825, 2008

  2. [2]

    B. Katz, J. Di Carlo, and S. Kim. Mini cheetah: A platform for pushing the limits of dynamic quadruped control. In2019 international conference on robotics and automation (ICRA), pages 6295–6301. IEEE, 2019

  3. [3]

    X. Peng, E. Coumans, T. Zhang, T. Lee, J. Tan, and S. Levine. Learning agile robotic locomo- tion skills by imitating animals. arxiv 2020.arXiv preprint arXiv:2004.00784

  4. [4]

    J. Lee, J. Hwangbo, L. Wellhausen, V . Koltun, and M. Hutter. Learning quadrupedal locomo- tion over challenging terrain.Science robotics, 5(47):eabc5986, 2020

  5. [5]

    Z. Xie, X. Da, M. Van de Panne, B. Babich, and A. Garg. Dynamics randomization revisited: A case study for quadrupedal locomotion. In2021 IEEE International Conference on Robotics and Automation (ICRA), pages 4955–4961. IEEE, 2021

  6. [6]

    Agarwal, A

    A. Agarwal, A. Kumar, J. Malik, and D. Pathak. Legged locomotion in challenging terrains using egocentric vision. InConference on robot learning, pages 403–415. PMLR, 2023

  7. [7]

    D. Kim, J. Di Carlo, B. Katz, G. Bledt, and S. Kim. Highly dynamic quadruped locomotion via whole-body impulse control and model predictive control.arXiv preprint arXiv:1909.06586, 2019

  8. [8]

    Nguyen, M

    Q. Nguyen, M. J. Powell, B. Katz, J. Di Carlo, and S. Kim. Optimized jumping on the mit cheetah 3 robot. In2019 International Conference on Robotics and Automation (ICRA), pages 7448–7454. IEEE, 2019

  9. [9]

    Di Carlo, P

    J. Di Carlo, P. M. Wensing, B. Katz, G. Bledt, and S. Kim. Dynamic locomotion in the mit cheetah 3 through convex model-predictive control. In2018 IEEE/RSJ international confer- ence on intelligent robots and systems (IROS), pages 1–9. IEEE, 2018

  10. [10]

    Todorov, T

    E. Todorov, T. Erez, and Y . Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE, 2012

  11. [11]

    Hutter, C

    M. Hutter, C. Gehring, D. Jud, A. Lauber, C. D. Bellicoso, V . Tsounis, J. Hwangbo, K. Bodie, P. Fankhauser, M. Bloesch, et al. Anymal-a highly mobile and dynamic quadrupedal robot. In2016 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 38–44. IEEE, 2016

  12. [12]

    Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning

    V . Makoviychuk, L. Wawrzyniak, Y . Guo, M. Lu, K. Storey, M. Macklin, D. Hoeller, N. Rudin, A. Allshire, A. Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning.arXiv preprint arXiv:2108.10470, 2021

  13. [13]

    Caluwaerts, A

    K. Caluwaerts, A. Iscen, J. C. Kew, W. Yu, T. Zhang, D. Freeman, K.-H. Lee, L. Lee, S. Sal- iceti, V . Zhuang, et al. Barkour: Benchmarking animal-level agility with quadruped robots. arXiv preprint arXiv:2305.14654, 2023

  14. [14]

    J. Tan, T. Zhang, E. Coumans, A. Iscen, Y . Bai, D. Hafner, S. Bohez, and V . Vanhoucke. Sim- to-real: Learning agile locomotion for quadruped robots.arXiv preprint arXiv:1804.10332, 2018

  15. [15]

    Maler and D

    O. Maler and D. Nickovic. Monitoring temporal properties of continuous signals. InInterna- tional symposium on formal techniques in real-time and fault-tolerant systems, pages 152–166. Springer, 2004. 9

  16. [16]

    X. Ding, S. L. Smith, C. Belta, and D. Rus. Optimal control of markov decision processes with linear temporal logic constraints.IEEE Transactions on Automatic Control, 59(5):1244–1257, 2014

  17. [18]

    Aksaray, A

    D. Aksaray, A. Jones, Z. Kong, M. Schwager, and C. Belta. Q-learning for robust satisfaction of signal temporal logic specifications. In2016 IEEE 55th Conference on Decision and Control (CDC), pages 6565–6570. IEEE, 2016

  18. [19]

    Kapinski, X

    J. Kapinski, X. Jin, J. Deshmukh, A. Donze, T. Yamaguchi, H. Ito, T. Kaga, S. Kobuna, and S. Seshia. St-lib: A library for specifying and classifying model behaviors. Technical report, SAE Technical Paper, 2016

  19. [20]

    J. V . Deshmukh, A. Donz ´e, S. Ghosh, X. Jin, G. Juniwal, and S. A. Seshia. Robust online monitoring of signal temporal logic.Formal Methods in System Design, 51(1):5–30, 2017

  20. [21]

    Camacho, R

    A. Camacho, R. T. Icarte, T. Q. Klassen, R. A. Valenzano, and S. A. McIlraith. Ltl and be- yond: Formal languages for reward function specification in reinforcement learning. InIJCAI, volume 19, pages 6065–6073, 2019

  21. [22]

    Balakrishnan and J

    A. Balakrishnan and J. V . Deshmukh. Structured reward shaping using signal temporal logic specifications. in 2019 ieee/rsj iros, 3481–3486, 2019

  22. [23]

    L. Z. Yuan, M. Hasanbeig, A. Abate, and D. Kroening. Modular deep reinforcement learning with temporal logic specifications.arXiv preprint arXiv:1909.11591, 2019

  23. [24]

    Jiang, S

    Y . Jiang, S. Bharadwaj, B. Wu, R. Shah, U. Topcu, and P. Stone. Temporal-logic-based reward shaping for continuing reinforcement learning tasks. InProceedings of the AAAI Conference on artificial Intelligence, volume 35, pages 7995–8003, 2021

  24. [25]

    Hasanbeig, D

    M. Hasanbeig, D. Kroening, and A. Abate. Deep reinforcement learning with temporal logics. InInternational Conference on Formal Modeling and Analysis of Timed Systems, pages 1–22. Springer, 2020

  25. [26]

    M. Wen, R. Ehlers, and U. Topcu. Correct-by-synthesis reinforcement learning with tempo- ral logic constraints. In2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4983–4990. IEEE, 2015

  26. [27]

    R. T. Icarte, T. Q. Klassen, R. Valenzano, and S. A. McIlraith. Reward machines: Exploiting reward function structure in reinforcement learning.Journal of Artificial Intelligence Research, 73:173–208, 2022

  27. [28]

    Dayan and B

    P. Dayan and B. W. Balleine. Reward, motivation, and reinforcement learning.Neuron, 36(2): 285–298, 2002

  28. [29]

    Eschmann

    J. Eschmann. Reward function design in reinforcement learning.Reinforcement learning algorithms: Analysis and Applications, pages 25–33, 2021

  29. [30]

    J. Hare. Dealing with sparse rewards in reinforcement learning.arXiv preprint arXiv:1910.09281, 2019

  30. [31]

    Kim, Y .-H

    G. Kim, Y .-H. Lee, and H.-W. Park. A learning framework for diverse legged robot locomotion using barrier-based style rewards. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 10004–10010. IEEE, 2025

  31. [32]

    J. Fu, K. Luo, and S. Levine. Learning robust rewards with adversarial inverse reinforcement learning.arXiv preprint arXiv:1710.11248, 2017. 10

  32. [33]

    A. Y . Ng, S. Russell, et al. Algorithms for inverse reinforcement learning. InIcml, volume 1, page 2, 2000

  33. [34]

    Arora and P

    S. Arora and P. Doshi. A survey of inverse reinforcement learning: Challenges, methods and progress.Artificial Intelligence, 297:103500, 2021

  34. [35]

    P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

  35. [36]

    D. Youm, H. Jung, H. Kim, J. Hwangbo, H.-W. Park, and S. Ha. Imitating and finetuning model predictive control for robust and symmetric quadrupedal locomotion.IEEE Robotics and Automation Letters, 8(11):7799–7806, 2023

  36. [37]

    T. Li, J. Won, J. Cho, S. Ha, and A. Rai. Fastmimic: Model-based motion imitation for agile, diverse and generalizable quadrupedal locomotion.Robotics, 12(3):90, 2023

  37. [38]

    H.-C. Liao. A survey of reinforcement learning with temporal logic rewards. 2020

  38. [39]

    Li, C.-I

    X. Li, C.-I. Vasile, and C. Belta. Reinforcement learning with temporal logic rewards. In2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3834–

  39. [40]

    Kapoor, A

    P. Kapoor, A. Balakrishnan, and J. V . Deshmukh. Model-based reinforcement learning from signal temporal logic specifications.arXiv preprint arXiv:2011.04950, 2020

  40. [41]

    Puranic, J

    A. Puranic, J. Deshmukh, and S. Nikolaidis. Learning from demonstrations using signal tem- poral logic. InConference on Robot Learning, pages 2228–2242. PMLR, 2021

  41. [42]

    S. Feng, X. Xinjilefu, W. Huang, and C. G. Atkeson. 3d walking based on online optimization. In2013 13th IEEE-RAS International Conference on Humanoid Robots (Humanoids), pages 21–27. IEEE, 2013

  42. [43]

    Y . Zhao, U. Topcu, and L. Sentis. High-level planner synthesis for whole-body locomotion in unstructured environments. In2016 IEEE 55th Conference on Decision and Control (CDC), pages 6557–6564. IEEE, 2016

  43. [44]

    Audren, A

    H. Audren, A. Kheddar, and P. Gergondet. Stability polygons reshaping and morphing for smooth multi-contact transitions and force control of humanoid robots. In2016 IEEE-RAS 16th International Conference on Humanoid Robots (Humanoids), pages 1037–1044. IEEE, 2016

  44. [45]

    Z. Gu, R. Guo, W. Yates, Y . Chen, Y . Zhao, and Y . Zhao. Walking-by-logic: Signal temporal logic-guided model predictive control for bipedal locomotion resilient to external perturba- tions. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 1121–1127. IEEE, 2024

  45. [46]

    Z. Gu, Y . Zhao, Y . Chen, R. Guo, J. K. Leestma, G. S. Sawicki, and Y . Zhao. Robust- locomotion-by-logic: Perturbation-resilient bipedal locomotion via signal temporal logic guided model predictive control.IEEE Transactions on Robotics, 2025

  46. [47]

    Humphreys and C

    J. Humphreys and C. Zhou. Learning to adapt through bio-inspired gait strategies for versatile quadruped locomotion.Nature Machine Intelligence, 7(7):1141–1153, 2025

  47. [48]

    DeFazio, Y

    D. DeFazio, Y . Hayamizu, and S. Zhang. Learning quadruped locomotion policies using logi- cal rules. InProceedings of the International Conference on Automated Planning and Schedul- ing, volume 34, pages 142–150, 2024

  48. [49]

    Sch ¨oner, W

    G. Sch ¨oner, W. Y . Jiang, and J. S. Kelso. A synergetic theory of quadrupedal gaits and gait transitions.Journal of theoretical Biology, 142(3):359–391, 1990. 11

  49. [50]

    S. M. Danner, S. D. Wilshin, N. A. Shevtsova, and I. A. Rybak. Central control of interlimb coordination and speed-dependent gait expression in quadrupeds.The Journal of physiology, 594(23):6947–6967, 2016

  50. [51]

    Righetti and A

    L. Righetti and A. J. Ijspeert. Pattern generators with sensory feedback for the control of quadruped locomotion. In2008 IEEE International Conference on Robotics and Automation, pages 819–824. IEEE, 2008

  51. [52]

    C. Liu, Y . Chen, J. Zhang, and Q. Chen. Cpg driven locomotion control of quadruped robot. In2009 IEEE International Conference on Systems, Man and Cybernetics, pages 2368–2373. IEEE, 2009

  52. [53]

    Humphreys, J

    J. Humphreys, J. Li, Y . Wan, H. Gao, and C. Zhou. Bio-inspired gait transitions for quadruped locomotion.IEEE Robotics and Automation Letters, 8(10):6131–6138, 2023

  53. [54]

    Neunert, F

    M. Neunert, F. Farshidian, A. W. Winkler, and J. Buchli. Trajectory optimization through contacts and automatic gait discovery for quadrupeds.IEEE Robotics and Automation Letters, 2(3):1502–1509, 2017

  54. [55]

    H. Sun, J. Yang, Y . Jia, and C. Wang. Online hierarchical planning for multicontact locomotion control of quadruped robots.IEEE/ASME Transactions on Mechatronics, 30(3):1718–1728, 2024

  55. [56]

    K. Liu, L. Dong, X. Tan, W. Zhang, and L. Zhu. Optimization-based flocking control and mpc-based gait synchronization control for multiple quadruped robots.IEEE Robotics and Automation Letters, 9(2):1929–1936, 2024

  56. [57]

    Bellegarda, M

    G. Bellegarda, M. Shafiee, and A. Ijspeert. Allgaits: Learning all quadruped gaits and tran- sitions. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 15929–15935. IEEE, 2025

  57. [58]

    Y . H. Lee, D. T. Tran, J.-h. Hyun, L. T. Phan, I. M. Koo, S. U. Yang, and H. R. Choi. A gait transition algorithm based on hybrid walking gait for a quadruped walking robot.Intelligent Service Robotics, 8(4):185–200, 2015

  58. [59]

    B. Hu, S. Shao, Z. Cao, Q. Xiao, Q. Li, and C. Ma. Learning a faster locomotion gait for a quadruped robot with model-free deep reinforcement learning. In2019 IEEE International Conference on Robotics and Biomimetics (ROBIO), pages 1097–1102. IEEE, 2019

  59. [60]

    Tsounis, M

    V . Tsounis, M. Alge, J. Lee, F. Farshidian, and M. Hutter. Deepgait: Planning and control of quadrupedal gaits using deep reinforcement learning.IEEE Robotics and Automation Letters, 5(2):3699–3706, 2020

  60. [61]

    Y . Shao, Y . Jin, X. Liu, W. He, H. Wang, and W. Yang. Learning free gait transition for quadruped robots via phase-guided controller.IEEE Robotics and Automation Letters, 7(2): 1230–1237, 2021

  61. [62]

    S. Xu, L. Zhu, and C. P. Ho. Learning efficient and robust multi-modal quadruped locomotion: A hierarchical approach. In2022 international conference on robotics and automation (ICRA), pages 4649–4655. IEEE, 2022

  62. [63]

    L. Wei, Y . Li, Y . Ai, Y . Wu, H. Xu, W. Wang, and G. Hu. Learning multiple-gait quadrupedal locomotion via hierarchical reinforcement learning.International Journal of Precision Engi- neering and Manufacturing, 24(9):1599–1613, 2023

  63. [64]

    Y . Yang, T. Zhang, E. Coumans, J. Tan, and B. Boots. Fast and efficient locomotion via learned gait transitions. InConference on robot learning, pages 773–783. PMLR, 2022

  64. [65]

    Y . Kim, B. Son, and D. Lee. Learning multiple gaits of quadruped robot using hierarchical reinforcement learning.arXiv preprint arXiv:2112.04741, 2021. 12

  65. [66]

    A. L. Mitchell, W. Merkt, A. Papatheodorou, I. Havoutis, and I. Posner. Gaitor: Learning a unified representation across gaits for real-world quadruped locomotion.arXiv preprint arXiv:2405.19452, 2024

  66. [67]

    Shafiee, G

    M. Shafiee, G. Bellegarda, and A. Ijspeert. Manyquadrupeds: Learning a single locomotion policy for diverse quadruped robots. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 3471–3477. IEEE, 2024

  67. [68]

    Zakka, B

    K. Zakka, B. Tabanpour, Q. Liao, M. Haiderbhai, S. Holt, J. Y . Luo, A. Allshire, E. Frey, K. Sreenath, L. A. Kahrs, et al. Mujoco playground.arXiv preprint arXiv:2502.08844, 2025

  68. [69]

    R. M. Alexander and A. Jayes. A dynamic similarity hypothesis for the gaits of quadrupedal mammals.Journal of zoology, 201(1):135–152, 1983

  69. [70]

    Donz ´e and O

    A. Donz ´e and O. Maler. Robust satisfaction of temporal logic over real-valued signals. In International conference on formal modeling and analysis of timed systems, pages 92–106. Springer, 2010

  70. [71]

    G. E. Fainekos and G. J. Pappas. Robustness of temporal logic specifications for continuous- time signals.Theoretical Computer Science, 410(42):4262–4291, 2009

  71. [72]

    Asarin, A

    E. Asarin, A. Donz ´e, O. Maler, and D. Nickovic. Parametric identification of temporal proper- ties. InInternational Conference on Runtime Verification, pages 147–160. Springer, 2011

  72. [73]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  73. [74]

    Kohl and P

    N. Kohl and P. Stone. Policy gradient reinforcement learning for fast quadrupedal locomotion. InIEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA’04. 2004, volume 3, pages 2619–2624. IEEE, 2004

  74. [75]

    Schulman, S

    J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. InInternational conference on machine learning, pages 1889–1897. PMLR, 2015

  75. [76]

    V . Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. InInternational conference on machine learning, pages 1928–1937. PmLR, 2016

  76. [77]

    C. D. Freeman, E. Frey, A. Raichuk, S. Girgin, I. Mordatch, and O. Bachem. Brax–a differen- tiable physics engine for large scale rigid body simulation.arXiv preprint arXiv:2106.13281, 2021. A Appendix A.1 STL Specifications We additionally evaluated the two rules described below; however, given the constraints already established in the main text, they d...