pith. sign in

arxiv: 2604.19102 · v1 · submitted 2026-04-21 · 💻 cs.RO · cs.AI

Multi-Gait Learning for Humanoid Robots Using Reinforcement Learning with Selective Adversarial Motion Prior

Pith reviewed 2026-05-10 02:53 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords multi-gait learningreinforcement learningadversarial motion priorhumanoid locomotionselective regularizationPPOsim-to-real transferrobot gaits
0
0 comments X

The pith

Selective application of an adversarial motion prior in one reinforcement learning policy lets a humanoid master five gaits with faster convergence and no loss of agility.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to prove that a single policy structure, action space, and reward formulation can produce five distinct humanoid gaits when an adversarial motion prior is applied only to the periodic, stability-critical ones. The authors show that deliberately omitting the prior from running and jumping prevents over-constraint while still accelerating learning and reducing error on walking, goose-stepping, and stair climbing. This matters because it offers a concrete way to reconcile the stability-agility trade-off inside one training run rather than training separate controllers. Policies are trained with PPO plus domain randomization in simulation and transferred zero-shot to a physical 12-DOF robot.

Core claim

By applying the adversarial motion prior discriminator only to walking, goose-stepping, and stair climbing while leaving it out for running and jumping, the selective strategy produces faster convergence, lower tracking error, and higher success rates than a uniform AMP baseline across all five gaits, all within an otherwise identical PPO training setup that transfers directly to hardware.

What carries the argument

The selective Adversarial Motion Prior strategy, which adds discriminator-based style regularization only to periodic stability-critical gaits and omits it from dynamic ones to preserve expressiveness.

If this is right

  • Training time decreases and final performance improves on walking, goose-stepping, and stair climbing compared with uniform AMP.
  • Running and jumping retain the same agility and success rates as policies trained without any AMP term.
  • The identical policy architecture and reward terms suffice for all five gaits without per-gait retuning.
  • Zero-shot sim-to-real transfer succeeds for the full set of gaits on a 12-DOF humanoid.
  • Stability-focused gaits show suppressed erratic behavior while dynamic gaits keep required explosiveness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same selective-regularization logic could be tested on other multi-behavior robotic tasks where some skills benefit from motion-style guidance and others need unrestricted exploration.
  • An automatic classifier that decides on the fly whether to apply the prior might remove the need for hand-labeled gait categories.
  • Combining selective AMP with terrain-aware rewards could extend the approach to outdoor or uneven surfaces without separate policies.
  • Scaling the method to humanoids with higher degrees of freedom would test whether the selective benefit persists when action spaces grow larger.

Load-bearing premise

That manually classifying gaits into stability-critical versus dynamic groups is accurate enough that omitting the prior from the dynamic group will preserve agility without introducing new instabilities.

What would settle it

A uniform AMP policy trained under identical conditions that reaches equal or lower tracking error and equal or higher success rates on running and jumping, or a selective policy that produces visibly erratic or unstable running and jumping motions.

Figures

Figures reproduced from arXiv: 2604.19102 by Boyang Xing, Keyi Wang, Linqi Ye, Yuanye Wu.

Figure 1
Figure 1. Figure 1: Overview of the multi-gait learning pipeline. Policies are trained [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Representative real-robot image sequences for the five learned gaits: [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Training curves used in the AMP-versus-no-AMP comparison. (a) Goose-stepping with AMP. (b) Jumping without AMP. Each panel reports total [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
read the original abstract

Learning diverse locomotion skills for humanoid robots in a unified reinforcement learning framework remains challenging due to the conflicting requirements of stability and dynamic expressiveness across different gaits. We present a multi-gait learning approach that enables a humanoid robot to master five distinct gaits -- walking, goose-stepping, running, stair climbing, and jumping -- using a consistent policy structure, action space, and reward formulation. The key contribution is a selective Adversarial Motion Prior (AMP) strategy: AMP is applied to periodic, stability-critical gaits (walking, goose-stepping, stair climbing) where it accelerates convergence and suppresses erratic behavior, while being deliberately omitted for highly dynamic gaits (running, jumping) where its regularization would over-constrain the motion. Policies are trained via PPO with domain randomization in simulation and deployed on a physical 12-DOF humanoid robot through zero-shot sim-to-real transfer. Quantitative comparisons demonstrate that selective AMP outperforms a uniform AMP policy across all five gaits, achieving faster convergence, lower tracking error, and higher success rates on stability-focused gaits without sacrificing the agility required for dynamic ones.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript presents a reinforcement learning method for training a unified policy on a 12-DOF humanoid robot to perform five gaits: walking, goose-stepping, running, stair climbing, and jumping. The core innovation is a selective application of Adversarial Motion Prior (AMP), applied only to periodic stability-critical gaits (walking, goose-stepping, stair climbing) to accelerate convergence and reduce erratic behavior, while omitted for dynamic gaits (running, jumping) to avoid over-constraining agility. Training uses PPO with domain randomization in simulation, followed by zero-shot sim-to-real transfer. The authors claim that this selective AMP outperforms a uniform AMP baseline across all gaits in terms of convergence speed, tracking error, and success rates.

Significance. If the empirical results hold with proper ablations and the selective strategy is robustly justified, this work could contribute to scalable multi-skill locomotion by balancing motion prior regularization with the need for dynamic expressiveness in RL policies for humanoids. The zero-shot hardware deployment adds practical value.

major comments (1)
  1. [Method section describing selective AMP and gait classification] The central claim that selective AMP preserves agility for dynamic gaits without sacrificing performance depends on the gait classification that treats running as dynamic (AMP omitted) despite its periodic nature, similar to the stability-critical gaits. No analysis, ablation, or discussion addresses the effect of applying AMP to running reference motions, alternative partitioning, or the sensitivity of results to the selection criteria. This is load-bearing for interpreting the outperformance and the 'without sacrificing agility' assertion.
minor comments (1)
  1. [Abstract] The abstract asserts quantitative outperformance (faster convergence, lower tracking error, higher success rates) but supplies no specific metrics, error bars, or references to tables/figures. These details should be added or explicitly linked for immediate evaluation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our selective AMP strategy. The major comment raises a valid point about the gait classification and lack of explicit analysis for running. We address this directly below, agreeing to strengthen the manuscript with additional discussion while defending the current approach on the basis of the existing uniform AMP baseline.

read point-by-point responses
  1. Referee: The central claim that selective AMP preserves agility for dynamic gaits without sacrificing performance depends on the gait classification that treats running as dynamic (AMP omitted) despite its periodic nature, similar to the stability-critical gaits. No analysis, ablation, or discussion addresses the effect of applying AMP to running reference motions, alternative partitioning, or the sensitivity of results to the selection criteria. This is load-bearing for interpreting the outperformance and the 'without sacrificing agility' assertion.

    Authors: We acknowledge that running exhibits periodicity. However, our classification prioritizes empirical training dynamics: AMP's regularization on running reference motions constrains the policy's capacity to exceed reference velocities and adapt foot placements under high-speed conditions, leading to reduced agility. The uniform AMP baseline already functions as the relevant ablation, as it applies the prior to running (and all other gaits) and yields measurably worse convergence, tracking error, and success rates on dynamic tasks compared with the selective variant. We will add a new paragraph in the Method section explicitly justifying the stability-critical versus dynamic partitioning with reference to observed policy behavior during training. We will also note the absence of exhaustive alternative partitioning experiments and sensitivity sweeps as a limitation, while arguing that the current criteria are robustly supported by the performance gap versus the uniform baseline. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical RL comparisons with design choice for selective AMP

full rationale

The paper is an empirical RL study using PPO with domain randomization for training a unified policy on five gaits. The selective AMP strategy is introduced as an explicit design decision (apply to stability-critical periodic gaits, omit for dynamic ones) whose value is shown via direct quantitative comparisons of convergence, tracking error, and success rates against a uniform AMP baseline. No derivations, first-principles predictions, or fitted parameters are claimed; the central result is the observed outperformance of the selective variant. AMP itself is referenced as an established prior technique rather than a self-derived result. The gait partitioning is a modeling assumption tested by the experiments, not a self-definitional or self-citation load-bearing step that reduces the outcome to its inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

Based solely on the abstract, the central claim rests on standard RL training assumptions and tuned reward components whose specifics are not provided. No new physical entities are postulated.

free parameters (2)
  • reward weights and formulation coefficients
    The consistent reward formulation across gaits requires multiple tuned parameters to balance stability, tracking, and expressiveness; values are not reported.
  • AMP selection criteria or thresholds
    Rules for applying AMP selectively to periodic gaits only are chosen but not quantified in the abstract.
axioms (2)
  • domain assumption PPO with domain randomization produces policies that transfer zero-shot to physical robots.
    Invoked implicitly for the sim-to-real deployment on the 12-DOF humanoid.
  • domain assumption The five gaits divide cleanly into stability-critical periodic versus highly dynamic categories that benefit from differential regularization.
    Underpins the selective AMP design choice.

pith-pipeline@v0.9.0 · 5499 in / 1595 out tokens · 54320 ms · 2026-05-10T02:53:08.361635+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages · 1 internal anchor

  1. [1]

    https://arxiv.org/abs/2404.17070

    L. Bao, L. Humphreys, and J. C. G. Pimentel, “Deep reinforcement learning for robotic bipedal locomotion: A brief survey,”arXiv preprint arXiv:2404.17070, 2024

  2. [2]

    Isaac Gym: High performance GPU-based physics simulation for robot learning,

    V . Makoviychuk, L. Wawrzyniak, Y . Guo, M. Liu, K. Mack, M. Mack, G. State, B. Sundaralingam, Y . Zhu, and Y . Xian, “Isaac Gym: High performance GPU-based physics simulation for robot learning,” in Proc. NeurIPS Datasets and Benchmarks, 2021

  3. [3]

    Humanoid-gym: Reinforcement learning for humanoid robot with zero-shot sim2real transfer.arXiv preprint arXiv:2404.05695, 2024

    X. Gu, Y .-J. Wang, and J. Chen, “Humanoid-Gym: Reinforcement learning for humanoid robot with zero-shot sim2real transfer,”arXiv preprint arXiv:2404.05695, 2024

  4. [4]

    Real-world humanoid locomotion with reinforcement learn- ing,

    I. Radosavovic, T. Xiao, B. Zhang, T. Darrell, J. Malik, and K. Sreenath, “Real-world humanoid locomotion with reinforcement learn- ing,”Sci. Robot., vol. 9, no. 89, p. eadi9579, 2024

  5. [5]

    Gait-conditioned reinforcement learning with multi-phase curriculum for humanoid locomotion,

    L. Bao, L. Humphreys, and J. C. G. Pimentel, “Gait-conditioned reinforcement learning with multi-phase curriculum for humanoid locomotion,” inProc. IEEE-RAS Int. Conf. Humanoid Robots (Hu- manoids), 2024

  6. [6]

    AMP: Ad- versarial motion priors for stylized physics-based character control,

    X. B. Peng, Z. Ma, P. Abbeel, S. Levine, and A. Kanazawa, “AMP: Ad- versarial motion priors for stylized physics-based character control,” ACM Trans. Graph. (SIGGRAPH), vol. 40, no. 4, pp. 1–20, 2021

  7. [7]

    Adversarial motion priors make good substitutes for complex reward functions,

    A. Escontrela, X. B. Peng, W. Yu, T. Zhang, A. Is ¸c ¸en, K. Goldberg, and P. Abbeel, “Adversarial motion priors make good substitutes for complex reward functions,” inProc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), 2022, pp. 223–230

  8. [8]

    Learning to walk and fly with adversarial motion priors,

    F. Lerario, G. Nava, A. Loquercio, and D. Scaramuzza, “Learning to walk and fly with adversarial motion priors,” inProc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), 2024, pp. 8976–8983

  9. [9]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

  10. [10]

    Sim-to- real transfer of robotic control with dynamics randomization,

    X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Sim-to- real transfer of robotic control with dynamics randomization,” inProc. IEEE Int. Conf. Robot. Autom. (ICRA), 2018, pp. 3567–3574