Unified Walking, Running, and Recovery for Humanoids via State-Dependent Adversarial Motion Priors

Liu Zhao; Peng Lu; Wanyue Li; Yichao Zhong; Yidan Lu

arxiv: 2605.18611 · v1 · pith:NZACOY4Dnew · submitted 2026-05-18 · 💻 cs.RO

Unified Walking, Running, and Recovery for Humanoids via State-Dependent Adversarial Motion Priors

Yidan Lu , Yichao Zhong , Liu Zhao , Wanyue Li , Peng Lu This is my paper

Pith reviewed 2026-05-20 09:00 UTC · model grok-4.3

classification 💻 cs.RO

keywords reinforcement learningadversarial motion priorshumanoid locomotionfall recoveryunified policystate-dependent gaterobot control

0 comments

The pith

A single policy enables a humanoid robot to walk, run, and recover from falls without any mode switching.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a reinforcement learning method that trains one controller to handle walking, running, and fall recovery together. It modifies Adversarial Motion Priors by inserting a state-dependent gate that sends each training step to either a recovery discriminator or a locomotion discriminator. The gate activates the recovery path when the robot tilts beyond a fixed gravity threshold and otherwise uses a velocity-conditioned discriminator to select walking or running references. A sympathetic reader would care because the resulting policy needs no runtime commands to choose behaviors and can be deployed as a single frozen model on physical hardware.

Core claim

The framework extends Adversarial Motion Priors by replacing the conventional global reference distribution with a state-dependent gate that routes each training transition to one of two discriminators: a dedicated recovery discriminator and a velocity-conditioned locomotion discriminator that jointly covers walking and running. The gate is defined by a single fixed threshold on projected gravity: the recovery discriminator is activated when body tilt exceeds approximately 37 degrees from vertical; otherwise the locomotion discriminator is used, with the normalized commanded velocity serving as a condition that selects the appropriate reference trajectory between walk and run clips. Only a 3

What carries the argument

state-dependent gate on projected gravity that routes training transitions to either a recovery discriminator or a velocity-conditioned locomotion discriminator

If this is right

A single frozen policy executes all three behaviors at a fixed rate with no runtime mode logic.
Smooth transitions between walking and running occur under the same controller.
Recovery works from both prone and supine starting positions on hardware.
The full behavior set is regularized using only three reference motion clips.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The gating idea could be extended to additional state signals to unify still more behaviors in one policy.
Similar orientation-based routing might transfer to other robot types that change modes according to body pose.
Varying the exact threshold value during training could be tested to measure its effect on transition smoothness.

Load-bearing premise

A fixed gravity threshold can reliably separate recovery states from locomotion states in both training and deployment without causing instability or needing extra boundary logic.

What would settle it

Run the deployed policy through motions that keep the projected gravity vector near the threshold value and check whether behavior switches occur correctly without instability or incorrect activation.

Figures

Figures reproduced from arXiv: 2605.18611 by Liu Zhao, Peng Lu, Wanyue Li, Yichao Zhong, Yidan Lu.

**Figure 2.** Figure 2: Hardware demonstration of the unified policy on Unitree G1. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

read the original abstract

We propose a unified reinforcement learning framework that enables a single policy to perform walking, running, and fall recovery on the Unitree G1 humanoid robot, validated on physical hardware without any explicit mode-switching command at deployment. The framework extends Adversarial Motion Priors (AMP) by replacing the conventional global reference distribution with a state-dependent gate that routes each training transition to one of two discriminators: a dedicated recovery discriminator and a velocity-conditioned locomotion discriminator that jointly covers walking and running. The gate is defined by a single fixed threshold on projected gravity: the recovery discriminator is activated when body tilt exceeds approximately $37^\circ$ from vertical ($|g_z+1|>0.6$); otherwise the locomotion discriminator is used, with the normalized commanded velocity serving as a condition that selects the appropriate reference trajectory between walk and run clips. Only three LAFAN1 reference clips are required to regularize the complete behavior set. At deployment, a single frozen ONNX policy executes at 50\,Hz with no runtime mode logic; hardware experiments demonstrate successful recovery from both prone and supine falls and smooth walk-to-run transitions under the same controller.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces a unified RL framework extending Adversarial Motion Priors (AMP) with a state-dependent gate that routes transitions to either a recovery discriminator or a velocity-conditioned locomotion discriminator based on a fixed projected-gravity threshold (|g_z + 1| > 0.6, corresponding to ~37° tilt). This enables a single policy to produce walking, running, and fall recovery behaviors using only three LAFAN1 reference clips. At deployment, the policy is exported as a frozen ONNX model running at 50 Hz on the Unitree G1 humanoid with no explicit mode-switching logic; hardware experiments claim successful recovery from prone and supine falls and smooth walk-to-run transitions.

Significance. If the central claims hold, the work demonstrates a practical route to multi-behavior unification in humanoid locomotion without runtime mode logic or multiple policies, which would reduce deployment complexity. The use of a minimal set of reference clips and hardware validation on a physical platform from both fall orientations constitutes a concrete empirical contribution to the field.

major comments (1)

[framework section (state-dependent gate)] The description of the state-dependent gate (framework section): the unified single-policy claim at deployment depends on the fixed threshold |g_z + 1| > 0.6 producing clean separation between recovery and locomotion states in both training rollouts and real hardware dynamics. No sensitivity analysis, boundary smoothing, or ablation on the threshold value is reported, despite the abstract noting that the threshold is approximate. Projected gravity can fluctuate near the boundary during partial recoveries, sensor noise, or high-speed locomotion, risking inconsistent discriminator signals and potential chattering or failed recoveries in the frozen ONNX policy.

minor comments (1)

[abstract] The abstract would be strengthened by including at least one quantitative hardware metric (e.g., success rate, average recovery time, or number of trials) to support the claimed hardware validation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and positive review of our manuscript. We address the single major comment on the state-dependent gate below and outline the revisions we will make.

read point-by-point responses

Referee: [framework section (state-dependent gate)] The description of the state-dependent gate (framework section): the unified single-policy claim at deployment depends on the fixed threshold |g_z + 1| > 0.6 producing clean separation between recovery and locomotion states in both training rollouts and real hardware dynamics. No sensitivity analysis, boundary smoothing, or ablation on the threshold value is reported, despite the abstract noting that the threshold is approximate. Projected gravity can fluctuate near the boundary during partial recoveries, sensor noise, or high-speed locomotion, risking inconsistent discriminator signals and potential chattering or failed recoveries in the frozen ONNX policy.

Authors: We agree that the lack of sensitivity analysis and boundary discussion is a valid concern that should be addressed to strengthen the robustness claims. The threshold |g_z + 1| > 0.6 was selected empirically to correspond to an approximate 37-degree tilt where recovery behaviors are required, providing reliable state separation in both our training rollouts and hardware tests on the Unitree G1. No chattering or inconsistent discriminator signals were observed in the reported experiments. To directly respond to the referee's point, we will add a new ablation subsection in the revised manuscript that varies the threshold over [0.4, 0.8], reports performance metrics, and analyzes boundary cases including simulated sensor noise and partial-recovery trajectories. We will also evaluate and report on a simple hysteresis band for smoothing if the results indicate improved stability. These additions will appear in the framework and experimental sections. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical design with external references and hardware validation

full rationale

The paper extends AMP with a state-dependent gate using a fixed, author-chosen threshold on projected gravity to route between recovery and locomotion discriminators. This is a methodological design choice, not a fitted parameter or self-referential equation that makes the unified policy output tautological by construction. Training relies on three external LAFAN1 clips and RL optimization, with final claims supported by physical hardware experiments on the Unitree G1 at 50 Hz. No self-citation chains, uniqueness theorems, or ansatzes imported from prior author work are load-bearing in the derivation. The result is self-contained against external benchmarks rather than reducing to its inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework depends on a manually chosen gravity threshold and the assumption that three LAFAN1 clips suffice to regularize all required behaviors.

free parameters (1)

gravity threshold = 0.6
Fixed value of 0.6 on |g_z + 1| chosen to activate the recovery discriminator.

axioms (1)

domain assumption Three LAFAN1 reference clips are representative enough to regularize walking, running, and recovery behaviors.
The abstract states that only these clips are required to cover the complete behavior set.

pith-pipeline@v0.9.0 · 5744 in / 1304 out tokens · 43191 ms · 2026-05-20T09:00:18.653946+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The gate is defined by a single fixed threshold on projected gravity: the recovery discriminator is activated when body tilt exceeds approximately 37° from vertical (|g_z + 1| > 0.6); otherwise the locomotion discriminator is used
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A single frozen ONNX policy executes walking, running, and fall recovery at 50 Hz with no runtime mode logic

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 1 internal anchor

[1]

Capture point: A step toward humanoid push recovery,

J. Pratt, J. Carff, S. Drakunov, and A. Goswami, “Capture point: A step toward humanoid push recovery,” in2006 6th IEEE-RAS International Conference on Humanoid Robots. IEEE, 2006, pp. 200–207

work page 2006
[2]

Learning humanoid standing-up control across diverse postures.arXiv preprint arXiv:2502.08378, 2025

T. Huang, J. Ren, H. Wang, Z. Wang, Q. Ben, M. Wen, X. Chen, J. Li, and J. Pang, “Learning humanoid standing-up control across diverse postures,”arXiv preprint arXiv:2502.08378, 2025

work page arXiv 2025
[3]

Learning getting-up policies for real-world humanoid robots.arXiv preprint arXiv:2502.12152, 2025

X. He, R. Dong, Z. Chen, and S. Gupta, “Learning getting-up policies for real-world humanoid robots,”arXiv preprint arXiv:2502.12152, 2025

work page arXiv 2025
[4]

Hifar: Multi-stage curriculum learning for high-dynamics humanoid fall recovery,

P. Chen, Y . Wang, C. Luo, W. Cai, and M. Zhao, “Hifar: Multi-stage curriculum learning for high-dynamics humanoid fall recovery,”arXiv preprint arXiv:2502.20061, 2025

work page arXiv 2025
[5]

Frasa: An end-to-end reinforcement learning agent for fall recovery and stand up of humanoid robots,

C. Gaspard, M. Duclusaud, G. Passault, M. Daniel, and O. Ly, “Frasa: An end-to-end reinforcement learning agent for fall recovery and stand up of humanoid robots,”arXiv preprint arXiv:2410.08655, 2024

work page arXiv 2024
[6]

Safefall: Learning protective control for humanoid robots,

Z. Meng, T. Liu, L. Ma, Y . Wu, R. Song, W. Zhang, and S. Huang, “Safefall: Learning protective control for humanoid robots,”arXiv preprint arXiv:2511.18509, 2025

work page arXiv 2025
[7]

Unified humanoid fall-safety policy from a few demonstrations,

Z. Xu, S. Bahl, and D. Pathak, “Unified humanoid fall-safety policy from a few demonstrations,”arXiv preprint arXiv:2511.07407, 2025

work page arXiv 2025
[8]

Deepmimic: Example-guided deep reinforcement learning of physics-based char- acter skills,

X. B. Peng, P. Abbeel, S. Levine, and M. van de Panne, “Deepmimic: Example-guided deep reinforcement learning of physics-based char- acter skills,”ACM Transactions on Graphics, vol. 37, no. 4, 2018

work page 2018
[9]

Amp: Ad- versarial motion priors for stylized physics-based character control,

X. B. Peng, Z. Ma, P. Abbeel, S. Levine, and A. Kanazawa, “Amp: Ad- versarial motion priors for stylized physics-based character control,” ACM Transactions on Graphics, vol. 40, no. 4, 2021

work page 2021
[10]

Robust motion in-betweening,

F. G. Harvey, M. Yurick, D. Nowrouzezahrai, and C. Pal, “Robust motion in-betweening,”ACM Transactions on Graphics, vol. 39, no. 4, pp. 60:1–60:12, 2020

work page 2020
[11]

Orbit: A unified simulation framework for interactive robot learning environments,

M. Mittal, C. Yu, Q. Yu, J. Liu, N. Rudin, D. Hoeller, J. L. Yuan, R. Singh, Y . Guo, H. Mukopadhyay, G. Bhatt, R. Burgess-Limerick, A. Mandlekar, M. Hutter, and A. Garg, “Orbit: A unified simulation framework for interactive robot learning environments,”IEEE Robotics and Automation Letters, vol. 8, no. 6, pp. 3740–3747, 2023

work page 2023
[12]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[1] [1]

Capture point: A step toward humanoid push recovery,

J. Pratt, J. Carff, S. Drakunov, and A. Goswami, “Capture point: A step toward humanoid push recovery,” in2006 6th IEEE-RAS International Conference on Humanoid Robots. IEEE, 2006, pp. 200–207

work page 2006

[2] [2]

Learning humanoid standing-up control across diverse postures.arXiv preprint arXiv:2502.08378, 2025

T. Huang, J. Ren, H. Wang, Z. Wang, Q. Ben, M. Wen, X. Chen, J. Li, and J. Pang, “Learning humanoid standing-up control across diverse postures,”arXiv preprint arXiv:2502.08378, 2025

work page arXiv 2025

[3] [3]

Learning getting-up policies for real-world humanoid robots.arXiv preprint arXiv:2502.12152, 2025

X. He, R. Dong, Z. Chen, and S. Gupta, “Learning getting-up policies for real-world humanoid robots,”arXiv preprint arXiv:2502.12152, 2025

work page arXiv 2025

[4] [4]

Hifar: Multi-stage curriculum learning for high-dynamics humanoid fall recovery,

P. Chen, Y . Wang, C. Luo, W. Cai, and M. Zhao, “Hifar: Multi-stage curriculum learning for high-dynamics humanoid fall recovery,”arXiv preprint arXiv:2502.20061, 2025

work page arXiv 2025

[5] [5]

Frasa: An end-to-end reinforcement learning agent for fall recovery and stand up of humanoid robots,

C. Gaspard, M. Duclusaud, G. Passault, M. Daniel, and O. Ly, “Frasa: An end-to-end reinforcement learning agent for fall recovery and stand up of humanoid robots,”arXiv preprint arXiv:2410.08655, 2024

work page arXiv 2024

[6] [6]

Safefall: Learning protective control for humanoid robots,

Z. Meng, T. Liu, L. Ma, Y . Wu, R. Song, W. Zhang, and S. Huang, “Safefall: Learning protective control for humanoid robots,”arXiv preprint arXiv:2511.18509, 2025

work page arXiv 2025

[7] [7]

Unified humanoid fall-safety policy from a few demonstrations,

Z. Xu, S. Bahl, and D. Pathak, “Unified humanoid fall-safety policy from a few demonstrations,”arXiv preprint arXiv:2511.07407, 2025

work page arXiv 2025

[8] [8]

Deepmimic: Example-guided deep reinforcement learning of physics-based char- acter skills,

X. B. Peng, P. Abbeel, S. Levine, and M. van de Panne, “Deepmimic: Example-guided deep reinforcement learning of physics-based char- acter skills,”ACM Transactions on Graphics, vol. 37, no. 4, 2018

work page 2018

[9] [9]

Amp: Ad- versarial motion priors for stylized physics-based character control,

X. B. Peng, Z. Ma, P. Abbeel, S. Levine, and A. Kanazawa, “Amp: Ad- versarial motion priors for stylized physics-based character control,” ACM Transactions on Graphics, vol. 40, no. 4, 2021

work page 2021

[10] [10]

Robust motion in-betweening,

F. G. Harvey, M. Yurick, D. Nowrouzezahrai, and C. Pal, “Robust motion in-betweening,”ACM Transactions on Graphics, vol. 39, no. 4, pp. 60:1–60:12, 2020

work page 2020

[11] [11]

Orbit: A unified simulation framework for interactive robot learning environments,

M. Mittal, C. Yu, Q. Yu, J. Liu, N. Rudin, D. Hoeller, J. L. Yuan, R. Singh, Y . Guo, H. Mukopadhyay, G. Bhatt, R. Burgess-Limerick, A. Mandlekar, M. Hutter, and A. Garg, “Orbit: A unified simulation framework for interactive robot learning environments,”IEEE Robotics and Automation Letters, vol. 8, no. 6, pp. 3740–3747, 2023

work page 2023

[12] [12]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017