pith. sign in

arxiv: 2606.12814 · v1 · pith:XMH6452Jnew · submitted 2026-06-11 · 💻 cs.RO · cs.AI

Stubborn: A Streamlined and Unified Reinforcement Learning Framework for Robust Motion Tracking and Fall Recovery for Humanoids

Pith reviewed 2026-06-27 07:10 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords reinforcement learninghumanoid robotmotion trackingfall recoveryprobabilistic terminationadaptive samplingunified policyactor-critic
0
0 comments X

The pith

A single unified RL policy with probabilistic termination and adaptive sampling achieves robust humanoid motion tracking and fall recovery without multi-stage training or separate policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Stubborn as a unified reinforcement learning framework that combines motion tracking and fall recovery into one policy for humanoids. Prior approaches require multi-stage training with specialized recovery rewards and immediate episode termination on failures, which limits exploration of recovery behaviors. Stubborn instead uses a Bernoulli-based probabilistic termination to continue training episodes with some probability after failures and an adaptive sampling strategy driven by tracking error to focus on difficult segments and unstable states. This setup is paired with yaw-aligned tracking representation in an asymmetric Actor-Critic architecture. If the mechanisms work as described, they enable competitive performance on both tasks through a streamlined process.

Core claim

Stubborn demonstrates that an asymmetric Actor-Critic policy trained with yaw-aligned tracking representation, Bernoulli-based probabilistic termination, and tracking-error-driven adaptive sampling can achieve competitive motion tracking and fall recovery performance using a single policy, without multi-stage training, specialized recovery rewards, or separate recovery policies. The probabilistic termination encourages exploration of fall-recovery behaviors under varying failure modes, while the adaptive sampling increases training efficiency for difficult motion segments and unstable states.

What carries the argument

Bernoulli-based probabilistic termination mechanism together with tracking-error-driven adaptive sampling that reshapes the episode distribution to include fallen and unstable states.

If this is right

  • Episodes continue with positive probability after severe tracking failures, enabling recovery-oriented exploration in fallen states.
  • The sampling distribution automatically increases exposure to motion segments with high tracking error and to unstable states.
  • Competitive tracking and recovery performance is reached without designing separate recovery policies or recovery-specific reward terms.
  • The same trained policy supports both nominal tracking and recovery from disturbances in simulation and real-world tests.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same termination and sampling ideas could be applied to other long-horizon robotic control problems that involve rare failure states.
  • Hard episode cutoffs may generally reduce sample efficiency in balance-critical tasks; probabilistic continuation offers a lightweight alternative.
  • Yaw alignment as a state representation might transfer to other heading-sensitive locomotion controllers.

Load-bearing premise

That a single unified policy with Bernoulli-based probabilistic termination and tracking-error-driven adaptive sampling can achieve robust fall recovery without multi-stage training, specialized recovery rewards, or separate recovery policies.

What would settle it

An ablation experiment in which removing the probabilistic termination or the adaptive sampling causes the single policy to fail at fall recovery on the same disturbance suite where the full Stubborn policy succeeds.

Figures

Figures reproduced from arXiv: 2606.12814 by He Kong, Xiao Ren, Yuhui Yang, Zhijie Liu, Zongbiao Weng.

Figure 1
Figure 1. Figure 1: Overview of Stubborn. The policy is trained with a yaw-aligned representation and via an asymmetric actor-critic architecture, with a Bernoulli-based soft termination mechanism, and probabilistic termination and tracking error-driven sampling strategy. motion tracking and fall recovery as separate tasks, has not been well explored and several design issues remain to be addressed. In particular, under long-… view at source ↗
Figure 2
Figure 2. Figure 2: Ablation results of the PT mechanism. The recovery success rate and the average number [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Ablation results of the AdpS sampling strategy. The results suggest that AdpS improves [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Experimental results. Stubborn can recover from falls and precisely tracks diverse dynamic [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Recent reinforcement learning approaches have shown great promise in improving humanoid motion tracking performance and achieving fall recovery under disturbances. However, most existing works treat motion tracking and fall recovery as different tasks and require multi-stage training with specialized recovery rewards and/or separate recovery policies. Moreover, existing reinforcement learning-based methods often terminate training episodes immediately after severe tracking failures, limiting recovery-oriented exploration in unstable or fallen states. To address the above issues, we propose Stubborn, a streamlined and unified reinforcement learning framework to achieve robust humanoid motion tracking and fall recovery. Specifically, Stubborn uses an asymmetric Actor-Critic architecture and consists of three major components. First, a yaw-aligned tracking representation is adopted to reduce sensitivity to global drift and heading disturbances while preserving gravity-related balance information. Second, we introduce a Bernoulli-based probabilistic termination mechanism that enables the policy to encourage exploration of fall-recovery behaviors under varying failure modes. Third, we propose a probabilistic termination and tracking-error-driven strategy that dynamically reshapes the sampling distribution based on tracking performance, increasing the training efficiency for difficult motion segments and unstable states. Extensive comparisons with SOTA methods and ablation studies show that Stubborn achieved competitive performance, and the proposed probabilistic termination mechanism and adaptive sampling strategy contributed to the performance and robustness gains. For real-world demonstrations, please refer to https://aislab-sustech.github.io/Stubborn/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes Stubborn, a unified RL framework for humanoid motion tracking and fall recovery that avoids multi-stage training or separate recovery policies. It employs an asymmetric Actor-Critic architecture with three components: a yaw-aligned tracking representation to reduce sensitivity to global drift, a Bernoulli-based probabilistic termination mechanism to encourage exploration of recovery behaviors in failure modes, and a tracking-error-driven adaptive sampling strategy to reshape the training distribution toward difficult segments. The central claim is that this single-policy approach achieves competitive performance against SOTA methods, with ablations confirming the contribution of the probabilistic termination and adaptive sampling, supported by simulation experiments and real-world demonstrations.

Significance. If the empirical claims hold with quantitative backing, the work offers a meaningful simplification to RL pipelines for robust humanoid control by unifying tracking and recovery tasks. The probabilistic termination and adaptive sampling mechanisms directly target exploration limitations in unstable states, which could reduce reliance on hand-crafted recovery rewards or staged curricula common in the field. Strengths include the explicit design choices for yaw alignment and failure-mode exploration, which are falsifiable via the described ablations.

major comments (1)
  1. [Abstract] Abstract: The claim that 'Stubborn achieved competitive performance' and that 'the proposed probabilistic termination mechanism and adaptive sampling strategy contributed to the performance and robustness gains' is presented without any numerical metrics, specific baselines, ablation tables, or quantitative results. This absence makes it impossible to evaluate the magnitude of improvement or the load-bearing support for the unified-policy claim over multi-stage alternatives.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and constructive comment on the abstract. We address the concern below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that 'Stubborn achieved competitive performance' and that 'the proposed probabilistic termination mechanism and adaptive sampling strategy contributed to the performance and robustness gains' is presented without any numerical metrics, specific baselines, ablation tables, or quantitative results. This absence makes it impossible to evaluate the magnitude of improvement or the load-bearing support for the unified-policy claim over multi-stage alternatives.

    Authors: We agree that the abstract would be strengthened by including concrete quantitative highlights. In the revised manuscript we will add specific metrics (e.g., average tracking error reductions versus the strongest baselines, success rates on fall-recovery tasks, and key ablation deltas) while preserving conciseness. The body of the paper already contains the full tables and statistical details; the abstract revision will simply surface the most load-bearing numbers to support the unified-policy claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes an RL framework with three explicit components (yaw-aligned representation, Bernoulli probabilistic termination, tracking-error-driven adaptive sampling) inside an asymmetric Actor-Critic architecture. All performance claims are presented as outcomes of training and evaluation against external SOTA baselines plus ablations; no equations, fitted parameters, or uniqueness theorems are shown that reduce the claimed gains back to the inputs by construction. No self-citations are invoked as load-bearing justification for the core mechanisms. The derivation chain is therefore a standard empirical pipeline from method design to benchmark results and is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not introduce or rely on explicit free parameters, axioms, or invented entities beyond standard RL concepts such as actor-critic architectures.

pith-pipeline@v0.9.1-grok · 5787 in / 1071 out tokens · 34483 ms · 2026-06-27T07:10:16.528339+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 21 canonical work pages · 3 internal anchors

  1. [1]

    X. B. Peng, Z. Ma, P. Abbeel, S. Levine, and A. Kanazawa. Amp: Adversarial motion priors for stylized physics-based character control.ACM Transactions on Graphics (ToG), 40(4): 1–20, 2021

  2. [2]

    Z. Luo, J. Cao, A. Winkler, K. Kitani, and W. Xu. Perpetual humanoid control for real-time simulated avatars. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10895–10904, 2023

  3. [3]

    Z. Sun, Y . Peng, Y . Meng, X. Li, B.-S. Huang, Z. Bing, X. Wang, and A. Knoll. Robotdancing: Residual-action reinforcement learning enables robust long-horizon humanoid motion track- ing.arXiv:2509.20717, 2025

  4. [4]

    W. Xie, J. Han, J. Zheng, H. Li, X. Liu, J. Shi, W. Zhang, C. Bai, and X. Li. Kung- fubot: Physics-based humanoid whole-body control for learning highly-dynamic skills. arXiv:2506.12851, 2025

  5. [5]

    J. Han, W. Xie, J. Zheng, J. Shi, W. Zhang, T. Xiao, and C. Bai. Kungfubot2: Learning versatile motion skills for humanoid whole-body control.arXiv:2509.16638, 2025

  6. [6]

    X. B. Peng, P. Abbeel, S. Levine, and M. Van de Panne. Deepmimic: Example-guided deep re- inforcement learning of physics-based character skills.ACM Transactions On Graphics (TOG), 37(4):1–14, 2018

  7. [7]

    T. Zhu, G. Cai, Z. Yang, G. Ren, H. Xie, Z. Wang, J. Wu, J. Wang, X. Yang, Y . Mu, and Y . Yan. Clot: Closed-loop global motion tracking for whole-body humanoid teleoperation. arXiv:2602.15060, 2026

  8. [8]

    Cheng, D

    J. Cheng, D. Kang, G. Fadini, G. Shi, and S. Coros. Rambo: Rl-augmented model-based whole-body control for loco-manipulation.IEEE Robotics and Automation Letters, 2025

  9. [9]

    H. J. Lee, S. H. Jeon, and S. Kim. Learning humanoid arm motion via centroidal momentum regularized multi-agent reinforcement learning.IEEE Robotics and Automation Letters, 2025

  10. [10]

    K. Yin, W. Zeng, K. Fan, M. Dai, Z. Wang, Q. Zhang, Z. Tian, J. Wang, J. Pang, and W. Zhang. Unitracker: Learning universal whole-body motion tracker for humanoid robots. arXiv:2507.07356, 2025

  11. [11]

    F. Wu, X. Nal, J. Jang, W. Zhu, Z. Gu, A. Wu, and Y . Zhao. Learn to teach: Sample-efficient privileged learning for humanoid locomotion over real-world uneven terrain.IEEE Robotics and Automation Letters, 2025

  12. [12]

    H. Jung, Z. Gu, Y . Zhao, H.-W. Park, and S. Ha. Ppf: Pre-training and preservative fine- tuning of humanoid locomotion via model-assumption-based regularization.IEEE Robotics and Automation Letters, 2025

  13. [13]

    Q. Liao, T. E. Truong, X. Huang, Y . Gao, G. Tevet, K. Sreenath, and C. K. Liu. Beyondmimic: From motion tracking to versatile humanoid control via guided diffusion.arXiv:2508.08241, 2025

  14. [14]

    Z. Chen, M. Ji, X. Cheng, X. Peng, X. B. Peng, and X. Wang. Gmt: General motion tracking for humanoid whole-body control.arXiv:2506.14770, 2025. 9

  15. [15]

    Duburcq, F

    A. Duburcq, F. Schramm, G. Bo ´eris, N. Bredeche, and Y . Chevaleyre. Reactive stepping for humanoid robots using reinforcement learning: Application to standing push recovery on the exoskeleton atalante. In2022 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 9302–9309. IEEE, 2022

  16. [16]

    L. Yang, B. Werner, A. B. Ghansah, and A. D. Ames. Bracing for impact: Robust humanoid push recovery and locomotion with reduced order models. In2025 IEEE-RAS 24th Interna- tional Conference on Humanoid Robots (Humanoids), pages 728–735. IEEE, 2025

  17. [17]

    Radosavovic, T

    I. Radosavovic, T. Xiao, B. Zhang, T. Darrell, J. Malik, and K. Sreenath. Real-world humanoid locomotion with reinforcement learning.Science Robotics, 9(89):eadi9579, 2024

  18. [18]

    M. Chen, K. Wang, B. Zhang, Y . Ren, Z. Zhu, X. Ma, Q. Huang, Z. Yang, Y . Wang, and Z. Su. Holomotion: A foundation model for whole-body humanoid control, 2026. URL https://github.com/HorizonRobotics/HoloMotion

  19. [19]

    Zhang, J

    Z. Zhang, J. Guo, C. Chen, J. Wang, C. Lin, Y . Lian, H. Xue, Z. Wang, M. Liu, J. Lyu, et al. Track any motions under any disturbances.arXiv:2509.13833, 2025

  20. [20]

    Y . Li, Z. Luo, T. Zhang, C. Dai, A. Kanervisto, A. Tirinzoni, H. Weng, K. Kitani, M. Guzek, A. Touati, et al. Bfm-zero: A promptable behavioral foundation model for humanoid control using unsupervised reinforcement learning.arXiv:2511.04131, 2025

  21. [21]

    Z. Fu, Q. Zhao, Q. Wu, G. Wetzstein, and C. Finn. Humanplus: Humanoid shadowing and imitation from humans.arXiv:2406.10454, 2024

  22. [22]

    T. He, Z. Luo, X. He, W. Xiao, C. Zhang, W. Zhang, K. Kitani, C. Liu, and G. Shi. Om- nih2o: Universal and dexterous human-to-humanoid whole-body teleoperation and learning. arXiv:2406.08858, 2024

  23. [23]

    T. He, J. Gao, W. Xiao, Y . Zhang, Z. Wang, J. Wang, Z. Luo, G. He, N. Sobanbab, C. Pan, et al. Asap: Aligning simulation and real-world physics for learning agile humanoid whole- body skills.arXiv:2502.01143, 2025

  24. [24]

    F. Liu, Z. Gu, Y . Cai, Z. Zhou, H. Jung, J. Jang, S. Zhao, S. Ha, Y . Chen, D. Xu, et al. Opt2skill: Imitating dynamically-feasible whole-body trajectories for versatile humanoid loco- manipulation.IEEE Robotics and Automation Letters, 2025

  25. [25]

    Cheng, Y

    X. Cheng, Y . Ji, J. Chen, R. Yang, G. Yang, and X. Wang. Expressive whole-body control for humanoid robots.arXiv:2402.16796, 2024

  26. [26]

    M. Ji, X. Peng, F. Liu, J. Li, G. Yang, X. Cheng, and X. Wang. Exbody2: Advanced expressive humanoid whole-body control.arXiv:2412.13196, 2024

  27. [27]

    Y . Ze, Z. Chen, J. P. Ara´ujo, Z.-a. Cao, X. B. Peng, J. Wu, and C. K. Liu. Twist: Teleoperated whole-body imitation system.arXiv:2505.02833, 2025

  28. [28]

    Y . Ze, S. Zhao, W. Wang, A. Kanazawa, R. Duan, P. Abbeel, G. Shi, J. Wu, and C. K. Liu. Twist2: Scalable, portable, and holistic humanoid data collection system.arXiv:2511.02832, 2025

  29. [29]

    Y . Li, Y . Lin, J. Cui, T. Liu, W. Liang, Y . Zhu, and S. Huang. Clone: Closed-loop whole-body humanoid teleoperation for long-horizon tasks. In9th Annual Conference on Robot Learning, 2025

  30. [30]

    Z. Luo, Y . Yuan, T. Wang, C. Li, S. Chen, F. Castaneda, Z.-A. Cao, J. Li, D. Minor, Q. Ben, et al. Sonic: Supersizing motion tracking for natural humanoid whole-body control. arXiv:2511.07820, 2025. 10

  31. [31]

    P. Chen, Y . Wang, C. Luo, W. Cai, and M. Zhao. Hifar: Multi-stage curriculum learning for high-dynamics humanoid fall recovery. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2908–2915. IEEE, 2025

  32. [32]

    Gaspard, M

    C. Gaspard, M. Duclusaud, G. Passault, M. Daniel, and O. Ly. Frasa: An end-to-end rein- forcement learning agent for fall recovery and stand up of humanoid robots. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 15994–16000. IEEE, 2025

  33. [33]

    T. Egle, Y . Yan, D. Lee, and C. Ott. Enhancing model-based step adaptation for push recovery through reinforcement learning of step timing and region. In2024 IEEE-RAS 23rd Interna- tional Conference on Humanoid Robots (Humanoids), pages 165–172. IEEE, 2024

  34. [34]

    Huang, J

    T. Huang, J. Ren, H. Wang, Z. Wang, Q. Ben, M. Wen, X. Chen, J. Li, and J. Pang. Learning humanoid standing-up control across diverse postures.arXiv:2502.08378, 2025

  35. [35]

    X. He, R. Dong, Z. Chen, and S. Gupta. Learning getting-up policies for real-world humanoid robots.arXiv:2502.12152, 2025

  36. [36]

    Zhang, B

    T. Zhang, B. Zheng, R. Nai, Y . Hu, Y .-J. Wang, G. Chen, F. Lin, J. Li, C. Hong, K. Sreenath, et al. Hub: Learning extreme humanoid balance.arXiv:2505.07294, 2025

  37. [37]

    F. G. Harvey, M. Yurick, D. Nowrouzezahrai, and C. Pal. Robust motion in-betweening.ACM Transactions on Graphics (TOG), 39(4):60–1, 2020

  38. [38]

    Mahmood, N

    N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll, and M. J. Black. Amass: Archive of motion capture as surface shapes. InProceedings of the IEEE/CVF international conference on computer vision, pages 5442–5451, 2019

  39. [39]

    Y . Wang, S. Zhu, P. Zhi, Y . Li, J. Li, Y .-L. Li, Y . Xiao, X. Wang, B. Jia, and S. Huang. Omnix- treme: Breaking the generality barrier in high-dynamic humanoid control.arXiv:2602.23843, 2026. 6 Appendix 6.1 Drift-Invariant Yaw-Aligned Tracking Representation To decouple global drift from the desired motion style, similar to [8, 9], our framework adopt...