pith. machine review for the scientific record. sign in

arxiv: 2511.06371 · v3 · submitted 2025-11-09 · 💻 cs.RO

Towards Adaptive Humanoid Control via Multi-Behavior Distillation and Reinforced Fine-Tuning

Pith reviewed 2026-05-17 23:46 UTC · model grok-4.3

classification 💻 cs.RO
keywords humanoid locomotionmulti-behavior distillationreinforced fine-tuningadaptive controlterrain adaptationlocomotion skillsUnitree G1policy distillation
0
0 comments X

The pith

A two-stage process of distilling multiple locomotion policies and then fine-tuning with online feedback produces a single adaptive controller for humanoid robots.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to show that training separate policies for each skill such as standing, walking, running, and jumping leads to controllers that break on irregular terrain, so a unified approach is needed. It first distills several primary policies into one multi-behavior controller that can switch behaviors according to the current environment. It then collects online feedback during operation on more varied terrains and uses reinforced fine-tuning to improve generalization. A sympathetic reader would care because this points toward humanoid robots that can be deployed in unstructured real-world settings without maintaining a library of brittle, skill-specific controllers.

Core claim

The paper claims that first training primary locomotion policies and distilling them into a basic multi-behavior controller, then performing reinforced fine-tuning by collecting online feedback on diverse terrains, produces an adaptive humanoid locomotion controller that exhibits strong adaptability across various situations and terrains, as shown in both simulation and real-world experiments on Unitree G1 robots.

What carries the argument

The two-stage Adaptive Humanoid Control framework that uses multi-behavior distillation to create a basic controller capable of environment-driven behavior switching, followed by reinforced fine-tuning that incorporates online feedback to improve terrain adaptability.

If this is right

  • The distilled controller enables the robot to switch between standing up, walking, running, and jumping based on environmental cues.
  • Reinforced fine-tuning with online feedback improves performance on irregular terrains beyond what independently trained policies achieve.
  • The resulting controller demonstrates strong adaptability in both simulation and physical experiments on the Unitree G1 robot.
  • The approach reduces the need to maintain multiple separate behavior-specific controllers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same distillation-plus-fine-tuning pattern could be tested on other legged platforms to see whether multi-skill adaptation transfers beyond humanoids.
  • Adding richer sensory streams such as vision during the fine-tuning stage might further close the gap between simulated and real terrain performance.
  • The online feedback loop suggests a route toward continual adaptation after initial deployment rather than one-time training.

Load-bearing premise

Multi-behavior distillation followed by reinforced fine-tuning on online feedback will automatically produce reliable behavior switching and terrain generalization.

What would settle it

The controller failing to switch behaviors correctly or showing brittle performance on a new irregular terrain never encountered during the fine-tuning stage in real-world Unitree G1 tests would show the central claim is false.

Figures

Figures reproduced from arXiv: 2511.06371 by Chenjia Bai, Dan Lu, Dewei Wang, Peng Liu, Qilong Han, Xinmiao Wang, Xinzhe Liu, Yingnan Zhao.

Figure 1
Figure 1. Figure 1: Comparison between multi-task RL and our pro [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed two-stage framework [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of recovery motions under AHC and [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Value loss curves during the second-stage fine [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Training episode return curves during second [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Snapshot of real-world deployment. The robot performs recovery and locomotion in diverse scenarios, including [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
read the original abstract

Humanoid robots are promising to learn a diverse set of human-like locomotion behaviors, including standing up, walking, running, and jumping. However, existing methods predominantly require training independent policies for each skill, yielding behavior-specific controllers that exhibit limited generalization and brittle performance when deployed on irregular terrains and in diverse situations. To address this challenge, we propose Adaptive Humanoid Control (AHC) that adopts a two-stage framework to learn an adaptive humanoid locomotion controller across different skills and terrains. Specifically, we first train several primary locomotion policies and perform a multi-behavior distillation process to obtain a basic multi-behavior controller, facilitating adaptive behavior switching based on the environment. Then, we perform reinforced fine-tuning by collecting online feedback in performing adaptive behaviors on more diverse terrains, enhancing terrain adaptability for the controller. We conduct experiments in both simulation and real-world experiments in Unitree G1 robots. The results show that our method exhibits strong adaptability across various situations and terrains. Project website: https://ahc-humanoid.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Adaptive Humanoid Control (AHC), a two-stage framework for humanoid locomotion. Primary policies are trained for individual skills (standing, walking, running, jumping) and distilled into a single multi-behavior controller that switches behaviors according to environmental cues. This controller is then refined via reinforced fine-tuning that collects online feedback on diverse terrains to improve robustness. Experiments are reported in simulation and on the Unitree G1 hardware, with the central claim that the resulting controller exhibits strong adaptability across situations and terrains.

Significance. If the empirical claims are substantiated, the work would offer a practical route to unified controllers that avoid the brittleness of per-skill policies, potentially simplifying deployment of humanoids on irregular terrain. The two-stage distillation-plus-RL structure is a clear engineering contribution that could be adopted by other locomotion systems.

major comments (3)
  1. [§4 and abstract] §4 (Experiments) and abstract: the central claim of 'strong adaptability across various situations and terrains' is presented without quantitative metrics (success rates, traversal distances, energy consumption), baseline comparisons (independent per-skill policies, single-policy RL, or prior distillation methods), error bars, or an explicit definition of how adaptability was measured. This absence makes the claim unverifiable from the reported evidence.
  2. [§3.2] §3.2 (Reinforced Fine-Tuning): the reward function, terrain sampling distribution, and online feedback mechanism are not specified. Because the claim that fine-tuning improves terrain generalization beyond the distilled policy rests on these choices, their omission is load-bearing; without them it is impossible to rule out that observed gains arise from favorable shaping or narrow terrain coverage rather than the proposed mechanism.
  3. [§3.1] §3.1 (Multi-Behavior Distillation): no analysis or ablation is provided to show that behavior switching occurs on the basis of environment cues rather than memorization of training conditions. A concrete test (e.g., out-of-distribution terrain or cue perturbation) would be required to support the adaptability assertion.
minor comments (2)
  1. The project website is referenced but the manuscript would benefit from explicit pointers to supplementary videos or code that demonstrate the claimed real-world behavior switching.
  2. [§3] Notation for the distilled policy and the fine-tuned policy should be introduced once and used consistently; currently the distinction is described only in prose.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments on our manuscript. We believe the suggested revisions will significantly strengthen the paper by providing more rigorous evidence for our claims. We address each major comment below.

read point-by-point responses
  1. Referee: [§4 and abstract] §4 (Experiments) and abstract: the central claim of 'strong adaptability across various situations and terrains' is presented without quantitative metrics (success rates, traversal distances, energy consumption), baseline comparisons (independent per-skill policies, single-policy RL, or prior distillation methods), error bars, or an explicit definition of how adaptability was measured. This absence makes the claim unverifiable from the reported evidence.

    Authors: We agree that additional quantitative evidence is necessary to substantiate the central claim. In the revised manuscript, we will expand §4 to include specific metrics such as success rates for skill transitions and terrain traversal, average distances traversed before failure, and energy consumption (e.g., torque norms). We will report these with error bars from multiple random seeds. Baseline comparisons will be added against independent per-skill policies and a monolithic RL policy trained directly on all behaviors. We will also explicitly define adaptability as the controller's ability to seamlessly switch behaviors and maintain stability on terrains with varying irregularities not seen during initial training. These changes will allow readers to verify the claims directly from the data. revision: yes

  2. Referee: [§3.2] §3.2 (Reinforced Fine-Tuning): the reward function, terrain sampling distribution, and online feedback mechanism are not specified. Because the claim that fine-tuning improves terrain generalization beyond the distilled policy rests on these choices, their omission is load-bearing; without them it is impossible to rule out that observed gains arise from favorable shaping or narrow terrain coverage rather than the proposed mechanism.

    Authors: The referee correctly identifies that these implementation details are essential. We will revise §3.2 to fully specify the reward function, which includes terms for forward velocity tracking, posture stability, foot clearance, and action smoothness. The terrain sampling distribution will be detailed, including the range of slope angles, step heights, and roughness levels used during fine-tuning. The online feedback mechanism involves periodic evaluation on a held-out set of diverse terrains, with policy updates based on accumulated rewards from these interactions. This will demonstrate that the improvements stem from the reinforced fine-tuning process rather than specific shaping. revision: yes

  3. Referee: [§3.1] §3.1 (Multi-Behavior Distillation): no analysis or ablation is provided to show that behavior switching occurs on the basis of environment cues rather than memorization of training conditions. A concrete test (e.g., out-of-distribution terrain or cue perturbation) would be required to support the adaptability assertion.

    Authors: While the distillation process is intended to produce a policy that conditions behavior on current observations including terrain features, we acknowledge the lack of explicit validation for cue-based switching. In the revised version, we will add an analysis in §3.1 or a new subsection, including ablation experiments where we test on out-of-distribution terrains (e.g., unseen obstacle configurations) and with artificially perturbed environmental cues. Performance degradation under cue perturbation would indicate reliance on cues, while robustness would support the adaptability claim. We will also visualize the behavior selection probabilities conditioned on different inputs. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical two-stage method with no self-defining equations or fitted predictions

full rationale

The paper describes a two-stage empirical framework: training primary locomotion policies, followed by multi-behavior distillation to create an adaptive controller, then reinforced fine-tuning using online feedback on diverse terrains. No equations, parameter fittings, or derivations are presented in the provided text that reduce a claimed result to its own inputs by construction. Adaptability is asserted via simulation and real-world experiments on Unitree G1 robots rather than any self-referential definition or self-citation load-bearing premise. The central claims therefore remain independent of the patterns that would indicate circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the method implicitly assumes standard RL components such as policy networks and reward signals but does not introduce new ones.

pith-pipeline@v0.9.0 · 5496 in / 1003 out tokens · 23306 ms · 2026-05-17T23:46:59.951603+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 2 internal anchors

  1. [1]

    , " * write output.state after.block = add.period write newline

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Ben, Q.; Jia, F.; Zeng, J.; Dong, J.; Lin, D.; and Pang, J. 2025. HOMIE: Humanoid Loco-Manipulation with Isomorphic Exoskeleton Cockpit. In Robotics: Science and Systems

  4. [4]

    a henb \

    Chen, D.; Zhou, B.; Koltun, V.; and Kr \"a henb \"u hl, P. 2020. Learning by cheating. In Conference on robot learning, 66--75. PMLR

  5. [5]

    Chen, Z.; Badrinarayanan, V.; Lee, C.-Y.; and Rabinovich, A. 2018. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In International conference on machine learning, 794--803. PMLR

  6. [6]

    Engstrom, L.; Ilyas, A.; Santurkar, S.; Tsipras, D.; Janoos, F.; Rudolph, L.; and Madry, A. 2019. Implementation matters in deep rl: A case study on ppo and trpo. In ICLR

  7. [7]

    Ernst, D.; and Louette, A. 2024. Introduction to reinforcement learning. 111--126

  8. [8]

    B.; Yu, W.; Zhang, T.; Iscen, A.; Goldberg, K.; and Abbeel, P

    Escontrela, A.; Peng, X. B.; Yu, W.; Zhang, T.; Iscen, A.; Goldberg, K.; and Abbeel, P. 2022. Adversarial motion priors make good substitutes for complex reward functions. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 25--32. IEEE

  9. [9]

    Gu, X.; Wang, Y.-J.; and Chen, J. 2024. Humanoid-Gym: Reinforcement Learning for Humanoid Robot with Zero-Shot Sim2Real Transfer. arXiv preprint arXiv:2404.05695

  10. [10]

    Gu, X.; Wang, Y.-J.; Zhu, X.; Shi, C.; Guo, Y.; Liu, Y.; and Chen, J. 2024 a . Advancing Humanoid Locomotion: Mastering Challenging Terrains with Denoising World Model Learning. In RSS

  11. [11]

    Gu, X.; Wang, Y.-J.; Zhu, X.; Shi, C.; Guo, Y.; Liu, Y.; and Chen, J. 2024 b . Advancing Humanoid Locomotion: Mastering Challenging Terrains with Denoising World Model Learning. arXiv:2408.14472

  12. [12]

    He, J.; Zhang, C.; Jenelten, F.; Grandia, R.; BÄcher, M.; and Hutter, M. 2025 a . Attention-Based Map Encoding for Learning Generalized Legged Locomotion. arXiv:2506.09588

  13. [13]

    He, X.; Dong, R.; Chen, Z.; and Gupta, S. 2025 b . Learning Getting-Up Policies for Real-World Humanoid Robots. In Robotics: Science and Systems

  14. [14]

    Hessel, M.; Soyer, H.; Espeholt, L.; Czarnecki, W.; Schmitt, S.; and Van Hasselt, H. 2019. Multi-task deep reinforcement learning with popart. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 3796--3803

  15. [15]

    Hoeller, D.; Rudin, N.; Sako, D.; and Hutter, M. 2024. Anymal parkour: Learning agile navigation for quadrupedal robots. Science Robotics, 9(88): eadi7566

  16. [16]

    Huang, R.; Zhu, S.; Du, Y.; and Zhao, H. 2025 a . MoE-Loco: Mixture of Experts for Multitask Locomotion. arXiv:2503.08564

  17. [17]

    Huang, T.; Ren, J.; Wang, H.; Wang, Z.; Ben, Q.; Wen, M.; Chen, X.; Li, J.; and Pang, J. 2025 b . Learning Humanoid Standing-up Control across Diverse Postures. In Robotics: Science and Systems

  18. [18]

    Li, J.; and Nguyen, Q. 2023. Multi-Contact MPC for Dynamic Loco-Manipulation on Humanoid Robots. In American Control Conference (ACC), 1215--1220. IEEE

  19. [19]

    Lin, S.; Qiao, G.; Tai, Y.; Li, A.; Jia, K.; and Liu, G. 2025. HWC-Loco: A Hierarchical Whole-Body Control Approach to Robust Humanoid Locomotion. arXiv preprint arXiv:2503.00923

  20. [20]

    Liu, B.; Liu, X.; Jin, X.; Stone, P.; and Liu, Q. 2021. Conflict-averse gradient descent for multi-task learning. Advances in Neural Information Processing Systems, 34: 18878--18890

  21. [22]

    Long, J.; Ren, J.; Shi, M.; Wang, Z.; Huang, T.; Luo, P.; and Pang, J. 2024 b . Learning humanoid locomotion with perceptive internal model. arXiv preprint arXiv:2411.14386

  22. [23]

    Makoviychuk, V.; Wawrzyniak, L.; Guo, Y.; Lu, M.; Storey, K.; Macklin, M.; Hoeller, D.; Rudin, N.; Allshire, A.; Handa, A.; et al. 2021. Isaac gym: High performance gpu-based physics simulation for robot learning. arXiv preprint arXiv:2108.10470

  23. [24]

    Mysore, S.; Cheng, G.; Zhao, Y.; Saenko, K.; and Wu, M. 2022. Multi-critic actor learning: Teaching rl policies to act with style. In International Conference on Learning Representations

  24. [25]

    B.; Ma, Z.; Abbeel, P.; Levine, S.; and Kanazawa, A

    Peng, X. B.; Ma, Z.; Abbeel, P.; Levine, S.; and Kanazawa, A. 2021. Amp: Adversarial motion priors for stylized physics-based character control. ACM Transactions on Graphics (ToG), 40(4)

  25. [26]

    Radosavovic, I.; Xiao, T.; Zhang, B.; Darrell, T.; Malik, J.; and Sreenath, K. 2024. Real-world humanoid locomotion with reinforcement learning. Science Robotics, 9(89): eadi9579

  26. [27]

    Ren, J.; Huang, T.; Wang, H.; Wang, Z.; Ben, Q.; Pang, J.; and Luo, P. 2025. Vb-com: Learning vision-blind composite humanoid locomotion against deficient perception. arXiv preprint arXiv:2502.14814

  27. [28]

    Ross, S.; Gordon, G.; and Bagnell, D. 2011. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, 627--635. JMLR Workshop and Conference Proceedings

  28. [29]

    Rudin, N.; Hoeller, D.; Reist, P.; and Hutter, M. 2022. Learning to walk in minutes using massively parallel deep reinforcement learning. In Conference on robot learning, 91--100. PMLR

  29. [30]

    Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Klimov, O. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347

  30. [31]

    Sentis, L.; and Khatib, O. 2006. A Whole-Body Control Framework for Humanoids Operating in Human Environments. In ICRA, 2641--2648. Orlando, FL, USA: IEEE

  31. [32]

    Shi, J.; Liu, X.; Wang, D.; Lu, O.; Schwertfeger, S.; Sun, F.; Bai, C.; and Li, X. 2025. Adversarial Locomotion and Motion Imitation for Humanoid Policy Learning. arXiv preprint arXiv:2504.14305

  32. [33]

    Sodhani, S.; Zhang, A.; and Pineau, J. 2021. Multi-Task Reinforcement Learning with Context-based Representations. arXiv:2102.06177

  33. [34]

    Tan, R.; Li, X.; Ni, F.; Zhou, D.; Ji, Y.; and Shao, X. 2024. Versatile Jumping of Humanoid Robots via Curriculum-Assisted Reinforcement Learning. In 2024 China Automation Congress (CAC), 2502--2508. IEEE

  34. [35]

    Todorov, E.; Erez, T.; and Tassa, Y. 2012. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, 5026--5033. IEEE

  35. [36]

    Wang, D.; Bai, C.; Li, C.; Shi, J.; Ding, Y.; Zhang, C.; and Zhao, B. 2025 a . Skill-Nav: Enhanced Navigation with Versatile Quadrupedal Locomotion via Waypoint Interface. arXiv preprint arXiv:2506.21853

  36. [37]

    Wang, D.; Wang, X.; Liu, X.; Shi, J.; Zhao, Y.; Bai, C.; and Li, X. 2025 b . MoRE: Mixture of Residual Experts for Humanoid Lifelike Gaits Learning on Complex Terrains. arXiv preprint arXiv:2506.08840

  37. [38]

    Wang, H.; Wang, Z.; Ren, J.; Ben, Q.; Huang, T.; Zhang, W.; and Pang, J. 2025 c . BeamDojo: Learning Agile Humanoid Locomotion on Sparse Footholds. In Robotics: Science and Systems ( RSS )

  38. [39]

    Xie, W.; Bai, C.; Shi, J.; Yang, J.; Ge, Y.; Zhang, W.; and Li, X. 2025. Humanoid Whole-Body Locomotion on Narrow Terrain via Dynamic Balance and Reinforcement Learning. In IEEE/RSJ International Conference on Intelligent Robots and Systems

  39. [40]

    Xue, Y.; Dong, W.; Liu, M.; Zhang, W.; and Pang, J. 2025. A Unified and General Humanoid Whole-Body Controller for Fine-Grained Locomotion. In Robotics: Science and Systems (RSS)

  40. [41]

    Yang, C.; Yuan, K.; Zhu, Q.; Yu, W.; and Li, Z. 2020. Multi-expert learning of adaptive legged locomotion. Science Robotics, 5(49)

  41. [42]

    Yu, T.; Kumar, S.; Gupta, A.; Levine, S.; Hausman, K.; and Finn, C. 2020. Gradient surgery for multi-task learning. Advances in neural information processing systems, 33: 5824--5836

  42. [43]

    Y.; Allshire, A.; Frey, E.; Sreenath, K.; Kahrs, L

    Zakka, K.; Tabanpour, B.; Liao, Q.; Haiderbhai, M.; Holt, S.; Luo, J. Y.; Allshire, A.; Frey, E.; Sreenath, K.; Kahrs, L. A.; et al. 2025. Mujoco playground

  43. [44]

    Zhuang, Z.; Fu, Z.; Wang, J.; Atkeson, C.; Schwertfeger, S.; Finn, C.; and Zhao, H. 2023. Robot Parkour Learning. In Conference on Robot Learning ( CoRL )

  44. [45]

    Zhuang, Z.; Yao, S.; and Zhao, H. 2024. Humanoid Parkour Learning. In 8th Annual Conference on Robot Learning