Towards Adaptive Humanoid Control via Multi-Behavior Distillation and Reinforced Fine-Tuning
Pith reviewed 2026-05-17 23:46 UTC · model grok-4.3
The pith
A two-stage process of distilling multiple locomotion policies and then fine-tuning with online feedback produces a single adaptive controller for humanoid robots.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that first training primary locomotion policies and distilling them into a basic multi-behavior controller, then performing reinforced fine-tuning by collecting online feedback on diverse terrains, produces an adaptive humanoid locomotion controller that exhibits strong adaptability across various situations and terrains, as shown in both simulation and real-world experiments on Unitree G1 robots.
What carries the argument
The two-stage Adaptive Humanoid Control framework that uses multi-behavior distillation to create a basic controller capable of environment-driven behavior switching, followed by reinforced fine-tuning that incorporates online feedback to improve terrain adaptability.
If this is right
- The distilled controller enables the robot to switch between standing up, walking, running, and jumping based on environmental cues.
- Reinforced fine-tuning with online feedback improves performance on irregular terrains beyond what independently trained policies achieve.
- The resulting controller demonstrates strong adaptability in both simulation and physical experiments on the Unitree G1 robot.
- The approach reduces the need to maintain multiple separate behavior-specific controllers.
Where Pith is reading between the lines
- The same distillation-plus-fine-tuning pattern could be tested on other legged platforms to see whether multi-skill adaptation transfers beyond humanoids.
- Adding richer sensory streams such as vision during the fine-tuning stage might further close the gap between simulated and real terrain performance.
- The online feedback loop suggests a route toward continual adaptation after initial deployment rather than one-time training.
Load-bearing premise
Multi-behavior distillation followed by reinforced fine-tuning on online feedback will automatically produce reliable behavior switching and terrain generalization.
What would settle it
The controller failing to switch behaviors correctly or showing brittle performance on a new irregular terrain never encountered during the fine-tuning stage in real-world Unitree G1 tests would show the central claim is false.
Figures
read the original abstract
Humanoid robots are promising to learn a diverse set of human-like locomotion behaviors, including standing up, walking, running, and jumping. However, existing methods predominantly require training independent policies for each skill, yielding behavior-specific controllers that exhibit limited generalization and brittle performance when deployed on irregular terrains and in diverse situations. To address this challenge, we propose Adaptive Humanoid Control (AHC) that adopts a two-stage framework to learn an adaptive humanoid locomotion controller across different skills and terrains. Specifically, we first train several primary locomotion policies and perform a multi-behavior distillation process to obtain a basic multi-behavior controller, facilitating adaptive behavior switching based on the environment. Then, we perform reinforced fine-tuning by collecting online feedback in performing adaptive behaviors on more diverse terrains, enhancing terrain adaptability for the controller. We conduct experiments in both simulation and real-world experiments in Unitree G1 robots. The results show that our method exhibits strong adaptability across various situations and terrains. Project website: https://ahc-humanoid.github.io.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Adaptive Humanoid Control (AHC), a two-stage framework for humanoid locomotion. Primary policies are trained for individual skills (standing, walking, running, jumping) and distilled into a single multi-behavior controller that switches behaviors according to environmental cues. This controller is then refined via reinforced fine-tuning that collects online feedback on diverse terrains to improve robustness. Experiments are reported in simulation and on the Unitree G1 hardware, with the central claim that the resulting controller exhibits strong adaptability across situations and terrains.
Significance. If the empirical claims are substantiated, the work would offer a practical route to unified controllers that avoid the brittleness of per-skill policies, potentially simplifying deployment of humanoids on irregular terrain. The two-stage distillation-plus-RL structure is a clear engineering contribution that could be adopted by other locomotion systems.
major comments (3)
- [§4 and abstract] §4 (Experiments) and abstract: the central claim of 'strong adaptability across various situations and terrains' is presented without quantitative metrics (success rates, traversal distances, energy consumption), baseline comparisons (independent per-skill policies, single-policy RL, or prior distillation methods), error bars, or an explicit definition of how adaptability was measured. This absence makes the claim unverifiable from the reported evidence.
- [§3.2] §3.2 (Reinforced Fine-Tuning): the reward function, terrain sampling distribution, and online feedback mechanism are not specified. Because the claim that fine-tuning improves terrain generalization beyond the distilled policy rests on these choices, their omission is load-bearing; without them it is impossible to rule out that observed gains arise from favorable shaping or narrow terrain coverage rather than the proposed mechanism.
- [§3.1] §3.1 (Multi-Behavior Distillation): no analysis or ablation is provided to show that behavior switching occurs on the basis of environment cues rather than memorization of training conditions. A concrete test (e.g., out-of-distribution terrain or cue perturbation) would be required to support the adaptability assertion.
minor comments (2)
- The project website is referenced but the manuscript would benefit from explicit pointers to supplementary videos or code that demonstrate the claimed real-world behavior switching.
- [§3] Notation for the distilled policy and the fine-tuned policy should be introduced once and used consistently; currently the distinction is described only in prose.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive comments on our manuscript. We believe the suggested revisions will significantly strengthen the paper by providing more rigorous evidence for our claims. We address each major comment below.
read point-by-point responses
-
Referee: [§4 and abstract] §4 (Experiments) and abstract: the central claim of 'strong adaptability across various situations and terrains' is presented without quantitative metrics (success rates, traversal distances, energy consumption), baseline comparisons (independent per-skill policies, single-policy RL, or prior distillation methods), error bars, or an explicit definition of how adaptability was measured. This absence makes the claim unverifiable from the reported evidence.
Authors: We agree that additional quantitative evidence is necessary to substantiate the central claim. In the revised manuscript, we will expand §4 to include specific metrics such as success rates for skill transitions and terrain traversal, average distances traversed before failure, and energy consumption (e.g., torque norms). We will report these with error bars from multiple random seeds. Baseline comparisons will be added against independent per-skill policies and a monolithic RL policy trained directly on all behaviors. We will also explicitly define adaptability as the controller's ability to seamlessly switch behaviors and maintain stability on terrains with varying irregularities not seen during initial training. These changes will allow readers to verify the claims directly from the data. revision: yes
-
Referee: [§3.2] §3.2 (Reinforced Fine-Tuning): the reward function, terrain sampling distribution, and online feedback mechanism are not specified. Because the claim that fine-tuning improves terrain generalization beyond the distilled policy rests on these choices, their omission is load-bearing; without them it is impossible to rule out that observed gains arise from favorable shaping or narrow terrain coverage rather than the proposed mechanism.
Authors: The referee correctly identifies that these implementation details are essential. We will revise §3.2 to fully specify the reward function, which includes terms for forward velocity tracking, posture stability, foot clearance, and action smoothness. The terrain sampling distribution will be detailed, including the range of slope angles, step heights, and roughness levels used during fine-tuning. The online feedback mechanism involves periodic evaluation on a held-out set of diverse terrains, with policy updates based on accumulated rewards from these interactions. This will demonstrate that the improvements stem from the reinforced fine-tuning process rather than specific shaping. revision: yes
-
Referee: [§3.1] §3.1 (Multi-Behavior Distillation): no analysis or ablation is provided to show that behavior switching occurs on the basis of environment cues rather than memorization of training conditions. A concrete test (e.g., out-of-distribution terrain or cue perturbation) would be required to support the adaptability assertion.
Authors: While the distillation process is intended to produce a policy that conditions behavior on current observations including terrain features, we acknowledge the lack of explicit validation for cue-based switching. In the revised version, we will add an analysis in §3.1 or a new subsection, including ablation experiments where we test on out-of-distribution terrains (e.g., unseen obstacle configurations) and with artificially perturbed environmental cues. Performance degradation under cue perturbation would indicate reliance on cues, while robustness would support the adaptability claim. We will also visualize the behavior selection probabilities conditioned on different inputs. revision: yes
Circularity Check
No circularity: empirical two-stage method with no self-defining equations or fitted predictions
full rationale
The paper describes a two-stage empirical framework: training primary locomotion policies, followed by multi-behavior distillation to create an adaptive controller, then reinforced fine-tuning using online feedback on diverse terrains. No equations, parameter fittings, or derivations are presented in the provided text that reduce a claimed result to its own inputs by construction. Adaptability is asserted via simulation and real-world experiments on Unitree G1 robots rather than any self-referential definition or self-citation load-bearing premise. The central claims therefore remain independent of the patterns that would indicate circularity.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
two-stage framework... multi-behavior distillation... reinforced fine-tuning... PCGrad... behavior-specific critics
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_strictMono_of_one_lt unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
AMP-based reward... discriminator... style reward r_style
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
, " * write output.state after.block = add.period write newline
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...
-
[2]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
-
[3]
Ben, Q.; Jia, F.; Zeng, J.; Dong, J.; Lin, D.; and Pang, J. 2025. HOMIE: Humanoid Loco-Manipulation with Isomorphic Exoskeleton Cockpit. In Robotics: Science and Systems
work page 2025
- [4]
-
[5]
Chen, Z.; Badrinarayanan, V.; Lee, C.-Y.; and Rabinovich, A. 2018. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In International conference on machine learning, 794--803. PMLR
work page 2018
-
[6]
Engstrom, L.; Ilyas, A.; Santurkar, S.; Tsipras, D.; Janoos, F.; Rudolph, L.; and Madry, A. 2019. Implementation matters in deep rl: A case study on ppo and trpo. In ICLR
work page 2019
-
[7]
Ernst, D.; and Louette, A. 2024. Introduction to reinforcement learning. 111--126
work page 2024
-
[8]
B.; Yu, W.; Zhang, T.; Iscen, A.; Goldberg, K.; and Abbeel, P
Escontrela, A.; Peng, X. B.; Yu, W.; Zhang, T.; Iscen, A.; Goldberg, K.; and Abbeel, P. 2022. Adversarial motion priors make good substitutes for complex reward functions. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 25--32. IEEE
work page 2022
- [9]
-
[10]
Gu, X.; Wang, Y.-J.; Zhu, X.; Shi, C.; Guo, Y.; Liu, Y.; and Chen, J. 2024 a . Advancing Humanoid Locomotion: Mastering Challenging Terrains with Denoising World Model Learning. In RSS
work page 2024
- [11]
- [12]
-
[13]
He, X.; Dong, R.; Chen, Z.; and Gupta, S. 2025 b . Learning Getting-Up Policies for Real-World Humanoid Robots. In Robotics: Science and Systems
work page 2025
-
[14]
Hessel, M.; Soyer, H.; Espeholt, L.; Czarnecki, W.; Schmitt, S.; and Van Hasselt, H. 2019. Multi-task deep reinforcement learning with popart. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 3796--3803
work page 2019
-
[15]
Hoeller, D.; Rudin, N.; Sako, D.; and Hutter, M. 2024. Anymal parkour: Learning agile navigation for quadrupedal robots. Science Robotics, 9(88): eadi7566
work page 2024
- [16]
-
[17]
Huang, T.; Ren, J.; Wang, H.; Wang, Z.; Ben, Q.; Wen, M.; Chen, X.; Li, J.; and Pang, J. 2025 b . Learning Humanoid Standing-up Control across Diverse Postures. In Robotics: Science and Systems
work page 2025
-
[18]
Li, J.; and Nguyen, Q. 2023. Multi-Contact MPC for Dynamic Loco-Manipulation on Humanoid Robots. In American Control Conference (ACC), 1215--1220. IEEE
work page 2023
- [19]
-
[20]
Liu, B.; Liu, X.; Jin, X.; Stone, P.; and Liu, Q. 2021. Conflict-averse gradient descent for multi-task learning. Advances in Neural Information Processing Systems, 34: 18878--18890
work page 2021
- [22]
-
[23]
Makoviychuk, V.; Wawrzyniak, L.; Guo, Y.; Lu, M.; Storey, K.; Macklin, M.; Hoeller, D.; Rudin, N.; Allshire, A.; Handa, A.; et al. 2021. Isaac gym: High performance gpu-based physics simulation for robot learning. arXiv preprint arXiv:2108.10470
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[24]
Mysore, S.; Cheng, G.; Zhao, Y.; Saenko, K.; and Wu, M. 2022. Multi-critic actor learning: Teaching rl policies to act with style. In International Conference on Learning Representations
work page 2022
-
[25]
B.; Ma, Z.; Abbeel, P.; Levine, S.; and Kanazawa, A
Peng, X. B.; Ma, Z.; Abbeel, P.; Levine, S.; and Kanazawa, A. 2021. Amp: Adversarial motion priors for stylized physics-based character control. ACM Transactions on Graphics (ToG), 40(4)
work page 2021
-
[26]
Radosavovic, I.; Xiao, T.; Zhang, B.; Darrell, T.; Malik, J.; and Sreenath, K. 2024. Real-world humanoid locomotion with reinforcement learning. Science Robotics, 9(89): eadi9579
work page 2024
- [27]
-
[28]
Ross, S.; Gordon, G.; and Bagnell, D. 2011. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, 627--635. JMLR Workshop and Conference Proceedings
work page 2011
-
[29]
Rudin, N.; Hoeller, D.; Reist, P.; and Hutter, M. 2022. Learning to walk in minutes using massively parallel deep reinforcement learning. In Conference on robot learning, 91--100. PMLR
work page 2022
-
[30]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Klimov, O. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[31]
Sentis, L.; and Khatib, O. 2006. A Whole-Body Control Framework for Humanoids Operating in Human Environments. In ICRA, 2641--2648. Orlando, FL, USA: IEEE
work page 2006
- [32]
- [33]
-
[34]
Tan, R.; Li, X.; Ni, F.; Zhou, D.; Ji, Y.; and Shao, X. 2024. Versatile Jumping of Humanoid Robots via Curriculum-Assisted Reinforcement Learning. In 2024 China Automation Congress (CAC), 2502--2508. IEEE
work page 2024
-
[35]
Todorov, E.; Erez, T.; and Tassa, Y. 2012. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, 5026--5033. IEEE
work page 2012
- [36]
- [37]
-
[38]
Wang, H.; Wang, Z.; Ren, J.; Ben, Q.; Huang, T.; Zhang, W.; and Pang, J. 2025 c . BeamDojo: Learning Agile Humanoid Locomotion on Sparse Footholds. In Robotics: Science and Systems ( RSS )
work page 2025
-
[39]
Xie, W.; Bai, C.; Shi, J.; Yang, J.; Ge, Y.; Zhang, W.; and Li, X. 2025. Humanoid Whole-Body Locomotion on Narrow Terrain via Dynamic Balance and Reinforcement Learning. In IEEE/RSJ International Conference on Intelligent Robots and Systems
work page 2025
-
[40]
Xue, Y.; Dong, W.; Liu, M.; Zhang, W.; and Pang, J. 2025. A Unified and General Humanoid Whole-Body Controller for Fine-Grained Locomotion. In Robotics: Science and Systems (RSS)
work page 2025
-
[41]
Yang, C.; Yuan, K.; Zhu, Q.; Yu, W.; and Li, Z. 2020. Multi-expert learning of adaptive legged locomotion. Science Robotics, 5(49)
work page 2020
-
[42]
Yu, T.; Kumar, S.; Gupta, A.; Levine, S.; Hausman, K.; and Finn, C. 2020. Gradient surgery for multi-task learning. Advances in neural information processing systems, 33: 5824--5836
work page 2020
-
[43]
Y.; Allshire, A.; Frey, E.; Sreenath, K.; Kahrs, L
Zakka, K.; Tabanpour, B.; Liao, Q.; Haiderbhai, M.; Holt, S.; Luo, J. Y.; Allshire, A.; Frey, E.; Sreenath, K.; Kahrs, L. A.; et al. 2025. Mujoco playground
work page 2025
-
[44]
Zhuang, Z.; Fu, Z.; Wang, J.; Atkeson, C.; Schwertfeger, S.; Finn, C.; and Zhao, H. 2023. Robot Parkour Learning. In Conference on Robot Learning ( CoRL )
work page 2023
-
[45]
Zhuang, Z.; Yao, S.; and Zhao, H. 2024. Humanoid Parkour Learning. In 8th Annual Conference on Robot Learning
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.