Perceptive Behavior Foundation Model: Adapting Human Motion Priors to Robot-Centric Terrain

Hao Xu; Junwei Liang; Qiang Zhang; Shuo Yang; Teli Ma; Yizhao Li; Yudong Fan; Zifan Wang

arxiv: 2606.08059 · v1 · pith:AMZGSIGXnew · submitted 2026-06-06 · 💻 cs.RO

Perceptive Behavior Foundation Model: Adapting Human Motion Priors to Robot-Centric Terrain

Zifan Wang , Yizhao Li , Teli Ma , Qiang Zhang , Yudong Fan , Hao Xu , Shuo Yang , Junwei Liang This is my paper

Pith reviewed 2026-06-27 19:47 UTC · model grok-4.3

classification 💻 cs.RO

keywords humanoid controlmotion priorsterrain adaptationfoundation modelperceptive behaviorreference synthesisteacher-student transferresidual learning

0 comments

The pith

Human motion priors are adapted to a robot's local terrain by synthesizing conformal references from raw clips and transferring them to a student policy via residual corrections.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to remove the assumption that human motion references are already compatible with the robot's surroundings. It does so by creating terrain-conformal references from locomotion clips and training a student policy that receives only the original raw references. Terrain features reach the policy only through residual pathways that start at zero and learn corrections only when needed. This keeps the motion-tracking prior intact while allowing adaptation of contacts, posture, and timing to the robot's actual ground.

Core claim

Perceptive BFM grounds human motion priors in robot-centric perception while preserving raw kinematic motion references as the behavioral interface. TCRS converts locomotion-oriented human motion clips into terrain-consistent references through contact-aware foothold construction, foot-geometry-aware swing optimization, support-aware root reconstruction, collision repair, and multi-point inverse kinematics. A blind adapted-reference teacher is trained and its terrain-conformal behavior is transferred to a deployed raw-reference student through target-frame action alignment in an identity-gated Transformer tracker whose terrain features enter through residual pathways initialized to preserve

What carries the argument

terrain-conformal reference synthesis (TCRS), the pipeline that converts human motion clips into terrain-consistent references via contact-aware foothold construction, foot-geometry-aware swing optimization, support-aware root reconstruction, collision repair, and multi-point inverse kinematics; paired with residual pathways in an identity-gated Transformer tracker that add local terrain corrections only when required.

If this is right

Raw kinematic references remain usable as the behavioral interface even when human and robot environments differ.
Local terrain observations adapt contacts, posture, and timing without retraining the core motion prior.
Terrain features enter the policy only through residuals, so corrections occur only when the raw reference is incompatible.
Scalable terrain supervision is obtained from automated synthesis rather than hand-designed or terrain-specific motion data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The residual-pathway design could allow perception modules to be added to existing motion trackers without full retraining.
If TCRS generalizes beyond locomotion clips, the same separation might support non-walking behaviors such as manipulation or climbing.
The teacher-student split separates the problem of reference synthesis from the problem of learning terrain corrections, which could be tested independently.

Load-bearing premise

That TCRS can reliably turn human locomotion clips into terrain-consistent references without artifacts that break policy training or cause real-world instability.

What would settle it

Run the trained student policy on terrain where the TCRS pipeline produces incorrect footholds or swing trajectories and check whether tracking fails or the robot falls.

Figures

Figures reproduced from arXiv: 2606.08059 by Hao Xu, Junwei Liang, Qiang Zhang, Shuo Yang, Teli Ma, Yizhao Li, Yudong Fan, Zifan Wang.

**Figure 1.** Figure 1: Single-policy terrain grounding. A single Perceptive BFM tracks diverse flat-ground human-motion commands while adapting them to randomly placed robot-side terrains. Robot-centric perception adjusts footholds, swing clearance, posture, and contact timing online. Abstract: Humanoid behavior foundation models aim to acquire reusable wholebody control policies from broad human motion priors, enabling a singl… view at source ↗

**Figure 2.** Figure 2: Perceptive BFM overview. TCRS synthesizes terrain-conformal references offline only; it is never queried at deployment. A blind teacher learns adapted-reference tracking on this supervision; the deployed identity-gated Transformer student receives the raw reference and a robot-centric terrain scan, and learns local residual corrections through target-frame action alignment. The deployment command remains t… view at source ↗

**Figure 3.** Figure 3: TCRS trajectory synthesis. The blue ghost is the raw reference placed on terrain; the opaque robot is the TCRS output. Foot traces compare the sampling-based (model predictive path integral, MPPI) foot-end optimization used in TCRS (yellow), Cubic Interp (blue), and direct terrain-height z-lifting (black). 0 2 4 6 8 10 Iterations (10^3) 0 10 20 30 40 50 60 Mean Reward PMT PMT w/o distillation Flat MLP MLP-… view at source ↗

**Figure 4.** Figure 4 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Real-robot mocap mismatch. (a) Human mocap motion captured on flat ground; (b,c) the robot tracks the (a) command over robot-side terrain; (d) a separate walk-anddance motion deployed in the wild [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Representative failure. The upper-body command is collisionunaware, so arms or torso can strike obstacles. Assumptions. TCRS is a kinematic synthesizer: it builds contact-consistent, style-preserving references without solving contact-rich dynamics, and assumes a static, rigid, observable height field, so it does not model deformable, granular, or slippery media, and assumes the upper-body command stays … view at source ↗

**Figure 7.** Figure 7: Detailed PMT network architecture. (A) Inputs: policy observation ot, 10-step proprioceptive history ht−k:t, command window ct−m:t, terrain map (Ht,Mt), critic observation, and supervision targets. (B) PMT actor: a Transformer motion-tracking backbone with cross-attention encoders Eh, Ec produces a motion intent ut; the terrain perception branch (Map CNN fcnn followed by a query-conditioned MapTransformer… view at source ↗

read the original abstract

Humanoid behavior foundation models aim to acquire reusable whole-body control policies from broad human motion priors, enabling a single controller to produce diverse and expressive behaviors. However, existing motion-centric foundation policies largely assume that the reference motion is already physically compatible with the robot's surroundings. This assumption breaks when the demonstrator, operator, and robot inhabit different environments: a human motion may specify the intended behavior, but not the footholds, clearance, body height, or contact timing required by the robot's local terrain. We introduce \emph{Perceptive Behavior Foundation Model} (Perceptive BFM), a terrain-aware humanoid control framework that grounds human motion priors in robot-centric perception. The model preserves raw kinematic motion references as the behavioral interface, while using local terrain observations to adapt contacts, posture, and timing. To provide scalable terrain supervision, we develop \emph{terrain-conformal reference synthesis} (TCRS), which converts locomotion-oriented human motion clips into terrain-consistent references through contact-aware foothold construction, foot-geometry-aware swing optimization, support-aware root reconstruction, collision repair, and multi-point inverse kinematics. We then train a blind adapted-reference teacher and transfer its terrain-conformal behavior to a deployed raw-reference student through target-frame action alignment. The student is an identity-gated Transformer tracker whose terrain features enter through residual pathways initialized to preserve the motion-tracking prior and trained to produce local corrections only when needed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TCRS pipeline plus the teacher-student transfer with residual terrain paths is the actual new piece, but it stands or falls on whether the synthesis keeps motions usable.

read the letter

The paper's core advance is the terrain-conformal reference synthesis (TCRS) that converts raw human locomotion clips into terrain-consistent references via five explicit stages, then uses that to train a blind adapted-reference teacher and distill to a raw-reference student through target-frame action alignment. The student itself is an identity-gated transformer that receives terrain features only through residual pathways initialized to zero correction.

This setup directly tackles the mismatch between human motion data and robot-local terrain, which is a practical barrier for whole-body controllers. Keeping the raw kinematic reference as the behavioral interface while adding perception only where needed is a clean design choice.

The TCRS stages themselves are described in enough detail to be reproducible in principle. The residual initialization is a sensible way to protect the motion-tracking prior.

The soft spot is exactly the one the stress-test flags: TCRS must not introduce artifacts in contact timing, clearance, or posture that the residual pathways cannot fix. The abstract supplies no contact error numbers, kinematic deviation stats, or stage ablations, so it is impossible to tell whether the synthesized references remain faithful enough for stable teacher training. If the full paper contains those checks and they are positive, the framework holds; if not, the central claim is under-supported.

This is for people working on motion-based humanoid policies and perceptive locomotion. A reader already building similar teacher-student setups would get concrete ideas from the pipeline.

It deserves peer review because the problem is real and the combination of TCRS with residual-gated transfer is new, even though the evidence on synthesis quality needs to be stronger.

Referee Report

2 major / 2 minor

Summary. The paper introduces the Perceptive Behavior Foundation Model (Perceptive BFM), a terrain-aware humanoid control framework that grounds human motion priors in robot-centric perception. It preserves raw kinematic references as the behavioral interface and uses local terrain observations to adapt contacts, posture, and timing. The key technical contribution is terrain-conformal reference synthesis (TCRS), a five-stage pipeline (contact-aware foothold construction, foot-geometry-aware swing optimization, support-aware root reconstruction, collision repair, multi-point inverse kinematics) that converts locomotion-oriented human motion clips into terrain-consistent references. A blind adapted-reference teacher is trained and its behavior transferred to a deployed raw-reference student (an identity-gated Transformer tracker) via target-frame action alignment, with terrain features entering through residual pathways initialized to preserve the motion-tracking prior.

Significance. If the TCRS pipeline reliably produces artifact-free references and the teacher-student transfer succeeds, the work would enable scalable reuse of human motion data for expressive humanoid behaviors on varied real-world terrain without requiring terrain-compatible demonstrations, addressing a key limitation of existing motion-centric foundation policies.

major comments (2)

[TCRS pipeline description] The TCRS description (abstract and method) claims the five-stage process produces terrain-consistent references faithful enough for stable teacher training and student transfer, but supplies no quantitative validation such as contact timing error, foot clearance statistics, kinematic deviation metrics, or success rates on downstream policy training. This is load-bearing for the central claim, as any systematic distortion in timing, posture, or clearance would undermine the raw-reference student's ability to recover intended behavior via residual corrections.
[Experiments / Evaluation] No ablation studies or quantitative results are reported to isolate the contribution of the residual terrain pathways, the target-frame action alignment transfer, or the identity-gated Transformer architecture. Without these, it is not possible to verify whether the student produces local corrections only when needed or whether the framework outperforms baselines that assume terrain-compatible references.

minor comments (2)

[Method] Clarify the exact definition of 'target-frame action alignment' and how it differs from standard imitation or distillation losses used in prior motion-tracking work.
[Student architecture] The abstract states the student is 'trained to produce local corrections only when needed,' but the initialization and training details for the residual pathways should be expanded for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for quantitative validation of TCRS and ablations on the transfer components. We address each major comment below and will revise the manuscript accordingly to strengthen the central claims.

read point-by-point responses

Referee: [TCRS pipeline description] The TCRS description (abstract and method) claims the five-stage process produces terrain-consistent references faithful enough for stable teacher training and student transfer, but supplies no quantitative validation such as contact timing error, foot clearance statistics, kinematic deviation metrics, or success rates on downstream policy training. This is load-bearing for the central claim, as any systematic distortion in timing, posture, or clearance would undermine the raw-reference student's ability to recover intended behavior via residual corrections.

Authors: We agree that explicit quantitative validation of TCRS is essential to support the claim that the synthesized references are sufficiently faithful. The current manuscript emphasizes the pipeline design and qualitative demonstrations but does not report the requested metrics. In the revised version we will add a new evaluation subsection that computes and reports contact timing error (mean absolute deviation in stance/swing phases), foot clearance statistics (minimum and average clearance over swing trajectories), kinematic deviation metrics (joint angle and root position RMSE relative to original human references), and downstream success rates (percentage of stable teacher training episodes and student transfer success across terrain types). These will be evaluated on a held-out set of locomotion clips adapted to procedurally generated terrains. revision: yes
Referee: [Experiments / Evaluation] No ablation studies or quantitative results are reported to isolate the contribution of the residual terrain pathways, the target-frame action alignment transfer, or the identity-gated Transformer architecture. Without these, it is not possible to verify whether the student produces local corrections only when needed or whether the framework outperforms baselines that assume terrain-compatible references.

Authors: We concur that isolating the contributions of the residual terrain pathways, target-frame action alignment, and identity-gated Transformer is necessary to substantiate the design choices. The present manuscript presents the integrated framework and overall results but omits these controlled ablations. In revision we will expand the experiments section with quantitative ablations: (1) variants with/without residual pathways (measuring terrain adaptation error and tracking fidelity), (2) alternative transfer methods versus target-frame action alignment (reporting policy success rate and correction magnitude), and (3) comparisons against non-gated Transformer baselines. All ablations will include performance on both simulated and real-robot terrain tasks to demonstrate when and how local corrections are applied. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained method description

full rationale

The provided abstract and framework description introduce Perceptive BFM and TCRS as a sequence of processing stages (contact-aware foothold construction, swing optimization, root reconstruction, collision repair, multi-point IK) followed by teacher-student transfer via target-frame action alignment. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations are present that would reduce any claimed output to its inputs by construction. The central claim rests on the described pipeline producing usable references, which is an empirical precondition rather than a circular reduction. This matches the default case of a non-circular proposal of a new control architecture.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review is based solely on the abstract; the ledger reflects the high-level components described. TCRS is presented as a composite procedure whose internal optimization steps are likely to contain tunable weights or thresholds not enumerated here.

axioms (1)

domain assumption Human motion priors remain a valid behavioral interface even after terrain-induced modifications to contacts and timing.
Invoked when the paper states that raw kinematic references are preserved as the behavioral interface while local terrain observations adapt the execution.

invented entities (1)

terrain-conformal reference no independent evidence
purpose: Provides a terrain-consistent motion target derived from raw human clips for training the teacher policy.
New construct introduced via the TCRS pipeline; no independent evidence outside the paper is supplied in the abstract.

pith-pipeline@v0.9.1-grok · 5805 in / 1486 out tokens · 25269 ms · 2026-06-27T19:47:11.570127+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 35 canonical work pages · 7 internal anchors

[1]

T. He, Z. Luo, X. He, W. Xiao, C. Zhang, W. Zhang, K. M. Kitani, C. Liu, and G. Shi. OmniH2O: Universal and dexterous human-to-humanoid whole-body teleoperation and learning. InProceedings of the 8th Conference on Robot Learning, volume 270 ofProceedings of Machine Learning Research, pages 1516–1540. PMLR, 2025. URL https://proceedings. mlr.press/v270/he25b.html

2025
[2]

Cheng, Y

X. Cheng, Y . Ji, J. Chen, R. Yang, G. Yang, and X. Wang. Expressive whole-body control for humanoid robots. InProceedings of Robotics: Science and Systems, Delft, Netherlands, July
[3]

doi:10.15607/RSS.2024.XX.107

work page doi:10.15607/rss.2024.xx.107 2024
[4]

T. He, W. Xiao, T. Lin, Z. Luo, Z. Xu, Z. Jiang, J. Kautz, C. Liu, G. Shi, X. Wang, L. Fan, and Y . Zhu. HOVER: Versatile neural whole-body controller for humanoid robots.arXiv preprint arXiv:2410.21229, 2024. doi:10.48550/arXiv.2410.21229

work page doi:10.48550/arxiv.2410.21229 2024
[5]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

J. Bjorck, F. Castañeda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. Llontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y . L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y . Xie, Y . Xu, Z. Xu, S. Ye, Z. Yu, A. Zhang, ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.14734 2025
[6]

W. Zeng, S. Lu, K. Yin, X. Niu, M. Dai, J. Wang, and J. Pang. Behavior foundation model for humanoid robots.arXiv preprint arXiv:2509.13780, 2025. doi:10.48550/arXiv.2509.13780

work page doi:10.48550/arxiv.2509.13780 2025
[7]

Y . Li, Z. Luo, T. Zhang, C. Dai, A. Kanervisto, A. Tirinzoni, H. Weng, K. Kitani, M. Guzek, A. Touati, A. Lazaric, M. Pirotta, and G. Shi. BFM-Zero: A promptable behavioral founda- tion model for humanoid control using unsupervised reinforcement learning.arXiv preprint arXiv:2511.04131, 2025. doi:10.48550/arXiv.2511.04131

work page doi:10.48550/arxiv.2511.04131 2025
[8]

Z. Luo, Y . Yuan, T. Wang, C. Li, S. Chen, F. Castaneda, Z.-A. Cao, J. Li, D. Minor, Q. Ben, X. Da, L. Fan, and Y . Zhu. SONIC: Supersizing motion tracking for natural humanoid whole- body control.arXiv preprint arXiv:2511.07820, 2025. doi:10.48550/arXiv.2511.07820

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.07820 2025
[9]

S. Zhu, Z. Zhuang, M. Zhao, K.-Y . Lee, and H. Zhao. Hiking in the wild: A scalable perceptive parkour framework for humanoids.arXiv preprint arXiv:2601.07718, 2026. doi:10.48550/ arXiv.2601.07718

arXiv 2026
[10]

Zhuang, S

Z. Zhuang, S. Zhu, M. Zhao, and H. Zhao. Deep whole-body parkour.arXiv preprint arXiv:2601.07701, 2026. doi:10.48550/arXiv.2601.07701

work page doi:10.48550/arxiv.2601.07701 2026
[11]

Z. Wu, X. Huang, L. Yang, Y . Zhang, K. Sreenath, X. Chen, P. Abbeel, R. Duan, A. Kanazawa, C. Sferrazza, G. Shi, and C. K. Liu. Perceptive humanoid parkour: Chaining dynamic human skills via motion matching.arXiv preprint arXiv:2602.15827, 2026. doi:10.48550/arXiv.2602. 15827

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602 2026
[12]

X. B. Peng, P. Abbeel, S. Levine, and M. van de Panne. Deepmimic: Example-guided deep reinforcement learning of physics-based character skills.ACM Transactions on Graphics, 37 (4):143, 2018. doi:10.1145/3197517.3201311

work page doi:10.1145/3197517.3201311 2018
[13]

X. B. Peng, Z. Ma, P. Abbeel, S. Levine, and A. Kanazawa. AMP: Adversarial motion priors for stylized physics-based character control.ACM Transactions on Graphics, 40(4):1–20, 2021. doi:10.1145/3450626.3459670

work page doi:10.1145/3450626.3459670 2021
[14]

T. He, Z. Luo, W. Xiao, C. Zhang, K. Kitani, C. Liu, and G. Shi. Learning human-to-humanoid real-time whole-body teleoperation. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 8944–8951, 2024. doi:10.1109/IROS58592.2024.10801984. 9

work page doi:10.1109/iros58592.2024.10801984 2024
[15]

Z. Luo, J. Cao, A. Winkler, K. Kitani, and W. Xu. Perpetual humanoid control for real-time simulated avatars. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10895–10904, 2023. doi:10.1109/ICCV51070.2023.01000

work page doi:10.1109/iccv51070.2023.01000 2023
[16]

Z. Luo, J. Cao, J. Merel, A. Winkler, J. Huang, K. Kitani, and W. Xu. Universal humanoid motion representations for physics-based control. InInternational Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=OrOd8PxOO2

2024
[17]

Z. Fu, Q. Zhao, Q. Wu, G. Wetzstein, and C. Finn. HumanPlus: Humanoid shadowing and imitation from humans. InProceedings of the 8th Conference on Robot Learning, volume 270 ofProceedings of Machine Learning Research, pages 2828–2844. PMLR, 2025. URL https://proceedings.mlr.press/v270/fu25a.html

2025
[18]

Y . Ma, H. Yu, J. Xie, C. Lv, Q. Luo, C. Zhang, Y . Yin, B. Xing, X. Ren, and D. Zheng. Robust and generalized humanoid motion tracking.arXiv preprint arXiv:2601.23080, 2026. doi:10.48550/arXiv.2601.23080

work page doi:10.48550/arxiv.2601.23080 2026
[19]

Y . Wang, S. Zhu, P. Zhi, Y . Li, J. Li, Y .-L. Li, Y . Xiao, X. Wang, B. Jia, and S. Huang. OmniXtreme: Breaking the generality barrier in high-dynamic humanoid control.arXiv preprint arXiv:2602.23843, 2026. doi:10.48550/arXiv.2602.23843

work page doi:10.48550/arxiv.2602.23843 2026
[20]

Agarwal, A

A. Agarwal, A. Kumar, J. Malik, and D. Pathak. Legged locomotion in challenging terrains using egocentric vision. InProceedings of the 6th Conference on Robot Learning, volume 205 ofProceedings of Machine Learning Research, pages 403–415. PMLR, 2023. URL https://proceedings.mlr.press/v205/agarwal23a.html

2023
[21]

Zhuang, Z

Z. Zhuang, Z. Fu, J. Wang, C. G. Atkeson, S. Schwertfeger, C. Finn, and H. Zhao. Robot parkour learning. InProceedings of the 7th Conference on Robot Learning, volume 229 ofProceedings of Machine Learning Research, pages 73–92. PMLR, 2023. URL https: //proceedings.mlr.press/v229/zhuang23a.html

2023
[22]

nvblox: GPU - accelerated incremental signed distance field mapping,

X. Cheng, K. Shi, A. Agarwal, and D. Pathak. Extreme parkour with legged robots. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 11443–11450, 2024. doi:10.1109/ICRA57147.2024.10610200

work page doi:10.1109/icra57147.2024.10610200 2024
[23]

Zhuang, S

Z. Zhuang, S. Yao, and H. Zhao. Humanoid parkour learning.arXiv preprint arXiv:2406.10759,

arXiv
[24]

doi:10.48550/arXiv.2406.10759

work page doi:10.48550/arxiv.2406.10759
[25]

H. Wang, Z. Wang, J. Ren, Q. Ben, T. Huang, W. Zhang, and J. Pang. BeamDojo: Learning agile humanoid locomotion on sparse footholds.arXiv preprint arXiv:2502.10363, 2025. doi:10.48550/arXiv.2502.10363

work page doi:10.48550/arxiv.2502.10363 2025
[26]

W. Sun, Y . Su, L. Huang, A. Zhang, D. Wei, M. San, D. Tian, E. Cao, F. Yan, E. Xie, and Z. Xie. Now you see that: Learning end-to-end humanoid locomotion from raw pixels.arXiv preprint arXiv:2602.06382, 2026. doi:10.48550/arXiv.2602.06382

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.06382 2026
[27]

Z. Wang, T. Ma, Y . Jia, X. Yang, J. Zhou, W. Ouyang, Q. Zhang, and J. Liang. Omni-perception: Omnidirectional collision avoidance for legged locomotion in dynamic environments.arXiv preprint arXiv:2505.19214, 2025. doi:10.48550/arXiv.2505.19214

work page doi:10.48550/arxiv.2505.19214 2025
[28]

Z. Wang, X. Yang, J. Zhao, J. Zhou, T. Ma, Z. Gao, A. Ajoudani, and J. Liang. End-to-end humanoid robot safe and comfortable locomotion policy.arXiv preprint arXiv:2508.07611,

arXiv
[29]

doi:10.48550/arXiv.2508.07611

work page doi:10.48550/arxiv.2508.07611
[30]

Learning Whole-Body Humanoid Locomotion via Motion Generation and Motion Tracking

Z. Zhang, K. Wen, M. Xu, J. He, C. Li, T. Miki, C. Schwarke, C. Zhang, X. B. Peng, and M. Hutter. Learning whole-body humanoid locomotion via motion generation and motion tracking.arXiv preprint arXiv:2604.17335, 2026. doi:10.48550/arXiv.2604.17335. 10

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.17335 2026
[31]

W. D. Compton, Z. Olkin, and A. D. Ames. Terrain consistent reference-guided RL for humanoid navigation autonomy.arXiv preprint arXiv:2605.15517, 2026. doi:10.48550/arXiv.2605.15517

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2605.15517 2026
[32]

Y . Li, P. Zhi, Y . Wang, T. Liu, S. Yan, W. Liu, X. Wang, B. Jia, and S. Huang. OmniTrack: General motion tracking via physics-consistent reference.arXiv preprint arXiv:2602.23832,

arXiv
[33]

doi:10.48550/arXiv.2602.23832

work page doi:10.48550/arxiv.2602.23832
[34]

S. Choi, M. K. X. J. Pan, and J. Kim. Nonparametric motion retargeting for humanoid robots on shared latent space. InRobotics: Science and Systems, 2020. doi:10.15607/RSS.2020.XVI.071

work page doi:10.15607/rss.2020.xvi.071 2020
[35]

Villegas, J

R. Villegas, J. Yang, D. Ceylan, and H. Lee. Neural kinematic networks for unsupervised motion retargetting. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 8639–8648, 2018. doi:10.1109/CVPR.2018.00901

work page doi:10.1109/cvpr.2018.00901 2018
[36]

L. Yang, X. Huang, Z. Wu, A. Kanazawa, P. Abbeel, C. Sferrazza, C. K. Liu, R. Duan, and G. Shi. OmniRetarget: Interaction-preserving data generation for humanoid whole-body loco- manipulation and scene interaction.arXiv preprint arXiv:2509.26633, 2025. doi:10.48550/ arXiv.2509.26633

Pith/arXiv arXiv 2025
[37]

Dantec, M

E. Dantec, M. Naveau, P. Fernbach, N. A. Villa, G. Saurel, O. Stasse, M. Taix, and N. Mansard. Whole-body model predictive control for biped locomotion on a torque-controlled humanoid robot.IEEE-RAS International Conference on Humanoid Robots (Humanoids), pages 638–644,
[38]

doi:10.1109/Humanoids53995.2022.10000129

work page doi:10.1109/humanoids53995.2022.10000129 2022
[39]

Pajon, S

A. Pajon, S. Caron, G. De Magistris, S. Miossec, and A. Kheddar. Walking on gravel with soft soles using linear inverted pendulum tracking and reaction force distribution. In2017 IEEE-RAS 17th International Conference on Humanoid Robotics (Humanoids), pages 432–437,
[40]

doi:10.1109/HUMANOIDS.2017.8246909

work page doi:10.1109/humanoids.2017.8246909 2017
[41]

J. Lee, J. Hwangbo, L. Wellhausen, V . Koltun, and M. Hutter. Learning quadrupedal locomotion over challenging terrain.Science Robotics, 5(47):eabc5986, 2020. doi:10.1126/scirobotics. abc5986

work page doi:10.1126/scirobotics 2020
[42]

Kumar, Z

A. Kumar, Z. Fu, D. Pathak, and J. Malik. RMA: Rapid motor adaptation for legged robots. In Robotics: Science and Systems, 2021. doi:10.15607/RSS.2021.XVII.011

work page doi:10.15607/rss.2021.xvii.011 2021
[43]

T. Miki, J. Lee, J. Hwangbo, L. Wellhausen, V . Koltun, and M. Hutter. Learning robust perceptive locomotion for quadrupedal robots in the wild.Science Robotics, 7(62):eabk2822,
[44]

doi:10.1126/scirobotics.abk2822

work page doi:10.1126/scirobotics.abk2822
[45]

T. He, J. Gao, W. Xiao, Y . Zhang, Z. Wang, J. Wang, Z. Luo, G. He, N. Sobanbab, C. Pan, Z. Yi, G. Qu, K. Kitani, J. Hodgins, L. J. Fan, Y . Zhu, C. Liu, and G. Shi. ASAP: Aligning simulation and real-world physics for learning agile humanoid whole-body skills.arXiv preprint arXiv:2502.01143, 2025. doi:10.48550/arXiv.2502.01143

work page doi:10.48550/arxiv.2502.01143 2025
[46]

Residual Policy Learning

T. Silver, K. Allen, J. Tenenbaum, and L. P. Kaelbling. Residual policy learning.arXiv preprint arXiv:1812.06298, 2018. doi:10.48550/arXiv.1812.06298

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1812.06298 2018
[47]

Johannink, S

T. Johannink, S. Bahl, A. Nair, J. Luo, A. Kumar, M. Loskyll, J. A. Ojea, E. Solowjow, and S. Levine. Residual reinforcement learning for robot control.IEEE International Conference on Robotics and Automation (ICRA), pages 6023–6029, 2019. doi:10.1109/ICRA.2019.8794127

work page doi:10.1109/icra.2019.8794127 2019
[48]

S. Zhao, Y . Ze, Y . Wang, C. K. Liu, P. Abbeel, G. Shi, and R. Duan. ResMimic: From general motion tracking to humanoid whole-body loco-manipulation via residual learning.arXiv preprint arXiv:2510.05070, 2025. doi:10.48550/arXiv.2510.05070

work page doi:10.48550/arxiv.2510.05070 2025
[49]

Z. Wang, Y . Jia, L. Shi, H. Wang, H. Zhao, X. Li, J. Zhou, J. Ma, and G. Zhou. Arm- constrained curriculum learning for loco-manipulation of a wheel-legged robot. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10770– 10776. IEEE, 2024. doi:10.1109/IROS58592.2024.10802062. 11 A Additional Implementation Details A....

work page doi:10.1109/iros58592.2024.10802062 2024

[1] [1]

T. He, Z. Luo, X. He, W. Xiao, C. Zhang, W. Zhang, K. M. Kitani, C. Liu, and G. Shi. OmniH2O: Universal and dexterous human-to-humanoid whole-body teleoperation and learning. InProceedings of the 8th Conference on Robot Learning, volume 270 ofProceedings of Machine Learning Research, pages 1516–1540. PMLR, 2025. URL https://proceedings. mlr.press/v270/he25b.html

2025

[2] [2]

Cheng, Y

X. Cheng, Y . Ji, J. Chen, R. Yang, G. Yang, and X. Wang. Expressive whole-body control for humanoid robots. InProceedings of Robotics: Science and Systems, Delft, Netherlands, July

[3] [3]

doi:10.15607/RSS.2024.XX.107

work page doi:10.15607/rss.2024.xx.107 2024

[4] [4]

T. He, W. Xiao, T. Lin, Z. Luo, Z. Xu, Z. Jiang, J. Kautz, C. Liu, G. Shi, X. Wang, L. Fan, and Y . Zhu. HOVER: Versatile neural whole-body controller for humanoid robots.arXiv preprint arXiv:2410.21229, 2024. doi:10.48550/arXiv.2410.21229

work page doi:10.48550/arxiv.2410.21229 2024

[5] [5]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

J. Bjorck, F. Castañeda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. Llontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y . L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y . Xie, Y . Xu, Z. Xu, S. Ye, Z. Yu, A. Zhang, ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.14734 2025

[6] [6]

W. Zeng, S. Lu, K. Yin, X. Niu, M. Dai, J. Wang, and J. Pang. Behavior foundation model for humanoid robots.arXiv preprint arXiv:2509.13780, 2025. doi:10.48550/arXiv.2509.13780

work page doi:10.48550/arxiv.2509.13780 2025

[7] [7]

Y . Li, Z. Luo, T. Zhang, C. Dai, A. Kanervisto, A. Tirinzoni, H. Weng, K. Kitani, M. Guzek, A. Touati, A. Lazaric, M. Pirotta, and G. Shi. BFM-Zero: A promptable behavioral founda- tion model for humanoid control using unsupervised reinforcement learning.arXiv preprint arXiv:2511.04131, 2025. doi:10.48550/arXiv.2511.04131

work page doi:10.48550/arxiv.2511.04131 2025

[8] [8]

Z. Luo, Y . Yuan, T. Wang, C. Li, S. Chen, F. Castaneda, Z.-A. Cao, J. Li, D. Minor, Q. Ben, X. Da, L. Fan, and Y . Zhu. SONIC: Supersizing motion tracking for natural humanoid whole- body control.arXiv preprint arXiv:2511.07820, 2025. doi:10.48550/arXiv.2511.07820

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.07820 2025

[9] [9]

S. Zhu, Z. Zhuang, M. Zhao, K.-Y . Lee, and H. Zhao. Hiking in the wild: A scalable perceptive parkour framework for humanoids.arXiv preprint arXiv:2601.07718, 2026. doi:10.48550/ arXiv.2601.07718

arXiv 2026

[10] [10]

Zhuang, S

Z. Zhuang, S. Zhu, M. Zhao, and H. Zhao. Deep whole-body parkour.arXiv preprint arXiv:2601.07701, 2026. doi:10.48550/arXiv.2601.07701

work page doi:10.48550/arxiv.2601.07701 2026

[11] [11]

Z. Wu, X. Huang, L. Yang, Y . Zhang, K. Sreenath, X. Chen, P. Abbeel, R. Duan, A. Kanazawa, C. Sferrazza, G. Shi, and C. K. Liu. Perceptive humanoid parkour: Chaining dynamic human skills via motion matching.arXiv preprint arXiv:2602.15827, 2026. doi:10.48550/arXiv.2602. 15827

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602 2026

[12] [12]

X. B. Peng, P. Abbeel, S. Levine, and M. van de Panne. Deepmimic: Example-guided deep reinforcement learning of physics-based character skills.ACM Transactions on Graphics, 37 (4):143, 2018. doi:10.1145/3197517.3201311

work page doi:10.1145/3197517.3201311 2018

[13] [13]

X. B. Peng, Z. Ma, P. Abbeel, S. Levine, and A. Kanazawa. AMP: Adversarial motion priors for stylized physics-based character control.ACM Transactions on Graphics, 40(4):1–20, 2021. doi:10.1145/3450626.3459670

work page doi:10.1145/3450626.3459670 2021

[14] [14]

T. He, Z. Luo, W. Xiao, C. Zhang, K. Kitani, C. Liu, and G. Shi. Learning human-to-humanoid real-time whole-body teleoperation. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 8944–8951, 2024. doi:10.1109/IROS58592.2024.10801984. 9

work page doi:10.1109/iros58592.2024.10801984 2024

[15] [15]

Z. Luo, J. Cao, A. Winkler, K. Kitani, and W. Xu. Perpetual humanoid control for real-time simulated avatars. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10895–10904, 2023. doi:10.1109/ICCV51070.2023.01000

work page doi:10.1109/iccv51070.2023.01000 2023

[16] [16]

Z. Luo, J. Cao, J. Merel, A. Winkler, J. Huang, K. Kitani, and W. Xu. Universal humanoid motion representations for physics-based control. InInternational Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=OrOd8PxOO2

2024

[17] [17]

Z. Fu, Q. Zhao, Q. Wu, G. Wetzstein, and C. Finn. HumanPlus: Humanoid shadowing and imitation from humans. InProceedings of the 8th Conference on Robot Learning, volume 270 ofProceedings of Machine Learning Research, pages 2828–2844. PMLR, 2025. URL https://proceedings.mlr.press/v270/fu25a.html

2025

[18] [18]

Y . Ma, H. Yu, J. Xie, C. Lv, Q. Luo, C. Zhang, Y . Yin, B. Xing, X. Ren, and D. Zheng. Robust and generalized humanoid motion tracking.arXiv preprint arXiv:2601.23080, 2026. doi:10.48550/arXiv.2601.23080

work page doi:10.48550/arxiv.2601.23080 2026

[19] [19]

Y . Wang, S. Zhu, P. Zhi, Y . Li, J. Li, Y .-L. Li, Y . Xiao, X. Wang, B. Jia, and S. Huang. OmniXtreme: Breaking the generality barrier in high-dynamic humanoid control.arXiv preprint arXiv:2602.23843, 2026. doi:10.48550/arXiv.2602.23843

work page doi:10.48550/arxiv.2602.23843 2026

[20] [20]

Agarwal, A

A. Agarwal, A. Kumar, J. Malik, and D. Pathak. Legged locomotion in challenging terrains using egocentric vision. InProceedings of the 6th Conference on Robot Learning, volume 205 ofProceedings of Machine Learning Research, pages 403–415. PMLR, 2023. URL https://proceedings.mlr.press/v205/agarwal23a.html

2023

[21] [21]

Zhuang, Z

Z. Zhuang, Z. Fu, J. Wang, C. G. Atkeson, S. Schwertfeger, C. Finn, and H. Zhao. Robot parkour learning. InProceedings of the 7th Conference on Robot Learning, volume 229 ofProceedings of Machine Learning Research, pages 73–92. PMLR, 2023. URL https: //proceedings.mlr.press/v229/zhuang23a.html

2023

[22] [22]

nvblox: GPU - accelerated incremental signed distance field mapping,

X. Cheng, K. Shi, A. Agarwal, and D. Pathak. Extreme parkour with legged robots. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 11443–11450, 2024. doi:10.1109/ICRA57147.2024.10610200

work page doi:10.1109/icra57147.2024.10610200 2024

[23] [23]

Zhuang, S

Z. Zhuang, S. Yao, and H. Zhao. Humanoid parkour learning.arXiv preprint arXiv:2406.10759,

arXiv

[24] [24]

doi:10.48550/arXiv.2406.10759

work page doi:10.48550/arxiv.2406.10759

[25] [25]

H. Wang, Z. Wang, J. Ren, Q. Ben, T. Huang, W. Zhang, and J. Pang. BeamDojo: Learning agile humanoid locomotion on sparse footholds.arXiv preprint arXiv:2502.10363, 2025. doi:10.48550/arXiv.2502.10363

work page doi:10.48550/arxiv.2502.10363 2025

[26] [26]

W. Sun, Y . Su, L. Huang, A. Zhang, D. Wei, M. San, D. Tian, E. Cao, F. Yan, E. Xie, and Z. Xie. Now you see that: Learning end-to-end humanoid locomotion from raw pixels.arXiv preprint arXiv:2602.06382, 2026. doi:10.48550/arXiv.2602.06382

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.06382 2026

[27] [27]

Z. Wang, T. Ma, Y . Jia, X. Yang, J. Zhou, W. Ouyang, Q. Zhang, and J. Liang. Omni-perception: Omnidirectional collision avoidance for legged locomotion in dynamic environments.arXiv preprint arXiv:2505.19214, 2025. doi:10.48550/arXiv.2505.19214

work page doi:10.48550/arxiv.2505.19214 2025

[28] [28]

Z. Wang, X. Yang, J. Zhao, J. Zhou, T. Ma, Z. Gao, A. Ajoudani, and J. Liang. End-to-end humanoid robot safe and comfortable locomotion policy.arXiv preprint arXiv:2508.07611,

arXiv

[29] [29]

doi:10.48550/arXiv.2508.07611

work page doi:10.48550/arxiv.2508.07611

[30] [30]

Learning Whole-Body Humanoid Locomotion via Motion Generation and Motion Tracking

Z. Zhang, K. Wen, M. Xu, J. He, C. Li, T. Miki, C. Schwarke, C. Zhang, X. B. Peng, and M. Hutter. Learning whole-body humanoid locomotion via motion generation and motion tracking.arXiv preprint arXiv:2604.17335, 2026. doi:10.48550/arXiv.2604.17335. 10

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.17335 2026

[31] [31]

W. D. Compton, Z. Olkin, and A. D. Ames. Terrain consistent reference-guided RL for humanoid navigation autonomy.arXiv preprint arXiv:2605.15517, 2026. doi:10.48550/arXiv.2605.15517

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2605.15517 2026

[32] [32]

Y . Li, P. Zhi, Y . Wang, T. Liu, S. Yan, W. Liu, X. Wang, B. Jia, and S. Huang. OmniTrack: General motion tracking via physics-consistent reference.arXiv preprint arXiv:2602.23832,

arXiv

[33] [33]

doi:10.48550/arXiv.2602.23832

work page doi:10.48550/arxiv.2602.23832

[34] [34]

S. Choi, M. K. X. J. Pan, and J. Kim. Nonparametric motion retargeting for humanoid robots on shared latent space. InRobotics: Science and Systems, 2020. doi:10.15607/RSS.2020.XVI.071

work page doi:10.15607/rss.2020.xvi.071 2020

[35] [35]

Villegas, J

R. Villegas, J. Yang, D. Ceylan, and H. Lee. Neural kinematic networks for unsupervised motion retargetting. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 8639–8648, 2018. doi:10.1109/CVPR.2018.00901

work page doi:10.1109/cvpr.2018.00901 2018

[36] [36]

L. Yang, X. Huang, Z. Wu, A. Kanazawa, P. Abbeel, C. Sferrazza, C. K. Liu, R. Duan, and G. Shi. OmniRetarget: Interaction-preserving data generation for humanoid whole-body loco- manipulation and scene interaction.arXiv preprint arXiv:2509.26633, 2025. doi:10.48550/ arXiv.2509.26633

Pith/arXiv arXiv 2025

[37] [37]

Dantec, M

E. Dantec, M. Naveau, P. Fernbach, N. A. Villa, G. Saurel, O. Stasse, M. Taix, and N. Mansard. Whole-body model predictive control for biped locomotion on a torque-controlled humanoid robot.IEEE-RAS International Conference on Humanoid Robots (Humanoids), pages 638–644,

[38] [38]

doi:10.1109/Humanoids53995.2022.10000129

work page doi:10.1109/humanoids53995.2022.10000129 2022

[39] [39]

Pajon, S

A. Pajon, S. Caron, G. De Magistris, S. Miossec, and A. Kheddar. Walking on gravel with soft soles using linear inverted pendulum tracking and reaction force distribution. In2017 IEEE-RAS 17th International Conference on Humanoid Robotics (Humanoids), pages 432–437,

[40] [40]

doi:10.1109/HUMANOIDS.2017.8246909

work page doi:10.1109/humanoids.2017.8246909 2017

[41] [41]

J. Lee, J. Hwangbo, L. Wellhausen, V . Koltun, and M. Hutter. Learning quadrupedal locomotion over challenging terrain.Science Robotics, 5(47):eabc5986, 2020. doi:10.1126/scirobotics. abc5986

work page doi:10.1126/scirobotics 2020

[42] [42]

Kumar, Z

A. Kumar, Z. Fu, D. Pathak, and J. Malik. RMA: Rapid motor adaptation for legged robots. In Robotics: Science and Systems, 2021. doi:10.15607/RSS.2021.XVII.011

work page doi:10.15607/rss.2021.xvii.011 2021

[43] [43]

T. Miki, J. Lee, J. Hwangbo, L. Wellhausen, V . Koltun, and M. Hutter. Learning robust perceptive locomotion for quadrupedal robots in the wild.Science Robotics, 7(62):eabk2822,

[44] [44]

doi:10.1126/scirobotics.abk2822

work page doi:10.1126/scirobotics.abk2822

[45] [45]

T. He, J. Gao, W. Xiao, Y . Zhang, Z. Wang, J. Wang, Z. Luo, G. He, N. Sobanbab, C. Pan, Z. Yi, G. Qu, K. Kitani, J. Hodgins, L. J. Fan, Y . Zhu, C. Liu, and G. Shi. ASAP: Aligning simulation and real-world physics for learning agile humanoid whole-body skills.arXiv preprint arXiv:2502.01143, 2025. doi:10.48550/arXiv.2502.01143

work page doi:10.48550/arxiv.2502.01143 2025

[46] [46]

Residual Policy Learning

T. Silver, K. Allen, J. Tenenbaum, and L. P. Kaelbling. Residual policy learning.arXiv preprint arXiv:1812.06298, 2018. doi:10.48550/arXiv.1812.06298

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1812.06298 2018

[47] [47]

Johannink, S

T. Johannink, S. Bahl, A. Nair, J. Luo, A. Kumar, M. Loskyll, J. A. Ojea, E. Solowjow, and S. Levine. Residual reinforcement learning for robot control.IEEE International Conference on Robotics and Automation (ICRA), pages 6023–6029, 2019. doi:10.1109/ICRA.2019.8794127

work page doi:10.1109/icra.2019.8794127 2019

[48] [48]

S. Zhao, Y . Ze, Y . Wang, C. K. Liu, P. Abbeel, G. Shi, and R. Duan. ResMimic: From general motion tracking to humanoid whole-body loco-manipulation via residual learning.arXiv preprint arXiv:2510.05070, 2025. doi:10.48550/arXiv.2510.05070

work page doi:10.48550/arxiv.2510.05070 2025

[49] [49]

Z. Wang, Y . Jia, L. Shi, H. Wang, H. Zhao, X. Li, J. Zhou, J. Ma, and G. Zhou. Arm- constrained curriculum learning for loco-manipulation of a wheel-legged robot. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10770– 10776. IEEE, 2024. doi:10.1109/IROS58592.2024.10802062. 11 A Additional Implementation Details A....

work page doi:10.1109/iros58592.2024.10802062 2024