ReactiveBFM: Reactive Closed-Loop Motion Planning Towards Universal Humanoid Whole-Body Control

Furui Xu; Huayi Wang; Jiahe Chen; Jianan Li; Jiangmiao Pang; Jingbo Wang; Kailin Li; Lihe Ding; Tai Wang; Tianfan Xue

arxiv: 2606.30362 · v1 · pith:BYKSVUHVnew · submitted 2026-06-29 · 💻 cs.RO · cs.AI· cs.CV

ReactiveBFM: Reactive Closed-Loop Motion Planning Towards Universal Humanoid Whole-Body Control

Xiao Chen , Weishuai Zeng , Xiaojie Niu , Zirui Wang , Jianan Li , Huayi Wang , Furui Xu , Jiahe Chen

show 7 more authors

Weixiang Zhong Lihe Ding Kailin Li Jiangmiao Pang Tai Wang Tianfan Xue Jingbo Wang

This is my paper

Pith reviewed 2026-06-30 05:06 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CV

keywords humanoid controlclosed-loop motion planningbehavior foundation modelsexposure biasreactive whole-body controlgenerative motion planningprefix sampling curriculum

0 comments

The pith

ReactiveBFM trains generative planners on imperfect states via prefix sampling to enable reactive closed-loop humanoid control.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that cascading open-loop Behavior Foundation Models with motion planners fails due to cumulative tracking errors that create exposure bias. ReactiveBFM counters this by training the planner on noisy physical states through a scheduled prefix sampling curriculum, forcing it to learn recovery behaviors. An asynchronous replanning scheme plus trajectory chunking then reconciles planning latency with high-frequency tracking. The resulting system runs on the Unitree G1 and produces text-conditioned whole-body motions that remain stable under perturbation.

Core claim

ReactiveBFM is a real-time closed-loop planning-control framework whose core is a scheduled prefix sampling curriculum that forces the generative planner to learn error-recovery behaviors from imperfect physical states rather than ground-truth trajectories; this is combined with an asynchronous replanning mechanism and trajectory chunking to produce spatio-temporally fluid execution.

What carries the argument

Scheduled prefix sampling curriculum that trains the planner on imperfect physical states to induce error-recovery behaviors.

If this is right

Achieves 93.1 percent success rate in sim-to-sim benchmarking under severe perturbations.
Outperforms cascaded open-loop baselines by 28.6 percent.
Enables zero-shot moving target reaching with intricate whole-body coordination and on-the-fly replanning.
Guarantees spatio-temporally fluid execution without physical jitter across a repertoire of text-conditioned motions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The curriculum approach may reduce dependence on accurate state estimation in unstructured settings.
Similar prefix sampling could be tested on other legged platforms to check transfer of recovery behaviors.
Pairing the planner with onboard perception could allow direct closed-loop response to visual changes without separate tracking layers.

Load-bearing premise

The scheduled prefix sampling curriculum successfully induces error-recovery behaviors in the generative planner when driven by imperfect physical states rather than ground-truth trajectories.

What would settle it

Removing the prefix sampling curriculum and measuring whether success rate under the same severe perturbations falls below 70 percent.

Figures

Figures reproduced from arXiv: 2606.30362 by Furui Xu, Huayi Wang, Jiahe Chen, Jianan Li, Jiangmiao Pang, Jingbo Wang, Kailin Li, Lihe Ding, Tai Wang, Tianfan Xue, Weishuai Zeng, Weixiang Zhong, Xiao Chen, Xiaojie Niu, Zirui Wang.

**Figure 1.** Figure 1: We introduce ReactiveBFM, a closed-loop framework integrating a behavior foundation model with a reactive whole-body motion planner. Guided by proprioceptive feedback, text, and target positions, ReactiveBFM enables robust text-conditioned control and seamless zero-shot replanning to reach moving targets. Abstract: While current Behavior Foundation Models (BFMs) provide robust control priors for humanoids… view at source ↗

**Figure 2.** Figure 2: Overview of ReactiveBFM. (a) Asynchronous closed-loop inference coupled with a universal controller via trajectory chunking and proprioceptive feedback. (b) Architecture of our reactive motion planner. It generates smooth robot trajectories from multi-modal streaming conditions. (c) Core training strategies including a scheduled prefix sampling curriculum and condition adherence to mitigate exposure bias… view at source ↗

**Figure 3.** Figure 3: (a) ReactiveBFM enables long-horizon zero-shot deployment of reaching a moving target. (b) The recorded actual robot-target trajectories show the effectiveness. (c) The curve illustrates that our system reactively plans coordinated whole-body motion and always reaches the moving global localizer. High-Quality Data Curation. Raw kinematic datasets often contain physically infeasible artifacts, such as foo… view at source ↗

**Figure 4.** Figure 4: Real-world deployment of ReactiveBFM under text-conditioned and streaming interactive [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Real-world robustness evaluation under diverse physical perturbations: (a) repeated heavy kicks; (b) holding for over 1s to forcibly reverse the rotation direction; (c) repeated strikes with a 3kg ball; and (d) being dragged off-balance and down. evaluation in [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Timeline and latency analysis during real-world deployment. via an HTC Vive Ultimate Tracker as shown in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: We utilize HTC VIVE Ultimate Trackers for global localization. During deployment, one tracker is mounted on the back of the robot’s pelvis, and another is attached to a handheld toy sword [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

read the original abstract

While current Behavior Foundation Models (BFMs) provide robust control priors for humanoids, they only execute pre-defined reference motions. As a result, they are vulnerable to environmental shifts and incapable of reactive whole-body coordination. Naively cascading them with generative motion planners fails to achieve true reactivity, as inevitable tracking discrepancies induce fatal cumulative exposure bias. To bridge this gap, we propose ReactiveBFM, a real-time closed-loop planning-control framework. At its core, we effectively mitigate exposure bias via a scheduled prefix sampling curriculum, forcing the generative planner to actively learn error-recovery behaviors from imperfect physical states rather than ground-truth trajectories. Systematically, to reconcile the severe latency mismatch between auto-regressive planning and high-frequency tracking, we introduce an asynchronous replanning mechanism. Combined with trajectory chunking to temporally ensemble spatial references, our system guarantees spatio-temporally fluid execution without physical jitter. Deployed on the Unitree G1 humanoid, ReactiveBFM demonstrates unprecedented physical agility across a vast repertoire of text-conditioned closed-loop motions. Notably, ReactiveBFM achieves zero-shot moving target reaching, showcasing intricate whole-body coordination and on-the-fly replanning. In sim-to-sim benchmarking under severe perturbations, ReactiveBFM achieves a 93.1% success rate, significantly outperforming cascaded open-loop baselines by 28.6%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReactiveBFM adds a prefix sampling curriculum and async replanning to close the loop on BFMs, but the 93.1% success claim lacks ablations isolating the curriculum's contribution.

read the letter

The main takeaway is that ReactiveBFM tries to convert open-loop Behavior Foundation Models into reactive closed-loop controllers for humanoids. It does this with a scheduled prefix sampling curriculum that trains the planner on imperfect physical states instead of ground-truth trajectories, plus asynchronous replanning and trajectory chunking to handle latency and keep motions smooth. On the Unitree G1 they show text-conditioned motions and zero-shot target reaching, and in sim-to-sim tests under perturbations they report 93.1% success, 28.6 points above cascaded baselines.

The new elements are the specific curriculum for error recovery and the async setup to reconcile planning and tracking speeds. These target a real deployment issue with BFMs, where tracking drift quickly breaks the system. The framework description gives a concrete way to think about closing that loop.

The soft spot is the missing isolation for the curriculum. The abstract credits the performance jump to the prefix schedule forcing recovery behaviors, yet there is no ablation that turns the schedule off or randomizes it while keeping replanning and chunking fixed. Without that, the gain could come from the replanning alone. The numbers also come without details on the exact schedule, baseline implementations, perturbation magnitudes, or any statistical checks.

This is for robotics researchers working on humanoid whole-body control and sim-to-real transfer. A reader interested in practical fixes for generative motion models would get usable ideas from the mechanisms even if the experiments need more work.

It should go to peer review. The problem is worth addressing and the proposed pieces are reasonable, but the central performance story needs tighter evidence before the claims can be taken at face value.

Referee Report

2 major / 2 minor

Summary. The paper introduces ReactiveBFM, a real-time closed-loop framework for humanoid whole-body control that combines generative motion planning with a scheduled prefix sampling curriculum to mitigate exposure bias, asynchronous replanning to handle latency mismatch, and trajectory chunking for fluid execution. It claims this enables reactive text-conditioned motions on the Unitree G1, including zero-shot moving target reaching, and reports a 93.1% success rate in sim-to-sim benchmarking under severe perturbations, outperforming cascaded open-loop baselines by 28.6%.

Significance. If the performance gains and mechanism are substantiated, the work would advance reactive control for humanoids by addressing a key limitation of behavior foundation models (exposure bias under imperfect state feedback), potentially enabling more robust closed-loop behaviors without hand-crafted recovery policies.

major comments (2)

[Abstract] Abstract (paragraph on mitigation of exposure bias): The central claim that the scheduled prefix sampling curriculum induces error-recovery behaviors from imperfect physical states (rather than ground-truth trajectories) is load-bearing for the 28.6% performance gain, yet no ablation is described that removes or randomizes the prefix schedule while holding replanning, chunking, and the base BFM fixed. Without this isolation, the success-rate delta cannot be attributed specifically to the curriculum.
[Abstract] Abstract (sim-to-sim benchmarking paragraph): The reported 93.1% success rate and 28.6% improvement lack supporting details on curriculum schedule parameters, baseline definitions, perturbation magnitudes, number of trials, or statistical tests, leaving the quantitative claim unsupported by visible evidence in the manuscript.

minor comments (2)

The manuscript should include explicit pseudocode or a table for the prefix sampling schedule and the asynchronous replanning logic to allow reproduction.
Clarify whether the sim-to-sim results use the same perturbation distribution for training and testing, and report variance across seeds.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments below and will revise the manuscript to strengthen the presentation of our claims.

read point-by-point responses

Referee: [Abstract] Abstract (paragraph on mitigation of exposure bias): The central claim that the scheduled prefix sampling curriculum induces error-recovery behaviors from imperfect physical states (rather than ground-truth trajectories) is load-bearing for the 28.6% performance gain, yet no ablation is described that removes or randomizes the prefix schedule while holding replanning, chunking, and the base BFM fixed. Without this isolation, the success-rate delta cannot be attributed specifically to the curriculum.

Authors: We agree that an explicit ablation isolating the scheduled prefix sampling curriculum is necessary to rigorously attribute the performance gains. The manuscript describes the curriculum's role in learning error-recovery from imperfect states, but does not include a controlled comparison against a fixed or randomized prefix schedule with replanning and chunking held constant. In the revised manuscript, we will add this ablation in the experiments section to directly quantify its contribution to the 28.6% improvement. revision: yes
Referee: [Abstract] Abstract (sim-to-sim benchmarking paragraph): The reported 93.1% success rate and 28.6% improvement lack supporting details on curriculum schedule parameters, baseline definitions, perturbation magnitudes, number of trials, or statistical tests, leaving the quantitative claim unsupported by visible evidence in the manuscript.

Authors: We acknowledge that the abstract and main text require expanded details to support the quantitative results. In the revision, we will update both the abstract and the sim-to-sim benchmarking section to specify curriculum schedule parameters, exact baseline configurations, perturbation magnitudes, number of trials, and statistical tests (including means, standard deviations, and significance levels). revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results independent of inputs

full rationale

The paper's core claims rest on an empirical sim-to-sim success rate (93.1%) measured under perturbations, presented as an outcome of the proposed asynchronous replanning, chunking, and prefix-sampling curriculum rather than a quantity defined by construction from those mechanisms. No equations appear that equate a fitted parameter to a renamed prediction, no self-citation chain supplies a uniqueness theorem that forces the architecture, and the exposure-bias mitigation is described as a training procedure whose effect is validated externally by benchmark deltas. The derivation chain therefore remains self-contained against external benchmarks; absence of an ablation isolating the curriculum is a question of evidence strength, not circular reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields minimal ledger entries; the curriculum schedule parameters and simulation fidelity are the primary unexamined elements.

free parameters (1)

prefix sampling schedule parameters
The curriculum relies on an unspecified schedule whose values determine how quickly imperfect states are introduced.

axioms (1)

domain assumption The simulation environment faithfully reproduces the dynamics needed for sim-to-sim transfer to physical Unitree G1 performance
Benchmarking claims rest on this unstated transfer assumption.

pith-pipeline@v0.9.1-grok · 5824 in / 1060 out tokens · 49275 ms · 2026-06-30T05:06:51.077361+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 22 canonical work pages · 8 internal anchors

[1]

K. Yin, W. Zeng, K. Fan, M. Dai, Z. Wang, Q. Zhang, Z. Tian, J. Wang, J. Pang, and W. Zhang. Unitracker: Learning universal whole-body motion tracker for humanoid robots.arXiv preprint arXiv:2507.07356, 2025

work page arXiv 2025
[2]

Y . Ze, S. Zhao, W. Wang, A. Kanazawa, R. Duan, P. Abbeel, G. Shi, J. Wu, and C. K. Liu. Twist2: Scalable, portable, and holistic humanoid data collection system.arXiv preprint arXiv:2511.02832, 2025

work page arXiv 2025
[3]

Z. Luo, Y . Yuan, T. Wang, C. Li, S. Chen, F. Casta ˜neda, Z.-A. Cao, J. Li, D. Minor, Q. Ben, et al. Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

W. Zeng, S. Lu, K. Yin, X. Niu, M. Dai, J. Wang, and J. Pang. Behavior foundation model for humanoid robots.arXiv preprint arXiv:2509.13780, 2025

work page arXiv 2025
[5]

Y . Li, Z. Luo, T. Zhang, C. Dai, A. Kanervisto, A. Tirinzoni, H. Weng, K. Kitani, M. Guzek, A. Touati, et al. Bfm-zero: A promptable behavioral foundation model for humanoid control using unsupervised reinforcement learning.arXiv preprint arXiv:2511.04131, 2025

work page arXiv 2025
[6]

Y . Wei, Z. Wang, K. Yin, Y . Hu, J. Wang, and S. Chen. Unveiling the impact of data and model scaling on high-level control for humanoid robots.arXiv preprint arXiv:2511.09241, 2025

work page arXiv 2025
[7]

J. Li, X. Chen, T. Huang, and T.-T. Wong. Learning to control physically-simulated 3d char- acters via generating and mimicking 2d motions.arXiv preprint arXiv:2512.08500, 2025

work page arXiv 2025
[8]

Kalaria, S

D. Kalaria, S. S. Harithas, P. Katara, S. Kwak, S. Bhagat, S. Sastry, S. Sridhar, S. Vemprala, A. Kapoor, and J. C.-K. Huang. Dreamcontrol: Human-inspired whole-body humanoid control for scene interaction via guided diffusion.arXiv preprint arXiv:2509.14353, 2025

work page arXiv 2025
[9]

J. Li, J. Cao, H. Zhang, D. Rempe, J. Kautz, U. Iqbal, and Y . Yuan. Genmo: A generalist model for human motion. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11766–11776, 2025

2025
[10]

R. Chen, M. Shi, S. Huang, P. Tan, T. Komura, and X. Chen. Taming diffusion probabilistic models for character control. InACM SIGGRAPH 2024 Conference Papers, pages 1–10, 2024

2024
[11]

K. Zhao, G. Li, and S. Tang. Dartcontrol: A diffusion-based autoregressive motion model for real-time text-driven motion control. InInternational Conference on Learning Representa- tions, volume 2025, pages 23569–23592, 2025

2025
[12]

Tevet, S

G. Tevet, S. Raab, S. Cohan, D. Reda, Z. Luo, X. B. Peng, A. H. Bermano, and M. van de Panne. Closd: Closing the loop between simulation and diffusion for multi-task character control. InThe Thirteenth International Conference on Learning Representations
[13]

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Z. Fu, T. Z. Zhao, and C. Finn. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation.arXiv preprint arXiv:2401.02117, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

W. Xie, J. Zheng, J. Han, J. Shi, W. Zhang, C. Bai, and X. Li. Textop: Real-time interactive text-driven humanoid robot motion generation and control.arXiv preprint arXiv:2602.07439, 2026

work page arXiv 2026
[16]

A. Q. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. Mcgrew, I. Sutskever, and M. Chen. Glide: Towards photorealistic image generation and editing with text-guided diffu- sion models. InInternational Conference on Machine Learning, pages 16784–16804. PMLR, 2022. 10

2022
[17]

Hierarchical Text-Conditional Image Generation with CLIP Latents

A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[18]

Rombach, A

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image syn- thesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

2022
[19]

Tevet, S

G. Tevet, S. Raab, B. Gordon, Y . Shafir, D. Cohen-or, and A. H. Bermano. Human motion diffusion model. InThe Eleventh International Conference on Learning Representations, 2023

2023
[20]

Zhang, Y

J. Zhang, Y . Zhang, X. Cun, Y . Zhang, H. Zhao, H. Lu, X. Shen, and Y . Shan. Generating human motion from textual descriptions with discrete representations. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14730–14740, 2023

2023
[21]

Jiang, X

B. Jiang, X. Chen, W. Liu, J. Yu, G. Yu, and T. Chen. Motiongpt: Human motion as a foreign language.Advances in Neural Information Processing Systems, 36:20067–20079, 2023

2023
[22]

Mahmood, N

N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll, and M. J. Black. Amass: Archive of motion capture as surface shapes. InProceedings of the IEEE/CVF international conference on computer vision, pages 5442–5451, 2019

2019
[23]

C. Guo, S. Zou, X. Zuo, S. Wang, W. Ji, X. Li, and L. Cheng. Generating diverse and natural 3d human motions from text. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5152–5161, 2022

2022
[24]

J. Lin, A. Zeng, S. Lu, Y . Cai, R. Zhang, H. Wang, and L. Zhang. Motion-x: A large-scale 3d expressive whole-body human motion dataset.Advances in Neural Information Processing Systems, 36:25268–25280, 2023

2023
[25]

S. Lu, J. Wang, Z. Lu, L.-H. Chen, W. Dai, J. Dong, Z. Dou, B. Dai, and R. Zhang. Scamo: Exploring the scaling law in autoregressive motion generation model. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 27872–27882, 2025

2025
[26]

K. Fan, S. Lu, M. Dai, R. Yu, L. Xiao, Z. Dou, J. Dong, L. Ma, and J. Wang. Go to zero: Towards zero-shot motion generation with million-scale data. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13336–13348, 2025

2025
[27]

Z. Luo, J. Cao, K. Kitani, W. Xu, et al. Perpetual humanoid control for real-time simulated avatars. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10895–10904, 2023

2023
[28]

Cheng, Y

X. Cheng, Y . Ji, J. Chen, R. Yang, G. Yang, and X. Wang. Expressive whole-body control for humanoid robots.arXiv preprint arXiv:2402.16796, 2024

work page arXiv 2024
[29]

Huang, H

T. Huang, H. Wang, J. Ren, K. Yin, Z. Wang, X. Chen, F. Jia, W. Zhang, J. Long, J. Wang, et al. Towards adaptable humanoid control via adaptive motion tracking.arXiv preprint arXiv:2510.14454, 2025

work page arXiv 2025
[30]

Mason, S

I. Mason, S. Starke, and T. Komura. Real-time style modelling of human locomotion via feature-wise transformations and local motion phases.Proceedings of the ACM on Computer Graphics and Interactive Techniques, 5(1):1–18, 2022

2022
[31]

J. Chen, Z. Wang, F. Jia, X. Chen, X. Niu, W. Zeng, T. Xue, X. Zhou, J. Pang, and J. Wang. Imagine2real: Towards zero-shot humanoid-object interaction via video generative priors. arXiv preprint arXiv:2605.22272, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[32]

L. Xiao, S. Lu, H. Pi, K. Fan, L. Pan, Y . Zhou, Z. Feng, X. Zhou, S. Peng, and J. Wang. Motionstreamer: Streaming motion generation via diffusion-based autoregressive model in causal latent space.arXiv preprint arXiv:2503.15451, 2025. 11

work page arXiv 2025
[33]

Coumans and Y

E. Coumans and Y . Bai. Pybullet, a python module for physics simulation for games, robotics and machine learning, 2016

2016
[34]

H. Wang, W. Zhang, R. Yu, T. Huang, J. Ren, F. Jia, Z. Wang, X. Niu, X. Chen, J. Chen, Q. Chen, J. Wang, and J. Pang. Physhsi: Towards a real-world generalizable and natural humanoid-scene interaction system.arXiv preprint arXiv:2510.11072, 2025

work page arXiv 2025
[35]

Anonymous

A. Anonymous. Scaling behavior foundation model for humanoid robots.Under Review, 2026

2026
[36]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[37]

B.; Jiang, Y.; Wang, T.; Iqbal, U.; Minor, D.; de Ruyter, M.; et al

D. Rempe, M. Petrovich, Y . Yuan, H. Zhang, X. B. Peng, Y . Jiang, T. Wang, U. Iqbal, D. Minor, M. de Ruyter, et al. Kimodo: Scaling controllable human motion generation.arXiv preprint arXiv:2603.15546, 2026

work page arXiv 2026
[38]

J. P. Araujo, Y . Ze, P. Xu, J. Wu, and C. K. Liu. Retargeting matters: General motion retargeting for humanoid motion tracking.arXiv preprint arXiv:2510.02252, 2025

work page arXiv 2025
[39]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polo- sukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

2017
[40]

N. Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2002
[41]

Zhang and R

B. Zhang and R. Sennrich. Root mean square layer normalization.Advances in neural infor- mation processing systems, 32, 2019

2019
[42]

J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024
[43]

Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning

M. Mittal, P. Roth, J. Tigue, A. Richard, O. Zhang, P. Du, A. Serrano-Mu ˜noz, X. Yao, R. Zurbr ¨ugg, N. Rudin, L. Wawrzyniak, M. Rakhsha, A. Denzler, E. Heiden, A. Borovicka, O. Ahmed, I. Akinola, A. Anwar, M. T. Carlson, J. Y . Feng, A. Garg, R. Gasoto, L. Gulich, Y . Guo, M. Gussert, A. Hansen, M. Kulkarni, C. Li, W. Liu, V . Makoviychuk, G. Malczyk, H...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

K. Yin, W. Zeng, K. Fan, M. Dai, Z. Wang, Q. Zhang, Z. Tian, J. Wang, J. Pang, and W. Zhang. Unitracker: Learning universal whole-body motion tracker for humanoid robots.arXiv preprint arXiv:2507.07356, 2025

work page arXiv 2025

[2] [2]

Y . Ze, S. Zhao, W. Wang, A. Kanazawa, R. Duan, P. Abbeel, G. Shi, J. Wu, and C. K. Liu. Twist2: Scalable, portable, and holistic humanoid data collection system.arXiv preprint arXiv:2511.02832, 2025

work page arXiv 2025

[3] [3]

Z. Luo, Y . Yuan, T. Wang, C. Li, S. Chen, F. Casta ˜neda, Z.-A. Cao, J. Li, D. Minor, Q. Ben, et al. Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

W. Zeng, S. Lu, K. Yin, X. Niu, M. Dai, J. Wang, and J. Pang. Behavior foundation model for humanoid robots.arXiv preprint arXiv:2509.13780, 2025

work page arXiv 2025

[5] [5]

Y . Li, Z. Luo, T. Zhang, C. Dai, A. Kanervisto, A. Tirinzoni, H. Weng, K. Kitani, M. Guzek, A. Touati, et al. Bfm-zero: A promptable behavioral foundation model for humanoid control using unsupervised reinforcement learning.arXiv preprint arXiv:2511.04131, 2025

work page arXiv 2025

[6] [6]

Y . Wei, Z. Wang, K. Yin, Y . Hu, J. Wang, and S. Chen. Unveiling the impact of data and model scaling on high-level control for humanoid robots.arXiv preprint arXiv:2511.09241, 2025

work page arXiv 2025

[7] [7]

J. Li, X. Chen, T. Huang, and T.-T. Wong. Learning to control physically-simulated 3d char- acters via generating and mimicking 2d motions.arXiv preprint arXiv:2512.08500, 2025

work page arXiv 2025

[8] [8]

Kalaria, S

D. Kalaria, S. S. Harithas, P. Katara, S. Kwak, S. Bhagat, S. Sastry, S. Sridhar, S. Vemprala, A. Kapoor, and J. C.-K. Huang. Dreamcontrol: Human-inspired whole-body humanoid control for scene interaction via guided diffusion.arXiv preprint arXiv:2509.14353, 2025

work page arXiv 2025

[9] [9]

J. Li, J. Cao, H. Zhang, D. Rempe, J. Kautz, U. Iqbal, and Y . Yuan. Genmo: A generalist model for human motion. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11766–11776, 2025

2025

[10] [10]

R. Chen, M. Shi, S. Huang, P. Tan, T. Komura, and X. Chen. Taming diffusion probabilistic models for character control. InACM SIGGRAPH 2024 Conference Papers, pages 1–10, 2024

2024

[11] [11]

K. Zhao, G. Li, and S. Tang. Dartcontrol: A diffusion-based autoregressive motion model for real-time text-driven motion control. InInternational Conference on Learning Representa- tions, volume 2025, pages 23569–23592, 2025

2025

[12] [12]

Tevet, S

G. Tevet, S. Raab, S. Cohan, D. Reda, Z. Luo, X. B. Peng, A. H. Bermano, and M. van de Panne. Closd: Closing the loop between simulation and diffusion for multi-task character control. InThe Thirteenth International Conference on Learning Representations

[13] [13]

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [14]

Z. Fu, T. Z. Zhao, and C. Finn. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation.arXiv preprint arXiv:2401.02117, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

W. Xie, J. Zheng, J. Han, J. Shi, W. Zhang, C. Bai, and X. Li. Textop: Real-time interactive text-driven humanoid robot motion generation and control.arXiv preprint arXiv:2602.07439, 2026

work page arXiv 2026

[16] [16]

A. Q. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. Mcgrew, I. Sutskever, and M. Chen. Glide: Towards photorealistic image generation and editing with text-guided diffu- sion models. InInternational Conference on Machine Learning, pages 16784–16804. PMLR, 2022. 10

2022

[17] [17]

Hierarchical Text-Conditional Image Generation with CLIP Latents

A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[18] [18]

Rombach, A

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image syn- thesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

2022

[19] [19]

Tevet, S

G. Tevet, S. Raab, B. Gordon, Y . Shafir, D. Cohen-or, and A. H. Bermano. Human motion diffusion model. InThe Eleventh International Conference on Learning Representations, 2023

2023

[20] [20]

Zhang, Y

J. Zhang, Y . Zhang, X. Cun, Y . Zhang, H. Zhao, H. Lu, X. Shen, and Y . Shan. Generating human motion from textual descriptions with discrete representations. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14730–14740, 2023

2023

[21] [21]

Jiang, X

B. Jiang, X. Chen, W. Liu, J. Yu, G. Yu, and T. Chen. Motiongpt: Human motion as a foreign language.Advances in Neural Information Processing Systems, 36:20067–20079, 2023

2023

[22] [22]

Mahmood, N

N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll, and M. J. Black. Amass: Archive of motion capture as surface shapes. InProceedings of the IEEE/CVF international conference on computer vision, pages 5442–5451, 2019

2019

[23] [23]

C. Guo, S. Zou, X. Zuo, S. Wang, W. Ji, X. Li, and L. Cheng. Generating diverse and natural 3d human motions from text. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5152–5161, 2022

2022

[24] [24]

J. Lin, A. Zeng, S. Lu, Y . Cai, R. Zhang, H. Wang, and L. Zhang. Motion-x: A large-scale 3d expressive whole-body human motion dataset.Advances in Neural Information Processing Systems, 36:25268–25280, 2023

2023

[25] [25]

S. Lu, J. Wang, Z. Lu, L.-H. Chen, W. Dai, J. Dong, Z. Dou, B. Dai, and R. Zhang. Scamo: Exploring the scaling law in autoregressive motion generation model. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 27872–27882, 2025

2025

[26] [26]

K. Fan, S. Lu, M. Dai, R. Yu, L. Xiao, Z. Dou, J. Dong, L. Ma, and J. Wang. Go to zero: Towards zero-shot motion generation with million-scale data. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13336–13348, 2025

2025

[27] [27]

Z. Luo, J. Cao, K. Kitani, W. Xu, et al. Perpetual humanoid control for real-time simulated avatars. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10895–10904, 2023

2023

[28] [28]

Cheng, Y

X. Cheng, Y . Ji, J. Chen, R. Yang, G. Yang, and X. Wang. Expressive whole-body control for humanoid robots.arXiv preprint arXiv:2402.16796, 2024

work page arXiv 2024

[29] [29]

Huang, H

T. Huang, H. Wang, J. Ren, K. Yin, Z. Wang, X. Chen, F. Jia, W. Zhang, J. Long, J. Wang, et al. Towards adaptable humanoid control via adaptive motion tracking.arXiv preprint arXiv:2510.14454, 2025

work page arXiv 2025

[30] [30]

Mason, S

I. Mason, S. Starke, and T. Komura. Real-time style modelling of human locomotion via feature-wise transformations and local motion phases.Proceedings of the ACM on Computer Graphics and Interactive Techniques, 5(1):1–18, 2022

2022

[31] [31]

J. Chen, Z. Wang, F. Jia, X. Chen, X. Niu, W. Zeng, T. Xue, X. Zhou, J. Pang, and J. Wang. Imagine2real: Towards zero-shot humanoid-object interaction via video generative priors. arXiv preprint arXiv:2605.22272, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[32] [32]

L. Xiao, S. Lu, H. Pi, K. Fan, L. Pan, Y . Zhou, Z. Feng, X. Zhou, S. Peng, and J. Wang. Motionstreamer: Streaming motion generation via diffusion-based autoregressive model in causal latent space.arXiv preprint arXiv:2503.15451, 2025. 11

work page arXiv 2025

[33] [33]

Coumans and Y

E. Coumans and Y . Bai. Pybullet, a python module for physics simulation for games, robotics and machine learning, 2016

2016

[34] [34]

H. Wang, W. Zhang, R. Yu, T. Huang, J. Ren, F. Jia, Z. Wang, X. Niu, X. Chen, J. Chen, Q. Chen, J. Wang, and J. Pang. Physhsi: Towards a real-world generalizable and natural humanoid-scene interaction system.arXiv preprint arXiv:2510.11072, 2025

work page arXiv 2025

[35] [35]

Anonymous

A. Anonymous. Scaling behavior foundation model for humanoid robots.Under Review, 2026

2026

[36] [36]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[37] [37]

B.; Jiang, Y.; Wang, T.; Iqbal, U.; Minor, D.; de Ruyter, M.; et al

D. Rempe, M. Petrovich, Y . Yuan, H. Zhang, X. B. Peng, Y . Jiang, T. Wang, U. Iqbal, D. Minor, M. de Ruyter, et al. Kimodo: Scaling controllable human motion generation.arXiv preprint arXiv:2603.15546, 2026

work page arXiv 2026

[38] [38]

J. P. Araujo, Y . Ze, P. Xu, J. Wu, and C. K. Liu. Retargeting matters: General motion retargeting for humanoid motion tracking.arXiv preprint arXiv:2510.02252, 2025

work page arXiv 2025

[39] [39]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polo- sukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

2017

[40] [40]

N. Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2002

[41] [41]

Zhang and R

B. Zhang and R. Sennrich. Root mean square layer normalization.Advances in neural infor- mation processing systems, 32, 2019

2019

[42] [42]

J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024

[43] [43]

Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning

M. Mittal, P. Roth, J. Tigue, A. Richard, O. Zhang, P. Du, A. Serrano-Mu ˜noz, X. Yao, R. Zurbr ¨ugg, N. Rudin, L. Wawrzyniak, M. Rakhsha, A. Denzler, E. Heiden, A. Borovicka, O. Ahmed, I. Akinola, A. Anwar, M. T. Carlson, J. Y . Feng, A. Garg, R. Gasoto, L. Gulich, Y . Guo, M. Gussert, A. Hansen, M. Kulkarni, C. Li, W. Liu, V . Makoviychuk, G. Malczyk, H...

work page internal anchor Pith review Pith/arXiv arXiv 2025