pith. sign in

arxiv: 2606.29209 · v1 · pith:ZFULWBDGnew · submitted 2026-06-28 · 💻 cs.RO · cs.AI

AnyBody: Free-Form Whole-Body Humanoid Control from Arbitrary Keypoint Guidance

Pith reviewed 2026-06-30 07:42 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords humanoid controlkeypoint guidancelatent motion representationwhole-body trackingtransformer encodermotion distillationteleoperationlocomotion
0
0 comments X

The pith

AnyBody learns one latent motion space that any keypoint subset can drive for whole-body humanoid control.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AnyBody to let a humanoid robot follow commands from any chosen subset of body keypoints supplied at runtime. It first trains a privileged teacher tracker on large unstructured motion data, then distills the tracker online into a deterministic encoder-decoder whose latent space is a unit sphere. A transformer keypoint encoder uses masked self-attention to map arbitrary keypoint subsets into this same latent space. The frozen decoder acts as a motor prior, and a lightweight residual corrector handles task-specific adjustments. This setup removes the need for full-body motion capture or separate upper- and lower-body hierarchies while preserving coordinated whole-body behavior.

Core claim

AnyBody closes the gap between full-motion-capture trackers and partial-keypoint control by learning a single latent motion representation on a unit sphere; a masked self-attention transformer aligns any keypoint subset to this representation, and the resulting latent commands a shared decoder that produces coordinated whole-body actions without hierarchical decomposition.

What carries the argument

Unit-sphere latent space from online distillation of a privileged teacher tracker, addressed by a masked self-attention transformer keypoint encoder.

If this is right

  • Arbitrary keypoint subsets at deploy time produce coordinated whole-body humanoid motions.
  • Downstream tasks are learned by adding a lightweight residual corrector on top of the frozen decoder.
  • Large-scale human motion tracking works from partial keypoint inputs without retargeting.
  • Free-form control and teleoperation become possible with flexible keypoint choices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Real-time camera-based keypoint detectors could replace mocap suits for robot training data.
  • The same latent-space approach might transfer to non-humanoid robots if the motor prior generalizes.
  • Task-specific correctors could be swapped at runtime to switch behaviors without retraining the core controller.

Load-bearing premise

The privileged teacher tracker trained on unstructured motion data can be distilled into a deterministic encoder-decoder whose unit-sphere latent space stays addressable by arbitrary masked keypoint subsets without loss of coordinated whole-body motion quality.

What would settle it

Running the same locomotion sequence with only upper-body keypoints and observing whether leg coordination degrades compared with full keypoints.

Figures

Figures reproduced from arXiv: 2606.29209 by Jiachen Li, Mingyu Ding, Shuning Li, Sikai Li.

Figure 1
Figure 1. Figure 1: AnyBody learns a unified whole-body humanoid controller driven by arbitrary subsets of [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of AnyBody. (a) We first train a privileged teacher tracker on a large-scale [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Open-ended behavior generation from sparse keypoint commands. Red arrows denote manually specified keypoint trajectories. AnyBody follows diverse sparse commands while synthe￾sizing physically plausible and coordinated whole-body motions, including single-keypoint control: directional locomotion, arm swings, arm raises, bending, squatting, and multi-keypoint/closed-loop control. These examples illustrate t… view at source ↗
Figure 4
Figure 4. Figure 4: Lightweight traj-creation interface. Beyond quantitative benchmarks, we further study the open-ended con￾trollability and generative capability of AnyBody under manually spec￾ified keypoint trajectories. Intuitively, the partial keypoint tracker can robustly follow a wide range of simple sparse commands while synthe￾sizing coordinated whole-body behaviors, as shown in [PITH_FULL_IMAGE:figures/full_fig_p00… view at source ↗
Figure 6
Figure 6. Figure 6: Latent-space RL expands AnyBody’s capability coverage. Latent-space RL expands AnyBody’s capability coverage beyond direct motion following. Left: when driven solely by sparse keypoint tracking objectives, the pretrained controller naively follows the commanded trajectory and may fail in task-constrained environments, for example by colliding with obstacles. Middle: latent￾space RL adapts the motion prior … view at source ↗
Figure 5
Figure 5. Figure 5: Real-world experiment overview. More demos could be found in the Appendix. We further evaluate whether AnyBody transfers from simulation to real-world humanoid tele￾operation. Our hardware setup consists of a Unitree G1 humanoid together with an Apple Vision Pro headset for sparse keypoint capture and teleoperation input. We study whether sparse keypoint commands can effectively drive the humanoid policy w… view at source ↗
Figure 7
Figure 7. Figure 7: Wrist-keypoint following on hardware. Using the wrist keypoints (green dots; red arrows show commanded direction), an operator commands the G1 to first raise its arms then lower them. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Obstacle reach on hardware. The G1 tracks a right-wrist keypoint target (green) while navigating around a table (top) and a cardboard box requiring a deep-squat posture (bottom) [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Torso-guided locomotion on hardware. Using only the torso keypoint (green dot), an operator commands the G1 to walk 3m forward. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
read the original abstract

We present AnyBody, a unified whole-body humanoid controller driven by an arbitrary subset of body keypoints chosen at deploy time. Prior physics-based trackers either rely on expensive full-body motion capture and error-prone trajectory retargeting, which bottleneck scalable data collection and policy learning, or decompose upper- and lower-body control into separate hierarchical representations, sacrificing the coordinated whole-body motions that loco-manipulation requires. We close this gap by learning a single latent motion representation that any keypoint subset can address. To achieve this, we first train a privileged teacher tracker on a large unstructured motion corpus and distill it online into a deterministic encoder-decoder student whose latent space is a unit sphere. We then train a transformer keypoint encoder that admits any subset of body keypoints through masked self-attention, aligning it to the privileged latent. Additionally, we treat the frozen decoder as a motor prior and specialize downstream tasks with a lightweight residual corrector in the latent space. We demonstrate the effectiveness of AnyBody by tracking large-scale human motions from arbitrary keypoint subsets, free-form control, flexibly teleoperating, and learning downstream behaviors including locomotion, in-air writing, and obstacle-reach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 1 minor

Summary. The manuscript presents AnyBody, a unified whole-body humanoid controller driven by an arbitrary subset of body keypoints at deploy time. It trains a privileged teacher tracker on unstructured motion data, distills it online into a deterministic encoder-decoder student with unit-sphere latent space, trains a masked self-attention transformer keypoint encoder aligned to that latent, and uses the frozen decoder plus a lightweight residual corrector for downstream task specialization. Demonstrations include tracking large-scale human motions from arbitrary keypoints, free-form control, teleoperation, and learning behaviors such as locomotion, in-air writing, and obstacle reaching.

Significance. If the online distillation and keypoint-to-latent alignment succeed without degrading coordinated whole-body motion quality, the work would be significant for physics-based humanoid control. It directly addresses the scalability bottleneck of full-body mocap and the coordination loss of hierarchical upper/lower-body decompositions, enabling more flexible loco-manipulation policies from partial observations. The single latent representation addressable by arbitrary masked subsets is a coherent architectural choice that, if validated, advances deploy-time adaptability.

minor comments (1)
  1. The provided manuscript text consists only of the abstract; without access to sections, equations, training details, or quantitative results, it is not possible to assess whether the distillation and alignment steps empirically support the central claim of lossless addressability by arbitrary keypoint subsets.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their summary of AnyBody and for noting its potential to address scalability and coordination issues in whole-body humanoid control. The recommendation of 'uncertain' is noted, but the report lists no specific major comments under the MAJOR COMMENTS section. We therefore have no individual points to rebut or revise at this stage and look forward to any additional detailed feedback.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The provided abstract and description outline a standard privileged-teacher distillation pipeline: an external unstructured motion corpus trains the teacher, online distillation produces the student encoder-decoder with unit-sphere latent, a separate masked-self-attention transformer is trained to align arbitrary keypoint subsets to that latent, and a frozen decoder plus residual corrector handles downstream tasks. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the text. The central claim (single latent addressable by arbitrary subsets) is presented as the outcome of this empirical alignment training rather than a mathematical identity or reduction to the inputs by construction. The pipeline is self-contained against external motion data and does not invoke uniqueness theorems or prior self-authored results as load-bearing premises.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; all such elements would be extracted from the full manuscript.

pith-pipeline@v0.9.1-grok · 5742 in / 1084 out tokens · 25072 ms · 2026-06-30T07:42:41.640733+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 24 canonical work pages · 7 internal anchors

  1. [1]

    Q. Liao, T. E. Truong, X. Huang, Y . Gao, G. Tevet, K. Sreenath, and C. K. Liu. Beyondmimic: From motion tracking to versatile humanoid control via guided diffusion.arXiv preprint arXiv:2508.08241, 2025

  2. [2]

    Y . Seo, C. Sferrazza, H. Geng, M. Nauman, Z.-H. Yin, and P. Abbeel. Fasttd3: Simple, fast, and capable reinforcement learning for humanoid control.arXiv preprint arXiv:2505.22642, 2025

  3. [3]

    Huang, H

    T. Huang, H. Wang, J. Ren, K. Yin, Z. Wang, X. Chen, F. Jia, W. Zhang, J. Long, J. Wang, et al. Towards adaptable humanoid control via adaptive motion tracking.arXiv preprint arXiv:2510.14454, 2025

  4. [4]

    Y . Li, Z. Luo, T. Zhang, C. Dai, A. Kanervisto, A. Tirinzoni, H. Weng, K. Kitani, M. Guzek, A. Touati, et al. Bfm-zero: A promptable behavioral foundation model for humanoid control using unsupervised reinforcement learning.arXiv preprint arXiv:2511.04131, 2025

  5. [5]

    Kalaria, S

    D. Kalaria, S. S. Harithas, P. Katara, S. Kwak, S. Bhagat, S. Sastry, S. Sridhar, S. Vemprala, A. Kapoor, and J. C.-K. Huang. Dreamcontrol: Human-inspired whole-body humanoid control for scene interaction via guided diffusion.arXiv preprint arXiv:2509.14353, 2025

  6. [6]

    C. Lu, X. Cheng, J. Li, S. Yang, M. Ji, C. Yuan, G. Yang, S. Yi, and X. Wang. Mobile- television: Predictive motion priors for humanoid whole-body control. In2025 IEEE Interna- tional Conference on Robotics and Automation (ICRA), pages 5364–5371. IEEE, 2025

  7. [7]

    Y . Li, Y . Zhang, W. Xiao, C. Pan, H. Weng, G. He, T. He, and G. Shi. Hold my beer: Learning gentle humanoid locomotion and end-effector stabilization control.arXiv preprint arXiv:2505.24198, 2025

  8. [8]

    J. Shi, X. Liu, D. Wang, S. Schwertfeger, C. Zhang, F. Sun, C. Bai, X. Li, et al. Adversarial lo- comotion and motion imitation for humanoid policy learning.Advances in Neural Information Processing Systems, 38:73918–73949, 2026

  9. [9]

    Y . Li, Y . Zhang, W. Xiao, C. Pan, H. Weng, G. He, T. He, and G. Shi. Learning gentle humanoid locomotion and end-effector stabilization control.arXiv e-prints, pages arXiv–2505, 2025

  10. [10]

    J. Li, X. Cheng, T. Huang, S. Yang, R.-Z. Qiu, and X. Wang. Amo: Adaptive motion opti- mization for hyper-dexterous humanoid whole-body control. InRobotics: Science and Systems (RSS), 2025. doi:10.15607/RSS.2025.XXI.061

  11. [11]

    Q. Ben, F. Jia, J. Zeng, J. Dong, D. Lin, and J. Pang. Homie: Humanoid loco-manipulation with isomorphic exoskeleton cockpit, 2025

  12. [12]

    Xu and G

    J. Xu and G. Durrett. Spherical latent spaces for stable variational autoencoders. InPro- ceedings of the 2018 conference on empirical methods in natural language processing, pages 4503–4513, 2018

  13. [13]

    W. Fan, H. Huang, C. Liang, X. Liu, and S.-J. Peng. Unsupervised meta-learning via spherical latent representations and dual vae-gan: W. fan et al.Applied Intelligence, 53(19):22775– 22788, 2023

  14. [14]

    Z. Luo, Y . Yuan, T. Wang, C. Li, F. Casta ˜neda, S. Chen, et al. Sonic: Supersizing motion tracking for natural humanoid whole-body control, 2025

  15. [15]

    X. B. Peng, P. Abbeel, S. Levine, and M. van de Panne. Deepmimic: Example-guided deep re- inforcement learning of physics-based character skills.ACM Transactions on Graphics (TOG), 37(4):143:1–143:14, 2018. doi:10.1145/3197517.3201311. 16

  16. [16]

    Z. Luo, J. Cao, A. Winkler, K. Kitani, and W. Xu. Perpetual humanoid control for real-time simulated avatars. InIEEE/CVF International Conference on Computer Vision (ICCV), pages 10861–10870, 2023. doi:10.1109/ICCV51070.2023.01000

  17. [17]

    Mahmood, N

    N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll, and M. J. Black. Amass: Archive of motion capture as surface shapes. InProceedings of the IEEE/CVF international conference on computer vision, pages 5442–5451, 2019

  18. [18]

    S. Li, S. Li, Z. Wei, Y . Yao, C. Li, and M. Ding. Coordex: Coordinating body and hand pri- ors for continuous dexterous humanoid loco-manipulation.arXiv preprint arXiv:2606.23680, 2026

  19. [20]

    J. Han, W. Xie, J. Zheng, J. Shi, W. Zhang, T. Xiao, and C. Bai. Kungfubot2: Learning versatile motion skills for humanoid whole-body control.arXiv preprint arXiv:2509.16638, 2025

  20. [21]

    Y . Shao, X. Huang, B. Zhang, Q. Liao, Y . Gao, Y . Chi, Z. Li, S. Shao, and K. Sreenath. Langwbc: Language-directed humanoid whole-body control via end-to-end learning.arXiv preprint arXiv:2504.21738, 2025

  21. [22]

    H. Weng, Y . Li, N. Sobanbabu, Z. Wang, Z. Luo, T. He, D. Ramanan, and G. Shi. Hdmi: Learning interactive humanoid whole-body control from human videos.arXiv preprint arXiv:2509.16757, 2025

  22. [23]

    H. Xue, X. Huang, D. Niu, Q. Liao, T. Kragerud, J. T. Gravdahl, X. B. Peng, G. Shi, T. Dar- rell, K. Sreenath, et al. Leverb: Humanoid whole-body control with latent vision-language instruction.arXiv preprint arXiv:2506.13751, 2025

  23. [24]

    Y . Wang, M. Yang, G. Ding, Y . Zhang, W. Zeng, X. Xu, H. Jiang, and Z. Lu. From experts to a generalist: Toward general whole-body control for humanoid robots.Advances in Neural Information Processing Systems, 38:147748–147772, 2026

  24. [25]

    Z. Chen, M. Ji, X. Cheng, X. Peng, X. B. Peng, and X. Wang. Gmt: General motion tracking for humanoid whole-body control.arXiv preprint arXiv:2506.14770, 2025

  25. [26]

    Z. Fu, Q. Zhao, Q. Wu, G. Wetzstein, and C. Finn. Humanplus: Humanoid shadowing and imitation from humans. InConference on Robot Learning (CoRL), 2024

  26. [27]

    T. He, Z. Luo, X. He, W. Xiao, C. Zhang, W. Zhang, K. Kitani, C. Liu, and G. Shi. Om- nih2o: Universal and dexterous human-to-humanoid whole-body teleoperation and learning. InConference on Robot Learning (CoRL), 2024

  27. [28]

    Y . Li, Y . Lin, J. Cui, T. Liu, W. Liang, Y . Zhu, and S. Huang. Clone: Closed-loop whole-body humanoid teleoperation for long-horizon tasks. InConference on Robot Learning (CoRL), pages 4493–4505, 2025

  28. [29]

    Dugar, A

    P. Dugar, A. Shrestha, F. Yu, B. van Marum, and A. Fern. Learning multi-modal whole- body control for real-world humanoid robots. InProceedings of the AAAI Symposium Series, volume 7, pages 650–657, 2025

  29. [30]

    T. He, W. Xiao, T. Lin, Z. Luo, Z. Xu, Z. Jiang, J. Kautz, C. Liu, G. Shi, X. Wang, L. Fan, and Y . Zhu. Hover: Versatile neural whole-body controller for humanoid robots. InIEEE International Conference on Robotics and Automation (ICRA), pages 9989–9996, 2025. doi: 10.1109/ICRA55743.2025.11128549

  30. [31]

    C. Hu, X. Li, D. Liu, H. Wu, X. Chen, J. Wang, and X. Liu. Teacher-student architecture for knowledge distillation: A survey.arXiv preprint arXiv:2308.04268, 2023. 17

  31. [32]

    Yamada, M

    J. Yamada, M. Rigter, J. Collins, and I. Posner. Twist: Teacher-student world model distillation for efficient sim-to-real transfer. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 9190–9196. IEEE, 2024

  32. [33]

    Wang and K.-J

    L. Wang and K.-J. Yoon. Knowledge distillation and student-teacher learning for visual in- telligence: A review and new outlooks.IEEE transactions on pattern analysis and machine intelligence, 44(6):3048–3068, 2021

  33. [34]

    Zhang, G

    Q. Zhang, G. Han, J. Sun, W. Zhao, C. Sun, J. Cao, J. Wang, Y . Guo, and R. Xu. Distillation- ppo: A novel two-stage reinforcement learning framework for humanoid robot perceptive lo- comotion. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2916–2922. IEEE, 2025

  34. [35]

    C. Yang, X. Yu, H. Yang, Z. An, C. Yu, L. Huang, and Y . Xu. Multi-teacher knowledge distillation with reinforcement learning for visual recognition. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 9148–9156, 2025

  35. [36]

    Behavioral Cloning from Observation

    F. Torabi, G. Warnell, and P. Stone. Behavioral cloning from observation.arXiv preprint arXiv:1805.01954, 2018

  36. [37]

    Z. Luo, J. Cao, J. Merel, A. Winkler, J. Huang, K. Kitani, and W. Xu. Universal humanoid motion representations for physics-based control. InInternational Conference on Learning Representations, volume 2024, pages 56766–56782, 2024

  37. [38]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  38. [39]

    LoRA-FA: Efficient and Effective Low Rank Representation Fine-tuning

    L. Zhang, L. Zhang, S. Shi, X. Chu, and B. Li. Lora-fa: Memory-efficient low-rank adaptation for large language models fine-tuning.arXiv preprint arXiv:2308.03303, 2023

  39. [40]

    Residual Policy Learning

    T. Silver, K. Allen, J. Tenenbaum, and L. Kaelbling. Residual policy learning.arXiv preprint arXiv:1812.06298, 2018

  40. [41]

    Johannink, S

    T. Johannink, S. Bahl, A. Nair, J. Luo, A. Kumar, M. Loskyll, J. A. Ojea, E. Solowjow, and S. Levine. Residual reinforcement learning for robot control. In2019 international conference on robotics and automation (ICRA), pages 6023–6029. IEEE, 2019

  41. [42]

    Alakuijala, G

    M. Alakuijala, G. Dulac-Arnold, J. Mairal, J. Ponce, and C. Schmid. Residual reinforcement learning from demonstrations.arXiv preprint arXiv:2106.08050, 2021

  42. [43]

    Zhang, W

    S. Zhang, W. Boehmer, and S. Whiteson. Deep residual reinforcement learning.arXiv preprint arXiv:1905.01072, 2019

  43. [44]

    BONES-SEED: Skeletal everyday embodiment dataset.https:// huggingface.co/datasets/bones-studio/seed, 2026

    Bones Studio. BONES-SEED: Skeletal everyday embodiment dataset.https:// huggingface.co/datasets/bones-studio/seed, 2026. 142,220 annotated motion clips, ≈288 hours at 120 fps

  44. [45]

    Unitree G1 Humanoid Robot.https://www.unitree.com/g1, 2024

    Unitree Robotics. Unitree G1 Humanoid Robot.https://www.unitree.com/g1, 2024

  45. [46]

    Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning

    NVIDIA, :, M. Mittal, P. Roth, J. Tigue, A. Richard, O. Zhang, P. Du, A. Serrano-Mu ˜noz, X. Yao, R. Zurbr ¨ugg, N. Rudin, L. Wawrzyniak, M. Rakhsha, A. Denzler, E. Heiden, A. Borovicka, O. Ahmed, I. Akinola, A. Anwar, M. T. Carlson, J. Y . Feng, A. Garg, R. Gasoto, L. Gulich, Y . Guo, M. Gussert, A. Hansen, M. Kulkarni, C. Li, W. Liu, V . Makoviychuk, G....