pith. the verified trust layer for science. sign in

arxiv: 2602.09628 · v2 · pith:777KAG5Nnew · submitted 2026-02-10 · 💻 cs.RO

TeleGate: Whole-Body Humanoid Teleoperation via Gated Expert Selection with Motion Prior

Pith reviewed 2026-05-16 05:37 UTC · model grok-4.3

classification 💻 cs.RO
keywords humanoid teleoperationgated expert selectionmotion priorwhole-body controlVAEreal-time roboticsdynamic motion tracking
0
0 comments X p. Extension
Add this Pith Number to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{777KAG5N}

Prints a linked pith:777KAG5N badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

A lightweight gating network selects among expert policies for precise whole-body humanoid teleoperation while a motion prior supplies missing future intent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TeleGate as a way to control humanoid robots in real time across varied motions without forcing all skills into one compromised policy. Instead of distilling experts, it keeps each specialized policy intact and uses a small gating network to pick the right one on the fly from current body states and motion references. A VAE module trained on past observations supplies implicit predictions of what comes next, which supports anticipatory actions such as jumping or standing up. The system is trained on only 2.5 hours of motion-capture data and is shown to deliver higher tracking accuracy and success rates than distilled baselines in both simulation and on a physical Unitree G1 robot.

Core claim

TeleGate preserves the full capability of domain-specific expert policies by training a lightweight gating network, which dynamically activates experts in real-time based on proprioceptive states and reference trajectories. To compensate for the absence of future reference trajectories in real-time teleoperation, a VAE-based motion prior module extracts implicit future motion intent from historical observations, enabling anticipatory control for motions requiring prediction such as jumping and standing up.

What carries the argument

Lightweight gating network that selects which expert policy to activate, paired with a VAE-based motion prior that infers future intent from past observations.

If this is right

  • High-precision real-time tracking holds for running, fall recovery, and jumping without the accuracy loss typical of single-policy distillation.
  • Only 2.5 hours of motion-capture data suffice for training that generalizes to both simulation and the physical Unitree G1 robot.
  • Success rate and tracking error both improve over baseline methods that merge experts into one policy.
  • The same gating-plus-prior structure supports deployment in unstructured environments where motions vary rapidly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The gating idea could apply to other multi-skill robotic domains where merging policies degrades peak performance on any single skill.
  • If the motion prior generalizes beyond the training distribution, similar modules might reduce reliance on future reference data in other teleoperation or imitation settings.
  • Testing whether adding more experts further widens the motion range without increasing gate error would be a direct next measurement.

Load-bearing premise

The lightweight gating network can reliably pick the correct expert from proprioceptive states and references, and the VAE can accurately infer future motion intent from history alone.

What would settle it

Record the gating network's expert choices during a failed jump or fall-recovery trial; if the wrong expert is chosen more than half the time and performance collapses, the selection mechanism does not work as claimed.

Figures

Figures reproduced from arXiv: 2602.09628 by 2), (2) AnyWit Robotics Co., Bing Tang (2), Feng Wu (1) ((1) University of Science, Jie Li (1, Ltd.), Technology of China.

Figure 1
Figure 1. Figure 1: Whole-body teleoperation of the Unitree G1 humanoid robot using [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Framework overview. Our method consists of three stages: (I) Data collection and preprocessing using inertial motion capture; (II) Expert policy [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Expert switching analysis during continuous motion. Top: Key frame [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: More real-world teleoperation skills: (a) sitting; (b) walking; (c) [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Real-time whole-body teleoperation is a critical method for humanoid robots to perform complex tasks in unstructured environments. However, developing a unified controller that robustly supports diverse human motions remains a significant challenge. Existing methods typically distill multiple expert policies into a single general policy, which often inevitably leads to performance degradation, particularly on highly dynamic motions. This paper presents TeleGate, a unified whole-body teleoperation framework for humanoid robots that achieves high-precision tracking across various motions while avoiding the performance loss inherent in knowledge distillation. Our key idea is to preserve the full capability of domain-specific expert policies by training a lightweight gating network, which dynamically activates experts in real-time based on proprioceptive states and reference trajectories. Furthermore, to compensate for the absence of future reference trajectories in real-time teleoperation, we introduce a VAE-based motion prior module that extracts implicit future motion intent from historical observations, enabling anticipatory control for motions requiring prediction such as jumping and standing up. We conducted empirical evaluations in simulation and also deployed our technique on the Unitree G1 humanoid robot. Using only 2.5 hours of motion capture data for training, our TeleGate achieves high-precision real-time teleoperation across diverse dynamic motions (e.g., running, fall recovery, and jumping), significantly outperforming the baseline methods in both tracking accuracy and success rate.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces TeleGate, a whole-body teleoperation system for humanoid robots that preserves specialized expert policies via a lightweight gating network selecting experts in real time from proprioceptive states and reference trajectories, augmented by a VAE-based motion prior that infers future intent from historical observations to support anticipatory control. It claims high-precision tracking on dynamic motions (running, jumping, fall recovery) in simulation and on the Unitree G1, trained from 2.5 hours of mocap data, with significant gains over distillation baselines in accuracy and success rate.

Significance. If the empirical advantages are confirmed with detailed metrics and controls, the gated-expert approach could meaningfully reduce the performance trade-offs typical of single-policy distillation for agile humanoid control, offering a practical route to robust real-time teleoperation with modest data requirements. The explicit separation of expert training from gating and the addition of a learned motion prior are technically coherent contributions.

major comments (3)
  1. [Abstract] Abstract: the central claim of 'significantly outperforming' baselines in tracking accuracy and success rate is stated without any numerical values, baseline implementations, statistical tests, or error bars, leaving the magnitude and reliability of the reported advantage impossible to assess from the provided text.
  2. [Evaluation] Evaluation section (implied by abstract): no quantitative results are supplied on gating-network accuracy, switching frequency, chattering, or latency under real-time constraints and sensor noise on the Unitree G1; without these, the skeptic concern that rapid state transitions (jumping, fall recovery) could cause incorrect expert activation remains unaddressed and load-bearing for the stability claim.
  3. [Method] Method (VAE motion prior): the abstract asserts that the VAE extracts 'implicit future motion intent' enabling anticipatory control, yet no prediction-error metrics, ablation on history length, or comparison against simpler predictors are reported, so the necessity and effectiveness of this module for the claimed dynamic motions cannot be verified.
minor comments (2)
  1. [Method] Clarify the precise conditioning inputs to the gating network (proprioception vs. reference trajectory encoding) and whether any smoothing or hysteresis is applied to prevent chattering.
  2. [Experiments] The training data volume (2.5 h) is stated but the split between expert-policy training and gating/VAE training is not detailed; add this breakdown.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, agreeing to strengthen the manuscript with additional quantitative details and analyses where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of 'significantly outperforming' baselines in tracking accuracy and success rate is stated without any numerical values, baseline implementations, statistical tests, or error bars, leaving the magnitude and reliability of the reported advantage impossible to assess from the provided text.

    Authors: We agree that the abstract would benefit from explicit numerical support for the performance claims. In the revised version, we will update the abstract to report key quantitative results, including average joint position RMSE, velocity tracking errors, and success rates for dynamic motions (running, jumping, fall recovery), with direct comparisons to the distillation baselines. These values will be drawn from the full evaluation tables and will reference the specific experimental conditions. revision: yes

  2. Referee: [Evaluation] Evaluation section (implied by abstract): no quantitative results are supplied on gating-network accuracy, switching frequency, chattering, or latency under real-time constraints and sensor noise on the Unitree G1; without these, the skeptic concern that rapid state transitions (jumping, fall recovery) could cause incorrect expert activation remains unaddressed and load-bearing for the stability claim.

    Authors: We acknowledge that these gating-specific metrics are important for addressing real-time stability concerns. We will add a dedicated subsection in the Experiments section that reports gating-network accuracy (expert selection precision), switching frequency, chattering statistics, and end-to-end latency measurements. This analysis will include results from the Unitree G1 hardware under realistic sensor noise, with focused case studies on rapid transitions such as jumping and fall recovery to directly mitigate the concern about incorrect expert activation. revision: yes

  3. Referee: [Method] Method (VAE motion prior): the abstract asserts that the VAE extracts 'implicit future motion intent' enabling anticipatory control, yet no prediction-error metrics, ablation on history length, or comparison against simpler predictors are reported, so the necessity and effectiveness of this module for the claimed dynamic motions cannot be verified.

    Authors: We agree that explicit validation of the VAE motion prior is needed. In the revision, we will add quantitative prediction-error metrics (e.g., future pose MSE over multiple horizons), an ablation study varying history length, and comparisons against simpler baselines such as constant-velocity extrapolation and LSTM predictors. These results will be presented in the Method and Experiments sections to demonstrate the VAE's contribution to anticipatory control for dynamic motions. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on separate training data and external baseline comparisons

full rationale

The paper trains a lightweight gating network and VAE motion prior on 2.5 hours of motion-capture data, then evaluates tracking accuracy and success rate on held-out dynamic motions (running, jumping, fall recovery) plus real-robot deployment on Unitree G1. No equations or steps reduce by construction to their own inputs; the gating selection and anticipatory control are learned modules whose outputs are measured against independent baselines rather than defined to match them. No self-citations are load-bearing, no uniqueness theorems are imported from the authors' prior work, and no fitted parameters are relabeled as predictions. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The central claim rests on the assumption that separate expert policies exist for different motion domains and that a learned gating network plus VAE prior can be trained effectively from limited mocap data to generalize to real-time operation.

free parameters (2)
  • gating network weights
    Learned parameters of the lightweight gating network that selects experts based on proprioception and reference trajectories.
  • VAE latent parameters
    Parameters of the variational autoencoder that encodes historical observations into future motion intent predictions.
axioms (2)
  • domain assumption Expert policies can be trained independently for distinct motion domains without interference
    Invoked when the framework preserves full capability of domain-specific experts rather than distilling them.
  • domain assumption Historical proprioceptive observations contain sufficient information to infer future motion intent for dynamic actions
    Required for the VAE module to enable anticipatory control in the absence of future reference trajectories.
invented entities (1)
  • VAE-based motion prior module no independent evidence
    purpose: Extracts implicit future motion intent from historical observations to support anticipatory control
    New module introduced specifically to address the real-time teleoperation constraint of missing future trajectories.

pith-pipeline@v0.9.0 · 5567 in / 1628 out tokens · 37103 ms · 2026-05-16T05:37:28.262882+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 5 internal anchors

  1. [1]

    Karen Liu

    Jo ˜ao Pedro Ara ´ujo, Yanjie Ze, Pei Xu, Jiajun Wu, and C. Karen Liu. Retargeting matters: General motion retargeting for humanoid motion tracking. InIEEE International Conference on Robotics and Automation (ICRA), 2026

  2. [2]

    Homie: Humanoid loco- manipulation with isomorphic exoskeleton cockpit

    Qingwei Ben, Feiyu Jia, Jia Zeng, Junting Dong, Dahua Lin, and Jiangmiao Pang. Homie: Humanoid loco- manipulation with isomorphic exoskeleton cockpit. In Robotics: Science and Systems (RSS), 2025

  3. [3]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision- language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

  4. [4]

    A systematic review of commercial smart gloves: Current status and applications.Sensors, 21(8):2667, 2021

    Manuel Caeiro-Rodr ´ıguez, Iv ´an Otero-Gonz ´alez, Fer- nando A Mikic-Fonte, and Mart ´ın Llamas-Nistal. A systematic review of commercial smart gloves: Current status and applications.Sensors, 21(8):2667, 2021

  5. [5]

    Learning smooth humanoid locomotion through lipschitz-constrained poli- cies

    Zixuan Chen, Xialin He, Yen-Jen Wang, Qiayuan Liao, Yanjie Ze, Zhongyu Li, S Shankar Sastry, Jiajun Wu, Koushil Sreenath, Saurabh Gupta, et al. Learning smooth humanoid locomotion through lipschitz-constrained poli- cies. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2025

  6. [6]

    Gmt: Gen- eral motion tracking for humanoid whole-body control

    Zixuan Chen, Mazeyu Ji, Xuxin Cheng, Xuanbin Peng, Xue Bin Peng, and Xiaolong Wang. Gmt: General motion tracking for humanoid whole-body control.arXiv preprint arXiv:2506.14770, 2025

  7. [7]

    Expressive whole-body control for humanoid robots

    Xuxin Cheng, Yandong Ji, Junming Chen, Ruihan Yang, Ge Yang, and Xiaolong Wang. Expressive whole-body control for humanoid robots. InRobotics: Science and Systems (RSS), 2024

  8. [8]

    Open-television: Teleoperation with immersive active visual feedback

    Xuxin Cheng, Jialong Li, Shiqi Yang, Ge Yang, and Xiaolong Wang. Open-television: Teleoperation with immersive active visual feedback. InConference on Robot Learning (CoRL), 2024

  9. [9]

    icub3 avatar system: Enabling remote fully immersive embodiment of humanoid robots.Sci- ence Robotics, 9(86):eadh3834, 2024

    Stefano Dafarra, Ugo Pattacini, Giulio Romualdi, Lorenzo Rapetti, Riccardo Grieco, Kourosh Darvish, Gi- anluca Milani, Enrico Valli, Ines Sorrentino, Paolo Maria Viceconte, et al. icub3 avatar system: Enabling remote fully immersive embodiment of humanoid robots.Sci- ence Robotics, 9(86):eadh3834, 2024

  10. [10]

    Whole-body geometric retargeting for humanoid robots

    Kourosh Darvish, Yeshasvi Tirupachuri, Giulio Ro- mualdi, Lorenzo Rapetti, Diego Ferigo, Francisco Javier Andrade Chavez, and Daniele Pucci. Whole-body geometric retargeting for humanoid robots. InIEEE- RAS International Conference on Humanoid Robots (Hu- manoids), pages 679–686, 2019

  11. [11]

    Legibility and predictability of robot motion

    Anca D Dragan, Kenton CT Lee, and Siddhartha S Srinivasa. Legibility and predictability of robot motion. InACM/IEEE International Conference on Human-Robot Interaction (HRI), pages 301–308, 2013

  12. [12]

    Learning Multi-Modal Whole-Body Control for Real-World Humanoid Robots

    Pranay Dugar, Aayam Shrestha, Fangzhou Yu, Bart van Marum, and Alan Fern. Learning multi-modal whole- body control for real-world humanoid robots.arXiv preprint arXiv:2408.07295, 2024

  13. [13]

    Airexo: Low-cost exoskeletons for learning whole- arm manipulation in the wild

    Hongjie Fang, Hao-Shu Fang, Yiming Wang, Jieji Ren, Jingjing Chen, Ruo Zhang, Weiming Wang, and Cewu Lu. Airexo: Low-cost exoskeletons for learning whole- arm manipulation in the wild. InIEEE International Conference on Robotics and Automation (ICRA), pages 15031–15038, 2024

  14. [14]

    Humanplus: Humanoid shadowing and imitation from humans

    Zipeng Fu, Qingqing Zhao, Qi Wu, Gordon Wetzstein, and Chelsea Finn. Humanplus: Humanoid shadowing and imitation from humans. InConference on Robot Learning (CoRL), 2024

  15. [15]

    Mobile aloha: Learning bimanual mobile manipulation with low- cost whole-body teleoperation

    Zipeng Fu, Tony Z Zhao, and Chelsea Finn. Mobile aloha: Learning bimanual mobile manipulation with low- cost whole-body teleoperation. InConference on Robot Learning (CoRL), 2024

  16. [16]

    Advancing humanoid locomotion: Mastering challenging terrains with denoising world model learning

    Xinyang Gu, Yen-Jen Wang, Xiang Zhu, Chengming Shi, Yanjiang Guo, Yichen Liu, and Jianyu Chen. Advancing humanoid locomotion: Mastering challenging terrains with denoising world model learning.arXiv preprint arXiv:2408.14472, 2024

  17. [17]

    Omnih2o: Universal and dexterous human- to-humanoid whole-body teleoperation and learning

    Tairan He, Zhengyi Luo, Xialin He, Wenli Xiao, Chong Zhang, Weinan Zhang, Kris Kitani, Changliu Liu, and Guanya Shi. Omnih2o: Universal and dexterous human- to-humanoid whole-body teleoperation and learning. In Conference on Robot Learning (CoRL), 2024

  18. [18]

    Learning human- to-humanoid real-time whole-body teleoperation

    Tairan He, Zhengyi Luo, Wenli Xiao, Chong Zhang, Kris Kitani, Changliu Liu, and Guanya Shi. Learning human- to-humanoid real-time whole-body teleoperation. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2024. Oral Presentation

  19. [19]

    Hodgins, Linxi Fan, Yuke Zhu, Changliu Liu, and Guanya Shi

    Tairan He, Jiawei Gao, Wenli Xiao, Yuanhang Zhang, Zi Wang, Jiashun Wang, Zhengyi Luo, Guanqi He, Nikhil Sobanbabu, Chaoyi Pan, Zeji Yi, Guannan Qu, Kris Ki- tani, Jessica K. Hodgins, Linxi Fan, Yuke Zhu, Changliu Liu, and Guanya Shi. Asap: Aligning simulation and real-world physics for learning agile humanoid whole- body skills. InRobotics: Science and S...

  20. [20]

    Humanup: Learning getting-up policies for real- world humanoid robots

    Xialin He, Runpei Dong, Zixuan Chen, and Saurabh Gupta. Humanup: Learning getting-up policies for real- world humanoid robots. InRobotics: Science and Sys- tems (RSS), 2025

  21. [21]

    Host: Learning humanoid standing-up control across diverse postures

    Tao Huang, Junli Ren, Huayi Wang, Zirui Wang, Qing- wei Ben, Muning Wen, Xiao Chen, Jianan Li, and Jiangmiao Pang. Host: Learning humanoid standing-up control across diverse postures. InRobotics: Science and Systems (RSS), 2025. Best Systems Paper Finalist

  22. [22]

    OPEN TEACH: A versatile teleoperation system for robotic manipulation,

    Aadhithya Iyer, Zhuoran Peng, Yinlong Dai, Irmak Guzey, Siddhant Haldar, Soumith Chintala, and Lerrel Pinto. Open teach: A versatile teleoperation system for robotic manipulation.arXiv preprint arXiv:2403.07870, 2024

  23. [23]

    Exbody2: Ad- vanced expressive humanoid whole-body control.arXiv preprint arXiv:2412.13196, 2024

    Mazeyu Ji, Xuanbin Peng, Fangchen Liu, Jialong Li, Ge Yang, Xuxin Cheng, and Xiaolong Wang. Ex- body2: Advanced expressive humanoid whole-body con- trol.arXiv preprint arXiv:2412.13196, 2024

  24. [24]

    Behavior robot suite: Stream- lining real-world whole-body manipulation for everyday household activities.arXiv preprint arXiv:2503.05652, 2025

    Yizhou Jiang, Ruihai Zhang, Josiah Wong, Chris Wang, Yanjie Ze, Hang Yin, Celso Gokmen, Shuran Song, Jiajun Wu, and Li Fei-Fei. Behavior robot suite: Stream- lining real-world whole-body manipulation for everyday household activities.arXiv preprint arXiv:2503.05652, 2025

  25. [25]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2014. ICLR 2014

  26. [26]

    Real-time imitation of human whole-body mo- tions by humanoids

    Johannes Koenemann, Felix Burget, and Maren Ben- newitz. Real-time imitation of human whole-body mo- tions by humanoids. InIEEE International Conference on Robotics and Automation (ICRA), pages 2806–2812, 2014

  27. [27]

    Amo: Adaptive mo- tion optimization for hyper-dexterous humanoid whole- body control

    Jialong Li, Xuxin Cheng, Tianshu Huang, Shiqi Yang, Ri-Zhao Qiu, and Xiaolong Wang. Amo: Adaptive mo- tion optimization for hyper-dexterous humanoid whole- body control. InRobotics: Science and Systems (RSS), 2025

  28. [28]

    Okami: Teaching humanoid robots manipulation skills through single video imitation

    Jinhan Li, Yifeng Zhu, Yuqi Xie, Zhenyu Jiang, Mingyo Seo, Georgios Pavlakos, and Yuke Zhu. Okami: Teaching humanoid robots manipulation skills through single video imitation. InConference on Robot Learning (CoRL),

  29. [29]

    Reinforcement learning for robust parameterized loco- motion control of bipedal robots

    Zhongyu Li, Xuxin Cheng, Xue Bin Peng, Pieter Abbeel, Sergey Levine, Glen Berseth, and Koushil Sreenath. Reinforcement learning for robust parameterized loco- motion control of bipedal robots. InIEEE International Conference on Robotics and Automation (ICRA), pages 2811–2817, 2021

  30. [30]

    Rein- forcement learning for versatile, dynamic, and robust bipedal locomotion control.The International Journal of Robotics Research, page 02783649241285161, 2024

    Zhongyu Li, Xue Bin Peng, Pieter Abbeel, Sergey Levine, Glen Berseth, and Koushil Sreenath. Rein- forcement learning for versatile, dynamic, and robust bipedal locomotion control.The International Journal of Robotics Research, page 02783649241285161, 2024

  31. [31]

    Berkeley humanoid: A research platform for learning-based con- trol

    Qiayuan Liao, Bike Zhang, Xuanyu Huang, Xiaoyu Huang, Zhongyu Li, and Koushil Sreenath. Berkeley humanoid: A research platform for learning-based con- trol. InIEEE International Conference on Robotics and Automation (ICRA), 2025

  32. [32]

    Learning visuotactile skills with two multifingered hands,

    Toru Lin, Yu Zhang, Qiyang Li, Haozhi Qi, Brent Yi, Sergey Levine, and Jitendra Malik. Learning visuotactile skills with two multifingered hands.arXiv preprint arXiv:2404.16823, 2024

  33. [33]

    A glove-based system for studying hand-object manipulation via joint pose and force sens- ing

    Hangxin Liu, Xu Xie, Matt Millar, Mark Edmonds, Feng Gao, Yixin Zhu, Veronica J Santos, Brandon Rothrock, and Song-Chun Zhu. A glove-based system for studying hand-object manipulation via joint pose and force sens- ing. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 6617–6624, 2017

  34. [34]

    High- fidelity grasping in virtual reality using a glove-based system

    Hangxin Liu, Zhenliang Zhang, Xu Xie, Yixin Zhu, Yue Liu, Yongtian Wang, and Song-Chun Zhu. High- fidelity grasping in virtual reality using a glove-based system. InIEEE International Conference on Robotics and Automation (ICRA), pages 5180–5186, 2019

  35. [35]

    Learning humanoid locomotion with perceptive internal model, 2024

    Junfeng Long, Junli Ren, Moji Shi, Zirui Wang, Tao Huang, Ping Luo, and Jiangmiao Pang. Learning hu- manoid locomotion with perceptive internal model.arXiv preprint arXiv:2411.14386, 2024

  36. [36]

    Learning h-infinity locomotion control.arXiv preprint, 2024

    Junfeng Long, Wenhan Yu, Quanyi Li, Zirui Wang, Dahua Lin, and Jiangmiao Pang. Learning h-infinity locomotion control.arXiv preprint, 2024

  37. [37]

    Mobile-television: Predictive motion priors for humanoid whole-body control

    Chenhao Lu, Xuxin Cheng, Jialong Li, Shiqi Yang, Mazeyu Ji, Chengjing Yuan, Ge Yang, Sha Yi, and Xiao- long Wang. Mobile-television: Predictive motion priors for humanoid whole-body control. InIEEE International Conference on Robotics and Automation (ICRA), 2025

  38. [38]

    Perpetual humanoid control for real-time simulated avatars

    Zhengyi Luo, Jinkun Cao, Alexander W Winkler, Kris Kitani, and Weipeng Xu. Perpetual humanoid control for real-time simulated avatars. InIEEE/CVF International Conference on Computer Vision (ICCV), pages 10895– 10904, 2023

  39. [39]

    Univer- sal humanoid motion representations for physics-based control

    Zhengyi Luo, Jinkun Cao, Josh Merel, Alexander Win- kler, Jing Huang, Kris Kitani, and Weipeng Xu. Univer- sal humanoid motion representations for physics-based control. InInternational Conference on Learning Repre- sentations (ICLR), 2024. Spotlight

  40. [40]

    Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025

    Zhengyi Luo, Ye Yuan, Tingwu Wang, Chenran Li, Sirui Chen, Fernando Casta ˜neda, Zi-Ang Cao, Jiefeng Li, David Minor, Qingwei Ben, Xingye Da, Runyu Ding, Cyrus Hogg, Lina Song, Edy Lim, Eugene Jeong, Tairan He, Haoru Xue, Wenli Xiao, Zi Wang, Simon Yuen, Jan Kautz, Yan Chang, Umar Iqbal, Linxi Fan, and Yuke Zhu. Sonic: Supersizing motion tracking for natu...

  41. [41]

    Amass: Archive of motion capture as surface shapes

    Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Gerard Pons-Moll, and Michael J Black. Amass: Archive of motion capture as surface shapes. InIEEE/CVF International Conference on Computer Vision (ICCV), pages 5442–5451, 2019

  42. [42]

    Deepmimic: Example-guided deep rein- forcement learning of physics-based character skills

    Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. Deepmimic: Example-guided deep rein- forcement learning of physics-based character skills. In ACM Transactions on Graphics (TOG), volume 37, pages 1–14, 2018

  43. [43]

    Amp: Adversarial motion priors for stylized physics-based character control.ACM Transac- tions on Graphics (TOG), 40(4):1–20, 2021

    Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa. Amp: Adversarial motion priors for stylized physics-based character control.ACM Transac- tions on Graphics (TOG), 40(4):1–20, 2021. SIGGRAPH 2021

  44. [44]

    Anyteleop: A general vision-based dexterous robot arm-hand teleoperation system.arXiv preprint arXiv:2307.04577, 2023

    Yuzhe Qin, Wei Yang, Binghao Huang, Karl Van Wyk, Hao Su, Xiaolong Wang, Yu-Wei Chao, and Dieter Fox. Anyteleop: A general vision-based dexterous robot arm-hand teleoperation system.arXiv preprint arXiv:2307.04577, 2023

  45. [45]

    Learning humanoid locomotion over challenging terrain

    Ilija Radosavovic, Sarthak Kamat, Trevor Darrell, and Jitendra Malik. Learning humanoid locomotion over challenging terrain.arXiv preprint arXiv:2410.03654, 2024

  46. [46]

    Real-world hu- manoid locomotion with reinforcement learning.Science Robotics, 9(89):eadi9579, 2024

    Ilija Radosavovic, Tete Xiao, Bike Zhang, Trevor Darrell, Jitendra Malik, and Koushil Sreenath. Real-world hu- manoid locomotion with reinforcement learning.Science Robotics, 9(89):eadi9579, 2024

  47. [47]

    Humanoid locomotion as next token prediction

    Ilija Radosavovic, Bike Zhang, Baifeng Shi, Jathushan Rajasegaran, Sarthak Kamat, Trevor Darrell, Koushil Sreenath, and Jitendra Malik. Humanoid locomotion as next token prediction. 2024

  48. [48]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  49. [49]

    Agon Serifi, Ruben Grandia, Espen Knoop, Markus Gross, and Moritz B ¨acher. Vmp: Versatile motion priors for robustly tracking motion on physical charac- ters.Computer Graphics Forum (ACM SIGGRAPH / Eurographics Symposium on Computer Animation), 43 (8), 2024

  50. [50]

    Bimanual dexterity for complex tasks

    Kenneth Shaw, Yulong Li, Jiahui Yang, Mohan Kumar Srirama, Ray Liu, Haoyu Xiong, Russell Mendonca, and Deepak Pathak. Bimanual dexterity for complex tasks. In8th Annual Conference on Robot Learning, 2024

  51. [51]

    Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InInternational Conference on Learning Representations (ICLR), 2017

  52. [52]

    Robotic telekinesis: Learning a robotic hand imitator by watching humans on youtube

    Aravind Sivakumar, Kenneth Shaw, and Deepak Pathak. Robotic telekinesis: Learning a robotic hand imitator by watching humans on youtube. InProceedings of Robotics: Science and Systems, New York City, NY , USA, 2022

  53. [53]

    Unified loco-manipulation controller for humanoid robots.arXiv preprint arXiv:2507.06905, 2025

    Wandong Sun, Luying Feng, Baoshi Cao, Yang Liu, Yaochu Jin, and Zongwu Xie. Unified loco-manipulation controller for humanoid robots.arXiv preprint arXiv:2507.06905, 2025

  54. [54]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), volume 30, 2017

  55. [55]

    Dexcap: Scalable and portable mocap data collection system for dexterous manipulation.arXiv preprint arXiv:2403.07788, 2024

    Chen Wang, Haochen Shi, Weizhuo Wang, Ruohan Zhang, Li Fei-Fei, and C Karen Liu. Dexcap: Scalable and portable mocap data collection system for dexterous manipulation.arXiv preprint arXiv:2403.07788, 2024

  56. [56]

    From experts to a generalist: Toward general whole-body control for humanoid robots.arXiv preprint arXiv:2506.12779, 2025

    Yuxuan Wang, Ming Yang, Weishuai Zeng, Yu Zhang, Xinrun Xu, Haobin Jiang, Ziluo Ding, and Zongqing Lu. From experts to a generalist: Toward general whole-body control for humanoid robots.arXiv preprint arXiv:2506.12779, 2025

  57. [57]

    Gello: A general, low-cost, and intuitive teleoperation framework for robot manipulators

    Philipp Wu, Yide Shentu, Zhongke Yi, Xingyu Lin, and Pieter Abbeel. Gello: A general, low-cost, and intuitive teleoperation framework for robot manipulators. 2023

  58. [58]

    Hugwbc: A unified and general hu- manoid whole-body controller for versatile locomotion

    Yufei Xue, Wentao Dong, Minghuan Liu, Weinan Zhang, and Jiangmiao Pang. Hugwbc: A unified and general hu- manoid whole-body controller for versatile locomotion. InRobotics: Science and Systems (RSS), 2025

  59. [59]

    Ace: A cross-platform visual-exoskeletons system for low-cost dexterous teleoperation

    Shiqi Yang, Minghuan Liu, Yuzhe Qin, Runyu Ding, Jialong Li, Xuxin Cheng, Ruihan Yang, Sha Yi, and Xi- aolong Wang. Ace: A cross-platform visual-exoskeletons system for low-cost dexterous teleoperation. InConfer- ence on Robot Learning (CoRL), 2024

  60. [60]

    Generalizable humanoid manipulation with improved 3d diffusion policies

    Yanjie Ze, Zixuan Chen, Jo ˜ao Pedro Ara ´ujo, Zi-ang Cao, Xue Bin Peng, Jiajun Wu, and C Karen Liu. Generalizable humanoid manipulation with improved 3d diffusion policies. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2025

  61. [61]

    Karen Liu

    Yanjie Ze, Zixuan Chen, Jo ˜ao Pedro Ara ´ujo, Zi-ang Cao, Xue Bin Peng, Jiajun Wu, and C. Karen Liu. Twist: Teleoperated whole-body imitation system. In Conference on Robot Learning (CoRL), 2025

  62. [62]

    Wococo: Learning whole-body humanoid control with sequential contacts

    Chong Zhang, Wenli Xiao, Tairan He, and Guanya Shi. Wococo: Learning whole-body humanoid control with sequential contacts. InConference on Robot Learning (CoRL), 2024. Oral Presentation

  63. [63]

    Track any motions under any disturbances

    Zhikai Zhang, Jun Guo, Chao Chen, Jilong Wang, Chenghuai Lin, Yunrui Lian, Han Xue, Zhenrong Wang, Maoqi Liu, Jiangran Lyu, Huaping Liu, He Wang, and Li Yi. Track any motions under any disturbances, 2025. URL https://arxiv.org/abs/2509.13833

  64. [64]

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

    Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

  65. [65]

    Embrace collisions: Humanoid shadowing for deployable contact-agnostics motions.arXiv preprint arXiv:2502.01465, 2025

    Ziwen Zhuang and Hang Zhao. Embrace collisions: Humanoid shadowing for deployable contact-agnostics motions.arXiv preprint arXiv:2502.01465, 2025

  66. [66]

    Humanoid parkour learning

    Ziwen Zhuang, Shenzhe Yao, and Hang Zhao. Humanoid parkour learning. InConference on Robot Learning (CoRL), pages 1975–1991. PMLR, 2024. APPENDIX A. Hyperparameters and Training Settings

  67. [67]

    The learning rate is set to 3×10 −4, with a clip range ofϵ clip = 0.2

    PPO Hyperparameters:Proximal Policy Optimization (PPO) is adopted for policy gradient training of both expert policies and the gating network. The learning rate is set to 3×10 −4, with a clip range ofϵ clip = 0.2. The Generalized Advantage Estimation (GAE) parameter isλ= 0.95, and the discount factor isγ= 0.97. Each batch of data is updated 4 times, with ...

  68. [68]

    VAE, Curriculum Sampling, and Action Scaling:The motion prediction prior based on Variational Autoencoder (V AE) is jointly trained with expert policies, with future tra- jectory reconstruction loss weightλ recon = 0.5, KL divergence weightλ KL = 0.0005, and latent dimensiond= 32. The tra- jectory sampling weight is computed asw i =T i · 1+min(γ· fi, β) ,...

  69. [69]

    The number of parallel environments is 32768, with a max- imum episode length of 500 steps

    Training Environment and Scale:All policies are trained in the MuJoCo physics simulator with NVIDIA RTX A6000 PRO GPUs, and implemented based on the mjlab framework. The number of parallel environments is 32768, with a max- imum episode length of 500 steps. During the expert policy phase, four expert groups (Walk/Run, Dance/Fight, Fall/Getup, Jump) are tr...

  70. [70]

    The architecture is shown in Table V

    VAE with Transformer-based Encoder/Decoder:The motion prediction prior adopts a Transformer-based V AE architecture: the encoderE ϕ takes as input the historical reference trajectoryM − t (5 frames) and outputs latent distri- bution parameters(µ t, σt); the decoderD ψ predicts the future window ˜M + t (3 frames) conditioned onz t. The architecture is show...

  71. [71]

    It adopts a 5-layer MLP with hidden layer dimensions of (512,512,256,256,128)and ReLU activation

    Expert Policy Network (Actor):The Actor takes as inputs t = (o t, mt, zt)and outputs actiona t ∈R 29. It adopts a 5-layer MLP with hidden layer dimensions of (512,512,256,256,128)and ReLU activation. The output TABLE V VAE / TRANSFORMERARCHITECTURE Component/Hyperparameter Value Encoder Transformer Layers 3 Attention Heads 8 Hidden Dimension (d model) 256...

  72. [72]

    The network architecture is the same as the Actor, adopting a 5-layer MLP with hidden layer dimensions of(512,512,256,256,128), ReLU activation, and a 1-dimensional output layer

    Critic Network:The Critic takes as input privileged observations (e.g., true state and future reference trajectories) and outputs a scalar state valueV(s t)∈R. The network architecture is the same as the Actor, adopting a 5-layer MLP with hidden layer dimensions of(512,512,256,256,128), ReLU activation, and a 1-dimensional output layer

  73. [73]

    Gating Network:The gating networkG θ : (o t, mt)7→ RK outputs scores forK= 4experts, and takes arg maxto obtain the current expert index. The network adopts a 5-layer MLP with hidden layer dimensions of (512,512,256,256,128), ReLU activation, and outputs a 4-dimensional vector corresponding to four expert groups (Walk/Run, Dance/Fight, Fall/Getup, Jump). ...