arxiv: 2602.09628 · v2 · pith:777KAG5Nnew · submitted 2026-02-10 · 💻 cs.RO

TeleGate: Whole-Body Humanoid Teleoperation via Gated Expert Selection with Motion Prior

Jie Li (1 , 2) , Bing Tang (2) , Feng Wu (1) ((1) University of Science , Technology of China , (2) AnyWit Robotics Co. , Ltd.) This is my paper

Pith reviewed 2026-05-16 05:37 UTC · model grok-4.3

classification 💻 cs.RO

keywords humanoid teleoperationgated expert selectionmotion priorwhole-body controlVAEreal-time roboticsdynamic motion tracking

0 comments p. Extension

Add this Pith Number to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{777KAG5N}

Prints a linked pith:777KAG5N badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

A lightweight gating network selects among expert policies for precise whole-body humanoid teleoperation while a motion prior supplies missing future intent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TeleGate as a way to control humanoid robots in real time across varied motions without forcing all skills into one compromised policy. Instead of distilling experts, it keeps each specialized policy intact and uses a small gating network to pick the right one on the fly from current body states and motion references. A VAE module trained on past observations supplies implicit predictions of what comes next, which supports anticipatory actions such as jumping or standing up. The system is trained on only 2.5 hours of motion-capture data and is shown to deliver higher tracking accuracy and success rates than distilled baselines in both simulation and on a physical Unitree G1 robot.

Core claim

TeleGate preserves the full capability of domain-specific expert policies by training a lightweight gating network, which dynamically activates experts in real-time based on proprioceptive states and reference trajectories. To compensate for the absence of future reference trajectories in real-time teleoperation, a VAE-based motion prior module extracts implicit future motion intent from historical observations, enabling anticipatory control for motions requiring prediction such as jumping and standing up.

What carries the argument

Lightweight gating network that selects which expert policy to activate, paired with a VAE-based motion prior that infers future intent from past observations.

If this is right

High-precision real-time tracking holds for running, fall recovery, and jumping without the accuracy loss typical of single-policy distillation.
Only 2.5 hours of motion-capture data suffice for training that generalizes to both simulation and the physical Unitree G1 robot.
Success rate and tracking error both improve over baseline methods that merge experts into one policy.
The same gating-plus-prior structure supports deployment in unstructured environments where motions vary rapidly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The gating idea could apply to other multi-skill robotic domains where merging policies degrades peak performance on any single skill.
If the motion prior generalizes beyond the training distribution, similar modules might reduce reliance on future reference data in other teleoperation or imitation settings.
Testing whether adding more experts further widens the motion range without increasing gate error would be a direct next measurement.

Load-bearing premise

The lightweight gating network can reliably pick the correct expert from proprioceptive states and references, and the VAE can accurately infer future motion intent from history alone.

What would settle it

Record the gating network's expert choices during a failed jump or fall-recovery trial; if the wrong expert is chosen more than half the time and performance collapses, the selection mechanism does not work as claimed.

Figures

Figures reproduced from arXiv: 2602.09628 by 2), (2) AnyWit Robotics Co., Bing Tang (2), Feng Wu (1) ((1) University of Science, Jie Li (1, Ltd.), Technology of China.

**Figure 2.** Figure 2: Framework overview. Our method consists of three stages: (I) Data collection and preprocessing using inertial motion capture; (II) Expert policy [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Expert switching analysis during continuous motion. Top: Key frame [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 5.** Figure 5: More real-world teleoperation skills: (a) sitting; (b) walking; (c) [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Real-time whole-body teleoperation is a critical method for humanoid robots to perform complex tasks in unstructured environments. However, developing a unified controller that robustly supports diverse human motions remains a significant challenge. Existing methods typically distill multiple expert policies into a single general policy, which often inevitably leads to performance degradation, particularly on highly dynamic motions. This paper presents TeleGate, a unified whole-body teleoperation framework for humanoid robots that achieves high-precision tracking across various motions while avoiding the performance loss inherent in knowledge distillation. Our key idea is to preserve the full capability of domain-specific expert policies by training a lightweight gating network, which dynamically activates experts in real-time based on proprioceptive states and reference trajectories. Furthermore, to compensate for the absence of future reference trajectories in real-time teleoperation, we introduce a VAE-based motion prior module that extracts implicit future motion intent from historical observations, enabling anticipatory control for motions requiring prediction such as jumping and standing up. We conducted empirical evaluations in simulation and also deployed our technique on the Unitree G1 humanoid robot. Using only 2.5 hours of motion capture data for training, our TeleGate achieves high-precision real-time teleoperation across diverse dynamic motions (e.g., running, fall recovery, and jumping), significantly outperforming the baseline methods in both tracking accuracy and success rate.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TeleGate keeps full expert performance on dynamic humanoid motions by gating instead of distilling, plus a VAE prior for anticipation, but the abstract leaves the quantitative backing thin.

read the letter

The new piece is the lightweight gating network that picks among expert policies at each step using proprioception and reference trajectories, paired with a VAE that pulls future intent from history to handle the missing lookahead in live teleop. This sidesteps the usual accuracy drop when you compress multiple experts into one policy, and they train the whole thing on only 2.5 hours of mocap data before testing on the Unitree G1. That combination is a straightforward practical step for whole-body control in unstructured settings. The real-robot deployment and the focus on motions like jumping and fall recovery are the parts that stand out as useful. The abstract claims clear wins in tracking accuracy and success rate over baselines, which aligns with the goal of preserving expert capability. The main soft spot is the lack of concrete numbers, baseline code details, switching-frequency stats, or noise-robustness checks. Without those, it is hard to judge whether the gate stays stable during rapid transitions or whether the VAE predictions hold up under hardware timing. The stress-test worry about chattering or mis-selection is reasonable given what is shown so far. Readers who build teleop systems for humanoids would get value from the architecture and the limited-data training story. The work is coherent on its own terms and addresses a real deployment gap, so it deserves a serious referee even if the current evidence needs tightening.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces TeleGate, a whole-body teleoperation system for humanoid robots that preserves specialized expert policies via a lightweight gating network selecting experts in real time from proprioceptive states and reference trajectories, augmented by a VAE-based motion prior that infers future intent from historical observations to support anticipatory control. It claims high-precision tracking on dynamic motions (running, jumping, fall recovery) in simulation and on the Unitree G1, trained from 2.5 hours of mocap data, with significant gains over distillation baselines in accuracy and success rate.

Significance. If the empirical advantages are confirmed with detailed metrics and controls, the gated-expert approach could meaningfully reduce the performance trade-offs typical of single-policy distillation for agile humanoid control, offering a practical route to robust real-time teleoperation with modest data requirements. The explicit separation of expert training from gating and the addition of a learned motion prior are technically coherent contributions.

major comments (3)

[Abstract] Abstract: the central claim of 'significantly outperforming' baselines in tracking accuracy and success rate is stated without any numerical values, baseline implementations, statistical tests, or error bars, leaving the magnitude and reliability of the reported advantage impossible to assess from the provided text.
[Evaluation] Evaluation section (implied by abstract): no quantitative results are supplied on gating-network accuracy, switching frequency, chattering, or latency under real-time constraints and sensor noise on the Unitree G1; without these, the skeptic concern that rapid state transitions (jumping, fall recovery) could cause incorrect expert activation remains unaddressed and load-bearing for the stability claim.
[Method] Method (VAE motion prior): the abstract asserts that the VAE extracts 'implicit future motion intent' enabling anticipatory control, yet no prediction-error metrics, ablation on history length, or comparison against simpler predictors are reported, so the necessity and effectiveness of this module for the claimed dynamic motions cannot be verified.

minor comments (2)

[Method] Clarify the precise conditioning inputs to the gating network (proprioception vs. reference trajectory encoding) and whether any smoothing or hysteresis is applied to prevent chattering.
[Experiments] The training data volume (2.5 h) is stated but the split between expert-policy training and gating/VAE training is not detailed; add this breakdown.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, agreeing to strengthen the manuscript with additional quantitative details and analyses where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of 'significantly outperforming' baselines in tracking accuracy and success rate is stated without any numerical values, baseline implementations, statistical tests, or error bars, leaving the magnitude and reliability of the reported advantage impossible to assess from the provided text.

Authors: We agree that the abstract would benefit from explicit numerical support for the performance claims. In the revised version, we will update the abstract to report key quantitative results, including average joint position RMSE, velocity tracking errors, and success rates for dynamic motions (running, jumping, fall recovery), with direct comparisons to the distillation baselines. These values will be drawn from the full evaluation tables and will reference the specific experimental conditions. revision: yes
Referee: [Evaluation] Evaluation section (implied by abstract): no quantitative results are supplied on gating-network accuracy, switching frequency, chattering, or latency under real-time constraints and sensor noise on the Unitree G1; without these, the skeptic concern that rapid state transitions (jumping, fall recovery) could cause incorrect expert activation remains unaddressed and load-bearing for the stability claim.

Authors: We acknowledge that these gating-specific metrics are important for addressing real-time stability concerns. We will add a dedicated subsection in the Experiments section that reports gating-network accuracy (expert selection precision), switching frequency, chattering statistics, and end-to-end latency measurements. This analysis will include results from the Unitree G1 hardware under realistic sensor noise, with focused case studies on rapid transitions such as jumping and fall recovery to directly mitigate the concern about incorrect expert activation. revision: yes
Referee: [Method] Method (VAE motion prior): the abstract asserts that the VAE extracts 'implicit future motion intent' enabling anticipatory control, yet no prediction-error metrics, ablation on history length, or comparison against simpler predictors are reported, so the necessity and effectiveness of this module for the claimed dynamic motions cannot be verified.

Authors: We agree that explicit validation of the VAE motion prior is needed. In the revision, we will add quantitative prediction-error metrics (e.g., future pose MSE over multiple horizons), an ablation study varying history length, and comparisons against simpler baselines such as constant-velocity extrapolation and LSTM predictors. These results will be presented in the Method and Experiments sections to demonstrate the VAE's contribution to anticipatory control for dynamic motions. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on separate training data and external baseline comparisons

full rationale

The paper trains a lightweight gating network and VAE motion prior on 2.5 hours of motion-capture data, then evaluates tracking accuracy and success rate on held-out dynamic motions (running, jumping, fall recovery) plus real-robot deployment on Unitree G1. No equations or steps reduce by construction to their own inputs; the gating selection and anticipatory control are learned modules whose outputs are measured against independent baselines rather than defined to match them. No self-citations are load-bearing, no uniqueness theorems are imported from the authors' prior work, and no fitted parameters are relabeled as predictions. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The central claim rests on the assumption that separate expert policies exist for different motion domains and that a learned gating network plus VAE prior can be trained effectively from limited mocap data to generalize to real-time operation.

free parameters (2)

gating network weights
Learned parameters of the lightweight gating network that selects experts based on proprioception and reference trajectories.
VAE latent parameters
Parameters of the variational autoencoder that encodes historical observations into future motion intent predictions.

axioms (2)

domain assumption Expert policies can be trained independently for distinct motion domains without interference
Invoked when the framework preserves full capability of domain-specific experts rather than distilling them.
domain assumption Historical proprioceptive observations contain sufficient information to infer future motion intent for dynamic actions
Required for the VAE module to enable anticipatory control in the absence of future reference trajectories.

invented entities (1)

VAE-based motion prior module no independent evidence
purpose: Extracts implicit future motion intent from historical observations to support anticipatory control
New module introduced specifically to address the real-time teleoperation constraint of missing future trajectories.

pith-pipeline@v0.9.0 · 5567 in / 1628 out tokens · 37103 ms · 2026-05-16T05:37:28.262882+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

lightweight gating network... dynamically activates experts... VAE-based motion prior module that extracts implicit future motion intent
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

expert policies... gated expert selection... VAE... anticipatory control

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 5 internal anchors

[1]

Karen Liu

Jo ˜ao Pedro Ara ´ujo, Yanjie Ze, Pei Xu, Jiajun Wu, and C. Karen Liu. Retargeting matters: General motion retargeting for humanoid motion tracking. InIEEE International Conference on Robotics and Automation (ICRA), 2026

work page 2026
[2]

Homie: Humanoid loco- manipulation with isomorphic exoskeleton cockpit

Qingwei Ben, Feiyu Jia, Jia Zeng, Junting Dong, Dahua Lin, and Jiangmiao Pang. Homie: Humanoid loco- manipulation with isomorphic exoskeleton cockpit. In Robotics: Science and Systems (RSS), 2025

work page 2025
[3]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision- language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

A systematic review of commercial smart gloves: Current status and applications.Sensors, 21(8):2667, 2021

Manuel Caeiro-Rodr ´ıguez, Iv ´an Otero-Gonz ´alez, Fer- nando A Mikic-Fonte, and Mart ´ın Llamas-Nistal. A systematic review of commercial smart gloves: Current status and applications.Sensors, 21(8):2667, 2021

work page 2021
[5]

Learning smooth humanoid locomotion through lipschitz-constrained poli- cies

Zixuan Chen, Xialin He, Yen-Jen Wang, Qiayuan Liao, Yanjie Ze, Zhongyu Li, S Shankar Sastry, Jiajun Wu, Koushil Sreenath, Saurabh Gupta, et al. Learning smooth humanoid locomotion through lipschitz-constrained poli- cies. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2025

work page 2025
[6]

Gmt: Gen- eral motion tracking for humanoid whole-body control

Zixuan Chen, Mazeyu Ji, Xuxin Cheng, Xuanbin Peng, Xue Bin Peng, and Xiaolong Wang. Gmt: General motion tracking for humanoid whole-body control.arXiv preprint arXiv:2506.14770, 2025

work page arXiv 2025
[7]

Expressive whole-body control for humanoid robots

Xuxin Cheng, Yandong Ji, Junming Chen, Ruihan Yang, Ge Yang, and Xiaolong Wang. Expressive whole-body control for humanoid robots. InRobotics: Science and Systems (RSS), 2024

work page 2024
[8]

Open-television: Teleoperation with immersive active visual feedback

Xuxin Cheng, Jialong Li, Shiqi Yang, Ge Yang, and Xiaolong Wang. Open-television: Teleoperation with immersive active visual feedback. InConference on Robot Learning (CoRL), 2024

work page 2024
[9]

icub3 avatar system: Enabling remote fully immersive embodiment of humanoid robots.Sci- ence Robotics, 9(86):eadh3834, 2024

Stefano Dafarra, Ugo Pattacini, Giulio Romualdi, Lorenzo Rapetti, Riccardo Grieco, Kourosh Darvish, Gi- anluca Milani, Enrico Valli, Ines Sorrentino, Paolo Maria Viceconte, et al. icub3 avatar system: Enabling remote fully immersive embodiment of humanoid robots.Sci- ence Robotics, 9(86):eadh3834, 2024

work page 2024
[10]

Whole-body geometric retargeting for humanoid robots

Kourosh Darvish, Yeshasvi Tirupachuri, Giulio Ro- mualdi, Lorenzo Rapetti, Diego Ferigo, Francisco Javier Andrade Chavez, and Daniele Pucci. Whole-body geometric retargeting for humanoid robots. InIEEE- RAS International Conference on Humanoid Robots (Hu- manoids), pages 679–686, 2019

work page 2019
[11]

Legibility and predictability of robot motion

Anca D Dragan, Kenton CT Lee, and Siddhartha S Srinivasa. Legibility and predictability of robot motion. InACM/IEEE International Conference on Human-Robot Interaction (HRI), pages 301–308, 2013

work page 2013
[12]

Learning Multi-Modal Whole-Body Control for Real-World Humanoid Robots

Pranay Dugar, Aayam Shrestha, Fangzhou Yu, Bart van Marum, and Alan Fern. Learning multi-modal whole- body control for real-world humanoid robots.arXiv preprint arXiv:2408.07295, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Airexo: Low-cost exoskeletons for learning whole- arm manipulation in the wild

Hongjie Fang, Hao-Shu Fang, Yiming Wang, Jieji Ren, Jingjing Chen, Ruo Zhang, Weiming Wang, and Cewu Lu. Airexo: Low-cost exoskeletons for learning whole- arm manipulation in the wild. InIEEE International Conference on Robotics and Automation (ICRA), pages 15031–15038, 2024

work page 2024
[14]

Humanplus: Humanoid shadowing and imitation from humans

Zipeng Fu, Qingqing Zhao, Qi Wu, Gordon Wetzstein, and Chelsea Finn. Humanplus: Humanoid shadowing and imitation from humans. InConference on Robot Learning (CoRL), 2024

work page 2024
[15]

Mobile aloha: Learning bimanual mobile manipulation with low- cost whole-body teleoperation

Zipeng Fu, Tony Z Zhao, and Chelsea Finn. Mobile aloha: Learning bimanual mobile manipulation with low- cost whole-body teleoperation. InConference on Robot Learning (CoRL), 2024

work page 2024
[16]

Advancing humanoid locomotion: Mastering challenging terrains with denoising world model learning

Xinyang Gu, Yen-Jen Wang, Xiang Zhu, Chengming Shi, Yanjiang Guo, Yichen Liu, and Jianyu Chen. Advancing humanoid locomotion: Mastering challenging terrains with denoising world model learning.arXiv preprint arXiv:2408.14472, 2024

work page arXiv 2024
[17]

Omnih2o: Universal and dexterous human- to-humanoid whole-body teleoperation and learning

Tairan He, Zhengyi Luo, Xialin He, Wenli Xiao, Chong Zhang, Weinan Zhang, Kris Kitani, Changliu Liu, and Guanya Shi. Omnih2o: Universal and dexterous human- to-humanoid whole-body teleoperation and learning. In Conference on Robot Learning (CoRL), 2024

work page 2024
[18]

Learning human- to-humanoid real-time whole-body teleoperation

Tairan He, Zhengyi Luo, Wenli Xiao, Chong Zhang, Kris Kitani, Changliu Liu, and Guanya Shi. Learning human- to-humanoid real-time whole-body teleoperation. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2024. Oral Presentation

work page 2024
[19]

Hodgins, Linxi Fan, Yuke Zhu, Changliu Liu, and Guanya Shi

Tairan He, Jiawei Gao, Wenli Xiao, Yuanhang Zhang, Zi Wang, Jiashun Wang, Zhengyi Luo, Guanqi He, Nikhil Sobanbabu, Chaoyi Pan, Zeji Yi, Guannan Qu, Kris Ki- tani, Jessica K. Hodgins, Linxi Fan, Yuke Zhu, Changliu Liu, and Guanya Shi. Asap: Aligning simulation and real-world physics for learning agile humanoid whole- body skills. InRobotics: Science and S...

work page 2025
[20]

Humanup: Learning getting-up policies for real- world humanoid robots

Xialin He, Runpei Dong, Zixuan Chen, and Saurabh Gupta. Humanup: Learning getting-up policies for real- world humanoid robots. InRobotics: Science and Sys- tems (RSS), 2025

work page 2025
[21]

Host: Learning humanoid standing-up control across diverse postures

Tao Huang, Junli Ren, Huayi Wang, Zirui Wang, Qing- wei Ben, Muning Wen, Xiao Chen, Jianan Li, and Jiangmiao Pang. Host: Learning humanoid standing-up control across diverse postures. InRobotics: Science and Systems (RSS), 2025. Best Systems Paper Finalist

work page 2025
[22]

OPEN TEACH: A versatile teleoperation system for robotic manipulation,

Aadhithya Iyer, Zhuoran Peng, Yinlong Dai, Irmak Guzey, Siddhant Haldar, Soumith Chintala, and Lerrel Pinto. Open teach: A versatile teleoperation system for robotic manipulation.arXiv preprint arXiv:2403.07870, 2024

work page arXiv 2024
[23]

Exbody2: Ad- vanced expressive humanoid whole-body control.arXiv preprint arXiv:2412.13196, 2024

Mazeyu Ji, Xuanbin Peng, Fangchen Liu, Jialong Li, Ge Yang, Xuxin Cheng, and Xiaolong Wang. Ex- body2: Advanced expressive humanoid whole-body con- trol.arXiv preprint arXiv:2412.13196, 2024

work page arXiv 2024
[24]

Behavior robot suite: Stream- lining real-world whole-body manipulation for everyday household activities.arXiv preprint arXiv:2503.05652, 2025

Yizhou Jiang, Ruihai Zhang, Josiah Wong, Chris Wang, Yanjie Ze, Hang Yin, Celso Gokmen, Shuran Song, Jiajun Wu, and Li Fei-Fei. Behavior robot suite: Stream- lining real-world whole-body manipulation for everyday household activities.arXiv preprint arXiv:2503.05652, 2025

work page arXiv 2025
[25]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2014. ICLR 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[26]

Real-time imitation of human whole-body mo- tions by humanoids

Johannes Koenemann, Felix Burget, and Maren Ben- newitz. Real-time imitation of human whole-body mo- tions by humanoids. InIEEE International Conference on Robotics and Automation (ICRA), pages 2806–2812, 2014

work page 2014
[27]

Amo: Adaptive mo- tion optimization for hyper-dexterous humanoid whole- body control

Jialong Li, Xuxin Cheng, Tianshu Huang, Shiqi Yang, Ri-Zhao Qiu, and Xiaolong Wang. Amo: Adaptive mo- tion optimization for hyper-dexterous humanoid whole- body control. InRobotics: Science and Systems (RSS), 2025

work page 2025
[28]

Okami: Teaching humanoid robots manipulation skills through single video imitation

Jinhan Li, Yifeng Zhu, Yuqi Xie, Zhenyu Jiang, Mingyo Seo, Georgios Pavlakos, and Yuke Zhu. Okami: Teaching humanoid robots manipulation skills through single video imitation. InConference on Robot Learning (CoRL),

work page
[29]

Reinforcement learning for robust parameterized loco- motion control of bipedal robots

Zhongyu Li, Xuxin Cheng, Xue Bin Peng, Pieter Abbeel, Sergey Levine, Glen Berseth, and Koushil Sreenath. Reinforcement learning for robust parameterized loco- motion control of bipedal robots. InIEEE International Conference on Robotics and Automation (ICRA), pages 2811–2817, 2021

work page 2021
[30]

Rein- forcement learning for versatile, dynamic, and robust bipedal locomotion control.The International Journal of Robotics Research, page 02783649241285161, 2024

Zhongyu Li, Xue Bin Peng, Pieter Abbeel, Sergey Levine, Glen Berseth, and Koushil Sreenath. Rein- forcement learning for versatile, dynamic, and robust bipedal locomotion control.The International Journal of Robotics Research, page 02783649241285161, 2024

work page 2024
[31]

Berkeley humanoid: A research platform for learning-based con- trol

Qiayuan Liao, Bike Zhang, Xuanyu Huang, Xiaoyu Huang, Zhongyu Li, and Koushil Sreenath. Berkeley humanoid: A research platform for learning-based con- trol. InIEEE International Conference on Robotics and Automation (ICRA), 2025

work page 2025
[32]

Learning visuotactile skills with two multifingered hands,

Toru Lin, Yu Zhang, Qiyang Li, Haozhi Qi, Brent Yi, Sergey Levine, and Jitendra Malik. Learning visuotactile skills with two multifingered hands.arXiv preprint arXiv:2404.16823, 2024

work page arXiv 2024
[33]

A glove-based system for studying hand-object manipulation via joint pose and force sens- ing

Hangxin Liu, Xu Xie, Matt Millar, Mark Edmonds, Feng Gao, Yixin Zhu, Veronica J Santos, Brandon Rothrock, and Song-Chun Zhu. A glove-based system for studying hand-object manipulation via joint pose and force sens- ing. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 6617–6624, 2017

work page 2017
[34]

High- fidelity grasping in virtual reality using a glove-based system

Hangxin Liu, Zhenliang Zhang, Xu Xie, Yixin Zhu, Yue Liu, Yongtian Wang, and Song-Chun Zhu. High- fidelity grasping in virtual reality using a glove-based system. InIEEE International Conference on Robotics and Automation (ICRA), pages 5180–5186, 2019

work page 2019
[35]

Learning humanoid locomotion with perceptive internal model, 2024

Junfeng Long, Junli Ren, Moji Shi, Zirui Wang, Tao Huang, Ping Luo, and Jiangmiao Pang. Learning hu- manoid locomotion with perceptive internal model.arXiv preprint arXiv:2411.14386, 2024

work page arXiv 2024
[36]

Learning h-infinity locomotion control.arXiv preprint, 2024

Junfeng Long, Wenhan Yu, Quanyi Li, Zirui Wang, Dahua Lin, and Jiangmiao Pang. Learning h-infinity locomotion control.arXiv preprint, 2024

work page 2024
[37]

Mobile-television: Predictive motion priors for humanoid whole-body control

Chenhao Lu, Xuxin Cheng, Jialong Li, Shiqi Yang, Mazeyu Ji, Chengjing Yuan, Ge Yang, Sha Yi, and Xiao- long Wang. Mobile-television: Predictive motion priors for humanoid whole-body control. InIEEE International Conference on Robotics and Automation (ICRA), 2025

work page 2025
[38]

Perpetual humanoid control for real-time simulated avatars

Zhengyi Luo, Jinkun Cao, Alexander W Winkler, Kris Kitani, and Weipeng Xu. Perpetual humanoid control for real-time simulated avatars. InIEEE/CVF International Conference on Computer Vision (ICCV), pages 10895– 10904, 2023

work page 2023
[39]

Univer- sal humanoid motion representations for physics-based control

Zhengyi Luo, Jinkun Cao, Josh Merel, Alexander Win- kler, Jing Huang, Kris Kitani, and Weipeng Xu. Univer- sal humanoid motion representations for physics-based control. InInternational Conference on Learning Repre- sentations (ICLR), 2024. Spotlight

work page 2024
[40]

Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025

Zhengyi Luo, Ye Yuan, Tingwu Wang, Chenran Li, Sirui Chen, Fernando Casta ˜neda, Zi-Ang Cao, Jiefeng Li, David Minor, Qingwei Ben, Xingye Da, Runyu Ding, Cyrus Hogg, Lina Song, Edy Lim, Eugene Jeong, Tairan He, Haoru Xue, Wenli Xiao, Zi Wang, Simon Yuen, Jan Kautz, Yan Chang, Umar Iqbal, Linxi Fan, and Yuke Zhu. Sonic: Supersizing motion tracking for natu...

work page arXiv 2025
[41]

Amass: Archive of motion capture as surface shapes

Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Gerard Pons-Moll, and Michael J Black. Amass: Archive of motion capture as surface shapes. InIEEE/CVF International Conference on Computer Vision (ICCV), pages 5442–5451, 2019

work page 2019
[42]

Deepmimic: Example-guided deep rein- forcement learning of physics-based character skills

Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. Deepmimic: Example-guided deep rein- forcement learning of physics-based character skills. In ACM Transactions on Graphics (TOG), volume 37, pages 1–14, 2018

work page 2018
[43]

Amp: Adversarial motion priors for stylized physics-based character control.ACM Transac- tions on Graphics (TOG), 40(4):1–20, 2021

Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa. Amp: Adversarial motion priors for stylized physics-based character control.ACM Transac- tions on Graphics (TOG), 40(4):1–20, 2021. SIGGRAPH 2021

work page 2021
[44]

Anyteleop: A general vision-based dexterous robot arm-hand teleoperation system.arXiv preprint arXiv:2307.04577, 2023

Yuzhe Qin, Wei Yang, Binghao Huang, Karl Van Wyk, Hao Su, Xiaolong Wang, Yu-Wei Chao, and Dieter Fox. Anyteleop: A general vision-based dexterous robot arm-hand teleoperation system.arXiv preprint arXiv:2307.04577, 2023

work page arXiv 2023
[45]

Learning humanoid locomotion over challenging terrain

Ilija Radosavovic, Sarthak Kamat, Trevor Darrell, and Jitendra Malik. Learning humanoid locomotion over challenging terrain.arXiv preprint arXiv:2410.03654, 2024

work page arXiv 2024
[46]

Real-world hu- manoid locomotion with reinforcement learning.Science Robotics, 9(89):eadi9579, 2024

Ilija Radosavovic, Tete Xiao, Bike Zhang, Trevor Darrell, Jitendra Malik, and Koushil Sreenath. Real-world hu- manoid locomotion with reinforcement learning.Science Robotics, 9(89):eadi9579, 2024

work page 2024
[47]

Humanoid locomotion as next token prediction

Ilija Radosavovic, Bike Zhang, Baifeng Shi, Jathushan Rajasegaran, Sarthak Kamat, Trevor Darrell, Koushil Sreenath, and Jitendra Malik. Humanoid locomotion as next token prediction. 2024

work page 2024
[48]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[49]

Agon Serifi, Ruben Grandia, Espen Knoop, Markus Gross, and Moritz B ¨acher. Vmp: Versatile motion priors for robustly tracking motion on physical charac- ters.Computer Graphics Forum (ACM SIGGRAPH / Eurographics Symposium on Computer Animation), 43 (8), 2024

work page 2024
[50]

Bimanual dexterity for complex tasks

Kenneth Shaw, Yulong Li, Jiahui Yang, Mohan Kumar Srirama, Ray Liu, Haoyu Xiong, Russell Mendonca, and Deepak Pathak. Bimanual dexterity for complex tasks. In8th Annual Conference on Robot Learning, 2024

work page 2024
[51]

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InInternational Conference on Learning Representations (ICLR), 2017

work page 2017
[52]

Robotic telekinesis: Learning a robotic hand imitator by watching humans on youtube

Aravind Sivakumar, Kenneth Shaw, and Deepak Pathak. Robotic telekinesis: Learning a robotic hand imitator by watching humans on youtube. InProceedings of Robotics: Science and Systems, New York City, NY , USA, 2022

work page 2022
[53]

Unified loco-manipulation controller for humanoid robots.arXiv preprint arXiv:2507.06905, 2025

Wandong Sun, Luying Feng, Baoshi Cao, Yang Liu, Yaochu Jin, and Zongwu Xie. Unified loco-manipulation controller for humanoid robots.arXiv preprint arXiv:2507.06905, 2025

work page arXiv 2025
[54]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), volume 30, 2017

work page 2017
[55]

Dexcap: Scalable and portable mocap data collection system for dexterous manipulation.arXiv preprint arXiv:2403.07788, 2024

Chen Wang, Haochen Shi, Weizhuo Wang, Ruohan Zhang, Li Fei-Fei, and C Karen Liu. Dexcap: Scalable and portable mocap data collection system for dexterous manipulation.arXiv preprint arXiv:2403.07788, 2024

work page arXiv 2024
[56]

From experts to a generalist: Toward general whole-body control for humanoid robots.arXiv preprint arXiv:2506.12779, 2025

Yuxuan Wang, Ming Yang, Weishuai Zeng, Yu Zhang, Xinrun Xu, Haobin Jiang, Ziluo Ding, and Zongqing Lu. From experts to a generalist: Toward general whole-body control for humanoid robots.arXiv preprint arXiv:2506.12779, 2025

work page arXiv 2025
[57]

Gello: A general, low-cost, and intuitive teleoperation framework for robot manipulators

Philipp Wu, Yide Shentu, Zhongke Yi, Xingyu Lin, and Pieter Abbeel. Gello: A general, low-cost, and intuitive teleoperation framework for robot manipulators. 2023

work page 2023
[58]

Hugwbc: A unified and general hu- manoid whole-body controller for versatile locomotion

Yufei Xue, Wentao Dong, Minghuan Liu, Weinan Zhang, and Jiangmiao Pang. Hugwbc: A unified and general hu- manoid whole-body controller for versatile locomotion. InRobotics: Science and Systems (RSS), 2025

work page 2025
[59]

Ace: A cross-platform visual-exoskeletons system for low-cost dexterous teleoperation

Shiqi Yang, Minghuan Liu, Yuzhe Qin, Runyu Ding, Jialong Li, Xuxin Cheng, Ruihan Yang, Sha Yi, and Xi- aolong Wang. Ace: A cross-platform visual-exoskeletons system for low-cost dexterous teleoperation. InConfer- ence on Robot Learning (CoRL), 2024

work page 2024
[60]

Generalizable humanoid manipulation with improved 3d diffusion policies

Yanjie Ze, Zixuan Chen, Jo ˜ao Pedro Ara ´ujo, Zi-ang Cao, Xue Bin Peng, Jiajun Wu, and C Karen Liu. Generalizable humanoid manipulation with improved 3d diffusion policies. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2025

work page 2025
[61]

Karen Liu

Yanjie Ze, Zixuan Chen, Jo ˜ao Pedro Ara ´ujo, Zi-ang Cao, Xue Bin Peng, Jiajun Wu, and C. Karen Liu. Twist: Teleoperated whole-body imitation system. In Conference on Robot Learning (CoRL), 2025

work page 2025
[62]

Wococo: Learning whole-body humanoid control with sequential contacts

Chong Zhang, Wenli Xiao, Tairan He, and Guanya Shi. Wococo: Learning whole-body humanoid control with sequential contacts. InConference on Robot Learning (CoRL), 2024. Oral Presentation

work page 2024
[63]

Track any motions under any disturbances

Zhikai Zhang, Jun Guo, Chao Chen, Jilong Wang, Chenghuai Lin, Yunrui Lian, Han Xue, Zhenrong Wang, Maoqi Liu, Jiangran Lyu, Huaping Liu, He Wang, and Li Yi. Track any motions under any disturbances, 2025. URL https://arxiv.org/abs/2509.13833

work page arXiv 2025
[64]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[65]

Embrace collisions: Humanoid shadowing for deployable contact-agnostics motions.arXiv preprint arXiv:2502.01465, 2025

Ziwen Zhuang and Hang Zhao. Embrace collisions: Humanoid shadowing for deployable contact-agnostics motions.arXiv preprint arXiv:2502.01465, 2025

work page arXiv 2025
[66]

Humanoid parkour learning

Ziwen Zhuang, Shenzhe Yao, and Hang Zhao. Humanoid parkour learning. InConference on Robot Learning (CoRL), pages 1975–1991. PMLR, 2024. APPENDIX A. Hyperparameters and Training Settings

work page 1975
[67]

The learning rate is set to 3×10 −4, with a clip range ofϵ clip = 0.2

PPO Hyperparameters:Proximal Policy Optimization (PPO) is adopted for policy gradient training of both expert policies and the gating network. The learning rate is set to 3×10 −4, with a clip range ofϵ clip = 0.2. The Generalized Advantage Estimation (GAE) parameter isλ= 0.95, and the discount factor isγ= 0.97. Each batch of data is updated 4 times, with ...

work page
[68]

VAE, Curriculum Sampling, and Action Scaling:The motion prediction prior based on Variational Autoencoder (V AE) is jointly trained with expert policies, with future tra- jectory reconstruction loss weightλ recon = 0.5, KL divergence weightλ KL = 0.0005, and latent dimensiond= 32. The tra- jectory sampling weight is computed asw i =T i · 1+min(γ· fi, β) ,...

work page
[69]

The number of parallel environments is 32768, with a max- imum episode length of 500 steps

Training Environment and Scale:All policies are trained in the MuJoCo physics simulator with NVIDIA RTX A6000 PRO GPUs, and implemented based on the mjlab framework. The number of parallel environments is 32768, with a max- imum episode length of 500 steps. During the expert policy phase, four expert groups (Walk/Run, Dance/Fight, Fall/Getup, Jump) are tr...

work page
[70]

The architecture is shown in Table V

VAE with Transformer-based Encoder/Decoder:The motion prediction prior adopts a Transformer-based V AE architecture: the encoderE ϕ takes as input the historical reference trajectoryM − t (5 frames) and outputs latent distri- bution parameters(µ t, σt); the decoderD ψ predicts the future window ˜M + t (3 frames) conditioned onz t. The architecture is show...

work page
[71]

It adopts a 5-layer MLP with hidden layer dimensions of (512,512,256,256,128)and ReLU activation

Expert Policy Network (Actor):The Actor takes as inputs t = (o t, mt, zt)and outputs actiona t ∈R 29. It adopts a 5-layer MLP with hidden layer dimensions of (512,512,256,256,128)and ReLU activation. The output TABLE V VAE / TRANSFORMERARCHITECTURE Component/Hyperparameter Value Encoder Transformer Layers 3 Attention Heads 8 Hidden Dimension (d model) 256...

work page
[72]

The network architecture is the same as the Actor, adopting a 5-layer MLP with hidden layer dimensions of(512,512,256,256,128), ReLU activation, and a 1-dimensional output layer

Critic Network:The Critic takes as input privileged observations (e.g., true state and future reference trajectories) and outputs a scalar state valueV(s t)∈R. The network architecture is the same as the Actor, adopting a 5-layer MLP with hidden layer dimensions of(512,512,256,256,128), ReLU activation, and a 1-dimensional output layer

work page
[73]

Gating Network:The gating networkG θ : (o t, mt)7→ RK outputs scores forK= 4experts, and takes arg maxto obtain the current expert index. The network adopts a 5-layer MLP with hidden layer dimensions of (512,512,256,256,128), ReLU activation, and outputs a 4-dimensional vector corresponding to four expert groups (Walk/Run, Dance/Fight, Fall/Getup, Jump). ...

work page