LIMMT: Less is More for Motion Tracking

Chenghuai Lin; Dairu Liu; He Wang; Jilong Wang; Li Yi; Wenyao Zhang; Xinqiang Yu; Xuchuan Chen; Yu Guan; Zekun Qi

arxiv: 2606.06953 · v1 · pith:JM3MTRHEnew · submitted 2026-06-05 · 💻 cs.RO

LIMMT: Less is More for Motion Tracking

Yu Guan , Zekun Qi , Chenghuai Lin , Xuchuan Chen , Dairu Liu , Wenyao Zhang , Jilong Wang , Xinqiang Yu

show 2 more authors

He Wang Li Yi

This is my paper

Pith reviewed 2026-06-27 22:11 UTC · model grok-4.3

classification 💻 cs.RO

keywords humanoid motion trackingphysics-based controldata selectionmotion data qualityAMASS datasetpolicy optimizationmotion capture

0 comments

The pith

Training on under 3% of high-quality motion data outperforms the full AMASS dataset for humanoid tracking policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that high-quality motion data steers physics-based tracking policies toward better optimization trajectories from early in training. It defines data quality along three dimensions: physics feasibility, diversity, and complexity. Experiments demonstrate that subsets under 3% of the AMASS collection, selected by these criteria, produce superior tracking performance compared to training on the entire dataset. The same selection process is used to clean estimated motion capture data collected from the web.

Core claim

Motion data selected according to physics feasibility, diversity, and complexity allows small subsets to guide humanoid tracking policies to better optimization trajectories than the full AMASS dataset, establishing the first data-centric approach for this task.

What carries the argument

Three-dimensional quality metric that scores motion clips for physics feasibility, diversity, and complexity to select data yielding superior policy optimization trajectories.

If this is right

Policies trained on quality-selected subsets reach higher tracking accuracy with less data.
The selection method improves performance on both curated AMASS motions and noisy web-sourced motion capture data.
Early-stage optimization trajectories improve when low-quality clips are removed rather than retained.
Dataset size alone does not determine tracking performance when quality criteria are applied.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Data selection of this form could lower the compute required to train effective humanoid controllers.
Similar quality filters might improve results in other imitation-learning settings beyond tracking.
Many large motion datasets may contain substantial portions that slow rather than help policy learning.

Load-bearing premise

The three quality dimensions correctly identify motion clips that produce superior optimization trajectories for tracking policies, and experiments isolate the effect of data selection from other training factors.

What would settle it

An experiment that trains identical policies on the full AMASS dataset with the same hyperparameters and compute budget and obtains equal or better tracking performance than the quality-selected 3% subset would falsify the central claim.

read the original abstract

We argue that high-quality motion data can steer tracking policies toward better optimization trajectories early in training. In this work, we introduce LIMMT (Less Is More for Motion Tracking). To our knowledge, this is the first data-centric study for physics-based humanoid motion tracking. We go beyond simply removing low-quality and erroneous clips, but define motion data quality through three dimensions: physics feasibility, diversity, and complexity. We show that even training with under 3% of AMASS yields better tracking performance than training with the full dataset. We further conduct data cleaning on the estimated web-sourced mocap data. Extensive experiments and analyses validate the effectiveness of our framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A 3% curated subset of AMASS beats the full set on tracking, but the comparison needs explicit controls on training effort to hold up.

read the letter

The main thing here is that training on a filtered slice under 3% of AMASS, picked for physics feasibility, diversity, and complexity, gives better tracking performance than the whole dataset. They extend the same filtering to web mocap as well.

This is new as a data-centric study in physics-based humanoid tracking. Most work in the area pushes scale, so the counter-result on curation is a useful observation. The three quality dimensions turn an abstract idea into something you can apply, and the paper shows the effect on held-out tracking metrics.

The experiments support the headline claim at the level described. That part is straightforward and worth having in the literature.

The soft spot is the isolation of the effect. The subset run and the full run have to match on total gradient steps, epochs, and hyperparameter effort for the gap to be attributed to data quality rather than optimization differences. If the smaller set simply converged faster under the same schedule, or got extra tuning, the result weakens. The abstract does not confirm those controls were equalized, so that needs checking in the full text.

This is for people working on data pipelines and training for humanoid control. A reader who wants practical ways to reduce data volume without losing performance would find the curation method and the scaling observation relevant.

Send it to peer review. The claim is concrete enough to test, and the approach is simple to reproduce if the training details check out.

Referee Report

2 major / 1 minor

Summary. The paper introduces LIMMT, a data-centric framework for physics-based humanoid motion tracking. It defines motion data quality along three dimensions (physics feasibility, diversity, complexity) and claims that training tracking policies on a curated subset of under 3% of AMASS yields better performance than training on the full dataset. The work also applies data cleaning to estimated web-sourced mocap data and validates the approach through experiments and analyses.

Significance. If the central empirical result holds under properly controlled conditions, the finding would demonstrate that targeted data selection can outperform scale in motion tracking, with implications for more efficient training of humanoid policies. The approach is grounded in held-out tracking metrics rather than circular definitions, providing a falsifiable empirical basis.

major comments (2)

[Abstract] Abstract: The headline claim that training with under 3% of AMASS outperforms the full dataset is load-bearing for the contribution, yet the description provides no confirmation that total optimization effort (epochs, gradient steps per epoch, or learning-rate schedules) was equalized between the subset and full-dataset runs; without this, performance differences cannot be attributed solely to the three quality dimensions.
[Abstract] The three quality dimensions (physics feasibility, diversity, complexity) are presented as correctly identifying motions that produce superior optimization trajectories, but the manuscript does not report whether hyperparameter tuning or random-seed averaging was performed identically for both conditions, leaving open confounding factors in the subset-vs-full comparison.

minor comments (1)

[Abstract] The claim 'to our knowledge, this is the first data-centric study' should be supported by a brief literature review in the introduction rather than left as an abstract statement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on ensuring fair experimental comparisons. We provide clarifications below and will update the manuscript to address these concerns.

read point-by-point responses

Referee: [Abstract] Abstract: The headline claim that training with under 3% of AMASS outperforms the full dataset is load-bearing for the contribution, yet the description provides no confirmation that total optimization effort (epochs, gradient steps per epoch, or learning-rate schedules) was equalized between the subset and full-dataset runs; without this, performance differences cannot be attributed solely to the three quality dimensions.

Authors: The training protocol was identical for both the curated subset and the full AMASS dataset, including the same number of epochs, gradient steps per epoch, and learning-rate schedules. This ensures that performance differences can be attributed to the data quality dimensions. We will explicitly state this in the revised abstract and experimental setup section. revision: yes
Referee: [Abstract] The three quality dimensions (physics feasibility, diversity, complexity) are presented as correctly identifying motions that produce superior optimization trajectories, but the manuscript does not report whether hyperparameter tuning or random-seed averaging was performed identically for both conditions, leaving open confounding factors in the subset-vs-full comparison.

Authors: Hyperparameters were tuned using the same procedure for both conditions, and all reported results are averaged over the same set of random seeds. We will add this information to the manuscript to confirm the comparisons are controlled. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical data-selection result grounded in held-out metrics

full rationale

The paper is an empirical study that curates a motion subset via three quality dimensions and reports superior tracking performance on held-out metrics when training on <3% of AMASS versus the full set. No equations, fitted parameters, or derivations are present that reduce the reported gains to the quality definitions by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The central claim rests on experimental comparison rather than any of the enumerated circular patterns, making the result self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities; the quality dimensions are presented as definitional choices rather than derived quantities. Because only the abstract is available, the ledger remains empty pending full text.

pith-pipeline@v0.9.1-grok · 5663 in / 1149 out tokens · 18421 ms · 2026-06-27T22:11:54.073046+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 1 canonical work pages

[1]

Gmt: General motion tracking for humanoid whole-body control.arXiv preprint arXiv:2506.14770, 2025

Zixuan Chen, Mazeyu Ji, Xuxin Cheng, Xuanbin Peng, Xue Bin Peng, and Xiaolong Wang. Gmt: General motion tracking for humanoid whole-body control.arXiv preprint arXiv:2506.14770, 2025

arXiv 2025
[2]

Expressive whole-body control for humanoid robots.arXiv preprint arXiv:2402.16796, 2024

Xuxin Cheng, Yandong Ji, Junming Chen, Ruihan Yang, Ge Yang, and Xiaolong Wang. Expressive whole-body control for humanoid robots.arXiv preprint arXiv:2402.16796, 2024

arXiv 2024
[3]

Christiano, Jan Leike, Tom B

Paul F. Christiano, Jan Leike, Tom B. Brown, Mil- jan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Adv. Neural Inform. Process. Syst., 2017

2017
[4]

Synchronized human-humanoid motion imitation.IEEE Robotics and Automation Letters, 8(7):4155–4162, 2023

Antonin Dallard, Mehdi Benallegue, Fumio Kane- hiro, and Abderrahmane Kheddar. Synchronized human-humanoid motion imitation.IEEE Robotics and Automation Letters, 8(7):4155–4162, 2023

2023
[5]

Go to zero: Towards zero-shot motion generation with million-scale data

Ke Fan, Shunlin Lu, Minyue Dai, Runyi Yu, Lixing Xiao, Zhiyang Dou, Junting Dong, Lizhuang Ma, and Jingbo Wang. Go to zero: Towards zero-shot motion generation with million-scale data. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, pages 13336–13348, 2025

2025
[6]

Humanplus: Humanoid shadowing and imitation from humans.arXiv preprint arXiv:2406.10454, 2024

Zipeng Fu, Qingqing Zhao, Qi Wu, Gordon Wet- zstein, and Chelsea Finn. Humanplus: Humanoid shadowing and imitation from humans.arXiv preprint arXiv:2406.10454, 2024

arXiv 2024
[7]

Robust motion in-betweening

Félix G Harvey, Mike Yurick, Derek Nowrouzezahrai, and Christopher Pal. Robust motion in-betweening. ACM Transactions on Graphics (TOG), 39(4):60–1, 2020

2020
[8]

Omnih2o: Universal and dexterous human-to-humanoid whole-body teleoper- ation and learning.arXiv preprint arXiv:2406.08858, 2024

Tairan He, Zhengyi Luo, Xialin He, Wenli Xiao, Chong Zhang, Weinan Zhang, Kris Kitani, Changliu Liu, and Guanya Shi. Omnih2o: Universal and dexterous human-to-humanoid whole-body teleoper- ation and learning.arXiv preprint arXiv:2406.08858, 2024

arXiv 2024
[9]

Exbody2: Advanced expressive humanoid whole- body control.arXiv preprint arXiv:2412.13196, 2024

Mazeyu Ji, Xuanbin Peng, Fangchen Liu, Jialong Li, Ge Yang, Xuxin Cheng, and Xiaolong Wang. Exbody2: Advanced expressive humanoid whole- body control.arXiv preprint arXiv:2412.13196, 2024

arXiv 2024
[10]

Switch-justdance: Benchmarking whole body motion tracking policies using a commercial console game.arXiv preprint arXiv:2511.17925, 2025

Jeonghwan Kim, Wontaek Kim, Yidan Lu, Jin Cheng, Fatemeh Zargarbashi, Zicheng Zeng, Zekun Qi, Zhiyang Dou, Nitish Sontakke, Donghoon Baek, et al. Switch-justdance: Benchmarking whole body motion tracking policies using a commercial console game.arXiv preprint arXiv:2511.17925, 2025

Pith/arXiv arXiv 2025
[11]

Phuma: Physically-grounded humanoid loco- motion dataset.arXiv preprint arXiv:2510.26236, 2025

Kyungmin Lee, Sibeen Kim, Minho Park, Hyunse- ung Kim, Dongyoon Hwang, Hojoon Lee, and Jaegul Choo. Phuma: Physically-grounded humanoid loco- motion dataset.arXiv preprint arXiv:2510.26236, 2025

Pith/arXiv arXiv 2025
[12]

Object motion guided human motion synthesis.ACM Trans- actions on Graphics (TOG), 42(6):1–11, 2023

Jiaman Li, Jiajun Wu, and C Karen Liu. Object motion guided human motion synthesis.ACM Trans- actions on Graphics (TOG), 42(6):1–11, 2023

2023
[13]

Clone: Closed-loop whole-body humanoid teleoperation for long-horizon tasks.arXiv preprint arXiv:2506.08931, 2025

Yixuan Li, Yutang Lin, Jieming Cui, Tengyu Liu, Wei Liang, Yixin Zhu, and Siyuan Huang. Clone: Closed-loop whole-body humanoid teleoperation for long-horizon tasks.arXiv preprint arXiv:2506.08931, 2025

arXiv 2025
[14]

Motion-x: A large-scale 3d expressive whole-body human motion dataset.Advances in Neural Infor- mation Processing Systems, 36:25268–25280, 2023

Jing Lin, Ailing Zeng, Shunlin Lu, Yuanhao Cai, Ruimao Zhang, Haoqian Wang, and Lei Zhang. Motion-x: A large-scale 3d expressive whole-body human motion dataset.Advances in Neural Infor- mation Processing Systems, 36:25268–25280, 2023

2023
[15]

Perpetual humanoid control for real-time simulated avatars

Zhengyi Luo, Jinkun Cao, Kris Kitani, Weipeng Xu, et al. Perpetual humanoid control for real-time simulated avatars. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10895–10904, 2023

2023
[16]

Universal humanoid motion representa- tions for physics-based control.arXiv preprint arXiv:2310.04582, 2023

Zhengyi Luo, Jinkun Cao, Josh Merel, Alexander Winkler, Jing Huang, Kris Kitani, and Weipeng Xu. Universal humanoid motion representa- tions for physics-based control.arXiv preprint arXiv:2310.04582, 2023

arXiv 2023
[17]

Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025

Zhengyi Luo, Ye Yuan, Tingwu Wang, Chen- ran Li, Sirui Chen, Fernando Castañeda, Zi-Ang Cao, Jiefeng Li, David Minor, Qingwei Ben, et al. Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025

Pith/arXiv arXiv 2025
[18]

Amass: Archive of motion capture as surface shapes

Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Gerard Pons-Moll, and Michael J Black. Amass: Archive of motion capture as surface shapes. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 5442–5451, 2019

2019
[19]

Learning from massive human videos for universal humanoid pose control.arXiv preprint arXiv:2412.14172, 2024

Jiageng Mao, Siheng Zhao, Siqi Song, Tianheng Shi, Junjie Ye, Mingtong Zhang, Haoran Geng, Jitendra Malik, Vitor Guizilini, and Yue Wang. Learning from massive human videos for universal humanoid pose control.arXiv preprint arXiv:2412.14172, 2024

arXiv 2024
[20]

Deepmimic: Example-guided deep reinforcement learning of physics-based charac- ter skills.ACM Transactions On Graphics (TOG), 37(4):1–14, 2018

Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel Van de Panne. Deepmimic: Example-guided deep reinforcement learning of physics-based charac- ter skills.ACM Transactions On Graphics (TOG), 37(4):1–14, 2018

2018
[21]

Amp: Adversarial motion priors for stylized physics-based character control

Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa. Amp: Adversarial motion priors for stylized physics-based character control. ACM Transactions on Graphics (ToG), 40(4):1–20, 2021

2021
[22]

Shapellm: Universal 3d object understand- ing for embodied interaction

Zekun Qi, Runpei Dong, Shaochen Zhang, Haoran Geng, Chunrui Han, Zheng Ge, Li Yi, and Kaisheng Ma. Shapellm: Universal 3d object understand- ing for embodied interaction. InComputer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part XLIII, volume 15101 ofLecture Notes in Com- puter Science, pages 2...

2024
[23]

Sofar: Language-grounded orientation bridges spatial reasoning and object manipulation.CoRR, abs/2502.13143, 2025

Zekun Qi, Wenyao Zhang, Yufei Ding, Runpei Dong, Xinqiang Yu, Jingwen Li, Lingyun Xu, Baoyu Li, Xialin He, Guofan Fan, Jiazhao Zhang, Jiawei He, Jiayuan Gu, Xin Jin, Kaisheng Ma, Zhizheng Zhang, He Wang, and Li Yi. Sofar: Language-grounded orientation bridges spatial reasoning and object manipulation.CoRR, abs/2502.13143, 2025. doi: 10.48550/ARXIV.2502.13...

work page doi:10.48550/arxiv.2502.13143 2025
[24]

Humanoid generative pre- training for zero-shot motion tracking

Zekun Qi, Xuchuan Chen, Jilong Wang, Chenghuai Lin, Yunrui Lian, Wenyao Zhang, Xinqiang Yu, He Wang, and Li Yi. Humanoid generative pre- training for zero-shot motion tracking. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20834–20844, 2026

2026
[25]

Anyteleop: A general vision- based dexterous robot arm-hand teleoperation sys- tem.arXiv preprint arXiv:2307.04577, 2023

Yuzhe Qin, Wei Yang, Binghao Huang, Karl Van Wyk, Hao Su, Xiaolong Wang, Yu-Wei Chao, and Dieter Fox. Anyteleop: A general vision- based dexterous robot arm-hand teleoperation sys- tem.arXiv preprint arXiv:2307.04577, 2023

arXiv 2023
[26]

Physcap: Physically plausi- ble monocular 3d motion capture in real time.ACM Transactions on Graphics (ToG), 39(6):1–16, 2020

Soshi Shimada, Vladislav Golyanik, Weipeng Xu, and Christian Theobalt. Physcap: Physically plausi- ble monocular 3d motion capture in real time.ACM Transactions on Graphics (ToG), 39(6):1–16, 2020

2020
[27]

Wham: Reconstructing world-grounded hu- mans with accurate 3d motion

Soyong Shin, Juyong Kim, Eni Halilaj, and Michael J Black. Wham: Reconstructing world-grounded hu- mans with accurate 3d motion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2070–2080, 2024

2070
[28]

Deepphase: Periodic autoencoders for learning mo- tion phase manifolds.ACM Transactions on Graph- ics (ToG), 41(4):1–13, 2022

Sebastian Starke, Ian Mason, and Taku Komura. Deepphase: Periodic autoencoders for learning mo- tion phase manifolds.ACM Transactions on Graph- ics (ToG), 41(4):1–13, 2022

2022
[29]

Vla-jepa: Enhancing vision- language-action model with latent world model

Jingwen Sun, Wenyao Zhang, Zekun Qi, Shaojie Ren, Zezhi Liu, Hanxin Zhu, Guangzhong Sun, Xin Jin, and Zhibo Chen. Vla-jepa: Enhancing vision- language-action model with latent world model. arXiv preprint arXiv:2602.10098, 2026

arXiv 2026
[30]

Kungfubot: Physics-based humanoid whole-body control for learning highly- dynamic skills.arXiv preprint arXiv:2506.12851, 2025

Weiji Xie, Jinrui Han, Jiakun Zheng, Huanyu Li, Xinzhe Liu, Jiyuan Shi, Weinan Zhang, Chenjia Bai, and Xuelong Li. Kungfubot: Physics-based humanoid whole-body control for learning highly- dynamic skills.arXiv preprint arXiv:2506.12851, 2025

arXiv 2025
[31]

Iterative preference learning from human feedback: Bridging theory and practice for RLHF under kl- constraint

Wei Xiong, Hanze Dong, Chenlu Ye, Ziqi Wang, Han Zhong, Heng Ji, Nan Jiang, and Tong Zhang. Iterative preference learning from human feedback: Bridging theory and practice for RLHF under kl- constraint. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024

2024
[32]

Collision-free humanoid traversal in cluttered indoor scenes.arXiv preprint arXiv:2601.16035, 2026

Han Xue, Sikai Liang, Zhikai Zhang, Zicheng Zeng, Yun Liu, Yunrui Lian, Jilong Wang, Qingtao Liu, Xuesong Shi, and Li Yi. Collision-free humanoid traversal in cluttered indoor scenes.arXiv preprint arXiv:2601.16035, 2026

arXiv 2026
[33]

Omnire- target: Interaction-preserving data generation for humanoid whole-body loco-manipulation and scene interaction.arXiv preprint arXiv:2509.26633, 2025

Lujie Yang, Xiaoyu Huang, Zhen Wu, Angjoo Kanazawa, Pieter Abbeel, Carmelo Sferrazza, C Karen Liu, Rocky Duan, and Guanya Shi. Omnire- target: Interaction-preserving data generation for humanoid whole-body loco-manipulation and scene interaction.arXiv preprint arXiv:2509.26633, 2025

Pith/arXiv arXiv 2025
[34]

Unitracker: Learning universal whole-body motion tracker for humanoid robots.arXiv preprint arXiv:2507.07356, 2025

Kangning Yin, Weishuai Zeng, Ke Fan, Minyue Dai, Zirui Wang, Qiang Zhang, Zheng Tian, Jingbo Wang, Jiangmiao Pang, and Weinan Zhang. Unitracker: Learning universal whole-body motion tracker for humanoid robots.arXiv preprint arXiv:2507.07356, 2025

arXiv 2025
[35]

Twist: Teleoperated whole-body imitation system

Yanjie Ze, Zixuan Chen, JoÃG, o Pedro AraÃšjo, Zi- ang Cao, Xue Bin Peng, Jiajun Wu, and C Karen Liu. Twist: Teleoperated whole-body imitation system. arXiv preprint arXiv:2505.02833, 2025

arXiv 2025
[36]

Twist2: Scalable, portable, and holistic humanoid data collection sys- tem.arXiv preprint arXiv:2511.02832, 2025

Yanjie Ze, Siheng Zhao, Weizhuo Wang, Angjoo Kanazawa, Rocky Duan, Pieter Abbeel, Guanya Shi, Jiajun Wu, and C Karen Liu. Twist2: Scalable, portable, and holistic humanoid data collection sys- tem.arXiv preprint arXiv:2511.02832, 2025

arXiv 2025
[37]

Dreamvla: A vision-language-action model dreamed with comprehensive world knowledge

WenyaoZhang, HongsiLiu, ZekunQi, YunnanWang, Xinqiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, He Wang, Zhizheng Zhang, Li Yi, Wenjun Zeng, and Xin Jin. Dreamvla: A vision-language-action model dreamed with comprehensive world knowledge. CoRR, abs/2507.04447, 2025

Pith/arXiv arXiv 2025
[38]

Disentangled robot learning via separate forward and inverse dynamics pretraining.arXiv preprint arXiv:2604.16391, 2026

Wenyao Zhang, Bozhou Zhang, Zekun Qi, Wenjun Zeng, Xin Jin, and Li Zhang. Disentangled robot learning via separate forward and inverse dynamics pretraining.arXiv preprint arXiv:2604.16391, 2026

Pith/arXiv arXiv 2026
[39]

Motion-x++: A large-scale multimodal 3d whole-body human motion dataset.arXiv preprint arXiv:2501.05098, 2025

Yuhong Zhang, Jing Lin, Ailing Zeng, Guanlin Wu, ShunlinLu, YurongFu, YuanhaoCai, RuimaoZhang, Haoqian Wang, and Lei Zhang. Motion-x++: A large-scale multimodal 3d whole-body human motion dataset.arXiv preprint arXiv:2501.05098, 2025

arXiv 2025
[40]

Freemotion: Mocap-free human motion synthesis with multimodal large language models

Zhikai Zhang, Yitang Li, Haofeng Huang, Mingx- ian Lin, and Li Yi. Freemotion: Mocap-free human motion synthesis with multimodal large language models. InEuropean Conference on Computer Vi- sion, pages 403–421. Springer, 2024

2024
[41]

Unleashing humanoid reaching poten- tial via real-world-ready skill space.arXiv preprint arXiv:2505.10918, 2025

Zhikai Zhang, Chao Chen, Han Xue, Jilong Wang, Sikai Liang, Yun Liu, Zongzhang Zhang, He Wang, and Li Yi. Unleashing humanoid reaching poten- tial via real-world-ready skill space.arXiv preprint arXiv:2505.10918, 2025

arXiv 2025
[42]

Track any motions under any disturbances.arXiv preprint arXiv:2509.13833, 2025

Zhikai Zhang, Jun Guo, Chao Chen, Jilong Wang, Chenghuai Lin, Yunrui Lian, Han Xue, Zhenrong Wang, Maoqi Liu, Jiangran Lyu, et al. Track any motions under any disturbances.arXiv preprint arXiv:2509.13833, 2025

arXiv 2025
[43]

Learning athletic humanoid tennis skills from imperfect human motion data.arXiv preprint arXiv:2603.12686, 2026

Zhikai Zhang, Haofei Lu, Yunrui Lian, Ziqing Chen, Yun Liu, Chenghuai Lin, Han Xue, Zicheng Zeng, Zekun Qi, Shaolin Zheng, et al. Learning athletic humanoid tennis skills from imperfect human motion data.arXiv preprint arXiv:2603.12686, 2026. A Implementation Details A.1 Domain Randomization To improve sim-to-real transfer and policy robustness, we apply ...

arXiv 2026

[1] [1]

Gmt: General motion tracking for humanoid whole-body control.arXiv preprint arXiv:2506.14770, 2025

Zixuan Chen, Mazeyu Ji, Xuxin Cheng, Xuanbin Peng, Xue Bin Peng, and Xiaolong Wang. Gmt: General motion tracking for humanoid whole-body control.arXiv preprint arXiv:2506.14770, 2025

arXiv 2025

[2] [2]

Expressive whole-body control for humanoid robots.arXiv preprint arXiv:2402.16796, 2024

Xuxin Cheng, Yandong Ji, Junming Chen, Ruihan Yang, Ge Yang, and Xiaolong Wang. Expressive whole-body control for humanoid robots.arXiv preprint arXiv:2402.16796, 2024

arXiv 2024

[3] [3]

Christiano, Jan Leike, Tom B

Paul F. Christiano, Jan Leike, Tom B. Brown, Mil- jan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Adv. Neural Inform. Process. Syst., 2017

2017

[4] [4]

Synchronized human-humanoid motion imitation.IEEE Robotics and Automation Letters, 8(7):4155–4162, 2023

Antonin Dallard, Mehdi Benallegue, Fumio Kane- hiro, and Abderrahmane Kheddar. Synchronized human-humanoid motion imitation.IEEE Robotics and Automation Letters, 8(7):4155–4162, 2023

2023

[5] [5]

Go to zero: Towards zero-shot motion generation with million-scale data

Ke Fan, Shunlin Lu, Minyue Dai, Runyi Yu, Lixing Xiao, Zhiyang Dou, Junting Dong, Lizhuang Ma, and Jingbo Wang. Go to zero: Towards zero-shot motion generation with million-scale data. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, pages 13336–13348, 2025

2025

[6] [6]

Humanplus: Humanoid shadowing and imitation from humans.arXiv preprint arXiv:2406.10454, 2024

Zipeng Fu, Qingqing Zhao, Qi Wu, Gordon Wet- zstein, and Chelsea Finn. Humanplus: Humanoid shadowing and imitation from humans.arXiv preprint arXiv:2406.10454, 2024

arXiv 2024

[7] [7]

Robust motion in-betweening

Félix G Harvey, Mike Yurick, Derek Nowrouzezahrai, and Christopher Pal. Robust motion in-betweening. ACM Transactions on Graphics (TOG), 39(4):60–1, 2020

2020

[8] [8]

Omnih2o: Universal and dexterous human-to-humanoid whole-body teleoper- ation and learning.arXiv preprint arXiv:2406.08858, 2024

Tairan He, Zhengyi Luo, Xialin He, Wenli Xiao, Chong Zhang, Weinan Zhang, Kris Kitani, Changliu Liu, and Guanya Shi. Omnih2o: Universal and dexterous human-to-humanoid whole-body teleoper- ation and learning.arXiv preprint arXiv:2406.08858, 2024

arXiv 2024

[9] [9]

Exbody2: Advanced expressive humanoid whole- body control.arXiv preprint arXiv:2412.13196, 2024

Mazeyu Ji, Xuanbin Peng, Fangchen Liu, Jialong Li, Ge Yang, Xuxin Cheng, and Xiaolong Wang. Exbody2: Advanced expressive humanoid whole- body control.arXiv preprint arXiv:2412.13196, 2024

arXiv 2024

[10] [10]

Switch-justdance: Benchmarking whole body motion tracking policies using a commercial console game.arXiv preprint arXiv:2511.17925, 2025

Jeonghwan Kim, Wontaek Kim, Yidan Lu, Jin Cheng, Fatemeh Zargarbashi, Zicheng Zeng, Zekun Qi, Zhiyang Dou, Nitish Sontakke, Donghoon Baek, et al. Switch-justdance: Benchmarking whole body motion tracking policies using a commercial console game.arXiv preprint arXiv:2511.17925, 2025

Pith/arXiv arXiv 2025

[11] [11]

Phuma: Physically-grounded humanoid loco- motion dataset.arXiv preprint arXiv:2510.26236, 2025

Kyungmin Lee, Sibeen Kim, Minho Park, Hyunse- ung Kim, Dongyoon Hwang, Hojoon Lee, and Jaegul Choo. Phuma: Physically-grounded humanoid loco- motion dataset.arXiv preprint arXiv:2510.26236, 2025

Pith/arXiv arXiv 2025

[12] [12]

Object motion guided human motion synthesis.ACM Trans- actions on Graphics (TOG), 42(6):1–11, 2023

Jiaman Li, Jiajun Wu, and C Karen Liu. Object motion guided human motion synthesis.ACM Trans- actions on Graphics (TOG), 42(6):1–11, 2023

2023

[13] [13]

Clone: Closed-loop whole-body humanoid teleoperation for long-horizon tasks.arXiv preprint arXiv:2506.08931, 2025

Yixuan Li, Yutang Lin, Jieming Cui, Tengyu Liu, Wei Liang, Yixin Zhu, and Siyuan Huang. Clone: Closed-loop whole-body humanoid teleoperation for long-horizon tasks.arXiv preprint arXiv:2506.08931, 2025

arXiv 2025

[14] [14]

Motion-x: A large-scale 3d expressive whole-body human motion dataset.Advances in Neural Infor- mation Processing Systems, 36:25268–25280, 2023

Jing Lin, Ailing Zeng, Shunlin Lu, Yuanhao Cai, Ruimao Zhang, Haoqian Wang, and Lei Zhang. Motion-x: A large-scale 3d expressive whole-body human motion dataset.Advances in Neural Infor- mation Processing Systems, 36:25268–25280, 2023

2023

[15] [15]

Perpetual humanoid control for real-time simulated avatars

Zhengyi Luo, Jinkun Cao, Kris Kitani, Weipeng Xu, et al. Perpetual humanoid control for real-time simulated avatars. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10895–10904, 2023

2023

[16] [16]

Universal humanoid motion representa- tions for physics-based control.arXiv preprint arXiv:2310.04582, 2023

Zhengyi Luo, Jinkun Cao, Josh Merel, Alexander Winkler, Jing Huang, Kris Kitani, and Weipeng Xu. Universal humanoid motion representa- tions for physics-based control.arXiv preprint arXiv:2310.04582, 2023

arXiv 2023

[17] [17]

Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025

Zhengyi Luo, Ye Yuan, Tingwu Wang, Chen- ran Li, Sirui Chen, Fernando Castañeda, Zi-Ang Cao, Jiefeng Li, David Minor, Qingwei Ben, et al. Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025

Pith/arXiv arXiv 2025

[18] [18]

Amass: Archive of motion capture as surface shapes

Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Gerard Pons-Moll, and Michael J Black. Amass: Archive of motion capture as surface shapes. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 5442–5451, 2019

2019

[19] [19]

Learning from massive human videos for universal humanoid pose control.arXiv preprint arXiv:2412.14172, 2024

Jiageng Mao, Siheng Zhao, Siqi Song, Tianheng Shi, Junjie Ye, Mingtong Zhang, Haoran Geng, Jitendra Malik, Vitor Guizilini, and Yue Wang. Learning from massive human videos for universal humanoid pose control.arXiv preprint arXiv:2412.14172, 2024

arXiv 2024

[20] [20]

Deepmimic: Example-guided deep reinforcement learning of physics-based charac- ter skills.ACM Transactions On Graphics (TOG), 37(4):1–14, 2018

Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel Van de Panne. Deepmimic: Example-guided deep reinforcement learning of physics-based charac- ter skills.ACM Transactions On Graphics (TOG), 37(4):1–14, 2018

2018

[21] [21]

Amp: Adversarial motion priors for stylized physics-based character control

Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa. Amp: Adversarial motion priors for stylized physics-based character control. ACM Transactions on Graphics (ToG), 40(4):1–20, 2021

2021

[22] [22]

Shapellm: Universal 3d object understand- ing for embodied interaction

Zekun Qi, Runpei Dong, Shaochen Zhang, Haoran Geng, Chunrui Han, Zheng Ge, Li Yi, and Kaisheng Ma. Shapellm: Universal 3d object understand- ing for embodied interaction. InComputer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part XLIII, volume 15101 ofLecture Notes in Com- puter Science, pages 2...

2024

[23] [23]

Sofar: Language-grounded orientation bridges spatial reasoning and object manipulation.CoRR, abs/2502.13143, 2025

Zekun Qi, Wenyao Zhang, Yufei Ding, Runpei Dong, Xinqiang Yu, Jingwen Li, Lingyun Xu, Baoyu Li, Xialin He, Guofan Fan, Jiazhao Zhang, Jiawei He, Jiayuan Gu, Xin Jin, Kaisheng Ma, Zhizheng Zhang, He Wang, and Li Yi. Sofar: Language-grounded orientation bridges spatial reasoning and object manipulation.CoRR, abs/2502.13143, 2025. doi: 10.48550/ARXIV.2502.13...

work page doi:10.48550/arxiv.2502.13143 2025

[24] [24]

Humanoid generative pre- training for zero-shot motion tracking

Zekun Qi, Xuchuan Chen, Jilong Wang, Chenghuai Lin, Yunrui Lian, Wenyao Zhang, Xinqiang Yu, He Wang, and Li Yi. Humanoid generative pre- training for zero-shot motion tracking. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20834–20844, 2026

2026

[25] [25]

Anyteleop: A general vision- based dexterous robot arm-hand teleoperation sys- tem.arXiv preprint arXiv:2307.04577, 2023

Yuzhe Qin, Wei Yang, Binghao Huang, Karl Van Wyk, Hao Su, Xiaolong Wang, Yu-Wei Chao, and Dieter Fox. Anyteleop: A general vision- based dexterous robot arm-hand teleoperation sys- tem.arXiv preprint arXiv:2307.04577, 2023

arXiv 2023

[26] [26]

Physcap: Physically plausi- ble monocular 3d motion capture in real time.ACM Transactions on Graphics (ToG), 39(6):1–16, 2020

Soshi Shimada, Vladislav Golyanik, Weipeng Xu, and Christian Theobalt. Physcap: Physically plausi- ble monocular 3d motion capture in real time.ACM Transactions on Graphics (ToG), 39(6):1–16, 2020

2020

[27] [27]

Wham: Reconstructing world-grounded hu- mans with accurate 3d motion

Soyong Shin, Juyong Kim, Eni Halilaj, and Michael J Black. Wham: Reconstructing world-grounded hu- mans with accurate 3d motion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2070–2080, 2024

2070

[28] [28]

Deepphase: Periodic autoencoders for learning mo- tion phase manifolds.ACM Transactions on Graph- ics (ToG), 41(4):1–13, 2022

Sebastian Starke, Ian Mason, and Taku Komura. Deepphase: Periodic autoencoders for learning mo- tion phase manifolds.ACM Transactions on Graph- ics (ToG), 41(4):1–13, 2022

2022

[29] [29]

Vla-jepa: Enhancing vision- language-action model with latent world model

Jingwen Sun, Wenyao Zhang, Zekun Qi, Shaojie Ren, Zezhi Liu, Hanxin Zhu, Guangzhong Sun, Xin Jin, and Zhibo Chen. Vla-jepa: Enhancing vision- language-action model with latent world model. arXiv preprint arXiv:2602.10098, 2026

arXiv 2026

[30] [30]

Kungfubot: Physics-based humanoid whole-body control for learning highly- dynamic skills.arXiv preprint arXiv:2506.12851, 2025

Weiji Xie, Jinrui Han, Jiakun Zheng, Huanyu Li, Xinzhe Liu, Jiyuan Shi, Weinan Zhang, Chenjia Bai, and Xuelong Li. Kungfubot: Physics-based humanoid whole-body control for learning highly- dynamic skills.arXiv preprint arXiv:2506.12851, 2025

arXiv 2025

[31] [31]

Iterative preference learning from human feedback: Bridging theory and practice for RLHF under kl- constraint

Wei Xiong, Hanze Dong, Chenlu Ye, Ziqi Wang, Han Zhong, Heng Ji, Nan Jiang, and Tong Zhang. Iterative preference learning from human feedback: Bridging theory and practice for RLHF under kl- constraint. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024

2024

[32] [32]

Collision-free humanoid traversal in cluttered indoor scenes.arXiv preprint arXiv:2601.16035, 2026

Han Xue, Sikai Liang, Zhikai Zhang, Zicheng Zeng, Yun Liu, Yunrui Lian, Jilong Wang, Qingtao Liu, Xuesong Shi, and Li Yi. Collision-free humanoid traversal in cluttered indoor scenes.arXiv preprint arXiv:2601.16035, 2026

arXiv 2026

[33] [33]

Omnire- target: Interaction-preserving data generation for humanoid whole-body loco-manipulation and scene interaction.arXiv preprint arXiv:2509.26633, 2025

Lujie Yang, Xiaoyu Huang, Zhen Wu, Angjoo Kanazawa, Pieter Abbeel, Carmelo Sferrazza, C Karen Liu, Rocky Duan, and Guanya Shi. Omnire- target: Interaction-preserving data generation for humanoid whole-body loco-manipulation and scene interaction.arXiv preprint arXiv:2509.26633, 2025

Pith/arXiv arXiv 2025

[34] [34]

Unitracker: Learning universal whole-body motion tracker for humanoid robots.arXiv preprint arXiv:2507.07356, 2025

Kangning Yin, Weishuai Zeng, Ke Fan, Minyue Dai, Zirui Wang, Qiang Zhang, Zheng Tian, Jingbo Wang, Jiangmiao Pang, and Weinan Zhang. Unitracker: Learning universal whole-body motion tracker for humanoid robots.arXiv preprint arXiv:2507.07356, 2025

arXiv 2025

[35] [35]

Twist: Teleoperated whole-body imitation system

Yanjie Ze, Zixuan Chen, JoÃG, o Pedro AraÃšjo, Zi- ang Cao, Xue Bin Peng, Jiajun Wu, and C Karen Liu. Twist: Teleoperated whole-body imitation system. arXiv preprint arXiv:2505.02833, 2025

arXiv 2025

[36] [36]

Twist2: Scalable, portable, and holistic humanoid data collection sys- tem.arXiv preprint arXiv:2511.02832, 2025

Yanjie Ze, Siheng Zhao, Weizhuo Wang, Angjoo Kanazawa, Rocky Duan, Pieter Abbeel, Guanya Shi, Jiajun Wu, and C Karen Liu. Twist2: Scalable, portable, and holistic humanoid data collection sys- tem.arXiv preprint arXiv:2511.02832, 2025

arXiv 2025

[37] [37]

Dreamvla: A vision-language-action model dreamed with comprehensive world knowledge

WenyaoZhang, HongsiLiu, ZekunQi, YunnanWang, Xinqiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, He Wang, Zhizheng Zhang, Li Yi, Wenjun Zeng, and Xin Jin. Dreamvla: A vision-language-action model dreamed with comprehensive world knowledge. CoRR, abs/2507.04447, 2025

Pith/arXiv arXiv 2025

[38] [38]

Disentangled robot learning via separate forward and inverse dynamics pretraining.arXiv preprint arXiv:2604.16391, 2026

Wenyao Zhang, Bozhou Zhang, Zekun Qi, Wenjun Zeng, Xin Jin, and Li Zhang. Disentangled robot learning via separate forward and inverse dynamics pretraining.arXiv preprint arXiv:2604.16391, 2026

Pith/arXiv arXiv 2026

[39] [39]

Motion-x++: A large-scale multimodal 3d whole-body human motion dataset.arXiv preprint arXiv:2501.05098, 2025

Yuhong Zhang, Jing Lin, Ailing Zeng, Guanlin Wu, ShunlinLu, YurongFu, YuanhaoCai, RuimaoZhang, Haoqian Wang, and Lei Zhang. Motion-x++: A large-scale multimodal 3d whole-body human motion dataset.arXiv preprint arXiv:2501.05098, 2025

arXiv 2025

[40] [40]

Freemotion: Mocap-free human motion synthesis with multimodal large language models

Zhikai Zhang, Yitang Li, Haofeng Huang, Mingx- ian Lin, and Li Yi. Freemotion: Mocap-free human motion synthesis with multimodal large language models. InEuropean Conference on Computer Vi- sion, pages 403–421. Springer, 2024

2024

[41] [41]

Unleashing humanoid reaching poten- tial via real-world-ready skill space.arXiv preprint arXiv:2505.10918, 2025

Zhikai Zhang, Chao Chen, Han Xue, Jilong Wang, Sikai Liang, Yun Liu, Zongzhang Zhang, He Wang, and Li Yi. Unleashing humanoid reaching poten- tial via real-world-ready skill space.arXiv preprint arXiv:2505.10918, 2025

arXiv 2025

[42] [42]

Track any motions under any disturbances.arXiv preprint arXiv:2509.13833, 2025

Zhikai Zhang, Jun Guo, Chao Chen, Jilong Wang, Chenghuai Lin, Yunrui Lian, Han Xue, Zhenrong Wang, Maoqi Liu, Jiangran Lyu, et al. Track any motions under any disturbances.arXiv preprint arXiv:2509.13833, 2025

arXiv 2025

[43] [43]

Learning athletic humanoid tennis skills from imperfect human motion data.arXiv preprint arXiv:2603.12686, 2026

Zhikai Zhang, Haofei Lu, Yunrui Lian, Ziqing Chen, Yun Liu, Chenghuai Lin, Han Xue, Zicheng Zeng, Zekun Qi, Shaolin Zheng, et al. Learning athletic humanoid tennis skills from imperfect human motion data.arXiv preprint arXiv:2603.12686, 2026. A Implementation Details A.1 Domain Randomization To improve sim-to-real transfer and policy robustness, we apply ...

arXiv 2026