SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control

Chenran Li; Cyrus Hogg; David Minor; David Sami; Edy Lim; Eugene Jeong; Fernando Casta\~neda; Haoru Xue; Jan Kautz; Jiefeng Li

arxiv: 2511.07820 · v3 · pith:T3RKL6MLnew · submitted 2025-11-11 · 💻 cs.RO · cs.AI· cs.CV· cs.GR· cs.SY· eess.SY

SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control

Zhengyi Luo , Ye Yuan , Tingwu Wang , Chenran Li , Fernando Casta\~neda , Sirui Chen , Zi-Ang Cao , Jiefeng Li

show 20 more authors

David Minor Qingwei Ben Jinhyung Park David Sami Zi Wang Xingye Da Runyu Ding Cyrus Hogg Lina Song Edy Lim Eugene Jeong Tairan He Haoru Xue Wenli Xiao Simon Yuen Jan Kautz Yan Chang Umar Iqbal Linxi "Jim" Fan Yuke Zhu

This is my paper

Pith reviewed 2026-05-22 12:22 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CVcs.GRcs.SYeess.SY

keywords humanoid controlmotion trackingscalingwhole-body controlmotion capturefoundation modelrobot learninglocomotion

0 comments

The pith

Scaling model size, data volume, and compute in motion tracking produces a generalist humanoid controller for natural whole-body movements.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that increasing neural network capacity from 1.2M to 42M parameters, training on over 100 million frames of motion capture data, and using 21k GPU hours leads to better humanoid robot control. By framing motion tracking as the primary learning task, the approach extracts human motion patterns directly from data instead of relying on hand-designed rewards for each behavior. This produces controllers that handle diverse movements and extend to planning and language-driven tasks. A sympathetic reader would care because it offers a data-driven route to more versatile humanoid robots that move like people across many situations.

Core claim

Scaling along network size, dataset volume from 700 hours of motion capture, and compute creates a foundation model for motion tracking that delivers natural, robust whole-body humanoid control, improves steadily with more resources, generalizes to unseen motions, and supports downstream uses such as real-time kinematic planning for navigation and a unified interface for VR teleoperation plus vision-language-action models.

What carries the argument

Motion tracking treated as a scalable supervised task that supplies dense supervision from large motion-capture datasets to acquire general human motion priors.

If this is right

Tracking performance rises steadily as compute and data diversity grow.
Policies generalize to motions absent from the training set.
A real-time kinematic planner can convert tracking outputs into natural navigation and interactive whole-body behaviors.
A single policy handles both VR teleoperation and vision-language-action models via a shared token space.
The same controller supports coordinated hand and foot actions in autonomous loco-manipulation driven by vision-language inputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This scaling route could reduce reliance on per-task reward engineering across many robot behaviors.
The learned motion priors might transfer to humanoids of different sizes or proportions if the underlying patterns prove body-agnostic.
Direct coupling to language models could let robots follow spoken instructions that combine locomotion with object handling.
Further increases in data and model size may close remaining gaps between simulated and real-world agility.

Load-bearing premise

That large-scale motion-capture data alone supplies enough signal to learn control policies that remain robust and generalizable on humanoid robots.

What would settle it

A head-to-head comparison showing that the largest scaled model performs no better than smaller versions when evaluated on a broad set of complex, previously unseen whole-body motion sequences.

Figures

Figures reproduced from arXiv: 2511.07820 by Chenran Li, Cyrus Hogg, David Minor, David Sami, Edy Lim, Eugene Jeong, Fernando Casta\~neda, Haoru Xue, Jan Kautz, Jiefeng Li, Jinhyung Park, Lina Song, Linxi "Jim" Fan, Qingwei Ben, Runyu Ding, Simon Yuen, Sirui Chen, Tairan He, Tingwu Wang, Umar Iqbal, Wenli Xiao, Xingye Da, Yan Chang, Ye Yuan, Yuke Zhu, Zhengyi Luo, Zi-Ang Cao, Zi Wang.

**Figure 1.** Figure 1: SONIC enables diverse humanoid tasks through a universal control policy that handles diverse input modalities and control interfaces. extensive reward engineering for each scenario – walking naturally forward provides little signal for dancing (He et al., 2025), getting up from the ground (He et al., 2025; Huang et al., 2025), or teleoperation (Ben et al., 2025; Li et al., 2025; Ze et al., 2025). Each new … view at source ↗

**Figure 2.** Figure 2: (a-c) Effect of scaling to different sizes of dataset, model, and compute. Mean per joint position error [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Top three rows: interactive navigation switching between different velocities, directions, and styles. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Interactive squatting, kneeling, and crawling. With [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Video teleoperation, multi-modal control, and VR whole-body teleoperation. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Apple-to-plate mobile bimanual manipulation on the Unitree G1 humanoid robot controlled by a fine [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Random samples from our motion dataset. innovation is its ability to seamlessly handle robot motion, human motion, and hybrid motion (combining upper-body keypoints with lower-body robot motions) through a shared latent representation. This crossembodiment capability enables the robot to learn from motion captured and raw video data, bridging the morphological gap between human and robot embodiments. We u… view at source ↗

**Figure 8.** Figure 8: SONIC enables universal humanoid motion tracking through a universal control policy that handles diverse motion commands and modalities. Specialized encoders process robot, human, and hybrid motion commands into a universal token that drives robot control and motion decoders. This cross-embodiment design supports diverse applications including gamepad control, VR teleoperation, whole-body teleoperation, vi… view at source ↗

read the original abstract

Despite the rise of billion-parameter foundation models trained across thousands of GPUs, similar scaling gains have not been shown for humanoid control. Current neural controllers for humanoids remain modest in size, target a limited set of behaviors, and are trained on a handful of GPUs. We show that scaling model capacity, data, and compute yields a generalist humanoid controller capable of natural, robust whole-body movements. We position motion tracking as a scalable task for humanoid control, leveraging dense supervision from diverse motion-capture data to acquire human motion priors without manual reward engineering. We build a foundation model for motion tracking by scaling along three axes: network size (1.2M to 42M parameters), dataset volume (100M+ frames from 700 hours of motion capture), and compute (21k GPU hours). Beyond demonstrating the benefits of scale, we further show downstream utility through: (1) a real-time kinematic planner bridging motion tracking to tasks such as navigation, enabling natural and interactive control, and (2) a unified token space supporting VR teleoperation and vision-language-action (VLA) models with a single policy. Through this interface, we demonstrate autonomous VLA-driven whole-body loco-manipulation requiring coordinated hand and foot placement. Scaling motion tracking exhibits favorable properties: performance improves steadily with compute and data diversity, and learned policies generalize to unseen motions, establishing motion tracking at scale as a practical foundation for humanoid control.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Scaling motion tracking to 42M parameters and 700 hours of mocap data produces a generalist humanoid controller with practical interfaces, though the size of the gains needs concrete numbers to judge.

read the letter

This paper shows that scaling model size, data, and compute for motion tracking can produce a generalist controller for humanoid whole-body control. They take a large mocap dataset of 700 hours and train models up to 42M parameters over 21k GPU hours. The results indicate steady performance gains and the ability to handle motions outside the training set. They also introduce a unified token space that connects the policy to VR teleoperation and VLA models, along with a kinematic planner for task-level control like navigation and loco-manipulation.

Referee Report

2 major / 1 minor

Summary. The manuscript presents SONIC, a scaled foundation model for humanoid whole-body control obtained by training a motion-tracking policy on large-scale motion-capture data. It scales network size from 1.2M to 42M parameters, dataset volume to over 100M frames drawn from 700 hours of mocap, and compute to 21k GPU hours. The central claim is that this scaling produces a generalist controller capable of natural, robust whole-body movements that generalizes to unseen motions; downstream utility is shown via a real-time kinematic planner for navigation and a unified token space enabling VR teleoperation and VLA-driven loco-manipulation.

Significance. If the scaling results and generalization claims hold, the work would be significant for robotics by demonstrating that dense mocap supervision can serve as a scalable pre-training task for humanoid control, yielding policies that avoid manual reward engineering. The explicit three-axis scaling study, the reported steady performance gains, and the practical interfaces for downstream tasks (kinematic planner and unified token space) constitute concrete strengths that could influence future foundation-model efforts in humanoid robotics.

major comments (2)

[Abstract] Abstract: the statement that 'performance improves steadily with compute and data diversity' is presented without any quantitative metrics, baselines, error bars, or ablation tables. Because this scaling hypothesis is the load-bearing empirical claim, the absence of supporting numbers leaves the central result only partially substantiated.
[Abstract] Abstract and §4 (presumed results section): the claim that 'learned policies generalize to unseen motions' is stated without details on how the unseen test set was constructed, what tracking-error metrics were used, or comparisons against smaller-scale baselines. These omissions directly affect evaluation of the generalization property asserted in the abstract.

minor comments (1)

[Abstract] Abstract: the phrase 'unified token space' is introduced without a brief definition or pointer to the section that explains how motion, VR, and VLA tokens are represented in the same space.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for acknowledging the potential significance of scaling motion tracking for humanoid control. We address each major comment below, referencing the relevant sections of the manuscript and indicating revisions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: the statement that 'performance improves steadily with compute and data diversity' is presented without any quantitative metrics, baselines, error bars, or ablation tables. Because this scaling hypothesis is the load-bearing empirical claim, the absence of supporting numbers leaves the central result only partially substantiated.

Authors: We agree that the abstract would be strengthened by including representative quantitative results. The full manuscript substantiates the scaling hypothesis in Section 4 with ablation studies across model sizes (1.2M to 42M parameters), data volumes, and compute budgets. Figure 3 and Table 2 report steady reductions in mean per-joint position error (MPJPE) and velocity error with increasing scale, including error bars from three random seeds and direct comparisons to smaller baselines. We have revised the abstract to incorporate key metrics illustrating these trends while respecting length limits. revision: yes
Referee: [Abstract] Abstract and §4 (presumed results section): the claim that 'learned policies generalize to unseen motions' is stated without details on how the unseen test set was constructed, what tracking-error metrics were used, or comparisons against smaller-scale baselines. These omissions directly affect evaluation of the generalization property asserted in the abstract.

Authors: Section 4.3 of the manuscript details the unseen test set construction: it comprises held-out motion sequences from the AMASS dataset (different performers and activity categories) plus custom captures not used in training, totaling approximately 10% of the data. Tracking error is quantified via MPJPE and angular velocity error, as defined in Section 3.2. Figure 5 and the accompanying text provide direct comparisons showing that the 42M model reduces error on these unseen motions by 12-18% relative to the 1.2M baseline. To improve clarity, we have added a brief description of the test-set construction and primary metrics to the abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper reports empirical scaling outcomes from training on external motion-capture datasets (100M+ frames, 700 hours) with explicit variation in model size (1.2M–42M parameters) and compute (21k GPU hours). Performance gains and generalization are measured directly against held-out motions and downstream tasks; no equations, fitted parameters, or self-citations are invoked as load-bearing derivations that reduce the claimed results to the inputs by construction. The argument is self-contained against observable benchmarks and does not rely on self-definitional loops or renamed known results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that dense mocap supervision can replace manual reward engineering and that standard deep-learning scaling behaviors transfer to humanoid control; no explicit free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Dense supervision from diverse motion-capture data acquires human motion priors without manual reward engineering.
Stated directly in the abstract when positioning motion tracking as the scalable task.

pith-pipeline@v0.9.0 · 5906 in / 1337 out tokens · 67496 ms · 2026-05-22T12:22:15.335513+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We build a foundation model for motion tracking by scaling along three axes: network size (1.2M to 42M parameters), dataset volume (100M+ frames from 700 hours of motion capture), and compute (21k GPU hours).
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Motion tracking leverages human motion capture data, which provides dense, frame-by-frame supervision without reward engineering.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 15 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CEER: Compliant End-Effector and Root Control as a Unified Interface for Hierarchical Humanoid Loco-Manipulation
cs.RO 2026-05 unverdicted novelty 6.0

CEER proposes a compliant end-effector and root control interface that unifies loco-manipulation for humanoids via a distilled low-level policy and hierarchical planners.
VOFA: Visual Object Goal Pushing with Force-Adaptive Control for Humanoids
cs.RO 2026-05 unverdicted novelty 6.0

VOFA combines a high-level visuomotor policy with a low-level force-adaptive controller to let humanoids push objects up to 17 kg to arbitrary goals using only noisy onboard vision, achieving over 80% real-world success.
ExoActor: Exocentric Video Generation as Generalizable Interactive Humanoid Control
cs.RO 2026-04 unverdicted novelty 6.0

ExoActor uses exocentric video generation to implicitly model robot-environment-object interactions and converts the resulting videos into task-conditioned humanoid control sequences.
Learning Whole-Body Humanoid Locomotion via Motion Generation and Motion Tracking
cs.RO 2026-04 unverdicted novelty 6.0

A diffusion-based motion generator combined with an RL motion tracker enables terrain-aware whole-body locomotion on a humanoid robot by adapting reference motions online from perception.
CLAW: Composable Language-Annotated Whole-body Motion Generation
cs.RO 2026-04 accept novelty 6.0

CLAW composes motion primitives from a kinematic planner, tracks them with a low-level controller in MuJoCo to produce physically grounded trajectories, and generates segment- and trajectory-level language annotations...
Sumo: Dynamic and Generalizable Whole-Body Loco-Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

Test-time steering of pre-trained whole-body policies via sample-based planning lets legged robots generalize dynamic loco-manipulation to varied heavy objects and tasks without additional training or tuning.
HEX: Humanoid-Aligned Experts for Cross-Embodiment Whole-Body Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

HEX introduces a state-centric framework with humanoid-aligned representations and mixture-of-experts proprioceptive prediction for coordinated whole-body control on bipedal humanoids.
HEX: Humanoid-Aligned Experts for Cross-Embodiment Whole-Body Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

HEX is a new framework with humanoid-aligned state representation, mixture-of-experts proprioceptive predictor, history tokens, and residual-gated fusion that achieves state-of-the-art success and generalization on re...
Perceptive Humanoid Parkour: Chaining Dynamic Human Skills via Motion Matching
cs.RO 2026-02 unverdicted novelty 6.0

A modular system uses motion matching to compose long-horizon human skill chains, trains RL experts, and distills them into a depth-based policy that lets a Unitree G1 humanoid autonomously climb, vault, and roll over...
HAIC: Humanoid Agile Object Interaction Control via Dynamics-Aware World Model
cs.RO 2026-02 unverdicted novelty 6.0

HAIC enables robust humanoid interactions with underactuated objects by predicting their dynamics from proprioceptive history and using a world model for adaptive control.
TeleGate: Whole-Body Humanoid Teleoperation via Gated Expert Selection with Motion Prior
cs.RO 2026-02 unverdicted novelty 6.0

TeleGate achieves high-precision real-time whole-body teleoperation of humanoid robots by dynamically gating between expert policies and using a VAE motion prior to infer future intent from history, outperforming dist...
HoloMotion-1 Technical Report
cs.RO 2026-05 unverdicted novelty 5.0

HoloMotion-1 trains a large Mixture-of-Experts Transformer policy on a hybrid corpus of video-reconstructed and MoCap motions to achieve robust zero-shot whole-body tracking that transfers directly to real humanoid robots.
HoloMotion-1 Technical Report
cs.RO 2026-05 unverdicted novelty 5.0

HoloMotion-1 trains a MoE Transformer policy on hybrid video and MoCap motion data to achieve robust zero-shot tracking that transfers directly to real humanoid robots.
Learning Versatile Humanoid Manipulation with Touch Dreaming
cs.RO 2026-04 conditional novelty 5.0

HTD, a multimodal transformer policy trained with behavioral cloning and touch dreaming to predict future tactile latents, achieves a 90.9% relative success rate improvement over baselines on five real-world contact-r...
Tree Learning: A Multi-Skill Continual Learning Framework for Humanoid Robots
cs.RO 2026-04 unverdicted novelty 5.0

Tree Learning uses root-branch parameter inheritance and multi-modal adaptation to enable continual multi-skill learning in humanoid robots, achieving higher rewards and 100% retention versus joint training in Unity s...

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · cited by 13 Pith papers · 18 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Michael Ahn, Anthony Brohan, Noah Brown, and et al. Do as i can, not as i say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691, 2022. URLhttps://arxiv.org/abs/2204.01691. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

Karen Liu

Joao Pedro Araujo, Yanjie Ze, Pei Xu, Jiajun Wu, and C. Karen Liu. Retargeting matters: General motion retargeting for humanoid motion tracking.arXiv preprint arXiv:2510.02252, 2025. 10

work page arXiv 2025
[4]

Homie: Humanoid loco-manipulation with isomorphic exoskeleton cockpit.arXiv preprint arXiv:2502.13013, 2025

Qingwei Ben, Feiyu Jia, Jia Zeng, Junting Dong, Dahua Lin, and Jiangmiao Pang. Homie: Humanoid loco-manipulation with isomorphic exoskeleton cockpit.arXiv preprint arXiv:2502.13013, 2025. 2

work page arXiv 2025
[5]

Gr00t n1.5: An improved open foundation model for generalist humanoid robots

Johan Bjorck, Valts Blukis, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi “Jim” Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Xiaowei Jiang, Kaushil Kundalia, Jan Kautz, Zhiqi Li, Kevin Lin, Zongyu Lin, Loic Magne, Yunze Man, Ajay Mandlekar, Avnish Narayan, Soroush Nasiriany, Scott Reed, You Liang Tan, Guanzhi Wang...

work page 2025
[6]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi “Jim” Fan, Yu Fang, DieterFox, FengyuanHu, SpencerHuang, JoelJang, ZhenyuJiang, JanKautz, KaushilKundalia, Lawrence Lao, Zhiqi Li, Zongyu Lin, Kevin Lin, Guilin Liu, Edith Llontop, Loic Magne, Ajay Mandlekar, Avnish Narayan, Soroush Nasiriany, Scott Reed, You Liang Tan, Guan...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.14734 2025
[7]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Robin Rombach, and Patrick Esser. Stable video diffusion: Scaling latent video diffusion models.arXiv preprint arXiv:2311.15127, 2023. URLhttps://arxiv.org/abs/ 2311.15127. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, and et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258, 2021. URLhttps://arxiv.org/abs/2108.07258. 1

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Carlos Carbajal, and et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022. URLhttps://arxiv.org/abs/2212.06817. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[10]

Rt-2: Vision-language-action models transfer web knowledge to robotic control.Nature, 2023

Anthony Brohan, Noah Brown, Ilya Chelombiev, and et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control.Nature, 2023. doi: 10.1038/s41586-023-06475-7. 2

work page doi:10.1038/s41586-023-06475-7 2023
[11]

Video generation models as world simulators.OpenAI, 2024

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators.OpenAI, 2024. 1

work page 2024
[12]

Gmt: General motion tracking for humanoid whole-body control.arXiv preprint arXiv:2506.14770, 2025

Zixuan Chen, Mazeyu Ji, Xuxin Cheng, Xuanbin Peng, Xue Bin Peng, and Xiaolong Wang. Gmt: General motion tracking for humanoid whole-body control.arXiv preprint arXiv:2506.14770, 2025. 2, 4, 10, 12 19 SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control

work page arXiv 2025
[13]

Accelerate: Training and inference at scale made simple, efficient and adaptable.https://github.com/huggingface/accelerate, 2022

Sylvain Gugger, Lysandre Debut, Thomas Wolf, Philipp Schmid, Zachary Mueller, Sourab Mangrulkar, Marc Sun, and Benjamin Bossan. Accelerate: Training and inference at scale made simple, efficient and adaptable.https://github.com/huggingface/accelerate, 2022. 14

work page 2022
[14]

Momask: Generative masked modeling of 3d human motions

Chuan Guo, Yuxuan Mu, Muhammad Gohar Javed, Sen Wang, and Li Cheng. Momask: Generative masked modeling of 3d human motions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1900–1910, 2024. 15

work page 1900
[15]

Omnih2o: Universal and dexterous human-to-humanoid whole-body teleoperation and learning.arXiv preprint arXiv:2406.08858, 2024

Tairan He, Zhengyi Luo, Xialin He, Wenli Xiao, Chong Zhang, Weinan Zhang, Kris Kitani, Changliu Liu, and Guanya Shi. Omnih2o: Universal and dexterous human-to-humanoid whole-body teleoperation and learning.arXiv preprint arXiv:2406.08858, 2024. 2, 12

work page arXiv 2024
[16]

Asap: Aligning simulation and real-world physics for learning agile humanoid whole-body skills.arXiv preprint arXiv:2502.01143, 2025

Tairan He, Jiawei Gao, Wenli Xiao, Yuanhang Zhang, Zi Wang, Jiashun Wang, Zhengyi Luo, Guanqi He, Nikhil Sobanbab, Chaoyi Pan, et al. Asap: Aligning simulation and real-world physics for learning agile humanoid whole-body skills.arXiv preprint arXiv:2502.01143, 2025. 2

work page arXiv 2025
[17]

Hover: Versatile neural whole-body controller for humanoid robots

Tairan He, Wenli Xiao, Toru Lin, Zhengyi Luo, Zhenjia Xu, Zhenyu Jiang, Jan Kautz, Changliu Liu, Guanya Shi, Xiaolong Wang, et al. Hover: Versatile neural whole-body controller for humanoid robots. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 9989–9996. IEEE, 2025. 2

work page 2025
[18]

Learning getting-up policies for real-world humanoid robots.arXiv preprint arXiv:2502.12152, 2025

Xialin He, Runpei Dong, Zixuan Chen, and Saurabh Gupta. Learning getting-up policies for real-world humanoid robots.arXiv preprint arXiv:2502.12152, 2025. 2

work page arXiv 2025
[19]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303, 2022. 1

work page internal anchor Pith review Pith/arXiv arXiv 2022
[20]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, and et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 2022. URLhttps://arxiv.org/abs/2203.15556. 1

work page internal anchor Pith review Pith/arXiv arXiv 2022
[21]

Learning humanoid standing-up control across diverse postures.arXiv preprint arXiv:2502.08378, 2025

Tao Huang, Junli Ren, Huayi Wang, Zirui Wang, Qingwei Ben, Muning Wen, Xiao Chen, Jianan Li, and Jiangmiao Pang. Learning humanoid standing-up control across diverse postures.arXiv preprint arXiv:2502.08378, 2025. 2

work page arXiv 2025
[22]

Farrar, Straus and Giroux, 2011

Daniel Kahneman.Thinking, Fast and Slow. Farrar, Straus and Giroux, 2011. ISBN 9780374275631. 9

work page 2011
[23]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, and et al. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020. URLhttps://arxiv.org/abs/2001.08361. 1

work page internal anchor Pith review Pith/arXiv arXiv 2001
[24]

Pyroki: A modular toolkit for robot kinematic optimization

Chung Min Kim*, Brent Yi*, Hongsuk Choi, Yi Ma, Ken Goldberg, and Angjoo Kanazawa. Pyroki: A modular toolkit for robot kinematic optimization. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2025. URLhttps://arxiv.org/abs/2505.03728. 10

work page arXiv 2025
[25]

Muhammed Kocabas, Nikos Athanasiou, and Michael J. Black. Vibe: Video inference for human body pose and shape estimation. InCVPR, 2020. doi: 10.1109/CVPR42600.2020.01265. 3

work page doi:10.1109/cvpr42600.2020.01265 2020
[26]

Amo: Adaptive motion optimization for hyper-dexterous humanoid whole-body control.arXiv preprint arXiv:2505.03738,

Jialong Li, Xuxin Cheng, Tianshu Huang, Shiqi Yang, Ri-Zhao Qiu, and Xiaolong Wang. Amo: Adaptive motion optimization for hyper-dexterous humanoid whole-body control.arXiv preprint arXiv:2505.03738,

work page arXiv
[27]

Genmo: A generalist model for human motion

Jiefeng Li, Jinkun Cao, Haotian Zhang, Davis Rempe, Jan Kautz, Umar Iqbal, and Ye Yuan. Genmo: A generalist model for human motion. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025. 6, 14, 16, 24 20 SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control

work page 2025
[28]

Bailando: 3d dance generation via actor-critic gpt with choreographic memory

Ruilong Li, Shan Li, Angjoo Huang, and et al. Bailando: 3d dance generation via actor-critic gpt with choreographic memory. InSIGGRAPH Asia 2021, 2021. doi: 10.1145/3478513.3480495. 3

work page doi:10.1145/3478513.3480495 2021
[29]

Ross, and Angjoo Kanazawa

Ruilong Li, Shan Yang, David A. Ross, and Angjoo Kanazawa. AI Choreographer: Music conditioned 3d dance generation with aist++. InIEEE/CVF International Conference on Computer Vision (ICCV), October

work page
[30]

BeyondMimic: From Motion Tracking to Versatile Humanoid Control via Guided Diffusion

Qiayuan Liao, Takara E Truong, Xiaoyu Huang, Guy Tevet, Koushil Sreenath, and C Karen Liu. Be- yondmimic: From motion tracking to versatile humanoid control via guided diffusion.arXiv preprint arXiv:2508.08241, 2025. 2, 4, 10, 12

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. Smpl: A skinned multi-person linear model.ACM Transactions on Graphics (Proc. SIGGRAPH Asia), 34(6): 248:1–248:16, 2015. doi: 10.1145/2816795.2818013. URLhttp://smpl.is.tue.mpg.de. 8, 13

work page doi:10.1145/2816795.2818013 2015
[32]

Winkler, Kris Kitani, and Weipeng Xu

Zhengyi Luo, Jinkun Cao, Alexander W. Winkler, Kris Kitani, and Weipeng Xu. Perpetual humanoid control for real-time simulated avatars. InInternational Conference on Computer Vision (ICCV), 2023. 2, 12

work page 2023
[33]

Open-magvit2: An open- source project toward democratizing auto-regressive visual generation.arXiv preprint arXiv:2409.04410,

Zhuoyan Luo, Fengyuan Shi, Yixiao Ge, Yujiu Yang, Limin Wang, and Ying Shan. Open-magvit2: An open- source project toward democratizing auto-regressive visual generation.arXiv preprint arXiv:2409.04410,

work page arXiv
[34]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Yi Ma, Zahid Hazara, Brian Ichter, and et al. Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945, 2024. URLhttps://arxiv.org/abs/2403.12945. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Troje, Gerard Pons-Moll, and Michael J

Naureen Mahmood, Nima Ghorbani, Nikolaus F. Troje, Gerard Pons-Moll, and Michael J. Black. Amass: Archive of motion capture as surface shapes. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019. 2, 4

work page 2019
[36]

Rethinking diffusion for text-driven human motion generation.arXiv preprint arXiv:2411.16575, 2024

Zichong Meng, Yiming Xie, Xiaogang Peng, Zeyu Han, and Huaizu Jiang. Rethinking diffusion for text-driven human motion generation.arXiv preprint arXiv:2411.16575, 2024. 14

work page arXiv 2024
[37]

Absolute coordinates make motion generation easy.arXiv preprint arXiv:2505.19377, 2025

Zichong Meng, Zeyu Han, Xiaogang Peng, Yiming Xie, and Huaizu Jiang. Absolute coordinates make motion generation easy.arXiv preprint arXiv:2505.19377, 2025. 14

work page arXiv 2025
[38]

Finite Scalar Quantization: VQ-VAE Made Simple

Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: Vq-vae made simple.arXiv preprint arXiv:2309.15505, 2023. 13

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning

NVIDIA, :, Mayank Mittal, Pascal Roth, James Tigue, Antoine Richard, Octi Zhang, Peter Du, Antonio Serrano-Muñoz, Xinjie Yao, René Zurbrügg, Nikita Rudin, Lukasz Wawrzyniak, Milad Rakhsha, Alain Denzler, Eric Heiden, Ales Borovicka, Ossama Ahmed, Iretiayo Akinola, Abrar Anwar, Mark T. Carlson, Ji Yuan Feng, Animesh Garg, Renato Gasoto, Lionel Gulich, Yiji...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Open X-Embodiment Collaboration. Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864, 2023. URLhttps://arxiv.org/abs/2310.08864. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

Deepmimic: Example-guided deep reinforcement learning of physics-based character skills

Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. Deepmimic: Example-guided deep reinforcement learning of physics-based character skills. InACM SIGGRAPH 2018 Papers, 2018. doi: 10.1145/3197517.3201311. 1

work page doi:10.1145/3197517.3201311 2018
[42]

Amp: Adversarial motion priors for stylized physics-based character control.ACM Transactions on Graphics, 40(4), 2021

Xue Bin Peng, Zhaoyu Zhou, Stephen Luo, and Michiel van de Panne. Amp: Adversarial motion priors for stylized physics-based character control.ACM Transactions on Graphics, 40(4), 2021. doi: 10.1145/ 3450626.3459670. 1

work page arXiv 2021
[43]

Mmm: Generative masked motion model

Ekkasit Pinyoanuntapong, Pu Wang, Minwoo Lee, and Chen Chen. Mmm: Generative masked motion model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1546–1555, 2024. 15

work page 2024
[44]

Black, Tushar Kapadi, and Gerard Pons-Moll

Abhinanda Punnakkal, Michael J. Black, Tushar Kapadi, and Gerard Pons-Moll. Babel: Bodies, action and behavior with english labels. InCVPR, 2021. doi: 10.1109/CVPR46437.2021.00756. 2

work page doi:10.1109/cvpr46437.2021.00756 2021
[45]

InWorkshop on Whole-body Control and Bimanual Manipulation: Applications in Humanoids and Beyond, Robotics: Science and Systems (RSS), June 2025

Marc Raibert and Farbod Farshidian. InWorkshop on Whole-body Control and Bimanual Manipulation: Applications in Humanoids and Beyond, Robotics: Science and Systems (RSS), June 2025. 12

work page 2025
[46]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 2022. 1

work page internal anchor Pith review Pith/arXiv arXiv 2022
[47]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 1

work page 2022
[48]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy opti- mization algorithms.arXiv preprint arXiv:1707.06347, 2017. 12

work page internal anchor Pith review Pith/arXiv arXiv 2017
[49]

Neural animation layering for synthesizing martial arts movements.ACM Transactions on Graphics (TOG), 40(4):1–16, 2021

Sebastian Starke, Yiwei Zhao, Fabio Zinno, and Taku Komura. Neural animation layering for synthesizing martial arts movements.ACM Transactions on Graphics (TOG), 40(4):1–16, 2021. 5

work page 2021
[50]

Richard S. Sutton. The bitter lesson. http://www.incompleteideas.net/IncIdeas/BitterLesson. html, 2019. 1

work page 2019
[51]

Human Motion Diffusion Model

Guy Tevet, Sigal Raab, Yuval Shafir, and et al. Human motion diffusion model.arXiv preprint arXiv:2209.14916, 2022. URLhttps://arxiv.org/abs/2209.14916. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[52]

Mujoco: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012. doi: 10.1109/IROS.2012.6386109. 5

work page doi:10.1109/iros.2012.6386109 2012
[53]

Humanoid robot G1_Humanoid Robot Functions_Humanoid Robot Price | Unitree Robotics — unitree.com.https://www.unitree.com/g1, 2024

Unitree. Humanoid robot G1_Humanoid Robot Functions_Humanoid Robot Price | Unitree Robotics — unitree.com.https://www.unitree.com/g1, 2024. [Accessed 31-10-2025]. 3

work page 2024
[54]

Unitree boxing.https://www.unitree.com/boxing, 2025

Unitree. Unitree boxing.https://www.unitree.com/boxing, 2025. Accessed: 2024-06-30. 5

work page 2025
[55]

Trl: Transformer reinforcement learning.https: //github.com/huggingface/trl, 2020

Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. Trl: Transformer reinforcement learning.https: //github.com/huggingface/trl, 2020. 14 22 SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control

work page 2020
[56]

Unicon: Universal neural controller for physics-based character motion.arXiv preprint arXiv:2011.15119, 2020

Tingwu Wang, Yunrong Guo, Maria Shugrina, and Sanja Fidler. Unicon: Universal neural controller for physics-based character motion.arXiv preprint arXiv:2011.15119, 2020. 2

work page arXiv 2011
[57]

Emergent Abilities of Large Language Models

Jason Wei, Yi Tay, Rishi Bommasani, and et al. Emergent abilities of large language models.arXiv preprint arXiv:2206.07682, 2022. URLhttps://arxiv.org/abs/2206.07682. 1

work page internal anchor Pith review Pith/arXiv arXiv 2022
[58]

Control strategies for physically simulated characters performing two-player competitive sports.ACM Transactions on Graphics (TOG), 40(4):1–11,

Jungdam Won, Deepak Gopinath, and Jessica Hodgins. Control strategies for physically simulated characters performing two-player competitive sports.ACM Transactions on Graphics (TOG), 40(4):1–11,

work page
[59]

Unitracker: Learning universal whole-body motion tracker for humanoid robots, 2025

Kangning Yin, Weishuai Zeng, Ke Fan, Minyue Dai, Zirui Wang, Qiang Zhang, Zheng Tian, Jingbo Wang, Jiangmiao Pang, and Weinan Zhang. Unitracker: Learning universal whole-body motion tracker for humanoid robots, 2025. URLhttps://arxiv.org/abs/2507.07356. 2

work page arXiv 2025
[60]

Magvit: Masked generative video transformer

Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, et al. Magvit: Masked generative video transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10459–10469,

work page
[61]

Twist: Teleoperated whole-body imitation system.arXiv preprint arXiv:2505.02833, 2025

Yanjie Ze, Zixuan Chen, JoÃG, o Pedro AraÃšjo, Zi-ang Cao, Xue Bin Peng, Jiajun Wu, and C Karen Liu. Twist: Teleoperated whole-body imitation system.arXiv preprint arXiv:2505.02833, 2025. 2, 4

work page arXiv 2025
[62]

Behavior foundation model for humanoid robots.arXiv preprint arXiv:2509.13780, 2025

Weishuai Zeng, Shunlin Lu, Kangning Yin, Xiaojie Niu, Minyue Dai, Jingbo Wang, and Jiangmiao Pang. Behavior foundation model for humanoid robots.arXiv preprint arXiv:2509.13780, 2025. 2, 4

work page arXiv 2025
[63]

T2m-gpt: Generating human motion from textual descriptions with discrete representations

Ye Zhang, Tong He, Qingxuan Zhang, and et al. T2m-gpt: Generating human motion from textual descriptions with discrete representations. InCVPR, 2023. doi: 10.1109/CVPR52729.2023.00877. 3

work page doi:10.1109/cvpr52729.2023.00877 2023
[64]

Trackanymotionsunderanydisturbances.arXivpreprintarXiv:2509.13833,

Zhikai Zhang, Jun Guo, Chao Chen, Jilong Wang, Chenghuai Lin, Yunrui Lian, Han Xue, Zhenrong Wang, MaoqiLiu, HuapingLiu, etal. Trackanymotionsunderanydisturbances.arXivpreprintarXiv:2509.13833,

work page arXiv
[65]

Xrobotoolkit: A cross-platform framework for robot teleoperation.arXiv preprint arXiv:2508.00097, 2025

Zhigen Zhao, Liuchuan Yu, Ke Jing, and Ning Yang. Xrobotoolkit: A cross-platform framework for robot teleoperation.arXiv preprint arXiv:2508.00097, 2025. 8, 14

work page arXiv 2025
[66]

walk forward

Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. On the continuity of rotation represen- tations in neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5745–5753, 2019. 12 23 SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control A. Supplementary Materials S1. Supplem...

work page 2019

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Michael Ahn, Anthony Brohan, Noah Brown, and et al. Do as i can, not as i say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691, 2022. URLhttps://arxiv.org/abs/2204.01691. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022

[3] [3]

Karen Liu

Joao Pedro Araujo, Yanjie Ze, Pei Xu, Jiajun Wu, and C. Karen Liu. Retargeting matters: General motion retargeting for humanoid motion tracking.arXiv preprint arXiv:2510.02252, 2025. 10

work page arXiv 2025

[4] [4]

Homie: Humanoid loco-manipulation with isomorphic exoskeleton cockpit.arXiv preprint arXiv:2502.13013, 2025

Qingwei Ben, Feiyu Jia, Jia Zeng, Junting Dong, Dahua Lin, and Jiangmiao Pang. Homie: Humanoid loco-manipulation with isomorphic exoskeleton cockpit.arXiv preprint arXiv:2502.13013, 2025. 2

work page arXiv 2025

[5] [5]

Gr00t n1.5: An improved open foundation model for generalist humanoid robots

Johan Bjorck, Valts Blukis, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi “Jim” Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Xiaowei Jiang, Kaushil Kundalia, Jan Kautz, Zhiqi Li, Kevin Lin, Zongyu Lin, Loic Magne, Yunze Man, Ajay Mandlekar, Avnish Narayan, Soroush Nasiriany, Scott Reed, You Liang Tan, Guanzhi Wang...

work page 2025

[6] [6]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi “Jim” Fan, Yu Fang, DieterFox, FengyuanHu, SpencerHuang, JoelJang, ZhenyuJiang, JanKautz, KaushilKundalia, Lawrence Lao, Zhiqi Li, Zongyu Lin, Kevin Lin, Guilin Liu, Edith Llontop, Loic Magne, Ajay Mandlekar, Avnish Narayan, Soroush Nasiriany, Scott Reed, You Liang Tan, Guan...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.14734 2025

[7] [7]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Robin Rombach, and Patrick Esser. Stable video diffusion: Scaling latent video diffusion models.arXiv preprint arXiv:2311.15127, 2023. URLhttps://arxiv.org/abs/ 2311.15127. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, and et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258, 2021. URLhttps://arxiv.org/abs/2108.07258. 1

work page internal anchor Pith review Pith/arXiv arXiv 2021

[9] [9]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Carlos Carbajal, and et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022. URLhttps://arxiv.org/abs/2212.06817. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022

[10] [10]

Rt-2: Vision-language-action models transfer web knowledge to robotic control.Nature, 2023

Anthony Brohan, Noah Brown, Ilya Chelombiev, and et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control.Nature, 2023. doi: 10.1038/s41586-023-06475-7. 2

work page doi:10.1038/s41586-023-06475-7 2023

[11] [11]

Video generation models as world simulators.OpenAI, 2024

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators.OpenAI, 2024. 1

work page 2024

[12] [12]

Gmt: General motion tracking for humanoid whole-body control.arXiv preprint arXiv:2506.14770, 2025

Zixuan Chen, Mazeyu Ji, Xuxin Cheng, Xuanbin Peng, Xue Bin Peng, and Xiaolong Wang. Gmt: General motion tracking for humanoid whole-body control.arXiv preprint arXiv:2506.14770, 2025. 2, 4, 10, 12 19 SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control

work page arXiv 2025

[13] [13]

Accelerate: Training and inference at scale made simple, efficient and adaptable.https://github.com/huggingface/accelerate, 2022

Sylvain Gugger, Lysandre Debut, Thomas Wolf, Philipp Schmid, Zachary Mueller, Sourab Mangrulkar, Marc Sun, and Benjamin Bossan. Accelerate: Training and inference at scale made simple, efficient and adaptable.https://github.com/huggingface/accelerate, 2022. 14

work page 2022

[14] [14]

Momask: Generative masked modeling of 3d human motions

Chuan Guo, Yuxuan Mu, Muhammad Gohar Javed, Sen Wang, and Li Cheng. Momask: Generative masked modeling of 3d human motions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1900–1910, 2024. 15

work page 1900

[15] [15]

Omnih2o: Universal and dexterous human-to-humanoid whole-body teleoperation and learning.arXiv preprint arXiv:2406.08858, 2024

Tairan He, Zhengyi Luo, Xialin He, Wenli Xiao, Chong Zhang, Weinan Zhang, Kris Kitani, Changliu Liu, and Guanya Shi. Omnih2o: Universal and dexterous human-to-humanoid whole-body teleoperation and learning.arXiv preprint arXiv:2406.08858, 2024. 2, 12

work page arXiv 2024

[16] [16]

Asap: Aligning simulation and real-world physics for learning agile humanoid whole-body skills.arXiv preprint arXiv:2502.01143, 2025

Tairan He, Jiawei Gao, Wenli Xiao, Yuanhang Zhang, Zi Wang, Jiashun Wang, Zhengyi Luo, Guanqi He, Nikhil Sobanbab, Chaoyi Pan, et al. Asap: Aligning simulation and real-world physics for learning agile humanoid whole-body skills.arXiv preprint arXiv:2502.01143, 2025. 2

work page arXiv 2025

[17] [17]

Hover: Versatile neural whole-body controller for humanoid robots

Tairan He, Wenli Xiao, Toru Lin, Zhengyi Luo, Zhenjia Xu, Zhenyu Jiang, Jan Kautz, Changliu Liu, Guanya Shi, Xiaolong Wang, et al. Hover: Versatile neural whole-body controller for humanoid robots. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 9989–9996. IEEE, 2025. 2

work page 2025

[18] [18]

Learning getting-up policies for real-world humanoid robots.arXiv preprint arXiv:2502.12152, 2025

Xialin He, Runpei Dong, Zixuan Chen, and Saurabh Gupta. Learning getting-up policies for real-world humanoid robots.arXiv preprint arXiv:2502.12152, 2025. 2

work page arXiv 2025

[19] [19]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303, 2022. 1

work page internal anchor Pith review Pith/arXiv arXiv 2022

[20] [20]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, and et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 2022. URLhttps://arxiv.org/abs/2203.15556. 1

work page internal anchor Pith review Pith/arXiv arXiv 2022

[21] [21]

Learning humanoid standing-up control across diverse postures.arXiv preprint arXiv:2502.08378, 2025

Tao Huang, Junli Ren, Huayi Wang, Zirui Wang, Qingwei Ben, Muning Wen, Xiao Chen, Jianan Li, and Jiangmiao Pang. Learning humanoid standing-up control across diverse postures.arXiv preprint arXiv:2502.08378, 2025. 2

work page arXiv 2025

[22] [22]

Farrar, Straus and Giroux, 2011

Daniel Kahneman.Thinking, Fast and Slow. Farrar, Straus and Giroux, 2011. ISBN 9780374275631. 9

work page 2011

[23] [23]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, and et al. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020. URLhttps://arxiv.org/abs/2001.08361. 1

work page internal anchor Pith review Pith/arXiv arXiv 2001

[24] [24]

Pyroki: A modular toolkit for robot kinematic optimization

Chung Min Kim*, Brent Yi*, Hongsuk Choi, Yi Ma, Ken Goldberg, and Angjoo Kanazawa. Pyroki: A modular toolkit for robot kinematic optimization. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2025. URLhttps://arxiv.org/abs/2505.03728. 10

work page arXiv 2025

[25] [25]

Muhammed Kocabas, Nikos Athanasiou, and Michael J. Black. Vibe: Video inference for human body pose and shape estimation. InCVPR, 2020. doi: 10.1109/CVPR42600.2020.01265. 3

work page doi:10.1109/cvpr42600.2020.01265 2020

[26] [26]

Amo: Adaptive motion optimization for hyper-dexterous humanoid whole-body control.arXiv preprint arXiv:2505.03738,

Jialong Li, Xuxin Cheng, Tianshu Huang, Shiqi Yang, Ri-Zhao Qiu, and Xiaolong Wang. Amo: Adaptive motion optimization for hyper-dexterous humanoid whole-body control.arXiv preprint arXiv:2505.03738,

work page arXiv

[27] [27]

Genmo: A generalist model for human motion

Jiefeng Li, Jinkun Cao, Haotian Zhang, Davis Rempe, Jan Kautz, Umar Iqbal, and Ye Yuan. Genmo: A generalist model for human motion. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025. 6, 14, 16, 24 20 SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control

work page 2025

[28] [28]

Bailando: 3d dance generation via actor-critic gpt with choreographic memory

Ruilong Li, Shan Li, Angjoo Huang, and et al. Bailando: 3d dance generation via actor-critic gpt with choreographic memory. InSIGGRAPH Asia 2021, 2021. doi: 10.1145/3478513.3480495. 3

work page doi:10.1145/3478513.3480495 2021

[29] [29]

Ross, and Angjoo Kanazawa

Ruilong Li, Shan Yang, David A. Ross, and Angjoo Kanazawa. AI Choreographer: Music conditioned 3d dance generation with aist++. InIEEE/CVF International Conference on Computer Vision (ICCV), October

work page

[30] [30]

BeyondMimic: From Motion Tracking to Versatile Humanoid Control via Guided Diffusion

Qiayuan Liao, Takara E Truong, Xiaoyu Huang, Guy Tevet, Koushil Sreenath, and C Karen Liu. Be- yondmimic: From motion tracking to versatile humanoid control via guided diffusion.arXiv preprint arXiv:2508.08241, 2025. 2, 4, 10, 12

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. Smpl: A skinned multi-person linear model.ACM Transactions on Graphics (Proc. SIGGRAPH Asia), 34(6): 248:1–248:16, 2015. doi: 10.1145/2816795.2818013. URLhttp://smpl.is.tue.mpg.de. 8, 13

work page doi:10.1145/2816795.2818013 2015

[32] [32]

Winkler, Kris Kitani, and Weipeng Xu

Zhengyi Luo, Jinkun Cao, Alexander W. Winkler, Kris Kitani, and Weipeng Xu. Perpetual humanoid control for real-time simulated avatars. InInternational Conference on Computer Vision (ICCV), 2023. 2, 12

work page 2023

[33] [33]

Open-magvit2: An open- source project toward democratizing auto-regressive visual generation.arXiv preprint arXiv:2409.04410,

Zhuoyan Luo, Fengyuan Shi, Yixiao Ge, Yujiu Yang, Limin Wang, and Ying Shan. Open-magvit2: An open- source project toward democratizing auto-regressive visual generation.arXiv preprint arXiv:2409.04410,

work page arXiv

[34] [34]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Yi Ma, Zahid Hazara, Brian Ichter, and et al. Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945, 2024. URLhttps://arxiv.org/abs/2403.12945. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

Troje, Gerard Pons-Moll, and Michael J

Naureen Mahmood, Nima Ghorbani, Nikolaus F. Troje, Gerard Pons-Moll, and Michael J. Black. Amass: Archive of motion capture as surface shapes. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019. 2, 4

work page 2019

[36] [36]

Rethinking diffusion for text-driven human motion generation.arXiv preprint arXiv:2411.16575, 2024

Zichong Meng, Yiming Xie, Xiaogang Peng, Zeyu Han, and Huaizu Jiang. Rethinking diffusion for text-driven human motion generation.arXiv preprint arXiv:2411.16575, 2024. 14

work page arXiv 2024

[37] [37]

Absolute coordinates make motion generation easy.arXiv preprint arXiv:2505.19377, 2025

Zichong Meng, Zeyu Han, Xiaogang Peng, Yiming Xie, and Huaizu Jiang. Absolute coordinates make motion generation easy.arXiv preprint arXiv:2505.19377, 2025. 14

work page arXiv 2025

[38] [38]

Finite Scalar Quantization: VQ-VAE Made Simple

Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: Vq-vae made simple.arXiv preprint arXiv:2309.15505, 2023. 13

work page internal anchor Pith review Pith/arXiv arXiv 2023

[39] [39]

Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning

NVIDIA, :, Mayank Mittal, Pascal Roth, James Tigue, Antoine Richard, Octi Zhang, Peter Du, Antonio Serrano-Muñoz, Xinjie Yao, René Zurbrügg, Nikita Rudin, Lukasz Wawrzyniak, Milad Rakhsha, Alain Denzler, Eric Heiden, Ales Borovicka, Ossama Ahmed, Iretiayo Akinola, Abrar Anwar, Mark T. Carlson, Ji Yuan Feng, Animesh Garg, Renato Gasoto, Lionel Gulich, Yiji...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Open X-Embodiment Collaboration. Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864, 2023. URLhttps://arxiv.org/abs/2310.08864. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[41] [41]

Deepmimic: Example-guided deep reinforcement learning of physics-based character skills

Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. Deepmimic: Example-guided deep reinforcement learning of physics-based character skills. InACM SIGGRAPH 2018 Papers, 2018. doi: 10.1145/3197517.3201311. 1

work page doi:10.1145/3197517.3201311 2018

[42] [42]

Amp: Adversarial motion priors for stylized physics-based character control.ACM Transactions on Graphics, 40(4), 2021

Xue Bin Peng, Zhaoyu Zhou, Stephen Luo, and Michiel van de Panne. Amp: Adversarial motion priors for stylized physics-based character control.ACM Transactions on Graphics, 40(4), 2021. doi: 10.1145/ 3450626.3459670. 1

work page arXiv 2021

[43] [43]

Mmm: Generative masked motion model

Ekkasit Pinyoanuntapong, Pu Wang, Minwoo Lee, and Chen Chen. Mmm: Generative masked motion model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1546–1555, 2024. 15

work page 2024

[44] [44]

Black, Tushar Kapadi, and Gerard Pons-Moll

Abhinanda Punnakkal, Michael J. Black, Tushar Kapadi, and Gerard Pons-Moll. Babel: Bodies, action and behavior with english labels. InCVPR, 2021. doi: 10.1109/CVPR46437.2021.00756. 2

work page doi:10.1109/cvpr46437.2021.00756 2021

[45] [45]

InWorkshop on Whole-body Control and Bimanual Manipulation: Applications in Humanoids and Beyond, Robotics: Science and Systems (RSS), June 2025

Marc Raibert and Farbod Farshidian. InWorkshop on Whole-body Control and Bimanual Manipulation: Applications in Humanoids and Beyond, Robotics: Science and Systems (RSS), June 2025. 12

work page 2025

[46] [46]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 2022. 1

work page internal anchor Pith review Pith/arXiv arXiv 2022

[47] [47]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 1

work page 2022

[48] [48]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy opti- mization algorithms.arXiv preprint arXiv:1707.06347, 2017. 12

work page internal anchor Pith review Pith/arXiv arXiv 2017

[49] [49]

Neural animation layering for synthesizing martial arts movements.ACM Transactions on Graphics (TOG), 40(4):1–16, 2021

Sebastian Starke, Yiwei Zhao, Fabio Zinno, and Taku Komura. Neural animation layering for synthesizing martial arts movements.ACM Transactions on Graphics (TOG), 40(4):1–16, 2021. 5

work page 2021

[50] [50]

Richard S. Sutton. The bitter lesson. http://www.incompleteideas.net/IncIdeas/BitterLesson. html, 2019. 1

work page 2019

[51] [51]

Human Motion Diffusion Model

Guy Tevet, Sigal Raab, Yuval Shafir, and et al. Human motion diffusion model.arXiv preprint arXiv:2209.14916, 2022. URLhttps://arxiv.org/abs/2209.14916. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022

[52] [52]

Mujoco: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012. doi: 10.1109/IROS.2012.6386109. 5

work page doi:10.1109/iros.2012.6386109 2012

[53] [53]

Humanoid robot G1_Humanoid Robot Functions_Humanoid Robot Price | Unitree Robotics — unitree.com.https://www.unitree.com/g1, 2024

Unitree. Humanoid robot G1_Humanoid Robot Functions_Humanoid Robot Price | Unitree Robotics — unitree.com.https://www.unitree.com/g1, 2024. [Accessed 31-10-2025]. 3

work page 2024

[54] [54]

Unitree boxing.https://www.unitree.com/boxing, 2025

Unitree. Unitree boxing.https://www.unitree.com/boxing, 2025. Accessed: 2024-06-30. 5

work page 2025

[55] [55]

Trl: Transformer reinforcement learning.https: //github.com/huggingface/trl, 2020

Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. Trl: Transformer reinforcement learning.https: //github.com/huggingface/trl, 2020. 14 22 SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control

work page 2020

[56] [56]

Unicon: Universal neural controller for physics-based character motion.arXiv preprint arXiv:2011.15119, 2020

Tingwu Wang, Yunrong Guo, Maria Shugrina, and Sanja Fidler. Unicon: Universal neural controller for physics-based character motion.arXiv preprint arXiv:2011.15119, 2020. 2

work page arXiv 2011

[57] [57]

Emergent Abilities of Large Language Models

Jason Wei, Yi Tay, Rishi Bommasani, and et al. Emergent abilities of large language models.arXiv preprint arXiv:2206.07682, 2022. URLhttps://arxiv.org/abs/2206.07682. 1

work page internal anchor Pith review Pith/arXiv arXiv 2022

[58] [58]

Control strategies for physically simulated characters performing two-player competitive sports.ACM Transactions on Graphics (TOG), 40(4):1–11,

Jungdam Won, Deepak Gopinath, and Jessica Hodgins. Control strategies for physically simulated characters performing two-player competitive sports.ACM Transactions on Graphics (TOG), 40(4):1–11,

work page

[59] [59]

Unitracker: Learning universal whole-body motion tracker for humanoid robots, 2025

Kangning Yin, Weishuai Zeng, Ke Fan, Minyue Dai, Zirui Wang, Qiang Zhang, Zheng Tian, Jingbo Wang, Jiangmiao Pang, and Weinan Zhang. Unitracker: Learning universal whole-body motion tracker for humanoid robots, 2025. URLhttps://arxiv.org/abs/2507.07356. 2

work page arXiv 2025

[60] [60]

Magvit: Masked generative video transformer

Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, et al. Magvit: Masked generative video transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10459–10469,

work page

[61] [61]

Twist: Teleoperated whole-body imitation system.arXiv preprint arXiv:2505.02833, 2025

Yanjie Ze, Zixuan Chen, JoÃG, o Pedro AraÃšjo, Zi-ang Cao, Xue Bin Peng, Jiajun Wu, and C Karen Liu. Twist: Teleoperated whole-body imitation system.arXiv preprint arXiv:2505.02833, 2025. 2, 4

work page arXiv 2025

[62] [62]

Behavior foundation model for humanoid robots.arXiv preprint arXiv:2509.13780, 2025

Weishuai Zeng, Shunlin Lu, Kangning Yin, Xiaojie Niu, Minyue Dai, Jingbo Wang, and Jiangmiao Pang. Behavior foundation model for humanoid robots.arXiv preprint arXiv:2509.13780, 2025. 2, 4

work page arXiv 2025

[63] [63]

T2m-gpt: Generating human motion from textual descriptions with discrete representations

Ye Zhang, Tong He, Qingxuan Zhang, and et al. T2m-gpt: Generating human motion from textual descriptions with discrete representations. InCVPR, 2023. doi: 10.1109/CVPR52729.2023.00877. 3

work page doi:10.1109/cvpr52729.2023.00877 2023

[64] [64]

Trackanymotionsunderanydisturbances.arXivpreprintarXiv:2509.13833,

Zhikai Zhang, Jun Guo, Chao Chen, Jilong Wang, Chenghuai Lin, Yunrui Lian, Han Xue, Zhenrong Wang, MaoqiLiu, HuapingLiu, etal. Trackanymotionsunderanydisturbances.arXivpreprintarXiv:2509.13833,

work page arXiv

[65] [65]

Xrobotoolkit: A cross-platform framework for robot teleoperation.arXiv preprint arXiv:2508.00097, 2025

Zhigen Zhao, Liuchuan Yu, Ke Jing, and Ning Yang. Xrobotoolkit: A cross-platform framework for robot teleoperation.arXiv preprint arXiv:2508.00097, 2025. 8, 14

work page arXiv 2025

[66] [66]

walk forward

Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. On the continuity of rotation represen- tations in neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5745–5753, 2019. 12 23 SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control A. Supplementary Materials S1. Supplem...

work page 2019