Imagine2Real: Towards Zero-shot Humanoid-Object Interaction via Video Generative Priors

Feiyu Jia; Jiahe Chen; Jiangmiao Pang; Jingbo Wang; Tianfan Xue; Weishuai Zeng; Xiao Chen; Xiaojie Niu; Xiaowei Zhou; Zirui Wang

arxiv: 2605.22272 · v2 · pith:VKV62HWPnew · submitted 2026-05-21 · 💻 cs.RO · cs.CV

Imagine2Real: Towards Zero-shot Humanoid-Object Interaction via Video Generative Priors

Jiahe Chen , ZiRui Wang , Feiyu Jia , Xiao Chen , Xiaojie Niu , Weishuai Zeng , Tianfan Xue , Xiaowei Zhou

show 2 more authors

Jiangmiao Pang Jingbo Wang

This is my paper

Pith reviewed 2026-05-25 05:57 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords humanoid-object interactionzero-shot deploymentvideo generative priors4D point trajectoriesbehavior foundation modelkeypoints trackingwhole-body controlrobotics

0 comments

The pith

A keypoints tracker using only base, hands, and object points inside a behavior foundation model latent space enables zero-shot humanoid-object interactions from video priors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that whole-body humanoid-object interactions can reach zero-shot physical deployment by representing motions as unified 4D point trajectories and limiting tracking to sparse base, hands, and object points inside a pre-trained behavior foundation model. This matters to a sympathetic reader because existing video-to-robot methods depend on scarce 3D data, explicit CAD models, and error-prone full-body retargeting that blocks practical use. By staying inside the model's latent space and using progressive training with basic tracking rewards, the method claims to produce natural gaits and robust behaviors without those extra steps.

Core claim

Imagine2Real resolves representation misalignment by formulating robot and object motions as unified 4D point trajectories. It overcomes retargeting complexity with a Keypoints Tracker that tracks only sparse critical points (base, hands, and object) and uses the latent space of a Behavior Foundation Model as the search domain. Progressive training with simple tracking rewards then produces robust behaviors that support zero-shot physical deployment within a motion capture system.

What carries the argument

The Keypoints Tracker, which searches motions inside the latent space of a pre-trained Behavior Foundation Model using only sparse base, hands, and object points derived from unified 4D point trajectories.

If this is right

Resolves representation misalignment between video and robot without reliance on geometric priors such as explicit CAD models.
Bypasses the error-amplifying full-body retargeting process by tracking only sparse critical points.
Maintains natural gaits and robust object interactions from sparse signals by restricting the search to the BFM latent space.
Enables zero-shot physical deployment of whole-body HOI after progressive training with simple tracking rewards.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If sparse base-hand-object tracking proves adequate, full-body retargeting pipelines may become unnecessary for many interaction tasks.
The method could reduce dependence on large dedicated 3D motion datasets by leveraging existing video generative priors.
A testable extension would be to apply the same tracker to new robot morphologies without retraining the underlying behavior model.

Load-bearing premise

Sparse tracking of only base, hands, and object points inside the BFM latent space is sufficient to produce natural gaits and robust interaction behaviors without geometric priors or full-body retargeting.

What would settle it

A physical mocap deployment trial in which the humanoid exhibits unstable gaits, dropped objects, or unnatural motion when performing previously unseen interactions after training with the sparse keypoints tracker.

Figures

Figures reproduced from arXiv: 2605.22272 by Feiyu Jia, Jiahe Chen, Jiangmiao Pang, Jingbo Wang, Tianfan Xue, Weishuai Zeng, Xiao Chen, Xiaojie Niu, Xiaowei Zhou, Zirui Wang.

**Figure 2.** Figure 2: Overview of the Imagine2Real framework. Top: The zero-shot real-world deployment pipeline synthesizes an interaction video, extracts unified 3D point trajectories via a points tracker, and executes the motion using the Keypoints Tracker and Interaction Adaptor. Bottom: The policy training adopts a three-stage progressive strategy: (1) training a BFM backbone (Encoder, Predictor, Decoder) on diverse whole-b… view at source ↗

**Figure 3.** Figure 3: Qualitative results in simulation. Time-lapse sequences illustrate natural whole-body [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative results of zero-shot real-world deployment. Time-lapse sequences demonstrate [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Whole-body Humanoid-Object Interaction (HOI) is bottlenecked by the scarcity of high-fidelity 3D data. While video generative priors offer a promising alternative, existing methods suffer from \textit{Representation Misalignment} due to their reliance on geometric priors (e.g., explicit CAD models), and \textit{Retargeting Complexity} arising from intensive morphing and morphological mismatch. We propose Imagine2Real, a zero-shot HOI framework for flexible, geometry-free interaction. To resolve misalignment, we formulate robot and object motions as unified 4D point trajectories. To overcome retargeting complexity, our Keypoints Tracker tracks only sparse critical points (base, hands, and object), entirely bypassing the error-amplifying retargeting process. To maintain natural gaits despite these sparse signals, we utilize the latent space of a Behavior Foundation Model (BFM) as the tracker's search domain. Using a progressive training strategy, Imagine2Real learns robust behaviors with simple tracking rewards, enabling zero-shot physical deployment within a motion capture(mocap) system.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper offers a geometry-free pipeline for zero-shot HOI using 4D trajectories and BFM-based sparse tracking, but lacks supporting experiments to back the key assumptions.

read the letter

The punchline is that Imagine2Real tries to make zero-shot humanoid object interaction feasible by ditching geometric models and full retargeting in favor of 4D point trajectories and sparse keypoint tracking inside a Behavior Foundation Model's latent space, with progressive training on simple rewards. What the paper does well is clearly identifying the bottlenecks in current video-prior methods for robotics: the need for CAD models causing misalignment, and the complexity of morphing motions across morphologies. Their unified 4D representation and the decision to track only critical points like base, hands, and object is a direct response to those issues. Using the BFM latent space as the search domain to preserve natural gaits is an interesting choice that could work if the model has captured the right motion statistics. The soft spots are around validation. The abstract describes the pipeline but gives no numbers on success rates, no ablations on the number of keypoints or the role of the BFM, and no details on the mocap deployment results. Without that, it's difficult to know if the sparse signals really suffice or if the policies end up with weird gaits or poor interaction robustness, as the stress-test note suggests. The central claim about zero-shot physical deployment would be more convincing with some real hardware evidence or comparisons. This kind of paper is for people in the robotics community focused on scaling humanoid skills with generative models. A reader looking for new ways to connect video generation to physical control might get ideas from it. It deserves a serious referee because the problem it tackles is important and the proposed integration is specific enough to be worth detailed feedback, even though the current version is mostly high-level. I'd recommend sending it to peer review with the expectation that reviewers will ask for more experimental support.

Referee Report

2 major / 0 minor

Summary. The paper proposes Imagine2Real, a zero-shot framework for whole-body humanoid-object interaction (HOI) that leverages video generative priors to address data scarcity. It resolves representation misalignment via unified 4D point trajectories for robot and object motions, and overcomes retargeting complexity by using a Keypoints Tracker that follows only sparse critical points (base, hands, object) inside the latent space of a pre-trained Behavior Foundation Model (BFM). A progressive training strategy with simple tracking rewards is claimed to yield natural gaits and robust behaviors suitable for zero-shot physical deployment in a mocap system.

Significance. If the central claims hold, the work could meaningfully advance geometry-free HOI by bypassing CAD models and full-body retargeting, potentially enabling more flexible deployment from generative priors. The use of BFM latent space as a search domain for sparse tracking is a distinctive idea that might compensate for missing kinematic signals. However, the manuscript contains no quantitative results, ablations, or deployment data, so the practical significance remains unevaluable at present.

major comments (2)

[Abstract] Abstract: No experiments, quantitative results, ablation studies, or implementation details are provided to substantiate whether unified 4D trajectories resolve misalignment, whether the sparse Keypoints Tracker in BFM latent space produces natural gaits without geometric priors, or whether progressive training enables the claimed zero-shot hardware deployment.
[Abstract] Abstract, paragraph on Keypoints Tracker and BFM search domain: The assumption that tracking only base/hands/object points inside the BFM latent space suffices for natural whole-body motions and robust HOI (without full-body retargeting or geometric priors) is load-bearing for the zero-shot claim, yet no supporting analysis, ablation on keypoint sparsity, or comparison to direct tracking is given.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. Below we respond point-by-point to the major comments on the abstract.

read point-by-point responses

Referee: [Abstract] Abstract: No experiments, quantitative results, ablation studies, or implementation details are provided to substantiate whether unified 4D trajectories resolve misalignment, whether the sparse Keypoints Tracker in BFM latent space produces natural gaits without geometric priors, or whether progressive training enables the claimed zero-shot hardware deployment.

Authors: The referee is correct that the submitted manuscript presents the Imagine2Real framework and its design rationale without accompanying quantitative experiments, ablations, or deployment metrics. The zero-shot claim is currently supported only by the architectural arguments (unified 4D trajectories, sparse tracking inside the BFM latent space, and progressive reward shaping) laid out in the methods. We will add a new experimental section containing simulation results, mocap deployment metrics, and basic implementation details in the revised manuscript. revision: yes
Referee: [Abstract] Abstract, paragraph on Keypoints Tracker and BFM search domain: The assumption that tracking only base/hands/object points inside the BFM latent space suffices for natural whole-body motions and robust HOI (without full-body retargeting or geometric priors) is load-bearing for the zero-shot claim, yet no supporting analysis, ablation on keypoint sparsity, or comparison to direct tracking is given.

Authors: We acknowledge that the manuscript does not yet contain an ablation on keypoint sparsity or a direct comparison against full-body or direct-tracking baselines. The choice of sparse base/hands/object keypoints is justified in the text by the need to bypass retargeting error amplification while relying on the BFM prior to complete natural whole-body motion; however, empirical verification of this assumption is absent. We will include the requested ablation study and baseline comparison in the revision. revision: yes

Circularity Check

0 steps flagged

No derivation chain or equations present; no circularity detectable

full rationale

The provided abstract and manuscript excerpt contain only high-level descriptive claims about the method (unified 4D trajectories, sparse keypoints tracker inside BFM latent space, progressive training with tracking rewards). No equations, derivations, fitted parameters presented as predictions, self-citations invoked as load-bearing uniqueness theorems, or ansatzes are visible. Without any explicit derivation chain to inspect, no reduction to inputs by construction can be exhibited. This matches the default expectation for papers lacking technical derivation content.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the Behavior Foundation Model is treated as an external prior whose properties are not audited here.

pith-pipeline@v0.9.0 · 5754 in / 1385 out tokens · 47796 ms · 2026-05-25T05:57:07.575570+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 10 internal anchors

[1]

Gmt: General motion tracking for humanoid whole-body control.arXiv preprint arXiv:2506.14770, 2025

Zixuan Chen, Mazeyu Ji, Xuxin Cheng, Xuanbin Peng, Xue Bin Peng, and Xiaolong Wang. Gmt: General motion tracking for humanoid whole-body control.arXiv preprint arXiv:2506.14770, 2025

work page arXiv 2025
[2]

BeyondMimic: From Motion Tracking to Versatile Humanoid Control via Guided Diffusion

Qiayuan Liao, Takara E Truong, Xiaoyu Huang, Yuman Gao, Guy Tevet, Koushil Sreenath, and C Karen Liu. Beyondmimic: From motion tracking to versatile humanoid control via guided diffusion.arXiv preprint arXiv:2508.08241, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control

Zhengyi Luo, Ye Yuan, Tingwu Wang, Chenran Li, Sirui Chen, Fernando Castaneda, Zi-Ang Cao, Jiefeng Li, David Minor, Qingwei Ben, et al. Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023

work page 2023
[6]

${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities

Physical Intelligence, Bo Ai, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Greg Balke, Kevin Black, George Bokinsky, Shihao Cao, Thomas Charbonnier, et al. Pi0.7: a steerable gen- eralist robotic foundation model with emergent capabilities.arXiv preprint arXiv:2604.15483, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[7]

Amp: Adversarial motion priors for stylized physics-based character control.ACM Transactions on Graphics (ToG), 40(4):1–20, 2021

Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa. Amp: Adversarial motion priors for stylized physics-based character control.ACM Transactions on Graphics (ToG), 40(4):1–20, 2021

work page 2021
[8]

Physhsi: Towards a real-world generalizable and natural humanoid-scene interaction system.arXiv preprint arXiv:2510.11072, 2025

Huayi Wang, Wentao Zhang, Runyi Yu, Tao Huang, Junli Ren, Feiyu Jia, Zirui Wang, Xiaojie Niu, Xiao Chen, Jiahe Chen, et al. Physhsi: Towards a real-world generalizable and natural humanoid-scene interaction system.arXiv preprint arXiv:2510.11072, 2025

work page arXiv 2025
[9]

Improved techniques for training gans.Advances in neural information processing systems, 29, 2016

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans.Advances in neural information processing systems, 29, 2016

work page 2016
[10]

Resmimic: From general motion tracking to humanoid whole-body loco-manipulation via residual learning.arXiv preprint arXiv:2510.05070, 2025

Siheng Zhao, Yanjie Ze, Yue Wang, C Karen Liu, Pieter Abbeel, Guanya Shi, and Rocky Duan. Resmimic: From general motion tracking to humanoid whole-body loco-manipulation via residual learning.arXiv preprint arXiv:2510.05070, 2025

work page arXiv 2025
[11]

Hdmi: Learning interactive humanoid whole-body control from human videos.arXiv preprint arXiv:2509.16757, 2025

Haoyang Weng, Yitang Li, Nikhil Sobanbabu, Zihan Wang, Zhengyi Luo, Tairan He, Deva Ramanan, and Guanya Shi. Hdmi: Learning interactive humanoid whole-body control from human videos.arXiv preprint arXiv:2509.16757, 2025

work page arXiv 2025
[12]

From generated human videos to physically plausible robot trajectories.arXiv preprint arXiv:2512.05094, 2025

James Ni, Zekai Wang, Wei Lin, Amir Bar, Yann LeCun, Trevor Darrell, Jitendra Malik, and Roei Herzig. From generated human videos to physically plausible robot trajectories.arXiv preprint arXiv:2512.05094, 2025

work page arXiv 2025
[13]

World-grounded human motion recovery via gravity-view coordinates

Zehong Shen, Huaijin Pi, Yan Xia, Zhi Cen, Sida Peng, Zechen Hu, Hujun Bao, Ruizhen Hu, and Xiaowei Zhou. World-grounded human motion recovery via gravity-view coordinates. In SIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024

work page 2024
[14]

Foundationpose: Unified 6d pose estimation and tracking of novel objects

Bowen Wen, Wei Yang, Jan Kautz, and Stan Birchfield. Foundationpose: Unified 6d pose estimation and tracking of novel objects. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17868–17879, 2024

work page 2024
[15]

Humanx: Toward agile and generalizable humanoid interaction skills from human videos.arXiv preprint arXiv:2602.02473, 2026

Yinhuai Wang, Qihan Zhao, Yuen Fui Lau, Runyi Yu, Hok Wai Tsui, Qifeng Chen, Jingbo Wang, Jiangmiao Pang, and Ping Tan. Humanx: Toward agile and generalizable humanoid interaction skills from human videos.arXiv preprint arXiv:2602.02473, 2026

work page arXiv 2026
[16]

Omniretarget: Interaction-preserving data gen- eration for humanoid whole-body loco-manipulation and scene interaction.arXiv preprint arXiv:2509.26633, 2025

Lujie Yang, Xiaoyu Huang, Zhen Wu, Angjoo Kanazawa, Pieter Abbeel, Carmelo Sferrazza, C Karen Liu, Rocky Duan, and Guanya Shi. Omniretarget: Interaction-preserving data gen- eration for humanoid whole-body loco-manipulation and scene interaction.arXiv preprint arXiv:2509.26633, 2025. 9

work page arXiv 2025
[17]

Spatial relationship preserving character motion adaptation

Edmond SL Ho, Taku Komura, and Chiew-Lan Tai. Spatial relationship preserving character motion adaptation. InACM SIGGRAPH 2010 papers, pages 1–8. 2010

work page 2010
[18]

Visualmimic: Visual hu- manoid loco-manipulation via motion tracking and generation.arXiv preprint arXiv:2509.20322, 2025

Shaofeng Yin, Yanjie Ze, Hong-Xing Yu, C Karen Liu, and Jiajun Wu. Visualmimic: Visual hu- manoid loco-manipulation via motion tracking and generation.arXiv preprint arXiv:2509.20322, 2025

work page arXiv 2025
[19]

Hand-eye au- tonomous delivery: Learning humanoid navigation, locomotion and reaching.arXiv preprint arXiv:2508.03068, 2025

Sirui Chen, Yufei Ye, Zi-ang Cao, Jennifer Lew, Pei Xu, and C Karen Liu. Hand-eye au- tonomous delivery: Learning humanoid navigation, locomotion and reaching.arXiv preprint arXiv:2508.03068, 2025

work page arXiv 2025
[20]

Bfm-zero: A promptable behavioral foundation model for humanoid control using unsupervised reinforcement learning

Yitang Li, Zhengyi Luo, Tonghe Zhang, Cunxi Dai, Anssi Kanervisto, Andrea Tirinzoni, Haoyang Weng, Kris Kitani, Mateusz Guzek, Ahmed Touati, et al. Bfm-zero: A promptable behavioral foundation model for humanoid control using unsupervised reinforcement learning. arXiv preprint arXiv:2511.04131, 2025

work page arXiv 2025
[21]

Behavior foundation model for humanoid robots.arXiv preprint arXiv:2509.13780, 2025

Weishuai Zeng, Shunlin Lu, Kangning Yin, Xiaojie Niu, Minyue Dai, Jingbo Wang, and Jiang- miao Pang. Behavior foundation model for humanoid robots.arXiv preprint arXiv:2509.13780, 2025

work page arXiv 2025
[22]

Intermimic: Towards univer- sal whole-body control for physics-based human-object interactions

Sirui Xu, Hung Yu Ling, Yu-Xiong Wang, and Liang-Yan Gui. Intermimic: Towards univer- sal whole-body control for physics-based human-object interactions. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12266–12277, 2025

work page 2025
[23]

Deepmimic: Example- guided deep reinforcement learning of physics-based character skills.ACM Transactions On Graphics (TOG), 37(4):1–14, 2018

Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel Van de Panne. Deepmimic: Example- guided deep reinforcement learning of physics-based character skills.ACM Transactions On Graphics (TOG), 37(4):1–14, 2018

work page 2018
[24]

Asap: Aligning simulation and real-world physics for learning agile humanoid whole-body skills.arXiv preprint arXiv:2502.01143, 2025

Tairan He, Jiawei Gao, Wenli Xiao, Yuanhang Zhang, Zi Wang, Jiashun Wang, Zhengyi Luo, Guanqi He, Nikhil Sobanbab, Chaoyi Pan, et al. Asap: Aligning simulation and real-world physics for learning agile humanoid whole-body skills.arXiv preprint arXiv:2502.01143, 2025

work page arXiv 2025
[25]

Perpetual humanoid control for real-time simulated avatars

Zhengyi Luo, Jinkun Cao, Kris Kitani, Weipeng Xu, et al. Perpetual humanoid control for real-time simulated avatars. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10895–10904, 2023

work page 2023
[26]

Omnih2o: Universal and dexterous human-to-humanoid whole-body teleoperation and learning.arXiv preprint arXiv:2406.08858, 2024

Tairan He, Zhengyi Luo, Xialin He, Wenli Xiao, Chong Zhang, Weinan Zhang, Kris Kitani, Changliu Liu, and Guanya Shi. Omnih2o: Universal and dexterous human-to-humanoid whole-body teleoperation and learning.arXiv preprint arXiv:2406.08858, 2024

work page arXiv 2024
[27]

Hover: Versatile neural whole-body controller for humanoid robots

Tairan He, Wenli Xiao, Toru Lin, Zhengyi Luo, Zhenjia Xu, Zhenyu Jiang, Jan Kautz, Changliu Liu, Guanya Shi, Xiaolong Wang, et al. Hover: Versatile neural whole-body controller for humanoid robots. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 9989–9996. IEEE, 2025

work page 2025
[28]

Exbody2: Advanced expressive humanoid whole-body control.arXiv preprint arXiv:2412.13196, 2024

Mazeyu Ji, Xuanbin Peng, Fangchen Liu, Jialong Li, Ge Yang, Xuxin Cheng, and Xiao- long Wang. Exbody2: Advanced expressive humanoid whole-body control.arXiv preprint arXiv:2412.13196, 2024

work page arXiv 2024
[29]

Twist2: Scalable, portable, and holistic humanoid data collection system.arXiv preprint arXiv:2511.02832, 2025

Yanjie Ze, Siheng Zhao, Weizhuo Wang, Angjoo Kanazawa, Rocky Duan, Pieter Abbeel, Guanya Shi, Jiajun Wu, and C Karen Liu. Twist2: Scalable, portable, and holistic humanoid data collection system.arXiv preprint arXiv:2511.02832, 2025

work page arXiv 2025
[30]

Clot: Closed-loop global motion tracking for whole-body humanoid teleoperation.arXiv preprint arXiv:2602.15060, 2026

Tengjie Zhu, Guanyu Cai, Yang Zhaohui, Guanzhu Ren, Haohui Xie, ZiRui Wang, Junsong Wu, Jingbo Wang, Xiaokang Yang, Yao Mu, et al. Clot: Closed-loop global motion tracking for whole-body humanoid teleoperation.arXiv preprint arXiv:2602.15060, 2026

work page arXiv 2026
[31]

Agility meets stability: Versatile humanoid control with heterogeneous data.arXiv preprint arXiv:2511.17373, 2025

Yixuan Pan, Ruoyi Qiao, Li Chen, Kashyap Chitta, Liang Pan, Haoguang Mai, Qingwen Bu, Hao Zhao, Cunyuan Zheng, Ping Luo, et al. Agility meets stability: Versatile humanoid control with heterogeneous data.arXiv preprint arXiv:2511.17373, 2025

work page arXiv 2025
[32]

Omnixtreme: Breaking the generality barrier in high-dynamic humanoid control.arXiv preprint arXiv:2602.23843, 2026

Yunshen Wang, Shaohang Zhu, Peiyuan Zhi, Yuhan Li, Jiaxin Li, Yong-Lu Li, Yuchen Xiao, Xingxing Wang, Baoxiong Jia, and Siyuan Huang. Omnixtreme: Breaking the generality barrier in high-dynamic humanoid control.arXiv preprint arXiv:2602.23843, 2026

work page arXiv 2026
[33]

Unitracker: Learning universal whole-body motion tracker for humanoid robots.IEEE Robotics and Automation Letters, 2026

Kangning Yin, Weishuai Zeng, Ke Fan, Minyue Dai, Zirui Wang, Qiang Zhang, Zheng Tian, Jingbo Wang, Jiangmiao Pang, and Weinan Zhang. Unitracker: Learning universal whole-body motion tracker for humanoid robots.IEEE Robotics and Automation Letters, 2026. 10

work page 2026
[34]

Learning athletic humanoid tennis skills from imperfect human motion data.arXiv preprint arXiv:2603.12686, 2026

Zhikai Zhang, Haofei Lu, Yunrui Lian, Ziqing Chen, Yun Liu, Chenghuai Lin, Han Xue, Zicheng Zeng, Zekun Qi, Shaolin Zheng, et al. Learning athletic humanoid tennis skills from imperfect human motion data.arXiv preprint arXiv:2603.12686, 2026

work page arXiv 2026
[35]

Smash: Mastering scalable whole-body skills for humanoid ping-pong with egocentric vision.arXiv preprint arXiv:2604.01158, 2026

Junli Ren, Yinghui Li, Kai Zhang, Penglin Fu, Haoran Jiang, Yixuan Pan, Guangjun Zeng, Tao Huang, Weizhong Guo, Peng Lu, et al. Smash: Mastering scalable whole-body skills for humanoid ping-pong with egocentric vision.arXiv preprint arXiv:2604.01158, 2026

work page arXiv 2026
[36]

Humanoid goalkeeper: Learning from position conditioned task-motion constraints.arXiv preprint arXiv:2510.18002, 2025

Junli Ren, Junfeng Long, Tao Huang, Huayi Wang, Zirui Wang, Feiyu Jia, Wentao Zhang, Jingbo Wang, Ping Luo, and Jiangmiao Pang. Humanoid goalkeeper: Learning from position conditioned task-motion constraints.arXiv preprint arXiv:2510.18002, 2025

work page arXiv 2025
[37]

Hitter: A humanoid table tennis robot via hierarchical planning and learning,

Zhi Su, Bike Zhang, Nima Rahmanian, Yuman Gao, Qiayuan Liao, Caitlin Regan, Koushil Sreenath, and S Shankar Sastry. Hitter: A humanoid table tennis robot via hierarchical planning and learning.arXiv preprint arXiv:2508.21043, 2025

work page arXiv 2025
[38]

Sim-to-real learning for humanoid box loco- manipulation

Jeremy Dao, Helei Duan, and Alan Fern. Sim-to-real learning for humanoid box loco- manipulation. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 16930–16936. IEEE, 2024

work page 2024
[39]

Opt2skill: Imitating dynamically-feasible whole-body trajectories for versatile humanoid loco-manipulation.IEEE Robotics and Automation Letters, 2025

Fukang Liu, Zhaoyuan Gu, Yilin Cai, Ziyi Zhou, Hyunyoung Jung, Jaehwi Jang, Shijie Zhao, Sehoon Ha, Yue Chen, Danfei Xu, et al. Opt2skill: Imitating dynamically-feasible whole-body trajectories for versatile humanoid loco-manipulation.IEEE Robotics and Automation Letters, 2025

work page 2025
[40]

Wococo: Learning whole-body humanoid control with sequential contacts.arXiv preprint arXiv:2406.06005, 2024

Chong Zhang, Wenli Xiao, Tairan He, and Guanya Shi. Wococo: Learning whole-body humanoid control with sequential contacts.arXiv preprint arXiv:2406.06005, 2024

work page arXiv 2024
[41]

Generalizable humanoid manipulation with 3d diffusion policies

Yanjie Ze, Zixuan Chen, Wenhao Wang, Tianyi Chen, Xialin He, Ying Yuan, Xue Bin Peng, and Jiajun Wu. Generalizable humanoid manipulation with 3d diffusion policies. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2873–2880. IEEE, 2025

work page 2025
[42]

Humanplus: Hu- manoid shadowing and imitation from humans.arXiv preprint arXiv:2406.10454, 2024

Zipeng Fu, Qingqing Zhao, Qi Wu, Gordon Wetzstein, and Chelsea Finn. Humanplus: Hu- manoid shadowing and imitation from humans.arXiv preprint arXiv:2406.10454, 2024

work page arXiv 2024
[43]

Leverb: Humanoid whole-body control with latent vision-language instruction.arXiv preprint arXiv:2506.13751, 2025

Haoru Xue, Xiaoyu Huang, Dantong Niu, Qiayuan Liao, Thomas Kragerud, Jan Tommy Gravdahl, Xue Bin Peng, Guanya Shi, Trevor Darrell, Koushil Sreenath, et al. Leverb: Humanoid whole-body control with latent vision-language instruction.arXiv preprint arXiv:2506.13751, 2025

work page arXiv 2025
[44]

Falcon: Learning force-adaptive humanoid loco-manipulation.arXiv preprint arXiv:2505.06776, 2025

Yuanhang Zhang, Yifu Yuan, Prajwal Gurunath, Ishita Gupta, Shayegan Omidshafiei, Ali-akbar Agha-mohammadi, Marcell Vazquez-Chanlatte, Liam Pedersen, Tairan He, and Guanya Shi. Falcon: Learning force-adaptive humanoid loco-manipulation.arXiv preprint arXiv:2505.06776, 2025

work page arXiv 2025
[45]

Homie: Humanoid loco-manipulation with isomorphic exoskeleton cockpit.arXiv preprint arXiv:2502.13013, 2025

Qingwei Ben, Feiyu Jia, Jia Zeng, Junting Dong, Dahua Lin, and Jiangmiao Pang. Homie: Humanoid loco-manipulation with isomorphic exoskeleton cockpit.arXiv preprint arXiv:2502.13013, 2025

work page arXiv 2025
[46]

Ultra: Unified multimodal control for autonomous humanoid whole-body loco-manipulation

Xialin He, Sirui Xu, Xinyao Li, Runpei Dong, Liuyu Bian, Yu-Xiong Wang, and Liang-Yan Gui. Ultra: Unified multimodal control for autonomous humanoid whole-body loco-manipulation. arXiv preprint arXiv:2603.03279, 2026

work page arXiv 2026
[47]

De- mohlm: From one demonstration to generalizable humanoid loco-manipulation.arXiv preprint arXiv:2510.11258, 2025

Yuhui Fu, Feiyang Xie, Chaoyi Xu, Jing Xiong, Haoqi Yuan, and Zongqing Lu. De- mohlm: From one demonstration to generalizable humanoid loco-manipulation.arXiv preprint arXiv:2510.11258, 2025

work page arXiv 2025
[48]

Interreal: A unified physics-based imitation framework for learning human-object interaction skills.arXiv preprint arXiv:2603.07516, 2026

Dayang Liang, Yuhang Lin, Xinzhe Liu, Jiyuan Shi, Yunlong Liu, and Chenjia Bai. Interreal: A unified physics-based imitation framework for learning human-object interaction skills.arXiv preprint arXiv:2603.07516, 2026

work page arXiv 2026
[49]

Pro-hoi: Perceptive root-guided humanoid-object interaction.arXiv preprint arXiv:2603.01126, 2026

Yuhang Lin, Jiyuan Shi, Dewei Wang, Jipeng Kong, Yong Liu, Chenjia Bai, and Xuelong Li. Pro-hoi: Perceptive root-guided humanoid-object interaction.arXiv preprint arXiv:2603.01126, 2026

work page arXiv 2026
[50]

Dreamcontrol: Human-inspired whole-body humanoid control for scene interaction via guided diffusion.arXiv preprint arXiv:2509.14353, 2025

Dvij Kalaria, Sudarshan S Harithas, Pushkal Katara, Sangkyung Kwak, Sarthak Bhagat, Shankar Sastry, Srinath Sridhar, Sai Vemprala, Ashish Kapoor, and Jonathan Chung-Kuan Huang. Dreamcontrol: Human-inspired whole-body humanoid control for scene interaction via guided diffusion.arXiv preprint arXiv:2509.14353, 2025. 11

work page arXiv 2025
[51]

Visual imitation enables contextual humanoid control.arXiv preprint arXiv:2505.03729, 2025

Arthur Allshire, Hongsuk Choi, Junyi Zhang, David McAllister, Anthony Zhang, Chung Min Kim, Trevor Darrell, Pieter Abbeel, Jitendra Malik, and Angjoo Kanazawa. Visual imitation enables contextual humanoid control.arXiv preprint arXiv:2505.03729, 2025

work page arXiv 2025
[52]

Zerowbc: Learning natural visuomotor humanoid control directly from human egocentric video.arXiv preprint arXiv:2603.09170, 2026

Haoran Yang, Jiacheng Bao, Yucheng Xin, Haoming Song, Yuyang Tian, Bin Zhao, Dong Wang, and Xuelong Li. Zerowbc: Learning natural visuomotor humanoid control directly from human egocentric video.arXiv preprint arXiv:2603.09170, 2026

work page arXiv 2026
[53]

Video generation models as world simulators

OpenAI. Video generation models as world simulators. Technical report, OpenAI, 2024

work page 2024
[54]

Gen-3 alpha

Runway. Gen-3 alpha. Technical report, Runway, 2024

work page 2024
[55]

Kling 1.5

Kuaishou. Kling 1.5. Technical report, Kuaishou Technology, 2024

work page 2024
[56]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

Seedance 2.0 fast

ByteDance. Seedance 2.0 fast. Technical report, ByteDance, 2025

work page 2025
[58]

Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations

Shivansh Patel, Shraddhaa Mohan, Hanlin Mai, Unnat Jain, Svetlana Lazebnik, and Yunzhu Li. Robotic manipulation by imitating generated videos without physical demonstrations.arXiv preprint arXiv:2507.00990, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

Gen2real: Towards demo-free dexterous manipulation by harnessing generated video.arXiv preprint arXiv:2509.14178, 2025

Kai Ye, Yuhang Wu, Shuyuan Hu, Junliang Li, Meng Liu, Yongquan Chen, and Rui Huang. Gen2real: Towards demo-free dexterous manipulation by harnessing generated video.arXiv preprint arXiv:2509.14178, 2025

work page arXiv 2025
[60]

Dexman: Learning bimanual dexterous manipulation from human and generated videos.arXiv preprint arXiv:2510.08475, 2025

Jhen Hsieh, Kuan-Hsun Tu, Kuo-Han Hung, and Tsung-Wei Ke. Dexman: Learning bimanual dexterous manipulation from human and generated videos.arXiv preprint arXiv:2510.08475, 2025

work page arXiv 2025
[61]

Geometry-aware 4D Video Generation for Robot Manipulation

Zeyi Liu, Shuang Li, Eric Cousineau, Siyuan Feng, Benjamin Burchfiel, and Shuran Song. Geometry-aware 4d video generation for robot manipulation.arXiv preprint arXiv:2507.01099, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[62]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[63]

Spatialtrackerv2: 3d point tracking made easy

Yuxi Xiao, Jianyuan Wang, Nan Xue, Nikita Karaev, Yuri Makarov, Bingyi Kang, Xing Zhu, Hujun Bao, Yujun Shen, and Xiaowei Zhou. Spatialtrackerv2: 3d point tracking made easy. arXiv preprint arXiv:2507.12462, 2025

work page arXiv 2025
[64]

Amass: Archive of motion capture as surface shapes

Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Gerard Pons-Moll, and Michael J Black. Amass: Archive of motion capture as surface shapes. InProceedings of the IEEE/CVF international conference on computer vision, pages 5442–5451, 2019

work page 2019
[65]

Robust motion in-betweening.ACM Transactions on Graphics (TOG), 39(4):60–1, 2020

Félix G Harvey, Mike Yurick, Derek Nowrouzezahrai, and Christopher Pal. Robust motion in-betweening.ACM Transactions on Graphics (TOG), 39(4):60–1, 2020

work page 2020
[66]

Ian Mason, Sebastian Starke, and Taku Komura. Real-time style modelling of human locomotion via feature-wise transformations and local motion phases.Proceedings of the ACM on Computer Graphics and Interactive Techniques, 5(1):1–18, 2022

work page 2022
[67]

Pink: Python inverse kinematics based on Pinocchio, 2026

Stéphane Caron, Yann De Mont-Marin, Rohan Budhiraja, Seung Hyeon Bang, Ivan Domrachev, Simeon Nedelchev, Peter Du, Adrien Escande, Joris Vaillant, Bruce Wingo, Santosh Patapati, Daniel San José Pro, and Nicolas Guillermo Marticorena Vidal. Pink: Python inverse kinematics based on Pinocchio, 2026

work page 2026
[68]

Object motion guided human motion synthesis.ACM Transactions on Graphics (TOG), 42(6):1–11, 2023

Jiaman Li, Jiajun Wu, and C Karen Liu. Object motion guided human motion synthesis.ACM Transactions on Graphics (TOG), 42(6):1–11, 2023. 12

work page 2023
[69]

Holosoma

Amazon FAR, Pieter Abbeel, Juyue Chen, Rocky Duan, Alejandro Escontrela, Manan Gandhi, Samuel Gundry, Xiaoyu Huang, Angjoo Kanazawa, Tomasz Lewicki, Jiaman Li, Karen Liu, Clay Rosenthal, Younggyo Seo, Carlo Sferrazza, Guanya Shi, Linda Shih, Jonathan Tseng, Zhen Wu, Lujie Yang, Brent Yi, and Yuanhang Zhang. Holosoma

work page
[70]

Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning

Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning.arXiv preprint arXiv:2108.10470, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[71]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[72]

The robot walks forward and carries the box

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE, 2012. 13 Appendix A Training Details In this section, we provide comprehensive training details for the three-stage progressive learning framework introduced in Sec...

work page 2012

[1] [1]

Gmt: General motion tracking for humanoid whole-body control.arXiv preprint arXiv:2506.14770, 2025

Zixuan Chen, Mazeyu Ji, Xuxin Cheng, Xuanbin Peng, Xue Bin Peng, and Xiaolong Wang. Gmt: General motion tracking for humanoid whole-body control.arXiv preprint arXiv:2506.14770, 2025

work page arXiv 2025

[2] [2]

BeyondMimic: From Motion Tracking to Versatile Humanoid Control via Guided Diffusion

Qiayuan Liao, Takara E Truong, Xiaoyu Huang, Yuman Gao, Guy Tevet, Koushil Sreenath, and C Karen Liu. Beyondmimic: From motion tracking to versatile humanoid control via guided diffusion.arXiv preprint arXiv:2508.08241, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control

Zhengyi Luo, Ye Yuan, Tingwu Wang, Chenran Li, Sirui Chen, Fernando Castaneda, Zi-Ang Cao, Jiefeng Li, David Minor, Qingwei Ben, et al. Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023

work page 2023

[6] [6]

${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities

Physical Intelligence, Bo Ai, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Greg Balke, Kevin Black, George Bokinsky, Shihao Cao, Thomas Charbonnier, et al. Pi0.7: a steerable gen- eralist robotic foundation model with emergent capabilities.arXiv preprint arXiv:2604.15483, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[7] [7]

Amp: Adversarial motion priors for stylized physics-based character control.ACM Transactions on Graphics (ToG), 40(4):1–20, 2021

Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa. Amp: Adversarial motion priors for stylized physics-based character control.ACM Transactions on Graphics (ToG), 40(4):1–20, 2021

work page 2021

[8] [8]

Physhsi: Towards a real-world generalizable and natural humanoid-scene interaction system.arXiv preprint arXiv:2510.11072, 2025

Huayi Wang, Wentao Zhang, Runyi Yu, Tao Huang, Junli Ren, Feiyu Jia, Zirui Wang, Xiaojie Niu, Xiao Chen, Jiahe Chen, et al. Physhsi: Towards a real-world generalizable and natural humanoid-scene interaction system.arXiv preprint arXiv:2510.11072, 2025

work page arXiv 2025

[9] [9]

Improved techniques for training gans.Advances in neural information processing systems, 29, 2016

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans.Advances in neural information processing systems, 29, 2016

work page 2016

[10] [10]

Resmimic: From general motion tracking to humanoid whole-body loco-manipulation via residual learning.arXiv preprint arXiv:2510.05070, 2025

Siheng Zhao, Yanjie Ze, Yue Wang, C Karen Liu, Pieter Abbeel, Guanya Shi, and Rocky Duan. Resmimic: From general motion tracking to humanoid whole-body loco-manipulation via residual learning.arXiv preprint arXiv:2510.05070, 2025

work page arXiv 2025

[11] [11]

Hdmi: Learning interactive humanoid whole-body control from human videos.arXiv preprint arXiv:2509.16757, 2025

Haoyang Weng, Yitang Li, Nikhil Sobanbabu, Zihan Wang, Zhengyi Luo, Tairan He, Deva Ramanan, and Guanya Shi. Hdmi: Learning interactive humanoid whole-body control from human videos.arXiv preprint arXiv:2509.16757, 2025

work page arXiv 2025

[12] [12]

From generated human videos to physically plausible robot trajectories.arXiv preprint arXiv:2512.05094, 2025

James Ni, Zekai Wang, Wei Lin, Amir Bar, Yann LeCun, Trevor Darrell, Jitendra Malik, and Roei Herzig. From generated human videos to physically plausible robot trajectories.arXiv preprint arXiv:2512.05094, 2025

work page arXiv 2025

[13] [13]

World-grounded human motion recovery via gravity-view coordinates

Zehong Shen, Huaijin Pi, Yan Xia, Zhi Cen, Sida Peng, Zechen Hu, Hujun Bao, Ruizhen Hu, and Xiaowei Zhou. World-grounded human motion recovery via gravity-view coordinates. In SIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024

work page 2024

[14] [14]

Foundationpose: Unified 6d pose estimation and tracking of novel objects

Bowen Wen, Wei Yang, Jan Kautz, and Stan Birchfield. Foundationpose: Unified 6d pose estimation and tracking of novel objects. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17868–17879, 2024

work page 2024

[15] [15]

Humanx: Toward agile and generalizable humanoid interaction skills from human videos.arXiv preprint arXiv:2602.02473, 2026

Yinhuai Wang, Qihan Zhao, Yuen Fui Lau, Runyi Yu, Hok Wai Tsui, Qifeng Chen, Jingbo Wang, Jiangmiao Pang, and Ping Tan. Humanx: Toward agile and generalizable humanoid interaction skills from human videos.arXiv preprint arXiv:2602.02473, 2026

work page arXiv 2026

[16] [16]

Omniretarget: Interaction-preserving data gen- eration for humanoid whole-body loco-manipulation and scene interaction.arXiv preprint arXiv:2509.26633, 2025

Lujie Yang, Xiaoyu Huang, Zhen Wu, Angjoo Kanazawa, Pieter Abbeel, Carmelo Sferrazza, C Karen Liu, Rocky Duan, and Guanya Shi. Omniretarget: Interaction-preserving data gen- eration for humanoid whole-body loco-manipulation and scene interaction.arXiv preprint arXiv:2509.26633, 2025. 9

work page arXiv 2025

[17] [17]

Spatial relationship preserving character motion adaptation

Edmond SL Ho, Taku Komura, and Chiew-Lan Tai. Spatial relationship preserving character motion adaptation. InACM SIGGRAPH 2010 papers, pages 1–8. 2010

work page 2010

[18] [18]

Visualmimic: Visual hu- manoid loco-manipulation via motion tracking and generation.arXiv preprint arXiv:2509.20322, 2025

Shaofeng Yin, Yanjie Ze, Hong-Xing Yu, C Karen Liu, and Jiajun Wu. Visualmimic: Visual hu- manoid loco-manipulation via motion tracking and generation.arXiv preprint arXiv:2509.20322, 2025

work page arXiv 2025

[19] [19]

Hand-eye au- tonomous delivery: Learning humanoid navigation, locomotion and reaching.arXiv preprint arXiv:2508.03068, 2025

Sirui Chen, Yufei Ye, Zi-ang Cao, Jennifer Lew, Pei Xu, and C Karen Liu. Hand-eye au- tonomous delivery: Learning humanoid navigation, locomotion and reaching.arXiv preprint arXiv:2508.03068, 2025

work page arXiv 2025

[20] [20]

Bfm-zero: A promptable behavioral foundation model for humanoid control using unsupervised reinforcement learning

Yitang Li, Zhengyi Luo, Tonghe Zhang, Cunxi Dai, Anssi Kanervisto, Andrea Tirinzoni, Haoyang Weng, Kris Kitani, Mateusz Guzek, Ahmed Touati, et al. Bfm-zero: A promptable behavioral foundation model for humanoid control using unsupervised reinforcement learning. arXiv preprint arXiv:2511.04131, 2025

work page arXiv 2025

[21] [21]

Behavior foundation model for humanoid robots.arXiv preprint arXiv:2509.13780, 2025

Weishuai Zeng, Shunlin Lu, Kangning Yin, Xiaojie Niu, Minyue Dai, Jingbo Wang, and Jiang- miao Pang. Behavior foundation model for humanoid robots.arXiv preprint arXiv:2509.13780, 2025

work page arXiv 2025

[22] [22]

Intermimic: Towards univer- sal whole-body control for physics-based human-object interactions

Sirui Xu, Hung Yu Ling, Yu-Xiong Wang, and Liang-Yan Gui. Intermimic: Towards univer- sal whole-body control for physics-based human-object interactions. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12266–12277, 2025

work page 2025

[23] [23]

Deepmimic: Example- guided deep reinforcement learning of physics-based character skills.ACM Transactions On Graphics (TOG), 37(4):1–14, 2018

Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel Van de Panne. Deepmimic: Example- guided deep reinforcement learning of physics-based character skills.ACM Transactions On Graphics (TOG), 37(4):1–14, 2018

work page 2018

[24] [24]

Asap: Aligning simulation and real-world physics for learning agile humanoid whole-body skills.arXiv preprint arXiv:2502.01143, 2025

Tairan He, Jiawei Gao, Wenli Xiao, Yuanhang Zhang, Zi Wang, Jiashun Wang, Zhengyi Luo, Guanqi He, Nikhil Sobanbab, Chaoyi Pan, et al. Asap: Aligning simulation and real-world physics for learning agile humanoid whole-body skills.arXiv preprint arXiv:2502.01143, 2025

work page arXiv 2025

[25] [25]

Perpetual humanoid control for real-time simulated avatars

Zhengyi Luo, Jinkun Cao, Kris Kitani, Weipeng Xu, et al. Perpetual humanoid control for real-time simulated avatars. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10895–10904, 2023

work page 2023

[26] [26]

Omnih2o: Universal and dexterous human-to-humanoid whole-body teleoperation and learning.arXiv preprint arXiv:2406.08858, 2024

Tairan He, Zhengyi Luo, Xialin He, Wenli Xiao, Chong Zhang, Weinan Zhang, Kris Kitani, Changliu Liu, and Guanya Shi. Omnih2o: Universal and dexterous human-to-humanoid whole-body teleoperation and learning.arXiv preprint arXiv:2406.08858, 2024

work page arXiv 2024

[27] [27]

Hover: Versatile neural whole-body controller for humanoid robots

Tairan He, Wenli Xiao, Toru Lin, Zhengyi Luo, Zhenjia Xu, Zhenyu Jiang, Jan Kautz, Changliu Liu, Guanya Shi, Xiaolong Wang, et al. Hover: Versatile neural whole-body controller for humanoid robots. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 9989–9996. IEEE, 2025

work page 2025

[28] [28]

Exbody2: Advanced expressive humanoid whole-body control.arXiv preprint arXiv:2412.13196, 2024

Mazeyu Ji, Xuanbin Peng, Fangchen Liu, Jialong Li, Ge Yang, Xuxin Cheng, and Xiao- long Wang. Exbody2: Advanced expressive humanoid whole-body control.arXiv preprint arXiv:2412.13196, 2024

work page arXiv 2024

[29] [29]

Twist2: Scalable, portable, and holistic humanoid data collection system.arXiv preprint arXiv:2511.02832, 2025

Yanjie Ze, Siheng Zhao, Weizhuo Wang, Angjoo Kanazawa, Rocky Duan, Pieter Abbeel, Guanya Shi, Jiajun Wu, and C Karen Liu. Twist2: Scalable, portable, and holistic humanoid data collection system.arXiv preprint arXiv:2511.02832, 2025

work page arXiv 2025

[30] [30]

Clot: Closed-loop global motion tracking for whole-body humanoid teleoperation.arXiv preprint arXiv:2602.15060, 2026

Tengjie Zhu, Guanyu Cai, Yang Zhaohui, Guanzhu Ren, Haohui Xie, ZiRui Wang, Junsong Wu, Jingbo Wang, Xiaokang Yang, Yao Mu, et al. Clot: Closed-loop global motion tracking for whole-body humanoid teleoperation.arXiv preprint arXiv:2602.15060, 2026

work page arXiv 2026

[31] [31]

Agility meets stability: Versatile humanoid control with heterogeneous data.arXiv preprint arXiv:2511.17373, 2025

Yixuan Pan, Ruoyi Qiao, Li Chen, Kashyap Chitta, Liang Pan, Haoguang Mai, Qingwen Bu, Hao Zhao, Cunyuan Zheng, Ping Luo, et al. Agility meets stability: Versatile humanoid control with heterogeneous data.arXiv preprint arXiv:2511.17373, 2025

work page arXiv 2025

[32] [32]

Omnixtreme: Breaking the generality barrier in high-dynamic humanoid control.arXiv preprint arXiv:2602.23843, 2026

Yunshen Wang, Shaohang Zhu, Peiyuan Zhi, Yuhan Li, Jiaxin Li, Yong-Lu Li, Yuchen Xiao, Xingxing Wang, Baoxiong Jia, and Siyuan Huang. Omnixtreme: Breaking the generality barrier in high-dynamic humanoid control.arXiv preprint arXiv:2602.23843, 2026

work page arXiv 2026

[33] [33]

Unitracker: Learning universal whole-body motion tracker for humanoid robots.IEEE Robotics and Automation Letters, 2026

Kangning Yin, Weishuai Zeng, Ke Fan, Minyue Dai, Zirui Wang, Qiang Zhang, Zheng Tian, Jingbo Wang, Jiangmiao Pang, and Weinan Zhang. Unitracker: Learning universal whole-body motion tracker for humanoid robots.IEEE Robotics and Automation Letters, 2026. 10

work page 2026

[34] [34]

Learning athletic humanoid tennis skills from imperfect human motion data.arXiv preprint arXiv:2603.12686, 2026

Zhikai Zhang, Haofei Lu, Yunrui Lian, Ziqing Chen, Yun Liu, Chenghuai Lin, Han Xue, Zicheng Zeng, Zekun Qi, Shaolin Zheng, et al. Learning athletic humanoid tennis skills from imperfect human motion data.arXiv preprint arXiv:2603.12686, 2026

work page arXiv 2026

[35] [35]

Smash: Mastering scalable whole-body skills for humanoid ping-pong with egocentric vision.arXiv preprint arXiv:2604.01158, 2026

Junli Ren, Yinghui Li, Kai Zhang, Penglin Fu, Haoran Jiang, Yixuan Pan, Guangjun Zeng, Tao Huang, Weizhong Guo, Peng Lu, et al. Smash: Mastering scalable whole-body skills for humanoid ping-pong with egocentric vision.arXiv preprint arXiv:2604.01158, 2026

work page arXiv 2026

[36] [36]

Humanoid goalkeeper: Learning from position conditioned task-motion constraints.arXiv preprint arXiv:2510.18002, 2025

Junli Ren, Junfeng Long, Tao Huang, Huayi Wang, Zirui Wang, Feiyu Jia, Wentao Zhang, Jingbo Wang, Ping Luo, and Jiangmiao Pang. Humanoid goalkeeper: Learning from position conditioned task-motion constraints.arXiv preprint arXiv:2510.18002, 2025

work page arXiv 2025

[37] [37]

Hitter: A humanoid table tennis robot via hierarchical planning and learning,

Zhi Su, Bike Zhang, Nima Rahmanian, Yuman Gao, Qiayuan Liao, Caitlin Regan, Koushil Sreenath, and S Shankar Sastry. Hitter: A humanoid table tennis robot via hierarchical planning and learning.arXiv preprint arXiv:2508.21043, 2025

work page arXiv 2025

[38] [38]

Sim-to-real learning for humanoid box loco- manipulation

Jeremy Dao, Helei Duan, and Alan Fern. Sim-to-real learning for humanoid box loco- manipulation. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 16930–16936. IEEE, 2024

work page 2024

[39] [39]

Opt2skill: Imitating dynamically-feasible whole-body trajectories for versatile humanoid loco-manipulation.IEEE Robotics and Automation Letters, 2025

Fukang Liu, Zhaoyuan Gu, Yilin Cai, Ziyi Zhou, Hyunyoung Jung, Jaehwi Jang, Shijie Zhao, Sehoon Ha, Yue Chen, Danfei Xu, et al. Opt2skill: Imitating dynamically-feasible whole-body trajectories for versatile humanoid loco-manipulation.IEEE Robotics and Automation Letters, 2025

work page 2025

[40] [40]

Wococo: Learning whole-body humanoid control with sequential contacts.arXiv preprint arXiv:2406.06005, 2024

Chong Zhang, Wenli Xiao, Tairan He, and Guanya Shi. Wococo: Learning whole-body humanoid control with sequential contacts.arXiv preprint arXiv:2406.06005, 2024

work page arXiv 2024

[41] [41]

Generalizable humanoid manipulation with 3d diffusion policies

Yanjie Ze, Zixuan Chen, Wenhao Wang, Tianyi Chen, Xialin He, Ying Yuan, Xue Bin Peng, and Jiajun Wu. Generalizable humanoid manipulation with 3d diffusion policies. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2873–2880. IEEE, 2025

work page 2025

[42] [42]

Humanplus: Hu- manoid shadowing and imitation from humans.arXiv preprint arXiv:2406.10454, 2024

Zipeng Fu, Qingqing Zhao, Qi Wu, Gordon Wetzstein, and Chelsea Finn. Humanplus: Hu- manoid shadowing and imitation from humans.arXiv preprint arXiv:2406.10454, 2024

work page arXiv 2024

[43] [43]

Leverb: Humanoid whole-body control with latent vision-language instruction.arXiv preprint arXiv:2506.13751, 2025

Haoru Xue, Xiaoyu Huang, Dantong Niu, Qiayuan Liao, Thomas Kragerud, Jan Tommy Gravdahl, Xue Bin Peng, Guanya Shi, Trevor Darrell, Koushil Sreenath, et al. Leverb: Humanoid whole-body control with latent vision-language instruction.arXiv preprint arXiv:2506.13751, 2025

work page arXiv 2025

[44] [44]

Falcon: Learning force-adaptive humanoid loco-manipulation.arXiv preprint arXiv:2505.06776, 2025

Yuanhang Zhang, Yifu Yuan, Prajwal Gurunath, Ishita Gupta, Shayegan Omidshafiei, Ali-akbar Agha-mohammadi, Marcell Vazquez-Chanlatte, Liam Pedersen, Tairan He, and Guanya Shi. Falcon: Learning force-adaptive humanoid loco-manipulation.arXiv preprint arXiv:2505.06776, 2025

work page arXiv 2025

[45] [45]

Homie: Humanoid loco-manipulation with isomorphic exoskeleton cockpit.arXiv preprint arXiv:2502.13013, 2025

Qingwei Ben, Feiyu Jia, Jia Zeng, Junting Dong, Dahua Lin, and Jiangmiao Pang. Homie: Humanoid loco-manipulation with isomorphic exoskeleton cockpit.arXiv preprint arXiv:2502.13013, 2025

work page arXiv 2025

[46] [46]

Ultra: Unified multimodal control for autonomous humanoid whole-body loco-manipulation

Xialin He, Sirui Xu, Xinyao Li, Runpei Dong, Liuyu Bian, Yu-Xiong Wang, and Liang-Yan Gui. Ultra: Unified multimodal control for autonomous humanoid whole-body loco-manipulation. arXiv preprint arXiv:2603.03279, 2026

work page arXiv 2026

[47] [47]

De- mohlm: From one demonstration to generalizable humanoid loco-manipulation.arXiv preprint arXiv:2510.11258, 2025

Yuhui Fu, Feiyang Xie, Chaoyi Xu, Jing Xiong, Haoqi Yuan, and Zongqing Lu. De- mohlm: From one demonstration to generalizable humanoid loco-manipulation.arXiv preprint arXiv:2510.11258, 2025

work page arXiv 2025

[48] [48]

Interreal: A unified physics-based imitation framework for learning human-object interaction skills.arXiv preprint arXiv:2603.07516, 2026

Dayang Liang, Yuhang Lin, Xinzhe Liu, Jiyuan Shi, Yunlong Liu, and Chenjia Bai. Interreal: A unified physics-based imitation framework for learning human-object interaction skills.arXiv preprint arXiv:2603.07516, 2026

work page arXiv 2026

[49] [49]

Pro-hoi: Perceptive root-guided humanoid-object interaction.arXiv preprint arXiv:2603.01126, 2026

Yuhang Lin, Jiyuan Shi, Dewei Wang, Jipeng Kong, Yong Liu, Chenjia Bai, and Xuelong Li. Pro-hoi: Perceptive root-guided humanoid-object interaction.arXiv preprint arXiv:2603.01126, 2026

work page arXiv 2026

[50] [50]

Dreamcontrol: Human-inspired whole-body humanoid control for scene interaction via guided diffusion.arXiv preprint arXiv:2509.14353, 2025

Dvij Kalaria, Sudarshan S Harithas, Pushkal Katara, Sangkyung Kwak, Sarthak Bhagat, Shankar Sastry, Srinath Sridhar, Sai Vemprala, Ashish Kapoor, and Jonathan Chung-Kuan Huang. Dreamcontrol: Human-inspired whole-body humanoid control for scene interaction via guided diffusion.arXiv preprint arXiv:2509.14353, 2025. 11

work page arXiv 2025

[51] [51]

Visual imitation enables contextual humanoid control.arXiv preprint arXiv:2505.03729, 2025

Arthur Allshire, Hongsuk Choi, Junyi Zhang, David McAllister, Anthony Zhang, Chung Min Kim, Trevor Darrell, Pieter Abbeel, Jitendra Malik, and Angjoo Kanazawa. Visual imitation enables contextual humanoid control.arXiv preprint arXiv:2505.03729, 2025

work page arXiv 2025

[52] [52]

Zerowbc: Learning natural visuomotor humanoid control directly from human egocentric video.arXiv preprint arXiv:2603.09170, 2026

Haoran Yang, Jiacheng Bao, Yucheng Xin, Haoming Song, Yuyang Tian, Bin Zhao, Dong Wang, and Xuelong Li. Zerowbc: Learning natural visuomotor humanoid control directly from human egocentric video.arXiv preprint arXiv:2603.09170, 2026

work page arXiv 2026

[53] [53]

Video generation models as world simulators

OpenAI. Video generation models as world simulators. Technical report, OpenAI, 2024

work page 2024

[54] [54]

Gen-3 alpha

Runway. Gen-3 alpha. Technical report, Runway, 2024

work page 2024

[55] [55]

Kling 1.5

Kuaishou. Kling 1.5. Technical report, Kuaishou Technology, 2024

work page 2024

[56] [56]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[57] [57]

Seedance 2.0 fast

ByteDance. Seedance 2.0 fast. Technical report, ByteDance, 2025

work page 2025

[58] [58]

Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations

Shivansh Patel, Shraddhaa Mohan, Hanlin Mai, Unnat Jain, Svetlana Lazebnik, and Yunzhu Li. Robotic manipulation by imitating generated videos without physical demonstrations.arXiv preprint arXiv:2507.00990, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[59] [59]

Gen2real: Towards demo-free dexterous manipulation by harnessing generated video.arXiv preprint arXiv:2509.14178, 2025

Kai Ye, Yuhang Wu, Shuyuan Hu, Junliang Li, Meng Liu, Yongquan Chen, and Rui Huang. Gen2real: Towards demo-free dexterous manipulation by harnessing generated video.arXiv preprint arXiv:2509.14178, 2025

work page arXiv 2025

[60] [60]

Dexman: Learning bimanual dexterous manipulation from human and generated videos.arXiv preprint arXiv:2510.08475, 2025

Jhen Hsieh, Kuan-Hsun Tu, Kuo-Han Hung, and Tsung-Wei Ke. Dexman: Learning bimanual dexterous manipulation from human and generated videos.arXiv preprint arXiv:2510.08475, 2025

work page arXiv 2025

[61] [61]

Geometry-aware 4D Video Generation for Robot Manipulation

Zeyi Liu, Shuang Li, Eric Cousineau, Siyuan Feng, Benjamin Burchfiel, and Shuran Song. Geometry-aware 4d video generation for robot manipulation.arXiv preprint arXiv:2507.01099, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[62] [62]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[63] [63]

Spatialtrackerv2: 3d point tracking made easy

Yuxi Xiao, Jianyuan Wang, Nan Xue, Nikita Karaev, Yuri Makarov, Bingyi Kang, Xing Zhu, Hujun Bao, Yujun Shen, and Xiaowei Zhou. Spatialtrackerv2: 3d point tracking made easy. arXiv preprint arXiv:2507.12462, 2025

work page arXiv 2025

[64] [64]

Amass: Archive of motion capture as surface shapes

Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Gerard Pons-Moll, and Michael J Black. Amass: Archive of motion capture as surface shapes. InProceedings of the IEEE/CVF international conference on computer vision, pages 5442–5451, 2019

work page 2019

[65] [65]

Robust motion in-betweening.ACM Transactions on Graphics (TOG), 39(4):60–1, 2020

Félix G Harvey, Mike Yurick, Derek Nowrouzezahrai, and Christopher Pal. Robust motion in-betweening.ACM Transactions on Graphics (TOG), 39(4):60–1, 2020

work page 2020

[66] [66]

Ian Mason, Sebastian Starke, and Taku Komura. Real-time style modelling of human locomotion via feature-wise transformations and local motion phases.Proceedings of the ACM on Computer Graphics and Interactive Techniques, 5(1):1–18, 2022

work page 2022

[67] [67]

Pink: Python inverse kinematics based on Pinocchio, 2026

Stéphane Caron, Yann De Mont-Marin, Rohan Budhiraja, Seung Hyeon Bang, Ivan Domrachev, Simeon Nedelchev, Peter Du, Adrien Escande, Joris Vaillant, Bruce Wingo, Santosh Patapati, Daniel San José Pro, and Nicolas Guillermo Marticorena Vidal. Pink: Python inverse kinematics based on Pinocchio, 2026

work page 2026

[68] [68]

Object motion guided human motion synthesis.ACM Transactions on Graphics (TOG), 42(6):1–11, 2023

Jiaman Li, Jiajun Wu, and C Karen Liu. Object motion guided human motion synthesis.ACM Transactions on Graphics (TOG), 42(6):1–11, 2023. 12

work page 2023

[69] [69]

Holosoma

Amazon FAR, Pieter Abbeel, Juyue Chen, Rocky Duan, Alejandro Escontrela, Manan Gandhi, Samuel Gundry, Xiaoyu Huang, Angjoo Kanazawa, Tomasz Lewicki, Jiaman Li, Karen Liu, Clay Rosenthal, Younggyo Seo, Carlo Sferrazza, Guanya Shi, Linda Shih, Jonathan Tseng, Zhen Wu, Lujie Yang, Brent Yi, and Yuanhang Zhang. Holosoma

work page

[70] [70]

Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning

Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning.arXiv preprint arXiv:2108.10470, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[71] [71]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[72] [72]

The robot walks forward and carries the box

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE, 2012. 13 Appendix A Training Details In this section, we provide comprehensive training details for the three-stage progressive learning framework introduced in Sec...

work page 2012