HumanoidArena: Benchmarking Egocentric Hierarchical Whole-body Learning

Bin Yang; Haodong Chen; Lihan Chen; Lusong Li; Qiwei Wu; Renjing Xu; Taowen Wang; Weisheng Xu; Xingyu Chen; Yichi Wang

arxiv: 2606.17833 · v1 · pith:VOSXP7SLnew · submitted 2026-06-16 · 💻 cs.RO

HumanoidArena: Benchmarking Egocentric Hierarchical Whole-body Learning

Taowen Wang , Zikang Xie , Bin Yang , Yunheng Wang , Zizhao Yuan , Yuetong Fang , Yixiao Feng , Yichi Wang

show 8 more authors

Xingyu Chen Haodong Chen Qiwei Wu Weisheng Xu Lihan Chen Lusong Li Zecui Zeng Renjing Xu

This is my paper

Pith reviewed 2026-06-27 01:13 UTC · model grok-4.3

classification 💻 cs.RO

keywords humanoid robotshierarchical controlwhole-body learningegocentric visionbenchmarkmotion trackinghuman-object interaction

0 comments

The pith

HumanoidArena benchmark shows hierarchical policies solve leg-critical tasks only when matched to specific motion trackers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents HumanoidArena to evaluate hierarchical learning in humanoid robots, where a high-level policy turns egocentric vision, proprioception, and instructions into whole-body actions that low-level trackers then execute. It designs seven tasks that require coordinated leg movements for success rather than treating legs as simple transport. Experiments demonstrate that this split allows policies to handle varied human-centered interactions. Yet success depends heavily on the tracker chosen and policies transfer poorly when the tracker changes. The work matters because it isolates the policy-tracker interface as a key barrier to scalable whole-body robot learning.

Core claim

HumanoidArena formulates policy learning as a hierarchical decision making problem in which a high-level policy converts egocentric vision, proprioception, and instructions into a compact whole-body action executed by a low-level general motion tracker, and through seven leg-critical human-object and human-scene interaction tasks demonstrates that hierarchical control enables learned policies to solve diverse interactions while performance remains strongly tracker-conditioned and cross-GMT transfer stays fragile.

What carries the argument

The policy-tracker interface, where the high-level policy outputs intermediate whole-body actions for execution by low-level general motion trackers.

If this is right

Hierarchical policies can complete tasks that require foot placement, balance maintenance, posture adjustment, and whole-body reorientation.
Success rates change sharply depending on which general motion tracker executes the actions.
Policies trained with one tracker show limited ability to work with a different tracker.
The benchmark supports separate diagnosis of generalization under perturbations and transfer across trackers.
Lower-body dynamics play a structural role in the chosen human-centered interaction tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Robust intermediate action representations that work across trackers would reduce the need to retrain high-level policies for each new execution backend.
The observed transfer fragility suggests testing whether actions can be learned in a tracker-invariant space.
Extending the benchmark to physical robots would reveal whether the same tracker dependence appears outside simulation.
Similar hierarchical splits could be applied to other robot platforms that combine high-level planning with modular low-level controllers.

Load-bearing premise

The seven tasks are built so that lower-body coordination is required to complete them rather than being incidental.

What would settle it

A single high-level policy achieving comparable success rates when tested on multiple different general motion trackers without retraining would falsify the fragility of cross-GMT transfer.

Figures

Figures reproduced from arXiv: 2606.17833 by Bin Yang, Haodong Chen, Lihan Chen, Lusong Li, Qiwei Wu, Renjing Xu, Taowen Wang, Weisheng Xu, Xingyu Chen, Yichi Wang, Yixiao Feng, Yuetong Fang, Yunheng Wang, Zecui Zeng, Zikang Xie, Zizhao Yuan.

**Figure 1.** Figure 1: Overview of HUMANOIDARENA. The benchmark studies leg-critical HOI/HSI tasks where success requires coordinated perception, foot placement, balance, and whole-body motion. Within this hierarchical formulation, high-level policies predict intermediate whole-body actions from egocentric visual observations, task instructions, and proprioception, while low-level GMTs stabilize and track them into feasible huma… view at source ↗

**Figure 2.** Figure 2: Qualitative rollouts. We visualize four representative successful episodes generated by Diffusion Policy with SONIC as the low-level GMT. The examples cover both HOI and HSI tasks, showing that the high-level policy can coordinate egocentric perception, task interaction, foot placement, and whole-body motion through the shared GMT-based execution interface. we report both category-level performance and the… view at source ↗

**Figure 3.** Figure 3: Perturbation-conditioned evaluation. We report suite-level average success rates under the in-distribution, visual, semantic, and execution settings. Each curve denotes one high-level policy, characterizing how different policy architectures degrade under the three benchmark-defined sources of perturbation with matched GMT execution. Perturbation-conditioned evaluation [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗

read the original abstract

Humanoid robots promise whole-body interaction in human-centered environments, but scalable policy learning remains difficult because task-level decision-making and whole-body dynamic execution are tightly coupled. A practical solution is hierarchical control, where a high-level policy predicts intermediate whole-body actions and low-level general motion trackers (GMTs) execute them as stable humanoid motion. However, existing benchmarks rarely evaluate the policy-tracker interface itself, leaving open whether intermediate whole-body actions are executable, robust under task distribution shifts, and transferable across different GMT backends. We introduce HumanoidArena, a simulation-first benchmark for egocentric hierarchical whole-body learning. The benchmark formulates policy learning as a hierarchical decision making problem: a high-level policy converts egocentric vision, proprioception, and instructions into a compact whole-body action, which is subsequently executed by a low-level GMT. Instead of treating the legs as planar transport tools, HumanoidArena emphasizes interactions where lower-body coordination is structurally necessary in task completion. We therefore design 7 leg-critical HOI/HSI tasks in which success requires foot placement, balance maintenance, posture adjustment, and whole-body reorientation. To further diagnose the hierarchical system, we evaluate policies from two complementary perspectives: perturbation-conditioned generalization and GMT-conditioned transfer. Experiments show that hierarchical control enables learned policies to solve diverse leg-critical interactions, but performance is strongly tracker-conditioned and cross-GMT transfer remains fragile. These results position HumanoidArena as a benchmark for studying transferable intermediate action representations and scalable egocentric whole-body policy learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HumanoidArena introduces a benchmark with seven leg-critical tasks and dual evaluation axes for hierarchical humanoid control, but the core claim needs stronger evidence that lower-body coordination is actually required.

read the letter

The paper's main move is to create HumanoidArena as a simulation benchmark that isolates how high-level policies interface with low-level general motion trackers on tasks where legs matter. It defines seven HOI/HSI scenarios that supposedly need foot placement, balance, and reorientation, then tests policies under perturbations and across different GMT backends. The reported outcome is that hierarchical setups can solve the tasks but stay heavily dependent on the tracker and transfer poorly between them.

What stands out is the explicit focus on the policy-tracker boundary and the two diagnostic axes. Most prior humanoid benchmarks either ignore the tracker or treat legs as simple transport. This one tries to make lower-body work load-bearing, which could help organize work on transferable intermediate actions.

The soft spot is task construction. The abstract claims success requires lower-body coordination, yet there is no mention of ablations that would show leg-agnostic or upper-body-only policies actually fail. If rewards or termination conditions can be met by treating legs as passive or by letting the GMT handle everything, then the advantage of the hierarchical approach is not secured. The stress-test concern lands here: without those controls, the central experimental claim rests on an assumption that is not yet defended.

The paper is aimed at researchers building scalable humanoid policies who need a testbed for the interface layer. A reader already working on whole-body learning or tracker-conditioned transfer would get concrete tasks and evaluation perspectives to compare against. It is not yet a finished story on what makes a good intermediate representation.

I would send it to peer review. The benchmark idea is timely and the evaluation framing is useful, but referees will need to see the task definitions, reward details, and ablations before the results can be taken as settled.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces HumanoidArena, a simulation benchmark for egocentric hierarchical whole-body policy learning on humanoid robots. A high-level policy maps egocentric vision, proprioception and instructions to compact whole-body actions that are executed by low-level general motion trackers (GMTs). The benchmark defines seven leg-critical human-object and human-scene interaction tasks in which success is claimed to require foot placement, balance, posture adjustment and whole-body reorientation. Experiments are reported to show that hierarchical control enables solution of these tasks, yet performance is strongly conditioned on the choice of GMT and cross-GMT transfer remains fragile.

Significance. If the seven tasks are shown to make lower-body coordination structurally necessary rather than incidental, the benchmark would supply a useful testbed for studying the policy-tracker interface, perturbation robustness and transfer of intermediate whole-body action representations. The dual evaluation axes (perturbation-conditioned generalization and GMT-conditioned transfer) directly target practical deployment questions in hierarchical humanoid control.

major comments (2)

[Abstract] Abstract (task design paragraph): the central claim that the seven tasks make lower-body coordination structurally necessary rests on the assertion that success requires foot placement, balance, posture adjustment and reorientation, yet no ablation is described that demonstrates failure of leg-agnostic or upper-body-only policies; without such evidence the reported advantage of hierarchical whole-body policies over non-hierarchical baselines is not secured.
[Abstract] Abstract (experimental outcomes): the statements that hierarchical control enables solution of the tasks and that performance is strongly tracker-conditioned are presented without accompanying task definitions, reward/termination conditions, success metrics, baseline implementations or quantitative tables, preventing verification of the tracker-conditioned and cross-GMT transfer results.

minor comments (1)

[Abstract] The abstract refers to 'perturbation-conditioned generalization' and 'GMT-conditioned transfer' without defining the perturbation distributions or naming the specific GMT backends used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on the abstract. We address each point below and will revise the abstract for improved clarity and support of the claims.

read point-by-point responses

Referee: [Abstract] Abstract (task design paragraph): the central claim that the seven tasks make lower-body coordination structurally necessary rests on the assertion that success requires foot placement, balance, posture adjustment and reorientation, yet no ablation is described that demonstrates failure of leg-agnostic or upper-body-only policies; without such evidence the reported advantage of hierarchical whole-body policies over non-hierarchical baselines is not secured.

Authors: The seven tasks were deliberately constructed so that lower-body coordination is structurally required for success (e.g., precise foot placement on narrow surfaces or dynamic balance during object carrying), as explained in the task design rationale. Upper-body-only or leg-agnostic control would fail by construction on these interactions. While the manuscript does not present an explicit ablation with leg-agnostic policies, the reported results with whole-body actions and the performance gaps across GMTs illustrate the necessity. We will revise the abstract to explicitly state the design intent and reference the task descriptions that establish this requirement. revision: partial
Referee: [Abstract] Abstract (experimental outcomes): the statements that hierarchical control enables solution of the tasks and that performance is strongly tracker-conditioned are presented without accompanying task definitions, reward/termination conditions, success metrics, baseline implementations or quantitative tables, preventing verification of the tracker-conditioned and cross-GMT transfer results.

Authors: The abstract is a concise summary; complete task definitions, reward and termination conditions, success metrics, baseline details, and quantitative tables appear in Sections 3–5 of the manuscript. To facilitate verification from the abstract alone, we will add brief section references for the key experimental claims regarding hierarchical control and GMT conditioning. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark paper with no derivation chain or fitted predictions

full rationale

The paper introduces HumanoidArena as a benchmark for egocentric hierarchical whole-body learning and reports experimental results on 7 leg-critical tasks. It contains no mathematical derivations, first-principles predictions, parameter fittings, or equations that could reduce outputs to inputs by construction. All load-bearing elements are empirical evaluations of policy performance under different trackers and perturbations, with no self-citation chains, uniqueness theorems, or ansatzes invoked to justify results. The work is self-contained as an empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Only abstract available; ledger reflects standard simulation assumptions and the benchmark itself as the primary invented element.

axioms (1)

domain assumption Physics simulation accurately captures humanoid dynamics, contacts, and balance for the chosen tasks
Benchmark is simulation-first and relies on sim fidelity for all reported results.

invented entities (1)

HumanoidArena benchmark and its 7 leg-critical tasks no independent evidence
purpose: To diagnose policy-tracker interface and cross-GMT transfer in whole-body learning
Newly defined tasks and evaluation protocol introduced in the paper.

pith-pipeline@v0.9.1-grok · 5859 in / 1314 out tokens · 38279 ms · 2026-06-27T01:13:03.098299+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 10 linked inside Pith

[1]

Merleau-Ponty and C

M. Merleau-Ponty and C. Smith.Phenomenology of Perception. Motilal Banarsidass Publishers (Pvt. Limited), 1996

1996
[2]

Rodney A. Brooks. Intelligence without representation.Artificial Intelligence, 1991

1991
[3]

Humanoid teleoperation for whole body manipulation

Mike Stilman, Koichi Nishiwaki, and Satoshi Kagami. Humanoid teleoperation for whole body manipulation. InICRA, 2008

2008
[4]

A whole-body pose taxonomy for loco-manipulation tasks

Júlia Borràs and Tamim Asfour. A whole-body pose taxonomy for loco-manipulation tasks. In IROS, 2015

2015
[5]

Karen Liu, Rocky Duan, and Guanya Shi

Lujie Yang, Xiaoyu Huang, Zhen Wu, Angjoo Kanazawa, Pieter Abbeel, Carmelo Sferrazza, C. Karen Liu, Rocky Duan, and Guanya Shi. Omniretarget: Interaction-preserving data generation for humanoid whole-body loco-manipulation and scene interaction. InICRA, 2026

2026
[6]

Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Pith/arXiv arXiv 2024
[7]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsc...

Pith/arXiv arXiv 2025
[8]

Truong, Xiaoyu Huang, Yuman Gao, Guy Tevet, Koushil Sreenath, and C

Qiayuan Liao, Takara E. Truong, Xiaoyu Huang, Yuman Gao, Guy Tevet, Koushil Sreenath, and C. Karen Liu. Beyondmimic: From motion tracking to versatile humanoid control via guided diffusion.arXiv preprint arXiv:2508.08241, 2025

Pith/arXiv arXiv 2025
[9]

Kungfubot: Physics-based humanoid whole-body control for learning highly-dynamic skills.arXiv preprint arXiv:2506.12851, 2025

Weiji Xie, Jinrui Han, Jiakun Zheng, Huanyu Li, Xinzhe Liu, Jiyuan Shi, Weinan Zhang, Chenjia Bai, and Xuelong Li. Kungfubot: Physics-based humanoid whole-body control for learning highly-dynamic skills.arXiv preprint arXiv:2506.12851, 2025

arXiv 2025
[10]

Omnixtreme: Breaking the generality barrier in high-dynamic humanoid control.arXiv preprint arXiv:2602.23843, 2026

Yunshen Wang, Shaohang Zhu, Peiyuan Zhi, Yuhan Li, Jiaxin Li, Yong-Lu Li, Yuchen Xiao, Xingxing Wang, Baoxiong Jia, and Siyuan Huang. Omnixtreme: Breaking the generality barrier in high-dynamic humanoid control.arXiv preprint arXiv:2602.23843, 2026

arXiv 2026
[11]

Karen Liu

Yanjie Ze, Siheng Zhao, Weizhuo Wang, Angjoo Kanazawa, Rocky Duan, Pieter Abbeel, Guanya Shi, Jiajun Wu, and C. Karen Liu. Twist2: Scalable, portable, and holistic humanoid data collection system.arXiv preprint arXiv:2511.02832, 2025

arXiv 2025
[12]

Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025

Zhengyi Luo, Ye Yuan, Tingwu Wang, Chenran Li, Sirui Chen, Fernando Castañeda, Zi-Ang Cao, Jiefeng Li, David Minor, Qingwei Ben, Xingye Da, Runyu Ding, Cyrus Hogg, Lina Song, Edy Lim, Eugene Jeong, Tairan He, Haoru Xue, Wenli Xiao, Zi Wang, Simon Yuen, Jan Kautz, Yan Chang, Umar Iqbal, Linxi Fan, and Yuke Zhu. Sonic: Supersizing motion tracking for natura...

Pith/arXiv arXiv 2025
[13]

Hu- manoidbench: Simulated humanoid benchmark for whole-body locomotion and manipulation

Carmelo Sferrazza, Dun-Ming Huang, Xingyu Lin, Youngwoon Lee, and Pieter Abbeel. Hu- manoidbench: Simulated humanoid benchmark for whole-body locomotion and manipulation. arXiv preprint arXiv:2403.10506, 2024

arXiv 2024
[14]

Leverb: Humanoid whole-body control with latent vision-language instruction.arXiv preprint arXiv:2506.13751, 2025

Haoru Xue, Xiaoyu Huang, Dantong Niu, Qiayuan Liao, Thomas Kragerud, Jan Tommy Gravdahl, Xue Bin Peng, Guanya Shi, Trevor Darrell, Koushil Sreenath, and Shankar Sastry. Leverb: Humanoid whole-body control with latent vision-language instruction.arXiv preprint arXiv:2506.13751, 2025. 10

arXiv 2025
[15]

BEHA VIOR-1k: A benchmark for embodied AI with 1,000 everyday activities and realistic simulation

Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gokmen, Sanjana Srivastava, Roberto Martín- Martín, Chen Wang, Gabrael Levine, Michael Lingelbach, Jiankai Sun, Mona Anvari, Minjune Hwang, Manasi Sharma, Arman Aydin, Dhruva Bansal, Samuel Hunter, Kyu-Young Kim, Alan Lou, Caleb R Matthews, Ivan Villa-Renteria, Jerry Huayang Tang, Claire Tang, Fei Xia, Silvio Sa...

2022
[16]

Libero: Benchmarking knowledge transfer for lifelong robot learning.arXiv preprint arXiv:2306.03310, 2023

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.arXiv preprint arXiv:2306.03310, 2023

Pith/arXiv arXiv 2023
[17]

Humanoidverse: A versatile humanoid for vision-language guided multi-object rearrangement.arXiv preprint arXiv:2508.16943v1, 2025

Haozhuo Zhang, Jingkai Sun, Michele Caprio, Jian Tang, Shanghang Zhang, Qiang Zhang, and Wei Pan. Humanoidverse: A versatile humanoid for vision-language guided multi-object rearrangement.arXiv preprint arXiv:2508.16943v1, 2025

arXiv 2025
[18]

Hdmi: Learning interactive humanoid whole-body control from human videos.arXiv preprint arXiv:2509.16757, 2025

Haoyang Weng, Yitang Li, Nikhil Sobanbabu, Zihan Wang, Zhengyi Luo, Tairan He, Deva Ramanan, and Guanya Shi. Hdmi: Learning interactive humanoid whole-body control from human videos.arXiv preprint arXiv:2509.16757, 2025

arXiv 2025
[19]

Karen Liu, Pieter Abbeel, Guanya Shi, and Rocky Duan

Siheng Zhao, Yanjie Ze, Yue Wang, C. Karen Liu, Pieter Abbeel, Guanya Shi, and Rocky Duan. Resmimic: From general motion tracking to humanoid whole-body loco-manipulation via residual learning.arXiv preprint arXiv:2510.05070, 2025

arXiv 2025
[20]

Haic: Humanoid agile object interaction control via dynamics-aware world model.arXiv preprint arXiv:2602.11758, 2026

Dongting Li, Xingyu Chen, Qianyang Wu, Bo Chen, Sikai Wu, Hanyu Wu, Guoyao Zhang, Liang Li, Mingliang Zhou, Diyun Xiang, et al. Haic: Humanoid agile object interaction control via dynamics-aware world model.arXiv preprint arXiv:2602.11758, 2026

Pith/arXiv arXiv 2026
[21]

Humanx: Toward agile and generalizable humanoid interaction skills from human videos.arXiv preprint arXiv:2602.02473, 2026

Yinhuai Wang, Qihan Zhao, Yuen Fui Lau, Runyi Yu, Hok Wai Tsui, Qifeng Chen, Jingbo Wang, Jiangmiao Pang, and Ping Tan. Humanx: Toward agile and generalizable humanoid interaction skills from human videos.arXiv preprint arXiv:2602.02473, 2026

arXiv 2026
[22]

Hitter: A humanoid table tennis robot via hierarchical planning and learning.arXiv preprint arXiv:2508.21043, 2025

Zhi Su, Bike Zhang, Nima Rahmanian, Yuman Gao, Qiayuan Liao, Caitlin Regan, Koushil Sreenath, and S Shankar Sastry. Hitter: A humanoid table tennis robot via hierarchical planning and learning.arXiv preprint arXiv:2508.21043, 2025

arXiv 2025
[23]

Learning athletic humanoid tennis skills from imperfect human motion data.arXiv preprint arXiv:2603.12686, 2026

Zhikai Zhang, Haofei Lu, Yunrui Lian, Ziqing Chen, Yun Liu, Chenghuai Lin, Han Xue, Zicheng Zeng, Zekun Qi, Shaolin Zheng, et al. Learning athletic humanoid tennis skills from imperfect human motion data.arXiv preprint arXiv:2603.12686, 2026

arXiv 2026
[24]

Physhsi: Towards a real-world generalizable and natural humanoid-scene interaction system.arXiv preprint arXiv:2510.11072, 2025

Huayi Wang, Wentao Zhang, Runyi Yu, Tao Huang, Junli Ren, Feiyu Jia, Zirui Wang, Xiaojie Niu, Xiao Chen, Jiahe Chen, et al. Physhsi: Towards a real-world generalizable and natural humanoid-scene interaction system.arXiv preprint arXiv:2510.11072, 2025

arXiv 2025
[25]

Visualmimic: Visual hu- manoid loco-manipulation via motion tracking and generation.arXiv preprint arXiv:2509.20322, 2025

Shaofeng Yin, Yanjie Ze, Hong-Xing Yu, C Karen Liu, and Jiajun Wu. Visualmimic: Visual hu- manoid loco-manipulation via motion tracking and generation.arXiv preprint arXiv:2509.20322, 2025

arXiv 2025
[26]

Viral: Visual sim-to-real at scale for humanoid loco-manipulation.arXiv preprint arXiv:2511.15200, 2025

Tairan He, Zi Wang, Haoru Xue, Qingwei Ben, Zhengyi Luo, Wenli Xiao, Ye Yuan, Xingye Da, Fernando Castañeda, Shankar Sastry, et al. Viral: Visual sim-to-real at scale for humanoid loco-manipulation.arXiv preprint arXiv:2511.15200, 2025

arXiv 2025
[27]

Pro-hoi: Perceptive root-guided humanoid-object interaction.arXiv preprint arXiv:2603.01126, 2026

Yuhang Lin, Jiyuan Shi, Dewei Wang, Jipeng Kong, Yong Liu, Chenjia Bai, and Xuelong Li. Pro-hoi: Perceptive root-guided humanoid-object interaction.arXiv preprint arXiv:2603.01126, 2026

arXiv 2026
[28]

Lessmimic: Long-horizon humanoid interaction with unified distance field representations.arXiv preprint arXiv:2602.21723, 2026

Yutang Lin, Jieming Cui, Yixuan Li, Baoxiong Jia, Yixin Zhu, and Siyuan Huang. Lessmimic: Long-horizon humanoid interaction with unified distance field representations.arXiv preprint arXiv:2602.21723, 2026

arXiv 2026
[29]

Ultra: Unified multimodal control for autonomous humanoid whole-body loco-manipulation

Xialin He, Sirui Xu, Xinyao Li, Runpei Dong, Liuyu Bian, Yu-Xiong Wang, and Liang-Yan Gui. Ultra: Unified multimodal control for autonomous humanoid whole-body loco-manipulation. arXiv preprint arXiv:2603.03279, 2026. 11

arXiv 2026
[30]

Smash: Mastering scalable whole-body skills for humanoid ping-pong with egocentric vision.arXiv preprint arXiv:2604.01158, 2026

Junli Ren, Yinghui Li, Kai Zhang, Penglin Fu, Haoran Jiang, Yixuan Pan, Guangjun Zeng, Tao Huang, Weizhong Guo, Peng Lu, et al. Smash: Mastering scalable whole-body skills for humanoid ping-pong with egocentric vision.arXiv preprint arXiv:2604.01158, 2026

arXiv 2026
[31]

Parkour in the wild: Learning a general and extensible agile locomotion policy using multi-expert distillation and rl fine-tuning

Nikita Rudin, Junzhe He, Joshua Aurand, and Marco Hutter. Parkour in the wild: Learning a general and extensible agile locomotion policy using multi-expert distillation and rl fine-tuning. arXiv preprint arXiv:2505.11164, 2025

arXiv 2025
[32]

Ego-vision world model for humanoid contact planning.arXiv preprint arXiv:2510.11682, 2025

Hang Liu, Yuman Gao, Sangli Teng, Yufeng Chi, Yakun Sophia Shao, Zhongyu Li, Maani Ghaffari, and Koushil Sreenath. Ego-vision world model for humanoid contact planning.arXiv preprint arXiv:2510.11682, 2025

arXiv 2025
[33]

Gallant: V oxel grid-based humanoid locomotion and local- navigation across 3d constrained terrains.arXiv preprint arXiv:2511.14625, 2025

Qingwei Ben, Botian Xu, Kailin Li, Feiyu Jia, Wentao Zhang, Jingping Wang, Jingbo Wang, Dahua Lin, and Jiangmiao Pang. Gallant: V oxel grid-based humanoid locomotion and local- navigation across 3d constrained terrains.arXiv preprint arXiv:2511.14625, 2025

arXiv 2025
[34]

Hiking in the wild: A scalable perceptive parkour framework for humanoids.arXiv preprint arXiv:2601.07718, 2026

Shaoting Zhu, Ziwen Zhuang, Mengjie Zhao, Kun-Ying Lee, and Hang Zhao. Hiking in the wild: A scalable perceptive parkour framework for humanoids.arXiv preprint arXiv:2601.07718, 2026

arXiv 2026
[35]

Deep whole-body parkour.arXiv preprint arXiv:2601.07701, 2026

Ziwen Zhuang, Shaoting Zhu, Mengjie Zhao, and Hang Zhao. Deep whole-body parkour.arXiv preprint arXiv:2601.07701, 2026

arXiv 2026
[36]

Perceptive humanoid parkour: Chaining dynamic human skills via motion matching.arXiv preprint arXiv:2602.15827, 2026

Zhen Wu, Xiaoyu Huang, Lujie Yang, Yuanhang Zhang, Koushil Sreenath, Xi Chen, Pieter Abbeel, Rocky Duan, Angjoo Kanazawa, Carmelo Sferrazza, et al. Perceptive humanoid parkour: Chaining dynamic human skills via motion matching.arXiv preprint arXiv:2602.15827, 2026

Pith/arXiv arXiv 2026
[37]

Zerowbc: Learning natural visuomotor humanoid control directly from human egocentric video.arXiv preprint arXiv:2603.09170, 2026

Haoran Yang, Jiacheng Bao, Yucheng Xin, Haoming Song, Yuyang Tian, Bin Zhao, Dong Wang, and Xuelong Li. Zerowbc: Learning natural visuomotor humanoid control directly from human egocentric video.arXiv preprint arXiv:2603.09170, 2026

Pith/arXiv arXiv 2026
[38]

Physiflow: Physics-aware humanoid whole-body vla via multi-brain latent flow matching and robust tracking.arXiv preprint arXiv:2603.05410, 2026

Weikai Qin, Sichen Wu, Ci Chen, Mengfan Liu, Linxi Feng, Xinru Cui, Haoqi Han, and Hesheng Wang. Physiflow: Physics-aware humanoid whole-body vla via multi-brain latent flow matching and robust tracking.arXiv preprint arXiv:2603.05410, 2026

arXiv 2026
[39]

Humanoid-vla: Towards universal humanoid control with visual integration.arXiv preprint arXiv:2502.14795, 2025

Pengxiang Ding, Jianfei Ma, Xinyang Tong, Binghong Zou, Xinxin Luo, Yiguo Fan, Ting Wang, Hongchao Lu, Panzhong Mo, Jinxin Liu, Yuefan Wang, Huaicheng Zhou, Wenshuo Feng, Jiacheng Liu, Siteng Huang, and Donglin Wang. Humanoid-vla: Towards universal humanoid control with visual integration.arXiv preprint arXiv:2502.14795, 2025

arXiv 2025
[40]

Wholebodyvla: Towards unified latent vla for whole-body loco-manipulation control.arXiv preprint arXiv:2512.11047, 2025

Haoran Jiang, Jin Chen, Qingwen Bu, Li Chen, Modi Shi, Yanjie Zhang, Delong Li, Chuanzhe Suo, Chuang Wang, Zhihui Peng, and Hongyang Li. Wholebodyvla: Towards unified latent vla for whole-body loco-manipulation control.arXiv preprint arXiv:2512.11047, 2025

arXiv 2025
[41]

ψ0: An open foundation model towards universal humanoid loco-manipulation.arXiv preprint arXiv:2603.12263, 2026

Songlin Wei, Hongyi Jing, Boqian Li, Zhenyu Zhao, Jiageng Mao, Zhenhao Ni, Sicheng He, Jie Liu, Xiawei Liu, Kaidi Kang, Sheng Zang, Weiduo Yuan, Marco Pavone, Di Huang, and Yue Wang. ψ0: An open foundation model towards universal humanoid loco-manipulation.arXiv preprint arXiv:2603.12263, 2026

arXiv 2026
[42]

BEHA VIOR robot suite: Streamlining real-world whole-body manipulation for everyday household activities

Yunfan Jiang, Ruohan Zhang, Josiah Wong, Chen Wang, Yanjie Ze, Hang Yin, Cem Gokmen, Shuran Song, Jiajun Wu, and Li Fei-Fei. BEHA VIOR robot suite: Streamlining real-world whole-body manipulation for everyday household activities. InCoRL, 2025

2025
[43]

Agentworld: An interactive simulation platform for scene construction and mobile robotic manipulation

Yizheng Zhang, Zhenjun Yu, Jiaxin Lai, Cewu Lu, and Lei Han. Agentworld: An interactive simulation platform for scene construction and mobile robotic manipulation. InCoRL, 2025

2025
[44]

Humanoid everyday: A comprehensive robotic dataset for open-world humanoid manipulation.arXiv preprint arXiv:2510.08807, 2025

Zhenyu Zhao, Hongyi Jing, Xiawei Liu, Jiageng Mao, Abha Jha, Hanwen Yang, Rong Xue, Sergey Zakharov, Vitor Guizilini, and Yue Wang. Humanoid everyday: A comprehensive robotic dataset for open-world humanoid manipulation.arXiv preprint arXiv:2510.08807, 2025

Pith/arXiv arXiv 2025
[45]

Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn

Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. InRSS, 2023. 12

2023
[46]

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. InRSS, 2023

2023
[47]

Large behavior models and atlas find new footing,

Boston Dynamics and TRI Research Team. Large behavior models and atlas find new footing,
[48]

Carlson, Ji Yuan Feng, Animesh Garg, Renato Gasoto, Lionel Gulich, Yijie Guo, M

Mayank Mittal, Pascal Roth, James Tigue, Antoine Richard, Octi Zhang, Peter Du, Antonio Serrano-Muñoz, Xinjie Yao, René Zurbrügg, Nikita Rudin, Lukasz Wawrzyniak, Milad Rakhsha, Alain Denzler, Eric Heiden, Ales Borovicka, Ossama Ahmed, Iretiayo Akinola, Abrar Anwar, Mark T. Carlson, Ji Yuan Feng, Animesh Garg, Renato Gasoto, Lionel Gulich, Yijie Guo, M. G...

Pith/arXiv arXiv 2025
[49]

Karen Liu

Joao Pedro Araujo, Yanjie Ze, Pei Xu, Jiajun Wu, and C. Karen Liu. Retargeting matters: General motion retargeting for humanoid motion tracking.arXiv preprint arXiv:2510.02252, 2025

arXiv 2025
[50]

semantic

Yanjie Ze, Zixuan Chen, João Pedro Araújo, Zi ang Cao, Xue Bin Peng, Jiajun Wu, and C. Karen Liu. Twist: Teleoperated whole-body imitation system.arXiv preprint arXiv:2505.02833, 2025. 13 SUMMARY OF THE APPENDIX This appendix contains additional experimental results and discussions of our work, organized as: • §A details theleg-critical HOI/HSI task suite...

arXiv 2025

[1] [1]

Merleau-Ponty and C

M. Merleau-Ponty and C. Smith.Phenomenology of Perception. Motilal Banarsidass Publishers (Pvt. Limited), 1996

1996

[2] [2]

Rodney A. Brooks. Intelligence without representation.Artificial Intelligence, 1991

1991

[3] [3]

Humanoid teleoperation for whole body manipulation

Mike Stilman, Koichi Nishiwaki, and Satoshi Kagami. Humanoid teleoperation for whole body manipulation. InICRA, 2008

2008

[4] [4]

A whole-body pose taxonomy for loco-manipulation tasks

Júlia Borràs and Tamim Asfour. A whole-body pose taxonomy for loco-manipulation tasks. In IROS, 2015

2015

[5] [5]

Karen Liu, Rocky Duan, and Guanya Shi

Lujie Yang, Xiaoyu Huang, Zhen Wu, Angjoo Kanazawa, Pieter Abbeel, Carmelo Sferrazza, C. Karen Liu, Rocky Duan, and Guanya Shi. Omniretarget: Interaction-preserving data generation for humanoid whole-body loco-manipulation and scene interaction. InICRA, 2026

2026

[6] [6]

Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Pith/arXiv arXiv 2024

[7] [7]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsc...

Pith/arXiv arXiv 2025

[8] [8]

Truong, Xiaoyu Huang, Yuman Gao, Guy Tevet, Koushil Sreenath, and C

Qiayuan Liao, Takara E. Truong, Xiaoyu Huang, Yuman Gao, Guy Tevet, Koushil Sreenath, and C. Karen Liu. Beyondmimic: From motion tracking to versatile humanoid control via guided diffusion.arXiv preprint arXiv:2508.08241, 2025

Pith/arXiv arXiv 2025

[9] [9]

Kungfubot: Physics-based humanoid whole-body control for learning highly-dynamic skills.arXiv preprint arXiv:2506.12851, 2025

Weiji Xie, Jinrui Han, Jiakun Zheng, Huanyu Li, Xinzhe Liu, Jiyuan Shi, Weinan Zhang, Chenjia Bai, and Xuelong Li. Kungfubot: Physics-based humanoid whole-body control for learning highly-dynamic skills.arXiv preprint arXiv:2506.12851, 2025

arXiv 2025

[10] [10]

Omnixtreme: Breaking the generality barrier in high-dynamic humanoid control.arXiv preprint arXiv:2602.23843, 2026

Yunshen Wang, Shaohang Zhu, Peiyuan Zhi, Yuhan Li, Jiaxin Li, Yong-Lu Li, Yuchen Xiao, Xingxing Wang, Baoxiong Jia, and Siyuan Huang. Omnixtreme: Breaking the generality barrier in high-dynamic humanoid control.arXiv preprint arXiv:2602.23843, 2026

arXiv 2026

[11] [11]

Karen Liu

Yanjie Ze, Siheng Zhao, Weizhuo Wang, Angjoo Kanazawa, Rocky Duan, Pieter Abbeel, Guanya Shi, Jiajun Wu, and C. Karen Liu. Twist2: Scalable, portable, and holistic humanoid data collection system.arXiv preprint arXiv:2511.02832, 2025

arXiv 2025

[12] [12]

Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025

Zhengyi Luo, Ye Yuan, Tingwu Wang, Chenran Li, Sirui Chen, Fernando Castañeda, Zi-Ang Cao, Jiefeng Li, David Minor, Qingwei Ben, Xingye Da, Runyu Ding, Cyrus Hogg, Lina Song, Edy Lim, Eugene Jeong, Tairan He, Haoru Xue, Wenli Xiao, Zi Wang, Simon Yuen, Jan Kautz, Yan Chang, Umar Iqbal, Linxi Fan, and Yuke Zhu. Sonic: Supersizing motion tracking for natura...

Pith/arXiv arXiv 2025

[13] [13]

Hu- manoidbench: Simulated humanoid benchmark for whole-body locomotion and manipulation

Carmelo Sferrazza, Dun-Ming Huang, Xingyu Lin, Youngwoon Lee, and Pieter Abbeel. Hu- manoidbench: Simulated humanoid benchmark for whole-body locomotion and manipulation. arXiv preprint arXiv:2403.10506, 2024

arXiv 2024

[14] [14]

Leverb: Humanoid whole-body control with latent vision-language instruction.arXiv preprint arXiv:2506.13751, 2025

Haoru Xue, Xiaoyu Huang, Dantong Niu, Qiayuan Liao, Thomas Kragerud, Jan Tommy Gravdahl, Xue Bin Peng, Guanya Shi, Trevor Darrell, Koushil Sreenath, and Shankar Sastry. Leverb: Humanoid whole-body control with latent vision-language instruction.arXiv preprint arXiv:2506.13751, 2025. 10

arXiv 2025

[15] [15]

BEHA VIOR-1k: A benchmark for embodied AI with 1,000 everyday activities and realistic simulation

Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gokmen, Sanjana Srivastava, Roberto Martín- Martín, Chen Wang, Gabrael Levine, Michael Lingelbach, Jiankai Sun, Mona Anvari, Minjune Hwang, Manasi Sharma, Arman Aydin, Dhruva Bansal, Samuel Hunter, Kyu-Young Kim, Alan Lou, Caleb R Matthews, Ivan Villa-Renteria, Jerry Huayang Tang, Claire Tang, Fei Xia, Silvio Sa...

2022

[16] [16]

Libero: Benchmarking knowledge transfer for lifelong robot learning.arXiv preprint arXiv:2306.03310, 2023

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.arXiv preprint arXiv:2306.03310, 2023

Pith/arXiv arXiv 2023

[17] [17]

Humanoidverse: A versatile humanoid for vision-language guided multi-object rearrangement.arXiv preprint arXiv:2508.16943v1, 2025

Haozhuo Zhang, Jingkai Sun, Michele Caprio, Jian Tang, Shanghang Zhang, Qiang Zhang, and Wei Pan. Humanoidverse: A versatile humanoid for vision-language guided multi-object rearrangement.arXiv preprint arXiv:2508.16943v1, 2025

arXiv 2025

[18] [18]

Hdmi: Learning interactive humanoid whole-body control from human videos.arXiv preprint arXiv:2509.16757, 2025

Haoyang Weng, Yitang Li, Nikhil Sobanbabu, Zihan Wang, Zhengyi Luo, Tairan He, Deva Ramanan, and Guanya Shi. Hdmi: Learning interactive humanoid whole-body control from human videos.arXiv preprint arXiv:2509.16757, 2025

arXiv 2025

[19] [19]

Karen Liu, Pieter Abbeel, Guanya Shi, and Rocky Duan

Siheng Zhao, Yanjie Ze, Yue Wang, C. Karen Liu, Pieter Abbeel, Guanya Shi, and Rocky Duan. Resmimic: From general motion tracking to humanoid whole-body loco-manipulation via residual learning.arXiv preprint arXiv:2510.05070, 2025

arXiv 2025

[20] [20]

Haic: Humanoid agile object interaction control via dynamics-aware world model.arXiv preprint arXiv:2602.11758, 2026

Dongting Li, Xingyu Chen, Qianyang Wu, Bo Chen, Sikai Wu, Hanyu Wu, Guoyao Zhang, Liang Li, Mingliang Zhou, Diyun Xiang, et al. Haic: Humanoid agile object interaction control via dynamics-aware world model.arXiv preprint arXiv:2602.11758, 2026

Pith/arXiv arXiv 2026

[21] [21]

Humanx: Toward agile and generalizable humanoid interaction skills from human videos.arXiv preprint arXiv:2602.02473, 2026

Yinhuai Wang, Qihan Zhao, Yuen Fui Lau, Runyi Yu, Hok Wai Tsui, Qifeng Chen, Jingbo Wang, Jiangmiao Pang, and Ping Tan. Humanx: Toward agile and generalizable humanoid interaction skills from human videos.arXiv preprint arXiv:2602.02473, 2026

arXiv 2026

[22] [22]

Hitter: A humanoid table tennis robot via hierarchical planning and learning.arXiv preprint arXiv:2508.21043, 2025

Zhi Su, Bike Zhang, Nima Rahmanian, Yuman Gao, Qiayuan Liao, Caitlin Regan, Koushil Sreenath, and S Shankar Sastry. Hitter: A humanoid table tennis robot via hierarchical planning and learning.arXiv preprint arXiv:2508.21043, 2025

arXiv 2025

[23] [23]

Learning athletic humanoid tennis skills from imperfect human motion data.arXiv preprint arXiv:2603.12686, 2026

Zhikai Zhang, Haofei Lu, Yunrui Lian, Ziqing Chen, Yun Liu, Chenghuai Lin, Han Xue, Zicheng Zeng, Zekun Qi, Shaolin Zheng, et al. Learning athletic humanoid tennis skills from imperfect human motion data.arXiv preprint arXiv:2603.12686, 2026

arXiv 2026

[24] [24]

Physhsi: Towards a real-world generalizable and natural humanoid-scene interaction system.arXiv preprint arXiv:2510.11072, 2025

Huayi Wang, Wentao Zhang, Runyi Yu, Tao Huang, Junli Ren, Feiyu Jia, Zirui Wang, Xiaojie Niu, Xiao Chen, Jiahe Chen, et al. Physhsi: Towards a real-world generalizable and natural humanoid-scene interaction system.arXiv preprint arXiv:2510.11072, 2025

arXiv 2025

[25] [25]

Visualmimic: Visual hu- manoid loco-manipulation via motion tracking and generation.arXiv preprint arXiv:2509.20322, 2025

Shaofeng Yin, Yanjie Ze, Hong-Xing Yu, C Karen Liu, and Jiajun Wu. Visualmimic: Visual hu- manoid loco-manipulation via motion tracking and generation.arXiv preprint arXiv:2509.20322, 2025

arXiv 2025

[26] [26]

Viral: Visual sim-to-real at scale for humanoid loco-manipulation.arXiv preprint arXiv:2511.15200, 2025

Tairan He, Zi Wang, Haoru Xue, Qingwei Ben, Zhengyi Luo, Wenli Xiao, Ye Yuan, Xingye Da, Fernando Castañeda, Shankar Sastry, et al. Viral: Visual sim-to-real at scale for humanoid loco-manipulation.arXiv preprint arXiv:2511.15200, 2025

arXiv 2025

[27] [27]

Pro-hoi: Perceptive root-guided humanoid-object interaction.arXiv preprint arXiv:2603.01126, 2026

Yuhang Lin, Jiyuan Shi, Dewei Wang, Jipeng Kong, Yong Liu, Chenjia Bai, and Xuelong Li. Pro-hoi: Perceptive root-guided humanoid-object interaction.arXiv preprint arXiv:2603.01126, 2026

arXiv 2026

[28] [28]

Lessmimic: Long-horizon humanoid interaction with unified distance field representations.arXiv preprint arXiv:2602.21723, 2026

Yutang Lin, Jieming Cui, Yixuan Li, Baoxiong Jia, Yixin Zhu, and Siyuan Huang. Lessmimic: Long-horizon humanoid interaction with unified distance field representations.arXiv preprint arXiv:2602.21723, 2026

arXiv 2026

[29] [29]

Ultra: Unified multimodal control for autonomous humanoid whole-body loco-manipulation

Xialin He, Sirui Xu, Xinyao Li, Runpei Dong, Liuyu Bian, Yu-Xiong Wang, and Liang-Yan Gui. Ultra: Unified multimodal control for autonomous humanoid whole-body loco-manipulation. arXiv preprint arXiv:2603.03279, 2026. 11

arXiv 2026

[30] [30]

Smash: Mastering scalable whole-body skills for humanoid ping-pong with egocentric vision.arXiv preprint arXiv:2604.01158, 2026

Junli Ren, Yinghui Li, Kai Zhang, Penglin Fu, Haoran Jiang, Yixuan Pan, Guangjun Zeng, Tao Huang, Weizhong Guo, Peng Lu, et al. Smash: Mastering scalable whole-body skills for humanoid ping-pong with egocentric vision.arXiv preprint arXiv:2604.01158, 2026

arXiv 2026

[31] [31]

Parkour in the wild: Learning a general and extensible agile locomotion policy using multi-expert distillation and rl fine-tuning

Nikita Rudin, Junzhe He, Joshua Aurand, and Marco Hutter. Parkour in the wild: Learning a general and extensible agile locomotion policy using multi-expert distillation and rl fine-tuning. arXiv preprint arXiv:2505.11164, 2025

arXiv 2025

[32] [32]

Ego-vision world model for humanoid contact planning.arXiv preprint arXiv:2510.11682, 2025

Hang Liu, Yuman Gao, Sangli Teng, Yufeng Chi, Yakun Sophia Shao, Zhongyu Li, Maani Ghaffari, and Koushil Sreenath. Ego-vision world model for humanoid contact planning.arXiv preprint arXiv:2510.11682, 2025

arXiv 2025

[33] [33]

Gallant: V oxel grid-based humanoid locomotion and local- navigation across 3d constrained terrains.arXiv preprint arXiv:2511.14625, 2025

Qingwei Ben, Botian Xu, Kailin Li, Feiyu Jia, Wentao Zhang, Jingping Wang, Jingbo Wang, Dahua Lin, and Jiangmiao Pang. Gallant: V oxel grid-based humanoid locomotion and local- navigation across 3d constrained terrains.arXiv preprint arXiv:2511.14625, 2025

arXiv 2025

[34] [34]

Hiking in the wild: A scalable perceptive parkour framework for humanoids.arXiv preprint arXiv:2601.07718, 2026

Shaoting Zhu, Ziwen Zhuang, Mengjie Zhao, Kun-Ying Lee, and Hang Zhao. Hiking in the wild: A scalable perceptive parkour framework for humanoids.arXiv preprint arXiv:2601.07718, 2026

arXiv 2026

[35] [35]

Deep whole-body parkour.arXiv preprint arXiv:2601.07701, 2026

Ziwen Zhuang, Shaoting Zhu, Mengjie Zhao, and Hang Zhao. Deep whole-body parkour.arXiv preprint arXiv:2601.07701, 2026

arXiv 2026

[36] [36]

Perceptive humanoid parkour: Chaining dynamic human skills via motion matching.arXiv preprint arXiv:2602.15827, 2026

Zhen Wu, Xiaoyu Huang, Lujie Yang, Yuanhang Zhang, Koushil Sreenath, Xi Chen, Pieter Abbeel, Rocky Duan, Angjoo Kanazawa, Carmelo Sferrazza, et al. Perceptive humanoid parkour: Chaining dynamic human skills via motion matching.arXiv preprint arXiv:2602.15827, 2026

Pith/arXiv arXiv 2026

[37] [37]

Zerowbc: Learning natural visuomotor humanoid control directly from human egocentric video.arXiv preprint arXiv:2603.09170, 2026

Haoran Yang, Jiacheng Bao, Yucheng Xin, Haoming Song, Yuyang Tian, Bin Zhao, Dong Wang, and Xuelong Li. Zerowbc: Learning natural visuomotor humanoid control directly from human egocentric video.arXiv preprint arXiv:2603.09170, 2026

Pith/arXiv arXiv 2026

[38] [38]

Physiflow: Physics-aware humanoid whole-body vla via multi-brain latent flow matching and robust tracking.arXiv preprint arXiv:2603.05410, 2026

Weikai Qin, Sichen Wu, Ci Chen, Mengfan Liu, Linxi Feng, Xinru Cui, Haoqi Han, and Hesheng Wang. Physiflow: Physics-aware humanoid whole-body vla via multi-brain latent flow matching and robust tracking.arXiv preprint arXiv:2603.05410, 2026

arXiv 2026

[39] [39]

Humanoid-vla: Towards universal humanoid control with visual integration.arXiv preprint arXiv:2502.14795, 2025

Pengxiang Ding, Jianfei Ma, Xinyang Tong, Binghong Zou, Xinxin Luo, Yiguo Fan, Ting Wang, Hongchao Lu, Panzhong Mo, Jinxin Liu, Yuefan Wang, Huaicheng Zhou, Wenshuo Feng, Jiacheng Liu, Siteng Huang, and Donglin Wang. Humanoid-vla: Towards universal humanoid control with visual integration.arXiv preprint arXiv:2502.14795, 2025

arXiv 2025

[40] [40]

Wholebodyvla: Towards unified latent vla for whole-body loco-manipulation control.arXiv preprint arXiv:2512.11047, 2025

Haoran Jiang, Jin Chen, Qingwen Bu, Li Chen, Modi Shi, Yanjie Zhang, Delong Li, Chuanzhe Suo, Chuang Wang, Zhihui Peng, and Hongyang Li. Wholebodyvla: Towards unified latent vla for whole-body loco-manipulation control.arXiv preprint arXiv:2512.11047, 2025

arXiv 2025

[41] [41]

ψ0: An open foundation model towards universal humanoid loco-manipulation.arXiv preprint arXiv:2603.12263, 2026

Songlin Wei, Hongyi Jing, Boqian Li, Zhenyu Zhao, Jiageng Mao, Zhenhao Ni, Sicheng He, Jie Liu, Xiawei Liu, Kaidi Kang, Sheng Zang, Weiduo Yuan, Marco Pavone, Di Huang, and Yue Wang. ψ0: An open foundation model towards universal humanoid loco-manipulation.arXiv preprint arXiv:2603.12263, 2026

arXiv 2026

[42] [42]

BEHA VIOR robot suite: Streamlining real-world whole-body manipulation for everyday household activities

Yunfan Jiang, Ruohan Zhang, Josiah Wong, Chen Wang, Yanjie Ze, Hang Yin, Cem Gokmen, Shuran Song, Jiajun Wu, and Li Fei-Fei. BEHA VIOR robot suite: Streamlining real-world whole-body manipulation for everyday household activities. InCoRL, 2025

2025

[43] [43]

Agentworld: An interactive simulation platform for scene construction and mobile robotic manipulation

Yizheng Zhang, Zhenjun Yu, Jiaxin Lai, Cewu Lu, and Lei Han. Agentworld: An interactive simulation platform for scene construction and mobile robotic manipulation. InCoRL, 2025

2025

[44] [44]

Humanoid everyday: A comprehensive robotic dataset for open-world humanoid manipulation.arXiv preprint arXiv:2510.08807, 2025

Zhenyu Zhao, Hongyi Jing, Xiawei Liu, Jiageng Mao, Abha Jha, Hanwen Yang, Rong Xue, Sergey Zakharov, Vitor Guizilini, and Yue Wang. Humanoid everyday: A comprehensive robotic dataset for open-world humanoid manipulation.arXiv preprint arXiv:2510.08807, 2025

Pith/arXiv arXiv 2025

[45] [45]

Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn

Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. InRSS, 2023. 12

2023

[46] [46]

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. InRSS, 2023

2023

[47] [47]

Large behavior models and atlas find new footing,

Boston Dynamics and TRI Research Team. Large behavior models and atlas find new footing,

[48] [48]

Carlson, Ji Yuan Feng, Animesh Garg, Renato Gasoto, Lionel Gulich, Yijie Guo, M

Mayank Mittal, Pascal Roth, James Tigue, Antoine Richard, Octi Zhang, Peter Du, Antonio Serrano-Muñoz, Xinjie Yao, René Zurbrügg, Nikita Rudin, Lukasz Wawrzyniak, Milad Rakhsha, Alain Denzler, Eric Heiden, Ales Borovicka, Ossama Ahmed, Iretiayo Akinola, Abrar Anwar, Mark T. Carlson, Ji Yuan Feng, Animesh Garg, Renato Gasoto, Lionel Gulich, Yijie Guo, M. G...

Pith/arXiv arXiv 2025

[49] [49]

Karen Liu

Joao Pedro Araujo, Yanjie Ze, Pei Xu, Jiajun Wu, and C. Karen Liu. Retargeting matters: General motion retargeting for humanoid motion tracking.arXiv preprint arXiv:2510.02252, 2025

arXiv 2025

[50] [50]

semantic

Yanjie Ze, Zixuan Chen, João Pedro Araújo, Zi ang Cao, Xue Bin Peng, Jiajun Wu, and C. Karen Liu. Twist: Teleoperated whole-body imitation system.arXiv preprint arXiv:2505.02833, 2025. 13 SUMMARY OF THE APPENDIX This appendix contains additional experimental results and discussions of our work, organized as: • §A details theleg-critical HOI/HSI task suite...

arXiv 2025