HEX: Humanoid-Aligned Experts for Cross-Embodiment Whole-Body Manipulation
Pith reviewed 2026-05-21 09:45 UTC · model grok-4.3
The pith
HEX uses a humanoid-aligned state representation and mixture-of-experts predictor to achieve coordinated whole-body control on full-sized bipedal robots from multi-embodiment data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HEX is a state-centric framework that defines a humanoid-aligned universal state representation to support scalable cross-embodiment learning, employs a Mixture-of-Experts Unified Proprioceptive Predictor to capture whole-body coordination and temporal motion dynamics, summarizes past visual observations with lightweight history tokens, and integrates cues via residual-gated fusion with a flow-matching action head, resulting in state-of-the-art success rates on real-world humanoid manipulation tasks.
What carries the argument
Humanoid-aligned universal state representation paired with the Mixture-of-Experts Unified Proprioceptive Predictor that models whole-body coordination and temporal dynamics from multi-embodiment trajectories.
If this is right
- Whole-body humanoid tasks become more reliable without separate policies for each limb.
- Training data collected from multiple robot platforms can be reused for new humanoids.
- Long-horizon and fast-reaction tasks improve because temporal proprioceptive dynamics are explicitly modeled.
- Inference cost stays low because history tokens replace repeated image encoding.
Where Pith is reading between the lines
- The same state representation might transfer to non-humanoid high-DoF platforms if the kinematic alignment step is generalized.
- Reducing embodiment-specific retraining could lower the data and compute cost of deploying new robots.
- Extending the flow-matching head to include force or contact predictions could further stabilize contact-rich manipulation.
Load-bearing premise
A single humanoid-aligned universal state representation combined with large-scale multi-embodiment trajectory data is sufficient to produce stable whole-body coordination across heterogeneous robot embodiments.
What would settle it
An experiment in which HEX is trained on the given multi-embodiment dataset and then tested on a new humanoid embodiment or a long-horizon task where its success rate falls below that of a baseline that treats body parts independently.
Figures
read the original abstract
Humans achieve complex manipulation through coordinated whole-body control, whereas most Vision-Language-Action (VLA) models treat robot body parts largely independently, making high-DoF humanoid control challenging and often unstable. We present HEX, a state-centric framework for coordinated manipulation on full-sized bipedal humanoid robots. HEX introduces a humanoid-aligned universal state representation for scalable learning across heterogeneous embodiments, and incorporates a Mixture-of-Experts Unified Proprioceptive Predictor to model whole-body coordination and temporal motion dynamics from large-scale multi-embodiment trajectory data. To efficiently capture temporal visual context, HEX uses lightweight history tokens to summarize past observations, avoiding repeated encoding of historical images during inference. It further employs a residual-gated fusion mechanism with a flow-matching action head to adaptively integrate visual-language cues with proprioceptive dynamics for action generation. Experiments on real-world humanoid manipulation tasks show that HEX achieves state-of-the-art performance in task success rate and generalization, particularly in fast-reaction and long-horizon scenarios.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces HEX, a state-centric framework for whole-body manipulation on full-sized bipedal humanoid robots. It proposes a humanoid-aligned universal state representation to enable scalable cross-embodiment learning, a Mixture-of-Experts Unified Proprioceptive Predictor trained on large-scale multi-embodiment trajectories to capture coordination and temporal dynamics, lightweight history tokens to summarize visual context without repeated encoding, and a residual-gated fusion mechanism paired with a flow-matching action head. The central claim is that these components yield state-of-the-art task success rates and generalization on real-world humanoid tasks, with particular gains in fast-reaction and long-horizon scenarios.
Significance. If the empirical results hold, the work would address a genuine limitation in current VLA models by moving beyond independent body-part control toward coordinated high-DoF humanoid behavior. The combination of a universal state representation with MoE proprioceptive modeling could offer a practical route to cross-embodiment transfer, which remains an open challenge in robotics. The absence of any quantitative metrics, baselines, or ablations in the manuscript, however, prevents assessment of whether these architectural choices deliver the claimed coordination and generalization benefits.
major comments (2)
- [Abstract] Abstract: The assertion of state-of-the-art performance in task success rate and generalization is presented without any numerical results, baseline comparisons, error bars, dataset sizes, or ablation studies. This omission is load-bearing because the paper's primary contribution is framed as an empirical advance over independent-part VLA baselines; without verifiable evidence, the effectiveness of the humanoid-aligned state representation and MoE predictor cannot be evaluated.
- [Method] Method (universal state representation and MoE predictor): The manuscript describes the humanoid-aligned universal state representation and Mixture-of-Experts Unified Proprioceptive Predictor as enabling stable whole-body coordination across heterogeneous embodiments, yet provides no explicit alignment mechanism, invariance argument, or analysis of how differing kinematics, joint limits, and sensor frames are reconciled without embodiment-specific adapters. This is central to the cross-embodiment claim; if the representation does not implicitly achieve the required alignment, the reported gains in fast-reaction and long-horizon tasks would not follow.
minor comments (2)
- [Abstract] The abstract would be clearer if it briefly indicated the number of real-world tasks, robot platforms, and trajectory hours used in training and evaluation.
- [Method] Notation for the residual-gated fusion and flow-matching action head could be introduced with a short equation or diagram to improve readability for readers unfamiliar with the specific implementation.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive review. We appreciate the recognition of the potential impact of our state-centric approach for coordinated humanoid manipulation and the identification of areas where the presentation can be strengthened. We address each major comment below and commit to revisions that directly respond to the concerns.
read point-by-point responses
-
Referee: [Abstract] Abstract: The assertion of state-of-the-art performance in task success rate and generalization is presented without any numerical results, baseline comparisons, error bars, dataset sizes, or ablation studies. This omission is load-bearing because the paper's primary contribution is framed as an empirical advance over independent-part VLA baselines; without verifiable evidence, the effectiveness of the humanoid-aligned state representation and MoE predictor cannot be evaluated.
Authors: We agree that the abstract would be strengthened by including concrete quantitative results. The manuscript body reports extensive real-world experiments with baselines, ablations, success rates, and generalization metrics; however, we will revise the abstract to explicitly state key numerical outcomes (including task success rates, improvements over baselines, dataset sizes, and evaluation details) so that the empirical claims are immediately verifiable. revision: yes
-
Referee: [Method] Method (universal state representation and MoE predictor): The manuscript describes the humanoid-aligned universal state representation and Mixture-of-Experts Unified Proprioceptive Predictor as enabling stable whole-body coordination across heterogeneous embodiments, yet provides no explicit alignment mechanism, invariance argument, or analysis of how differing kinematics, joint limits, and sensor frames are reconciled without embodiment-specific adapters. This is central to the cross-embodiment claim; if the representation does not implicitly achieve the required alignment, the reported gains in fast-reaction and long-horizon tasks would not follow.
Authors: We thank the referee for highlighting this point. The alignment is realized by mapping embodiment-specific proprioception to a canonical humanoid kinematic model through forward kinematics, joint-angle normalization, and sensor-frame calibration, allowing the MoE predictor to operate on a shared representation trained across multi-embodiment trajectories. We acknowledge that the current text would benefit from greater explicitness. In the revision we will add a dedicated paragraph in the Method section that formalizes the state-mapping procedure, provides an invariance argument grounded in the training data distribution, and analyzes robustness to kinematic and sensor variations. revision: yes
Circularity Check
No significant circularity; claims rest on empirical evaluation of proposed components.
full rationale
The paper introduces HEX as a new state-centric framework featuring a humanoid-aligned universal state representation and a Mixture-of-Experts Unified Proprioceptive Predictor trained on multi-embodiment trajectory data. Performance claims are explicitly tied to real-world experiments demonstrating SOTA task success rates and generalization, rather than any derivation that reduces by construction to fitted parameters or self-referential definitions. No equations, uniqueness theorems, or ansatzes are presented in the abstract or method sketch that equate outputs to inputs via self-definition or self-citation chains. The central inductive bias (cross-embodiment coordination via the universal representation) is treated as a modeling choice whose effectiveness is tested externally through hardware trials, satisfying the criteria for a self-contained empirical contribution without load-bearing circular steps.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
HEX introduces a humanoid-aligned universal state representation for scalable learning across heterogeneous embodiments, and incorporates a Mixture-of-Experts Unified Proprioceptive Predictor to model whole-body coordination and temporal motion dynamics from large-scale multi-embodiment trajectory data.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We organize the input state using a fixed set of canonical body-part slots... map each available part into a shared latent space... MoE modules at the input and output boundaries of the predictor.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 3 Pith papers
-
CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models
Capability vectors extracted from parameter differences between standard and auxiliary-finetuned VLA models can be merged into pretrained weights to match auxiliary-training performance while reducing computational ov...
-
Any2Any: Efficient Cross-Embodiment Transfer for Humanoid Whole-Body Tracking
Any2Any transfers pretrained humanoid whole-body tracking policies to new embodiments with 1% of original training cost via kinematic alignment and parameter-efficient fine-tuning.
-
RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark
RoboMemArena is a new large-scale robotic memory benchmark with real-world tasks, and PrediMem is a dual VLA system that outperforms baselines by managing memory buffers with predictive coding.
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Latent Reasoning VLA: Latent Thinking and Prediction for Vision-Language-Action Models
Shuanghao Bai, Jing Lyu, Wanqi Zhou, Zhe Li, Dakai Wang, Lei Xing, Xiaoguang Zhao, Pengwei Wang, Zhongyuan Wang, Cheng Chi, et al. Latent reasoning vla: Latent thinking and prediction for vision-language-action models.arXiv preprint arXiv:2602.01166, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[3]
Shuanghao Bai, Wenxuan Song, Jiayi Chen, Yuheng Ji, Zhide Zhong, Jin Yang, Han Zhao, Wanqi Zhou, Zhe Li, Pengxiang Ding, et al. Embodied robot manipulation in the era of foundation models: Planning and learning perspectives.arXiv preprint arXiv:2512.22983, 2025
-
[4]
Shuanghao Bai, Wenxuan Song, Jiayi Chen, Yuheng Ji, Zhide Zhong, Jin Yang, Han Zhao, Wanqi Zhou, Wei Zhao, Zhe Li, et al. Towards a unified understanding of robot manipulation: A comprehensive survey.arXiv preprint arXiv:2510.10903, 2025
-
[5]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3549–3556, 2025
work page 2025
-
[7]
Univla: Learning to act anywhere with task-centric latent actions
Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions. InRobotics: Science and Systems, 2025
work page 2025
-
[8]
Hand-eye autonomous delivery: Learning humanoid navigation, locomotion and reaching
Sirui Chen, Yufei Ye, Zi-ang Cao, Pei Xu, Jennifer Lew, and Karen Liu. Hand-eye autonomous delivery: Learning humanoid navigation, locomotion and reaching. InConference on Robot Learning, pages 4058–4073. PMLR, 2025. 14
work page 2025
-
[9]
Gmt: General motion tracking for humanoid whole-body control.arXiv preprint arXiv:2506.14770, 2025
Zixuan Chen, Mazeyu Ji, Xuxin Cheng, Xuanbin Peng, Xue Bin Peng, and Xiaolong Wang. Gmt: General motion tracking for humanoid whole-body control.arXiv preprint arXiv:2506.14770, 2025
-
[10]
Can Cui, Pengxiang Ding, Wenxuan Song, Shuanghao Bai, Xinyang Tong, Zirui Ge, Runze Suo, Wanqi Zhou, Yang Liu, Bofang Jia, et al. Openhelix: A short survey, empirical analysis, and open-source dual-system vla model for robotic manipulation.arXiv preprint arXiv:2505.03912, 2025
-
[11]
Pengxiang Ding, Jianfei Ma, Xinyang Tong, Binghong Zou, Xinxin Luo, Yiguo Fan, Ting Wang, Hongchao Lu, Panzhong Mo, Jinxin Liu, et al. Humanoid-vla: Towards universal humanoid control with visual integration.arXiv preprint arXiv:2502.14795, 2025
-
[12]
Scaling cross-embodied learning: One policy for manipulation, navigation, locomotion and aviation
Ria Doshi, Homer Rich Walke, Oier Mees, Sudeep Dasari, and Sergey Levine. Scaling cross-embodied learning: One policy for manipulation, navigation, locomotion and aviation. InConference on Robot Learning, pages 496–512. PMLR, 2025
work page 2025
-
[13]
Zhiying Du, Bei Liu, Yaobo Liang, Yichao Shen, Haidong Cao, Xiangyu Zheng, Zhiyuan Feng, Zuxuan Wu, Jiaolong Yang, and Yu-Gang Jiang. Himoe-vla: Hierarchical mixture-of-experts for generalist vision-language-action policies.arXiv preprint arXiv:2512.05693, 2025
-
[14]
Long-vla: Unleashing long-horizon capability of vision language action model for robot manipulation
Yiguo Fan, Shuanghao Bai, Xinyang Tong, Pengxiang Ding, Yuyang Zhu, Hongchao Lu, Fengqi Dai, Wei Zhao, Yang Liu, Siteng Huang, et al. Long-vla: Unleashing long-horizon capability of vision language action model for robot manipulation. In Conference on Robot Learning, pages 2018–2037. PMLR, 2025
work page 2018
-
[15]
Humanplus: Humanoid shadowing and imitation from humans
Zipeng Fu, Qingqing Zhao, Qi Wu, Gordon Wetzstein, and Chelsea Finn. Humanplus: Humanoid shadowing and imitation from humans. InConference on Robot Learning, pages 2828–2844. PMLR, 2025
work page 2025
-
[16]
Omnih2o: Universal and dexterous human-to-humanoid whole-body teleoperation and learning
Tairan He, Zhengyi Luo, Xialin He, Wenli Xiao, Chong Zhang, Weinan Zhang, Kris M Kitani, Changliu Liu, and Guanya Shi. Omnih2o: Universal and dexterous human-to-humanoid whole-body teleoperation and learning. InConference on Robot Learning, pages 1516–1540. PMLR, 2025
work page 2025
-
[17]
Tairan He, Zi Wang, Haoru Xue, Qingwei Ben, Zhengyi Luo, Wenli Xiao, Ye Yuan, Xingye Da, Fernando Castañeda, Shankar Sastry, et al. Viral: Visual sim-to-real at scale for humanoid loco-manipulation.arXiv preprint arXiv:2511.15200, 2025
-
[18]
Chengkai Hou, Kun Wu, Jiaming Liu, Zhengping Che, Di Wu, Fei Liao, Guangrun Li, Jingyang He, Qiuxuan Feng, Zhao Jin, et al. Robomind 2.0: A multimodal, bimanual mobile manipulation dataset for generalizable embodied intelligence.arXiv preprint arXiv:2512.24653, 2025
-
[19]
Slac: Simulation-pretrained latent action space for whole-body real-world rl
Jiaheng Hu, Peter Stone, and Roberto Martín-Martín. Slac: Simulation-pretrained latent action space for whole-body real-world rl. InConference on Robot Learning, pages 2966–2982. PMLR, 2025
work page 2025
-
[20]
π0.5: A vision-language-action model with open-world generalization
Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. π0.5: A vision-language-action model with open-world generalization. InConference on Robot Learning, 2025
work page 2025
-
[21]
Wholebodyvla: Towards unified latent vla for whole-body loco-manipulation control
Haoran Jiang, Jin Chen, Qingwen Bu, Li Chen, Modi Shi, Yanjie Zhang, Delong Li, Chuanzhe Suo, Chuang Wang, Zhihui Peng, et al. Wholebodyvla: Towards unified latent vla for whole-body loco-manipulation control. InThe Fourteenth International Conference on Learning Representations, 2026
work page 2026
-
[22]
Openvla: An open-source vision-language-action model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language-action model. InConference on Robot Learning, pages 2679–2713. PMLR, 2025
work page 2025
-
[23]
Okami: Teaching humanoid robots manipulation skills through single video imitation
Jinhan Li, Yifeng Zhu, Yuqi Xie, Zhenyu Jiang, Mingyo Seo, Georgios Pavlakos, and Yuke Zhu. Okami: Teaching humanoid robots manipulation skills through single video imitation. InConference on Robot Learning, pages 299–317. PMLR, 2025
work page 2025
-
[24]
Meng Li, Zhen Zhao, Zhengping Che, Fei Liao, Kun Wu, Zhiyuan Xu, Pei Ren, Zhao Jin, Ning Liu, and Jian Tang. Switchvla: Execution-aware task switching for vision-language-action models.arXiv preprint arXiv:2506.03574, 2025
-
[25]
BeyondMimic: From Motion Tracking to Versatile Humanoid Control via Guided Diffusion
Qiayuan Liao, Takara E Truong, Xiaoyu Huang, Yuman Gao, Guy Tevet, Koushil Sreenath, and C Karen Liu. Beyondmimic: From motion tracking to versatile humanoid control via guided diffusion.arXiv preprint arXiv:2508.08241, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
Yunfeng Lin, Minghuan Liu, Yufei Xue, Ming Zhou, Yong Yu, Jiangmiao Pang, and Weinan Zhang. H-zero: Cross-humanoid locomotion pretraining enables few-shot novel embodiment transfer.arXiv preprint arXiv:2512.00971, 2025
-
[27]
Trajbooster: Boosting humanoid whole-body manipulation via trajectory-centric learning
Jiacheng Liu, Pengxiang Ding, Qihang Zhou, Yuxuan Wu, Da Huang, Zimian Peng, Wei Xiao, Weinan Zhang, Lixin Yang, Cewu Lu, et al. Trajbooster: Boosting humanoid whole-body manipulation via trajectory-centric learning. In2026 IEEE International Conference on Robotics and Automation (ICRA), 2026. 15
work page 2026
-
[28]
Mobile-television: Predictive motion priors for humanoid whole-body control
Chenhao Lu, Xuxin Cheng, Jialong Li, Shiqi Yang, Mazeyu Ji, Chengjing Yuan, Ge Yang, Sha Yi, and Xiaolong Wang. Mobile-television: Predictive motion priors for humanoid whole-body control. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 5364–5371. IEEE, 2025
work page 2025
- [29]
-
[30]
SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control
Zhengyi Luo, Ye Yuan, Tingwu Wang, Chenran Li, Sirui Chen, Fernando Castaneda, Zi-Ang Cao, Jiefeng Li, David Minor, Qingwei Ben, et al. Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025
work page internal anchor Pith review arXiv 2025
-
[31]
Learning from massive human videos for universal humanoid pose control
Jiageng Mao, Siheng Zhao, Siqi Song, Tianheng Shi, Junjie Ye, Mingtong Zhang, Haoran Geng, Jitendra Malik, Vitor Guizilini, and Yue Wang. Learning from massive human videos for universal humanoid pose control. InInternational Conference on Humanoid Robots, 2025
work page 2025
-
[32]
Jun Nakanishi, Jun Morimoto, Gen Endo, Gordon Cheng, Stefan Schaal, and Mitsuo Kawato. Learning from demonstration and adaptation of biped locomotion.Robotics and autonomous systems, 47(2-3):79–91, 2004
work page 2004
-
[33]
Quanquan Peng, Yunfeng Lin, Yufei Xue, Jiangmiao Pang, and Weinan Zhang. Embodiment-aware generalist specialist distillation for unified humanoid whole-body control.arXiv preprint arXiv:2602.02960, 2026
-
[34]
Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel Van de Panne. Deepmimic: Example-guided deep reinforcement learning of physics-based character skills.ACM Transactions On Graphics (TOG), 37(4):1–14, 2018
work page 2018
-
[35]
Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa. Amp: Adversarial motion priors for stylized physics-based character control.ACM Transactions on Graphics (ToG), 40(4):1–20, 2021
work page 2021
-
[36]
Egobridge: Domain adaptation for generalizable imitation from egocentric human data
Ryan Punamiya, Dhruv Patel, Patcharapong Aphiwetsa, Pranav Kuppili, Lawrence Y Zhu, Simar Kareer, Judy Hoffman, and Danfei Xu. Egobridge: Domain adaptation for generalizable imitation from egocentric human data. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[37]
Ri-Zhao Qiu, Shiqi Yang, Xuxin Cheng, Chaitanya Chawla, Jialong Li, Tairan He, Ge Yan, David J Yoon, Ryan Hoque, Lars Paulsen, et al. Humanoid policy human policy. InConference on Robot Learning, pages 2888–2906. PMLR, 2025
work page 2025
-
[38]
Real-world humanoid locomotion with reinforcement learning.Science Robotics, 9(89):eadi9579, 2024
Ilija Radosavovic, Tete Xiao, Bike Zhang, Trevor Darrell, Jitendra Malik, and Koushil Sreenath. Real-world humanoid locomotion with reinforcement learning.Science Robotics, 9(89):eadi9579, 2024
work page 2024
-
[39]
Modi Shi, Shijia Peng, Jin Chen, Haoran Jiang, Yinghui Li, Di Huang, Ping Luo, Hongyang Li, and Li Chen. Egohumanoid: Unlocking in-the-wild loco-manipulation with robot-free egocentric demonstration.arXiv preprint arXiv:2602.10106, 2026
-
[40]
Reconvla: Reconstructive vision-language-action model as effective robot perceiver
Wenxuan Song, Ziyang Zhou, Han Zhao, Jiayi Chen, Pengxiang Ding, Haodong Yan, Yuxin Huang, Feilong Tang, Donglin Wang, and Haoang Li. Reconvla: Reconstructive vision-language-action model as effective robot perceiver. InThe 40th Annual AAAI Conference on Artificial Intelligence, 2026
work page 2026
-
[41]
Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers
Lirui Wang, Xinlei Chen, Jialiang Zhao, and Kaiming He. Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers. InAdvances in neural information processing systems, volume 37, pages 124420–124450, 2024
work page 2024
-
[42]
Vla-adapter: An effective paradigm for tiny-scale vision-language-action model
Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, Siteng Huang, Yifan Tang, Wenhui Wang, Ru Zhang, Jianyi Liu, and Donglin Wang. Vla-adapter: An effective paradigm for tiny-scale vision-language-action model. InThe 40th Annual AAAI Conference on Artificial Intelligence, 2026
work page 2026
-
[43]
ψ0: An open foundation model towards universal humanoid loco-manipulation, 2026
Songlin Wei, Hongyi Jing, Boqian Li, Zhenyu Zhao, Jiageng Mao, Zhenhao Ni, Sicheng He, Jie Liu, Xiawei Liu, Kaidi Kang, Sheng Zang, Weiduo Yuan, Marco Pavone, Di Huang, and Yue Wang. ψ0: An open foundation model towards universal humanoid loco-manipulation, 2026
work page 2026
-
[44]
Haoyang Weng, Yitang Li, Nikhil Sobanbabu, Zihan Wang, Zhengyi Luo, Tairan He, Deva Ramanan, and Guanya Shi. Hdmi: Learning interactive humanoid whole-body control from human videos.arXiv preprint arXiv:2509.16757, 2025
-
[45]
Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation
Kun Wu, Chengkai Hou, Jiaming Liu, Zhengping Che, Xiaozhu Ju, Zhuqin Yang, Meng Li, Yinuo Zhao, Zhiyuan Xu, Guang Yang, et al. Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation. InRobotics: Science and Systems (RSS), 2025
work page 2025
-
[46]
RoboCOIN: An Open-Sourced Bimanual Robotic Data Collection for Integrated Manipulation
Shihan Wu, Xuecheng Liu, Shaoxuan Xie, Pengwei Wang, Xinghang Li, Bowen Yang, Zhe Li, Kai Zhu, Hongyu Wu, Yiheng Liu, et al. Robocoin: An open-sourced bimanual robotic data collection for integrated manipulation.arXiv preprint arXiv:2511.17441, 2025. 16
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
Kungfubot: Physics-based humanoid whole-body control for learning highly-dynamic skills
Weiji Xie, Jinrui Han, Jiakun Zheng, Huanyu Li, Xinzhe Liu, Jiyuan Shi, Weinan Zhang, Chenjia Bai, and Xuelong Li. Kungfubot: Physics-based humanoid whole-body control for learning highly-dynamic skills. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[48]
Zhaoming Xie, Jonathan Tseng, Sebastian Starke, Michiel Van De Panne, and C Karen Liu. Hierarchical planning and control for box loco-manipulation.Proceedings of the ACM on Computer Graphics and Interactive Techniques, 6(3):1–18, 2023
work page 2023
-
[49]
Hacts: a human-as- copilot teleoperation system for robot learning
Zhiyuan Xu, Yinuo Zhao, Kun Wu, Ning Liu, Junjie Ji, Zhengping Che, Chi Harold Liu, and Jian Tang. Hacts: a human-as- copilot teleoperation system for robot learning. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 15475–15481. IEEE, 2025
work page 2025
-
[50]
Haoru Xue, Xiaoyu Huang, Dantong Niu, Qiayuan Liao, Thomas Kragerud, Jan Tommy Gravdahl, Xue Bin Peng, Guanya Shi, Trevor Darrell, Koushil Sreenath, et al. Leverb: Humanoid whole-body control with latent vision-language instruction.arXiv preprint arXiv:2506.13751, 2025
-
[51]
Scalable and General Whole-Body Control for Cross-Humanoid Locomotion
Yufei Xue, YunFeng Lin, Wentao Dong, Yang Tang, Jingbo Wang, Jiangmiao Pang, Ming Zhou, Minghuan Liu, and Weinan Zhang. Scalable and general whole-body control for cross-humanoid locomotion.arXiv preprint arXiv:2602.05791, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[52]
Haoran Yang, Jiacheng Bao, Yucheng Xin, Haoming Song, Yuyang Tian, Bin Zhao, Dong Wang, and Xuelong Li. Zerowbc: Learning natural visuomotor humanoid control directly from human egocentric video.arXiv preprint arXiv:2603.09170, 2026
-
[53]
Pushing the limits of cross-embodiment learning for manipulation and navigation
Jonathan Yang, Catherine Glossop, Arjun Bhorkar, Dhruv Shah, Quan Vuong, Chelsea Finn, Dorsa Sadigh, and Sergey Levine. Pushing the limits of cross-embodiment learning for manipulation and navigation. InRobotics: Science and Systems, 2024
work page 2024
-
[54]
EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos
Ruihan Yang, Qinxi Yu, Yecheng Wu, Rui Yan, Borui Li, An-Chieh Cheng, Xueyan Zou, Yunhao Fang, Xuxin Cheng, Ri-Zhao Qiu, et al. Egovla: Learning vision-language-action models from egocentric human videos.arXiv preprint arXiv:2507.12440, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[55]
Twist: Teleoperated whole-body imitation system
Yanjie Ze, Zixuan Chen, Joao Pedro Araujo, Zi-ang Cao, Xue Bin Peng, Jiajun Wu, and Karen Liu. Twist: Teleoperated whole-body imitation system. InConference on Robot Learning, pages 2143–2154. PMLR, 2025
work page 2025
-
[56]
Generalizable humanoid manipulation with 3d diffusion policies
Yanjie Ze, Zixuan Chen, Wenhao Wang, Tianyi Chen, Xialin He, Ying Yuan, Xue Bin Peng, and Jiajun Wu. Generalizable humanoid manipulation with 3d diffusion policies. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2873–2880. IEEE, 2025
work page 2025
-
[57]
Falcon: Learning force-adaptive humanoid loco-manipulation
Yuanhang Zhang, Yifu Yuan, Prajwal Gurunath, Ishita Gupta, Shayegan Omidshafiei, Ali-akbar Agha-mohammadi, Marcell Vazquez-Chanlatte, Liam Pedersen, Tairan He, and Guanya Shi. Falcon: Learning force-adaptive humanoid loco-manipulation. 8th Annual Learning for Dynamics\& Control Conference, 2026
work page 2026
-
[58]
Tony Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.Robotics: Science and Systems XIX, 2023
work page 2023
-
[59]
Zhenyu Zhao, Hongyi Jing, Xiawei Liu, Jiageng Mao, Abha Jha, Hanwen Yang, Rong Xue, Sergey Zakharor, Vitor Guizilini, and Yue Wang. Humanoid everyday: A comprehensive robotic dataset for open-world humanoid manipulation.arXiv preprint arXiv:2510.08807, 2025. 17
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.