VAIC: Vision-Guided Humanoid Agile Object Interaction Control via Decoupled Commands

Diyun Xiang; Dongting Li; Guoyao Zhang; Jianzhu Ma; Liang Li; Mingliang Zhou; Qiang Zhang; Qianyang Wu; Renjing Xu; Sikai Wu

arxiv: 2606.09286 · v1 · pith:RSO4O4QHnew · submitted 2026-06-08 · 💻 cs.RO

VAIC: Vision-Guided Humanoid Agile Object Interaction Control via Decoupled Commands

Dongting Li , Qianyang Wu , Xingyu Chen , Liang Li , Yuhang Lin , Sikai Wu , Guoyao Zhang , Mingliang Zhou

show 4 more authors

Diyun Xiang Qiang Zhang Renjing Xu Jianzhu Ma

This is my paper

Pith reviewed 2026-06-27 16:36 UTC · model grok-4.3

classification 💻 cs.RO

keywords humanoid robotagile object interactionpolicy distillationvision-guided controlrecurrent adaptationdecoupled commandsreal-world deployment

0 comments

The pith

VAIC distills a teacher policy into a single student policy that performs diverse agile object interactions on humanoids using only depth, proprioception, and decoupled commands.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to close the gap between simulation-trained controllers and real-world humanoid deployment by removing the need for dense reference trajectories and perfect state observability. It introduces a two-stage process in which a privileged teacher first masters interaction skills with full kinematic information, then transfers the behavior to a deployable student that receives only onboard depth images, past joint states, velocity targets on multiple axes, and a per-frame interaction flag. A recurrent module inside the student learns to recover hidden object properties on the fly. If this works, one policy can be deployed across unstructured settings for tasks that previously required separate controllers or privileged sensing. The result is framed as progress toward autonomous humanoids that assist in everyday environments.

Core claim

VAIC is a unified framework that bridges this gap by operating exclusively on onboard depth, historical proprioception, and a decoupled user command interface. It employs a two-stage distillation paradigm. First, a privileged teacher policy masters diverse interaction skills using precise object kinematics and exact environmental states. Second, a deployable student policy distills these capabilities by replacing full body tracking with velocity targets across multiple axes and an interaction indicator for each frame. The student utilizes a recurrent object adaptation module to implicitly infer unobservable object dynamics from raw depth streams and proprioception. Evaluations and real-world

What carries the argument

The recurrent object adaptation module, which replaces explicit state estimation by learning to recover hidden object dynamics directly from depth images and proprioceptive history inside the student policy.

If this is right

A single policy can execute multiple dynamic tasks without per-task retraining or reference trajectories.
Control remains functional when full environmental state is unavailable, relying instead on onboard depth and proprioception.
Decoupled velocity commands plus an interaction flag suffice to coordinate whole-body motion for carrying, pushing, and balancing tasks.
Real-world transfer succeeds on a physical humanoid without additional sensing hardware beyond depth and joint encoders.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same distillation structure might allow reuse of the teacher across different robot morphologies if the student input interface stays fixed.
Replacing explicit dynamics models with recurrent vision inference could reduce the engineering cost of adding new object classes.
Extending the command interface to include higher-level goals such as target locations would test whether the current velocity-plus-indicator format remains sufficient.

Load-bearing premise

The recurrent object adaptation module can implicitly infer unobservable object dynamics from raw depth streams and proprioception.

What would settle it

Deploy the student policy on objects whose mass or surface friction differs from training examples and measure whether interaction stability collapses when the module receives no explicit dynamics parameters.

Figures

Figures reproduced from arXiv: 2606.09286 by Diyun Xiang, Dongting Li, Guoyao Zhang, Jianzhu Ma, Liang Li, Mingliang Zhou, Qiang Zhang, Qianyang Wu, Renjing Xu, Sikai Wu, Xingyu Chen, Yuhang Lin.

**Figure 2.** Figure 2: Overview of VAIC. The framework follows a two-stage distillation paradigm. A privi [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Real-world hardware generalization of VAIC to out-of-distribution object attributes and [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of the predicted object state from the VAIC adaptation module against Mu [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Humanoid robots hold immense potential for real-world assistance, yet agile interaction with objects in unstructured environments demands tightly coupled whole-body coordination. Despite recent advancements, current controllers face a critical deployment gap. They rely heavily on dense reference trajectories and perfect state observability, which inherently limits physical generalization. We present Vision Guided Agile Interaction Control (VAIC), a unified framework that bridges this gap by operating exclusively on onboard depth, historical proprioception, and a decoupled user command interface. VAIC employs a two-stage distillation paradigm. First, a privileged teacher policy masters diverse interaction skills using precise object kinematics and exact environmental states. Second, a deployable student policy distills these capabilities by replacing full body tracking with velocity targets across multiple axes and an interaction indicator for each frame. The student utilizes a recurrent object adaptation module to implicitly infer unobservable object dynamics from raw depth streams and proprioception. Evaluations and real-world deployments on the humanoid robot demonstrate that a single VAIC policy successfully executes highly diverse dynamic tasks. These tasks include box carrying, cart interaction, and skateboarding, consistently outperforming baselines and advancing autonomous humanoid deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VAIC sketches a distillation pipeline with decoupled commands and recurrent adaptation for vision-only humanoid tasks, but the abstract supplies zero metrics or ablations to back the multi-task claims.

read the letter

The main point is that this paper describes a teacher-student setup where the student runs on depth, proprioception history, and simple multi-axis velocity targets plus an interaction flag, with a recurrent module meant to handle object dynamics on the fly for tasks like box carrying, cart pushing, and skateboarding.

What stands out as new is the specific combination of decoupled velocity commands and the per-frame interaction indicator inside the distillation, aimed at removing the need for dense trajectories and full observability. The paper does a reasonable job identifying the deployment gap in current humanoid controllers and showing how the student policy is structured to close it.

The soft spots are the lack of any numbers. The abstract asserts real-robot outperformance and a single policy handling diverse contact regimes, yet reports no success rates, trial counts, baseline details, or failure modes. The recurrent adaptation is presented as the mechanism that infers unobservable properties like mass and friction from depth streams, but nothing in the text shows ablations, object variation tests, or evidence that this inference works across the three tasks. The stress-test concern lands directly here: depth gives geometry, not intrinsic parameters, so the history must carry the load, and that step remains unverified.

This is for researchers working on sim-to-real humanoid RL and vision-based interaction. A reader in that niche might pick up the command interface or adaptation module as something to test, but the current text does not give enough to build on.

Send it to peer review so the experiments can be examined properly.

Referee Report

2 major / 1 minor

Summary. The paper introduces VAIC, a two-stage distillation framework for humanoid agile object interaction. A privileged teacher policy learns diverse skills with full state access; the student policy operates on depth, proprioception history, and decoupled velocity targets plus an interaction indicator, using a recurrent object adaptation module to infer unobservable dynamics. The central claim is that a single student policy executes box carrying, cart interaction, and skateboarding on a real humanoid, consistently outperforming baselines.

Significance. If validated with quantitative evidence, the decoupled command interface and recurrent adaptation for implicit dynamics inference would represent a meaningful step toward generalizable humanoid controllers that reduce dependence on perfect observability and dense trajectories. The two-stage paradigm is a clear strength for sim-to-real transfer in multi-contact tasks.

major comments (2)

[Abstract and Evaluations section] Abstract and Evaluations section: the claim that a single policy 'consistently outperforming baselines' on real-robot tasks is made without any reported metrics, baseline implementations, trial counts, success rates, or failure-mode analysis; this directly undermines assessment of the multi-task generalization result.
[Student policy description] Student policy description (recurrent object adaptation module): no ablation studies, object-parameter variation sweeps, or quantitative tests are provided to support the claim that the RNN implicitly infers unobservable dynamics (mass distribution, friction, compliance) from depth and proprioception history across qualitatively different contact regimes; this is load-bearing for the headline claim.

minor comments (1)

[Method] Notation for the decoupled velocity targets and interaction indicator could be made more explicit with an equation or diagram reference.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and indicate the planned revisions.

read point-by-point responses

Referee: [Abstract and Evaluations section] Abstract and Evaluations section: the claim that a single policy 'consistently outperforming baselines' on real-robot tasks is made without any reported metrics, baseline implementations, trial counts, success rates, or failure-mode analysis; this directly undermines assessment of the multi-task generalization result.

Authors: We agree that the abstract's claim requires quantitative backing to be fully assessable. The current evaluations section focuses on qualitative real-robot demonstrations. We will revise to include a dedicated quantitative table reporting success rates, number of trials, baseline implementations, and failure-mode analysis for each task. revision: yes
Referee: [Student policy description] Student policy description (recurrent object adaptation module): no ablation studies, object-parameter variation sweeps, or quantitative tests are provided to support the claim that the RNN implicitly infers unobservable dynamics (mass distribution, friction, compliance) from depth and proprioception history across qualitatively different contact regimes; this is load-bearing for the headline claim.

Authors: The referee is correct that the submitted manuscript lacks ablations or parameter sweeps for the recurrent adaptation module. We will add new experiments, including ablations with and without the RNN, plus quantitative tests varying object mass, friction, and compliance across contact regimes, to directly support the inference claim. revision: yes

Circularity Check

0 steps flagged

No circularity: standard teacher-student distillation with no self-referential definitions or fitted predictions

full rationale

The paper describes a conventional two-stage RL distillation pipeline (privileged teacher mastering skills with full state, student trained on depth/proprioception plus velocity targets) without any equations, parameters, or uniqueness claims that reduce the reported outcome to its own inputs by construction. The recurrent adaptation module is presented as a trained component whose inference capability is an empirical training result rather than a definitional identity. No self-citations are invoked as load-bearing mathematical facts, and no fitted quantities are relabeled as independent predictions. The derivation chain is therefore self-contained against external RL benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the approach implicitly rests on standard assumptions of policy distillation preserving performance and recurrent networks being able to extract dynamics from depth sequences.

pith-pipeline@v0.9.1-grok · 5766 in / 1098 out tokens · 25118 ms · 2026-06-27T16:36:39.946289+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 8 linked inside Pith

[1]

H. Wang, W. Zhang, R. Yu, T. Huang, J. Ren, F. Jia, Z. Wang, X. Niu, X. Chen, J. Chen, et al. Physhsi: Towards a real-world generalizable and natural humanoid-scene interaction system. arXiv preprint arXiv:2510.11072, 2025

arXiv 2025
[2]

H. Weng, Y . Li, N. Sobanbabu, Z. Wang, Z. Luo, T. He, D. Ramanan, and G. Shi. Hdmi: Learning interactive humanoid whole-body control from human videos.arXiv preprint arXiv:2509.16757, 2025

arXiv 2025
[3]

S. Zhao, Y . Ze, Y . Wang, C. K. Liu, P. Abbeel, G. Shi, and R. Duan. Resmimic: From gen- eral motion tracking to humanoid whole-body loco-manipulation via residual learning.arXiv preprint arXiv:2510.05070, 2025

arXiv 2025
[4]

S. Yin, Y . Ze, H.-X. Yu, C. K. Liu, and J. Wu. Visualmimic: Visual humanoid loco- manipulation via motion tracking and generation.arXiv preprint arXiv:2509.20322, 2025

arXiv 2025
[5]

D. Li, X. Chen, Q. Wu, B. Chen, S. Wu, H. Wu, G. Zhang, L. Li, M. Zhou, D. Xiang, J. Ma, Q. Zhang, and R. Xu. Haic: Humanoid agile object interaction control via dynamics-aware world model.arXiv preprint arXiv:2602.11758, 2026

Pith/arXiv arXiv 2026
[6]

Q. Liao, T. E. Truong, X. Huang, Y . Gao, G. Tevet, K. Sreenath, and C. K. Liu. Beyondmimic: From motion tracking to versatile humanoid control via guided diffusion.arXiv preprint arXiv:2508.08241, 2025

Pith/arXiv arXiv 2025
[7]

Z. Luo, Y . Yuan, T. Wang, C. Li, S. Chen, F. Casta ˜neda, Z.-A. Cao, J. Li, D. Minor, Q. Ben, et al. Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025

Pith/arXiv arXiv 2025
[8]

X. B. Peng, Z. Ma, P. Abbeel, S. Levine, and A. Kanazawa. Amp: Adversarial motion priors for stylized physics-based character control.ACM Transactions on Graphics (ToG), 40(4): 1–20, 2021

2021
[9]

Y . Lin, J. Shi, D. Wang, J. Kong, Y . Liu, C. Bai, and X. Li. Pro-hoi: Perceptive root-guided humanoid-object interaction.arXiv preprint arXiv:2603.01126, 2026

arXiv 2026
[10]

L. Yang, X. Huang, Z. Wu, A. Kanazawa, P. Abbeel, C. Sferrazza, C. K. Liu, R. Duan, and G. Shi. Omniretarget: Interaction-preserving data generation for humanoid whole-body loco- manipulation and scene interaction.arXiv preprint arXiv:2509.26633, 2025

Pith/arXiv arXiv 2025
[11]

Y . Wang, Q. Zhao, Y . F. Lau, R. Yu, H. W. Tsui, Q. Chen, J. Wang, J. Pang, and P. Tan. Humanx: Toward agile and generalizable humanoid interaction skills from human videos.arXiv preprint arXiv:2602.02473, 2026

arXiv 2026
[12]

X. B. Peng, P. Abbeel, S. Levine, and M. Van de Panne. Deepmimic: Example-guided deep re- inforcement learning of physics-based character skills.ACM Transactions On Graphics (TOG), 37(4):1–14, 2018

2018
[13]

Z. Chen, M. Ji, X. Cheng, X. Peng, X. B. Peng, and X. Wang. Gmt: General motion tracking for humanoid whole-body control.arXiv preprint arXiv:2506.14770, 2025

arXiv 2025
[14]

K. Yin, W. Zeng, K. Fan, M. Dai, Z. Wang, Q. Zhang, Z. Tian, J. Wang, J. Pang, and W. Zhang. Unitracker: Learning universal whole-body motion tracker for humanoid robots.arXiv preprint arXiv:2507.07356, 2025

arXiv 2025
[15]

J. P. Araujo, Y . Ze, P. Xu, J. Wu, and C. K. Liu. Retargeting matters: General motion retargeting for humanoid motion tracking.arXiv preprint arXiv:2510.02252, 2025. 9

arXiv 2025
[16]

T. He, W. Xiao, T. Lin, Z. Luo, Z. Xu, Z. Jiang, J. Kautz, C. Liu, G. Shi, X. Wang, et al. Hover: Versatile neural whole-body controller for humanoid robots. InInternational Conference on Robotics and Automation (ICRA), 2025

2025
[17]

T. He, J. Gao, W. Xiao, Y . Zhang, Z. Wang, J. Wang, Z. Luo, G. He, N. Sobanbabu, C. Pan, Z. Yi, G. Qu, K. Kitani, J. Hodgins, L. J. Fan, Y . Zhu, C. Liu, and G. Shi. Asap: Aligning sim- ulation and real-world physics for learning agile humanoid whole-body skills.arXiv preprint arXiv:2502.01143, 2025

arXiv 2025
[18]

Rempe, M

D. Rempe, M. Petrovich, Y . Yuan, H. Zhang, X. B. Peng, Y . Jiang, T. Wang, U. Iqbal, D. Minor, M. de Ruyter, et al. Kimodo: Scaling controllable human motion generation.arXiv preprint arXiv:2603.15546, 2026

arXiv 2026
[19]

W. Zeng, S. Lu, K. Yin, X. Niu, M. Dai, J. Wang, and J. Pang. Behavior foundation model for humanoid robots.arXiv preprint arXiv:2509.13780, 2025

arXiv 2025
[20]

M. Yuan, T. Yu, W. Ge, X. Yao, D. Li, H. Wang, J. Chen, X. Jin, B. Li, H. Chen, et al. Behavior foundation model: Towards next-generation whole-body control system of humanoid robots. arXiv preprint arXiv:2506.20487, 2025

arXiv 2025
[21]

Y . Li, Z. Luo, T. Zhang, C. Dai, A. Kanervisto, A. Tirinzoni, H. Weng, K. Kitani, M. Guzek, A. Touati, et al. Bfm-zero: A promptable behavioral foundation model for humanoid control using unsupervised reinforcement learning.arXiv preprint arXiv:2511.04131, 2025

arXiv 2025
[22]

Jiang, Z

N. Jiang, Z. He, W. Yu, L. Pang, Y . Li, H. Li, J. Cui, Y . Li, Y . Wang, Y . Zhu, et al. Uni- act: Unified motion generation and action streaming for humanoid robots.arXiv preprint arXiv:2512.24321, 2025

arXiv 2025
[23]

Y . Shao, X. Huang, B. Zhang, Q. Liao, Y . Gao, Y . Chi, Z. Li, S. Shao, and K. Sreenath. Langwbc: Language-directed humanoid whole-body control via end-to-end learning.arXiv preprint arXiv:2504.21738, 2025

arXiv 2025
[24]

B. L. Bhatnagar, X. Xie, I. Petrov, C. Sminchisescu, C. Theobalt, and G. Pons-Moll. BEHA VE: Dataset and method for tracking human object interactions. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 2022

2022
[25]

Zhang, H

J. Zhang, H. Luo, H. Yang, X. Xu, Q. Wu, Y . Shi, J. Yu, L. Xu, and J. Wang. Neuraldome: A neural modeling pipeline on multi-view human-object interactions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8834–8845, 2023

2023
[26]

Lu, C.-H

J. Lu, C.-H. P. Huang, U. Bhattacharya, Q. Huang, and Y . Zhou. Humoto: A 4d dataset of mocap human object interactions. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10886–10897, 2025

2025
[27]

C. Zhao, J. Zhang, J. Du, Z. Shan, J. Wang, J. Yu, J. Wang, and L. Xu. I’m hoi: Inertia- aware monocular capture of 3d human-object interactions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 729–741, 2024

2024
[28]

S. Xu, D. Li, Y . Zhang, X. Xu, Q. Long, Z. Wang, Y . Lu, S. Dong, H. Jiang, A. Gupta, Y .-X. Wang, and L.-Y . Gui. Interact: Advancing large-scale versatile 3d human-object interac- tion generation. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 2025

2025
[29]

Y . Wang, J. Lin, A. Zeng, Z. Luo, J. Zhang, and L. Zhang. Physhoi: Physics-based imitation of dynamic human-object interaction.arXiv preprint arXiv:2312.04393, 2023

arXiv 2023
[30]

S. Xu, H. Y . Ling, Y .-X. Wang, and L. Gui. Intermimic: Towards universal whole-body con- trol for physics-based human-object interactions. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 2025. 10

2025
[31]

Y . Lin, Y . Xie, J. Xie, Y . Huang, R. Wang, J. Lv, Y . Ma, and X. Zuo. Simgenhoi: Physically realistic whole-body humanoid-object interaction via generative modeling and reinforcement learning.arXiv preprint arXiv:2508.14120, 2025

arXiv 2025
[32]

Q. Wu, Y . Shi, X. Huang, J. Yu, L. Xu, and J. Wang. Thor: Text to human-object interaction diffusion via relation intervention.arXiv preprint arXiv:2403.11208, 2024

arXiv 2024
[33]

L. Pan, Z. Yang, Z. Dou, W. Wang, B. Huang, B. Dai, T. Komura, and J. Wang. Tokenhsi: Uni- fied synthesis of physical human-scene interactions through task tokenization. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 2025

2025
[34]

J. Dao, H. Duan, and A. Fern. Sim-to-real learning for humanoid box loco-manipulation. In International Conference on Robotics and Automation (ICRA), 2024

2024
[35]

Zhang, Y

Y . Zhang, Y . Yuan, P. Gurunath, I. Gupta, S. Omidshafiei, A.-a. Agha-mohammadi, M. Vazquez-Chanlatte, L. Pedersen, T. He, and G. Shi. Falcon: Learning force-adaptive hu- manoid loco-manipulation.arXiv preprint arXiv:2505.06776, 2025

arXiv 2025
[36]

W. Sun, L. Feng, B. Cao, Y . Liu, Y . Jin, and Z. Xie. Ulc: A unified and fine-grained controller for humanoid loco-manipulation.arXiv preprint arXiv:2507.06905, 2025

arXiv 2025
[37]

L. Wei, X. Peng, R.-Z. Qiu, T. Huang, X. Cheng, and X. Wang. Hmc: Learning heterogeneous meta-control for contact-rich loco-manipulation.arXiv preprint arXiv:2511.14756, 2025

arXiv 2025
[38]

Zhang, H

Z. Zhang, H. Lu, Y . Lian, Z. Chen, Y . Liu, C. Lin, H. Xue, Z. Zeng, Z. Qi, S. Zheng, et al. Learning athletic humanoid tennis skills from imperfect human motion data.arXiv preprint arXiv:2603.12686, 2026

arXiv 2026
[39]

Y . Chen, S. Dong, X. Ji, J. Sun, Z. Luo, L. Zhao, J. Zhang, W. Li, J. Ma, B. Xu, et al. Learning human-like badminton skills for humanoid robots.arXiv preprint arXiv:2602.08370, 2026

arXiv 2026
[40]

J. Kong, X. Liu, Y . Lin, J. Han, S. Schwertfeger, C. Bai, and X. Li. Learning soccer skills for humanoid robots: A progressive perception-action framework.arXiv preprint arXiv:2602.05310, 2026

arXiv 2026
[41]

J. Ren, Y . Li, K. Zhang, P. Fu, H. Jiang, Y . Pan, G. Zeng, T. Huang, W. Guo, P. Lu, et al. Smash: Mastering scalable whole-body skills for humanoid ping-pong with egocentric vision. arXiv preprint arXiv:2604.01158, 2026

arXiv 2026
[42]

Allshire, H

A. Allshire, H. Choi, J. Zhang, D. McAllister, A. Zhang, C. M. Kim, T. Darrell, P. Abbeel, J. Malik, and A. Kanazawa. Visual imitation enables contextual humanoid control. InPro- ceedings of the Conference on Robot Learning (CoRL), 2025

2025
[43]

T. Wu, X. Kong, Y . Chen, Q. Yu, H. Ye, J. Li, Y . Wang, and H. Dong. Sugar: A scalable human- video-driven generalizable humanoid loco-manipulation learning framework.arXiv preprint arXiv:2605.20373, 2026

Pith/arXiv arXiv 2026
[44]

Agarwal, A

A. Agarwal, A. Kumar, J. Malik, and D. Pathak. Legged locomotion in challenging terrains using egocentric vision. InConference on Robot Learning (CoRL), 2022

2022
[45]

J. Long, J. Ren, M. Shi, Z. Wang, T. Huang, P. Luo, and J. Pang. Learning humanoid locomo- tion with perceptive internal model.arXiv preprint arXiv:2411.14386, 2024

arXiv 2024
[46]

J. Sun, G. Han, P. Sun, W. Zhao, J. Cao, J. Wang, Y . Guo, and Q. Zhang. Dpl: Depth- only perceptive humanoid locomotion via realistic depth synthesis and cross-attention terrain reconstruction.arXiv preprint arXiv:2510.07152, 2025

arXiv 2025
[47]

C. Han, S. He, Y . Cheng, L. Ye, and H. Liu. Prior: Perceptive learning for humanoid locomo- tion with reference gait priors.arXiv preprint arXiv:2603.18979, 2026. 11

arXiv 2026
[48]

Zhang, Y

Y . Zhang, Y . Seo, J. Chen, Y . Yuan, K. Sreenath, P. Abbeel, C. Sferrazza, K. Liu, R. Duan, and G. Shi. Rpl: Learning robust humanoid perceptive locomotion on challenging terrains.arXiv preprint arXiv:2602.03002, 2026

arXiv 2026
[49]

W. Sun, Y . Su, L. Huang, A. Zhang, D. Wei, M. San, D. Tian, E. Cao, B. Cao, Y . Liu, et al. Now you see that: Learning end-to-end humanoid locomotion from raw pixels.arXiv preprint arXiv:2602.06382, 2026

Pith/arXiv arXiv 2026
[50]

Zhuang, S

Z. Zhuang, S. Yao, and H. Zhao. Humanoid parkour learning. InConference on Robot Learn- ing (CoRL), 2024

2024
[51]

Z. Wu, X. Huang, L. Yang, Y . Zhang, K. Sreenath, X. Chen, P. Abbeel, R. Duan, A. Kanazawa, C. Sferrazza, et al. Perceptive humanoid parkour: Chaining dynamic human skills via motion matching.arXiv preprint arXiv:2602.15827, 2026

Pith/arXiv arXiv 2026
[52]

Zhuang, S

Z. Zhuang, S. Zhu, M. Zhao, and H. Zhao. Deep whole-body parkour.arXiv preprint arXiv:2601.07701, 2026

arXiv 2026
[53]

S. Zhu, B. Ye, J. Wang, J. Chen, Z. Zhuang, L. Mou, R. Huang, and H. Zhao. Ttt-parkour: Rapid test-time training for perceptive robot parkour.arXiv preprint arXiv:2602.02331, 2026

arXiv 2026
[54]

H. Xue, X. Huang, D. Niu, Q. Liao, T. Kragerud, J. T. Gravdahl, X. B. Peng, G. Shi, T. Dar- rell, K. Sreenath, et al. Leverb: Humanoid whole-body control with latent vision-language instruction.arXiv preprint arXiv:2506.13751, 2025

arXiv 2025
[55]

T. He, Z. Wang, H. Xue, Q. Ben, Z. Luo, W. Xiao, Y . Yuan, X. Da, F. Casta ˜neda, S. Sas- try, et al. Viral: Visual sim-to-real at scale for humanoid loco-manipulation.arXiv preprint arXiv:2511.15200, 2025

arXiv 2025
[56]

Jiang, J

H. Jiang, J. Chen, Q. Bu, L. Chen, M. Shi, Y . Zhang, D. Li, C. Suo, C. Wang, Z. Peng, et al. Wholebodyvla: Towards unified latent vla for whole-body loco-manipulation control.arXiv preprint arXiv:2512.11047, 2025

arXiv 2025
[57]

H. Liu, Y . Gao, S. Teng, Y . Chi, Y . S. Shao, Z. Li, M. Ghaffari, and K. Sreenath. Ego-vision world model for humanoid contact planning. InInternational Conference on Robotics and Automation (ICRA), 2026

2026
[58]

Y . Lin, J. Cui, Y . Li, B. Jia, Y . Zhu, and S. Huang. Lessmimic: Long-horizon humanoid interaction with unified distance field representations.arXiv preprint arXiv:2602.21723, 2026

arXiv 2026
[59]

X. He, S. Xu, X. Li, R. Dong, L. Bian, Y .-X. Wang, and L.-Y . Gui. Ultra: Unified mul- timodal control for autonomous humanoid whole-body loco-manipulation.arXiv preprint arXiv:2603.03279, 2026

arXiv 2026
[60]

Schulman, F

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017. 12 V AIC: Vision-Guided Humanoid Agile Object Interaction Control via Decoupled Commands Appendix In this appendix, we provide additional experimental setups and details: 1.Demo Video.A demonstration video includin...

Pith/arXiv arXiv 2017

[1] [1]

H. Wang, W. Zhang, R. Yu, T. Huang, J. Ren, F. Jia, Z. Wang, X. Niu, X. Chen, J. Chen, et al. Physhsi: Towards a real-world generalizable and natural humanoid-scene interaction system. arXiv preprint arXiv:2510.11072, 2025

arXiv 2025

[2] [2]

H. Weng, Y . Li, N. Sobanbabu, Z. Wang, Z. Luo, T. He, D. Ramanan, and G. Shi. Hdmi: Learning interactive humanoid whole-body control from human videos.arXiv preprint arXiv:2509.16757, 2025

arXiv 2025

[3] [3]

S. Zhao, Y . Ze, Y . Wang, C. K. Liu, P. Abbeel, G. Shi, and R. Duan. Resmimic: From gen- eral motion tracking to humanoid whole-body loco-manipulation via residual learning.arXiv preprint arXiv:2510.05070, 2025

arXiv 2025

[4] [4]

S. Yin, Y . Ze, H.-X. Yu, C. K. Liu, and J. Wu. Visualmimic: Visual humanoid loco- manipulation via motion tracking and generation.arXiv preprint arXiv:2509.20322, 2025

arXiv 2025

[5] [5]

D. Li, X. Chen, Q. Wu, B. Chen, S. Wu, H. Wu, G. Zhang, L. Li, M. Zhou, D. Xiang, J. Ma, Q. Zhang, and R. Xu. Haic: Humanoid agile object interaction control via dynamics-aware world model.arXiv preprint arXiv:2602.11758, 2026

Pith/arXiv arXiv 2026

[6] [6]

Q. Liao, T. E. Truong, X. Huang, Y . Gao, G. Tevet, K. Sreenath, and C. K. Liu. Beyondmimic: From motion tracking to versatile humanoid control via guided diffusion.arXiv preprint arXiv:2508.08241, 2025

Pith/arXiv arXiv 2025

[7] [7]

Z. Luo, Y . Yuan, T. Wang, C. Li, S. Chen, F. Casta ˜neda, Z.-A. Cao, J. Li, D. Minor, Q. Ben, et al. Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025

Pith/arXiv arXiv 2025

[8] [8]

X. B. Peng, Z. Ma, P. Abbeel, S. Levine, and A. Kanazawa. Amp: Adversarial motion priors for stylized physics-based character control.ACM Transactions on Graphics (ToG), 40(4): 1–20, 2021

2021

[9] [9]

Y . Lin, J. Shi, D. Wang, J. Kong, Y . Liu, C. Bai, and X. Li. Pro-hoi: Perceptive root-guided humanoid-object interaction.arXiv preprint arXiv:2603.01126, 2026

arXiv 2026

[10] [10]

L. Yang, X. Huang, Z. Wu, A. Kanazawa, P. Abbeel, C. Sferrazza, C. K. Liu, R. Duan, and G. Shi. Omniretarget: Interaction-preserving data generation for humanoid whole-body loco- manipulation and scene interaction.arXiv preprint arXiv:2509.26633, 2025

Pith/arXiv arXiv 2025

[11] [11]

Y . Wang, Q. Zhao, Y . F. Lau, R. Yu, H. W. Tsui, Q. Chen, J. Wang, J. Pang, and P. Tan. Humanx: Toward agile and generalizable humanoid interaction skills from human videos.arXiv preprint arXiv:2602.02473, 2026

arXiv 2026

[12] [12]

X. B. Peng, P. Abbeel, S. Levine, and M. Van de Panne. Deepmimic: Example-guided deep re- inforcement learning of physics-based character skills.ACM Transactions On Graphics (TOG), 37(4):1–14, 2018

2018

[13] [13]

Z. Chen, M. Ji, X. Cheng, X. Peng, X. B. Peng, and X. Wang. Gmt: General motion tracking for humanoid whole-body control.arXiv preprint arXiv:2506.14770, 2025

arXiv 2025

[14] [14]

K. Yin, W. Zeng, K. Fan, M. Dai, Z. Wang, Q. Zhang, Z. Tian, J. Wang, J. Pang, and W. Zhang. Unitracker: Learning universal whole-body motion tracker for humanoid robots.arXiv preprint arXiv:2507.07356, 2025

arXiv 2025

[15] [15]

J. P. Araujo, Y . Ze, P. Xu, J. Wu, and C. K. Liu. Retargeting matters: General motion retargeting for humanoid motion tracking.arXiv preprint arXiv:2510.02252, 2025. 9

arXiv 2025

[16] [16]

T. He, W. Xiao, T. Lin, Z. Luo, Z. Xu, Z. Jiang, J. Kautz, C. Liu, G. Shi, X. Wang, et al. Hover: Versatile neural whole-body controller for humanoid robots. InInternational Conference on Robotics and Automation (ICRA), 2025

2025

[17] [17]

T. He, J. Gao, W. Xiao, Y . Zhang, Z. Wang, J. Wang, Z. Luo, G. He, N. Sobanbabu, C. Pan, Z. Yi, G. Qu, K. Kitani, J. Hodgins, L. J. Fan, Y . Zhu, C. Liu, and G. Shi. Asap: Aligning sim- ulation and real-world physics for learning agile humanoid whole-body skills.arXiv preprint arXiv:2502.01143, 2025

arXiv 2025

[18] [18]

Rempe, M

D. Rempe, M. Petrovich, Y . Yuan, H. Zhang, X. B. Peng, Y . Jiang, T. Wang, U. Iqbal, D. Minor, M. de Ruyter, et al. Kimodo: Scaling controllable human motion generation.arXiv preprint arXiv:2603.15546, 2026

arXiv 2026

[19] [19]

W. Zeng, S. Lu, K. Yin, X. Niu, M. Dai, J. Wang, and J. Pang. Behavior foundation model for humanoid robots.arXiv preprint arXiv:2509.13780, 2025

arXiv 2025

[20] [20]

M. Yuan, T. Yu, W. Ge, X. Yao, D. Li, H. Wang, J. Chen, X. Jin, B. Li, H. Chen, et al. Behavior foundation model: Towards next-generation whole-body control system of humanoid robots. arXiv preprint arXiv:2506.20487, 2025

arXiv 2025

[21] [21]

Y . Li, Z. Luo, T. Zhang, C. Dai, A. Kanervisto, A. Tirinzoni, H. Weng, K. Kitani, M. Guzek, A. Touati, et al. Bfm-zero: A promptable behavioral foundation model for humanoid control using unsupervised reinforcement learning.arXiv preprint arXiv:2511.04131, 2025

arXiv 2025

[22] [22]

Jiang, Z

N. Jiang, Z. He, W. Yu, L. Pang, Y . Li, H. Li, J. Cui, Y . Li, Y . Wang, Y . Zhu, et al. Uni- act: Unified motion generation and action streaming for humanoid robots.arXiv preprint arXiv:2512.24321, 2025

arXiv 2025

[23] [23]

Y . Shao, X. Huang, B. Zhang, Q. Liao, Y . Gao, Y . Chi, Z. Li, S. Shao, and K. Sreenath. Langwbc: Language-directed humanoid whole-body control via end-to-end learning.arXiv preprint arXiv:2504.21738, 2025

arXiv 2025

[24] [24]

B. L. Bhatnagar, X. Xie, I. Petrov, C. Sminchisescu, C. Theobalt, and G. Pons-Moll. BEHA VE: Dataset and method for tracking human object interactions. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 2022

2022

[25] [25]

Zhang, H

J. Zhang, H. Luo, H. Yang, X. Xu, Q. Wu, Y . Shi, J. Yu, L. Xu, and J. Wang. Neuraldome: A neural modeling pipeline on multi-view human-object interactions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8834–8845, 2023

2023

[26] [26]

Lu, C.-H

J. Lu, C.-H. P. Huang, U. Bhattacharya, Q. Huang, and Y . Zhou. Humoto: A 4d dataset of mocap human object interactions. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10886–10897, 2025

2025

[27] [27]

C. Zhao, J. Zhang, J. Du, Z. Shan, J. Wang, J. Yu, J. Wang, and L. Xu. I’m hoi: Inertia- aware monocular capture of 3d human-object interactions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 729–741, 2024

2024

[28] [28]

S. Xu, D. Li, Y . Zhang, X. Xu, Q. Long, Z. Wang, Y . Lu, S. Dong, H. Jiang, A. Gupta, Y .-X. Wang, and L.-Y . Gui. Interact: Advancing large-scale versatile 3d human-object interac- tion generation. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 2025

2025

[29] [29]

Y . Wang, J. Lin, A. Zeng, Z. Luo, J. Zhang, and L. Zhang. Physhoi: Physics-based imitation of dynamic human-object interaction.arXiv preprint arXiv:2312.04393, 2023

arXiv 2023

[30] [30]

S. Xu, H. Y . Ling, Y .-X. Wang, and L. Gui. Intermimic: Towards universal whole-body con- trol for physics-based human-object interactions. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 2025. 10

2025

[31] [31]

Y . Lin, Y . Xie, J. Xie, Y . Huang, R. Wang, J. Lv, Y . Ma, and X. Zuo. Simgenhoi: Physically realistic whole-body humanoid-object interaction via generative modeling and reinforcement learning.arXiv preprint arXiv:2508.14120, 2025

arXiv 2025

[32] [32]

Q. Wu, Y . Shi, X. Huang, J. Yu, L. Xu, and J. Wang. Thor: Text to human-object interaction diffusion via relation intervention.arXiv preprint arXiv:2403.11208, 2024

arXiv 2024

[33] [33]

L. Pan, Z. Yang, Z. Dou, W. Wang, B. Huang, B. Dai, T. Komura, and J. Wang. Tokenhsi: Uni- fied synthesis of physical human-scene interactions through task tokenization. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 2025

2025

[34] [34]

J. Dao, H. Duan, and A. Fern. Sim-to-real learning for humanoid box loco-manipulation. In International Conference on Robotics and Automation (ICRA), 2024

2024

[35] [35]

Zhang, Y

Y . Zhang, Y . Yuan, P. Gurunath, I. Gupta, S. Omidshafiei, A.-a. Agha-mohammadi, M. Vazquez-Chanlatte, L. Pedersen, T. He, and G. Shi. Falcon: Learning force-adaptive hu- manoid loco-manipulation.arXiv preprint arXiv:2505.06776, 2025

arXiv 2025

[36] [36]

W. Sun, L. Feng, B. Cao, Y . Liu, Y . Jin, and Z. Xie. Ulc: A unified and fine-grained controller for humanoid loco-manipulation.arXiv preprint arXiv:2507.06905, 2025

arXiv 2025

[37] [37]

L. Wei, X. Peng, R.-Z. Qiu, T. Huang, X. Cheng, and X. Wang. Hmc: Learning heterogeneous meta-control for contact-rich loco-manipulation.arXiv preprint arXiv:2511.14756, 2025

arXiv 2025

[38] [38]

Zhang, H

Z. Zhang, H. Lu, Y . Lian, Z. Chen, Y . Liu, C. Lin, H. Xue, Z. Zeng, Z. Qi, S. Zheng, et al. Learning athletic humanoid tennis skills from imperfect human motion data.arXiv preprint arXiv:2603.12686, 2026

arXiv 2026

[39] [39]

Y . Chen, S. Dong, X. Ji, J. Sun, Z. Luo, L. Zhao, J. Zhang, W. Li, J. Ma, B. Xu, et al. Learning human-like badminton skills for humanoid robots.arXiv preprint arXiv:2602.08370, 2026

arXiv 2026

[40] [40]

J. Kong, X. Liu, Y . Lin, J. Han, S. Schwertfeger, C. Bai, and X. Li. Learning soccer skills for humanoid robots: A progressive perception-action framework.arXiv preprint arXiv:2602.05310, 2026

arXiv 2026

[41] [41]

J. Ren, Y . Li, K. Zhang, P. Fu, H. Jiang, Y . Pan, G. Zeng, T. Huang, W. Guo, P. Lu, et al. Smash: Mastering scalable whole-body skills for humanoid ping-pong with egocentric vision. arXiv preprint arXiv:2604.01158, 2026

arXiv 2026

[42] [42]

Allshire, H

A. Allshire, H. Choi, J. Zhang, D. McAllister, A. Zhang, C. M. Kim, T. Darrell, P. Abbeel, J. Malik, and A. Kanazawa. Visual imitation enables contextual humanoid control. InPro- ceedings of the Conference on Robot Learning (CoRL), 2025

2025

[43] [43]

T. Wu, X. Kong, Y . Chen, Q. Yu, H. Ye, J. Li, Y . Wang, and H. Dong. Sugar: A scalable human- video-driven generalizable humanoid loco-manipulation learning framework.arXiv preprint arXiv:2605.20373, 2026

Pith/arXiv arXiv 2026

[44] [44]

Agarwal, A

A. Agarwal, A. Kumar, J. Malik, and D. Pathak. Legged locomotion in challenging terrains using egocentric vision. InConference on Robot Learning (CoRL), 2022

2022

[45] [45]

J. Long, J. Ren, M. Shi, Z. Wang, T. Huang, P. Luo, and J. Pang. Learning humanoid locomo- tion with perceptive internal model.arXiv preprint arXiv:2411.14386, 2024

arXiv 2024

[46] [46]

J. Sun, G. Han, P. Sun, W. Zhao, J. Cao, J. Wang, Y . Guo, and Q. Zhang. Dpl: Depth- only perceptive humanoid locomotion via realistic depth synthesis and cross-attention terrain reconstruction.arXiv preprint arXiv:2510.07152, 2025

arXiv 2025

[47] [47]

C. Han, S. He, Y . Cheng, L. Ye, and H. Liu. Prior: Perceptive learning for humanoid locomo- tion with reference gait priors.arXiv preprint arXiv:2603.18979, 2026. 11

arXiv 2026

[48] [48]

Zhang, Y

Y . Zhang, Y . Seo, J. Chen, Y . Yuan, K. Sreenath, P. Abbeel, C. Sferrazza, K. Liu, R. Duan, and G. Shi. Rpl: Learning robust humanoid perceptive locomotion on challenging terrains.arXiv preprint arXiv:2602.03002, 2026

arXiv 2026

[49] [49]

W. Sun, Y . Su, L. Huang, A. Zhang, D. Wei, M. San, D. Tian, E. Cao, B. Cao, Y . Liu, et al. Now you see that: Learning end-to-end humanoid locomotion from raw pixels.arXiv preprint arXiv:2602.06382, 2026

Pith/arXiv arXiv 2026

[50] [50]

Zhuang, S

Z. Zhuang, S. Yao, and H. Zhao. Humanoid parkour learning. InConference on Robot Learn- ing (CoRL), 2024

2024

[51] [51]

Z. Wu, X. Huang, L. Yang, Y . Zhang, K. Sreenath, X. Chen, P. Abbeel, R. Duan, A. Kanazawa, C. Sferrazza, et al. Perceptive humanoid parkour: Chaining dynamic human skills via motion matching.arXiv preprint arXiv:2602.15827, 2026

Pith/arXiv arXiv 2026

[52] [52]

Zhuang, S

Z. Zhuang, S. Zhu, M. Zhao, and H. Zhao. Deep whole-body parkour.arXiv preprint arXiv:2601.07701, 2026

arXiv 2026

[53] [53]

S. Zhu, B. Ye, J. Wang, J. Chen, Z. Zhuang, L. Mou, R. Huang, and H. Zhao. Ttt-parkour: Rapid test-time training for perceptive robot parkour.arXiv preprint arXiv:2602.02331, 2026

arXiv 2026

[54] [54]

H. Xue, X. Huang, D. Niu, Q. Liao, T. Kragerud, J. T. Gravdahl, X. B. Peng, G. Shi, T. Dar- rell, K. Sreenath, et al. Leverb: Humanoid whole-body control with latent vision-language instruction.arXiv preprint arXiv:2506.13751, 2025

arXiv 2025

[55] [55]

T. He, Z. Wang, H. Xue, Q. Ben, Z. Luo, W. Xiao, Y . Yuan, X. Da, F. Casta ˜neda, S. Sas- try, et al. Viral: Visual sim-to-real at scale for humanoid loco-manipulation.arXiv preprint arXiv:2511.15200, 2025

arXiv 2025

[56] [56]

Jiang, J

H. Jiang, J. Chen, Q. Bu, L. Chen, M. Shi, Y . Zhang, D. Li, C. Suo, C. Wang, Z. Peng, et al. Wholebodyvla: Towards unified latent vla for whole-body loco-manipulation control.arXiv preprint arXiv:2512.11047, 2025

arXiv 2025

[57] [57]

H. Liu, Y . Gao, S. Teng, Y . Chi, Y . S. Shao, Z. Li, M. Ghaffari, and K. Sreenath. Ego-vision world model for humanoid contact planning. InInternational Conference on Robotics and Automation (ICRA), 2026

2026

[58] [58]

Y . Lin, J. Cui, Y . Li, B. Jia, Y . Zhu, and S. Huang. Lessmimic: Long-horizon humanoid interaction with unified distance field representations.arXiv preprint arXiv:2602.21723, 2026

arXiv 2026

[59] [59]

X. He, S. Xu, X. Li, R. Dong, L. Bian, Y .-X. Wang, and L.-Y . Gui. Ultra: Unified mul- timodal control for autonomous humanoid whole-body loco-manipulation.arXiv preprint arXiv:2603.03279, 2026

arXiv 2026

[60] [60]

Schulman, F

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017. 12 V AIC: Vision-Guided Humanoid Agile Object Interaction Control via Decoupled Commands Appendix In this appendix, we provide additional experimental setups and details: 1.Demo Video.A demonstration video includin...

Pith/arXiv arXiv 2017