pith. sign in

arxiv: 2606.09286 · v1 · pith:RSO4O4QHnew · submitted 2026-06-08 · 💻 cs.RO

VAIC: Vision-Guided Humanoid Agile Object Interaction Control via Decoupled Commands

Pith reviewed 2026-06-27 16:36 UTC · model grok-4.3

classification 💻 cs.RO
keywords humanoid robotagile object interactionpolicy distillationvision-guided controlrecurrent adaptationdecoupled commandsreal-world deployment
0
0 comments X

The pith

VAIC distills a teacher policy into a single student policy that performs diverse agile object interactions on humanoids using only depth, proprioception, and decoupled commands.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to close the gap between simulation-trained controllers and real-world humanoid deployment by removing the need for dense reference trajectories and perfect state observability. It introduces a two-stage process in which a privileged teacher first masters interaction skills with full kinematic information, then transfers the behavior to a deployable student that receives only onboard depth images, past joint states, velocity targets on multiple axes, and a per-frame interaction flag. A recurrent module inside the student learns to recover hidden object properties on the fly. If this works, one policy can be deployed across unstructured settings for tasks that previously required separate controllers or privileged sensing. The result is framed as progress toward autonomous humanoids that assist in everyday environments.

Core claim

VAIC is a unified framework that bridges this gap by operating exclusively on onboard depth, historical proprioception, and a decoupled user command interface. It employs a two-stage distillation paradigm. First, a privileged teacher policy masters diverse interaction skills using precise object kinematics and exact environmental states. Second, a deployable student policy distills these capabilities by replacing full body tracking with velocity targets across multiple axes and an interaction indicator for each frame. The student utilizes a recurrent object adaptation module to implicitly infer unobservable object dynamics from raw depth streams and proprioception. Evaluations and real-world

What carries the argument

The recurrent object adaptation module, which replaces explicit state estimation by learning to recover hidden object dynamics directly from depth images and proprioceptive history inside the student policy.

If this is right

  • A single policy can execute multiple dynamic tasks without per-task retraining or reference trajectories.
  • Control remains functional when full environmental state is unavailable, relying instead on onboard depth and proprioception.
  • Decoupled velocity commands plus an interaction flag suffice to coordinate whole-body motion for carrying, pushing, and balancing tasks.
  • Real-world transfer succeeds on a physical humanoid without additional sensing hardware beyond depth and joint encoders.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same distillation structure might allow reuse of the teacher across different robot morphologies if the student input interface stays fixed.
  • Replacing explicit dynamics models with recurrent vision inference could reduce the engineering cost of adding new object classes.
  • Extending the command interface to include higher-level goals such as target locations would test whether the current velocity-plus-indicator format remains sufficient.

Load-bearing premise

The recurrent object adaptation module can implicitly infer unobservable object dynamics from raw depth streams and proprioception.

What would settle it

Deploy the student policy on objects whose mass or surface friction differs from training examples and measure whether interaction stability collapses when the module receives no explicit dynamics parameters.

Figures

Figures reproduced from arXiv: 2606.09286 by Diyun Xiang, Dongting Li, Guoyao Zhang, Jianzhu Ma, Liang Li, Mingliang Zhou, Qiang Zhang, Qianyang Wu, Renjing Xu, Sikai Wu, Xingyu Chen, Yuhang Lin.

Figure 1
Figure 1. Figure 1: We propose VAIC, a unified vision-guided control framework that enables a humanoid [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of VAIC. The framework follows a two-stage distillation paradigm. A privi [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Real-world hardware generalization of VAIC to out-of-distribution object attributes and [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of the predicted object state from the VAIC adaptation module against Mu [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Humanoid robots hold immense potential for real-world assistance, yet agile interaction with objects in unstructured environments demands tightly coupled whole-body coordination. Despite recent advancements, current controllers face a critical deployment gap. They rely heavily on dense reference trajectories and perfect state observability, which inherently limits physical generalization. We present Vision Guided Agile Interaction Control (VAIC), a unified framework that bridges this gap by operating exclusively on onboard depth, historical proprioception, and a decoupled user command interface. VAIC employs a two-stage distillation paradigm. First, a privileged teacher policy masters diverse interaction skills using precise object kinematics and exact environmental states. Second, a deployable student policy distills these capabilities by replacing full body tracking with velocity targets across multiple axes and an interaction indicator for each frame. The student utilizes a recurrent object adaptation module to implicitly infer unobservable object dynamics from raw depth streams and proprioception. Evaluations and real-world deployments on the humanoid robot demonstrate that a single VAIC policy successfully executes highly diverse dynamic tasks. These tasks include box carrying, cart interaction, and skateboarding, consistently outperforming baselines and advancing autonomous humanoid deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces VAIC, a two-stage distillation framework for humanoid agile object interaction. A privileged teacher policy learns diverse skills with full state access; the student policy operates on depth, proprioception history, and decoupled velocity targets plus an interaction indicator, using a recurrent object adaptation module to infer unobservable dynamics. The central claim is that a single student policy executes box carrying, cart interaction, and skateboarding on a real humanoid, consistently outperforming baselines.

Significance. If validated with quantitative evidence, the decoupled command interface and recurrent adaptation for implicit dynamics inference would represent a meaningful step toward generalizable humanoid controllers that reduce dependence on perfect observability and dense trajectories. The two-stage paradigm is a clear strength for sim-to-real transfer in multi-contact tasks.

major comments (2)
  1. [Abstract and Evaluations section] Abstract and Evaluations section: the claim that a single policy 'consistently outperforming baselines' on real-robot tasks is made without any reported metrics, baseline implementations, trial counts, success rates, or failure-mode analysis; this directly undermines assessment of the multi-task generalization result.
  2. [Student policy description] Student policy description (recurrent object adaptation module): no ablation studies, object-parameter variation sweeps, or quantitative tests are provided to support the claim that the RNN implicitly infers unobservable dynamics (mass distribution, friction, compliance) from depth and proprioception history across qualitatively different contact regimes; this is load-bearing for the headline claim.
minor comments (1)
  1. [Method] Notation for the decoupled velocity targets and interaction indicator could be made more explicit with an equation or diagram reference.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and indicate the planned revisions.

read point-by-point responses
  1. Referee: [Abstract and Evaluations section] Abstract and Evaluations section: the claim that a single policy 'consistently outperforming baselines' on real-robot tasks is made without any reported metrics, baseline implementations, trial counts, success rates, or failure-mode analysis; this directly undermines assessment of the multi-task generalization result.

    Authors: We agree that the abstract's claim requires quantitative backing to be fully assessable. The current evaluations section focuses on qualitative real-robot demonstrations. We will revise to include a dedicated quantitative table reporting success rates, number of trials, baseline implementations, and failure-mode analysis for each task. revision: yes

  2. Referee: [Student policy description] Student policy description (recurrent object adaptation module): no ablation studies, object-parameter variation sweeps, or quantitative tests are provided to support the claim that the RNN implicitly infers unobservable dynamics (mass distribution, friction, compliance) from depth and proprioception history across qualitatively different contact regimes; this is load-bearing for the headline claim.

    Authors: The referee is correct that the submitted manuscript lacks ablations or parameter sweeps for the recurrent adaptation module. We will add new experiments, including ablations with and without the RNN, plus quantitative tests varying object mass, friction, and compliance across contact regimes, to directly support the inference claim. revision: yes

Circularity Check

0 steps flagged

No circularity: standard teacher-student distillation with no self-referential definitions or fitted predictions

full rationale

The paper describes a conventional two-stage RL distillation pipeline (privileged teacher mastering skills with full state, student trained on depth/proprioception plus velocity targets) without any equations, parameters, or uniqueness claims that reduce the reported outcome to its own inputs by construction. The recurrent adaptation module is presented as a trained component whose inference capability is an empirical training result rather than a definitional identity. No self-citations are invoked as load-bearing mathematical facts, and no fitted quantities are relabeled as independent predictions. The derivation chain is therefore self-contained against external RL benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the approach implicitly rests on standard assumptions of policy distillation preserving performance and recurrent networks being able to extract dynamics from depth sequences.

pith-pipeline@v0.9.1-grok · 5766 in / 1098 out tokens · 25118 ms · 2026-06-27T16:36:39.946289+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 8 linked inside Pith

  1. [1]

    H. Wang, W. Zhang, R. Yu, T. Huang, J. Ren, F. Jia, Z. Wang, X. Niu, X. Chen, J. Chen, et al. Physhsi: Towards a real-world generalizable and natural humanoid-scene interaction system. arXiv preprint arXiv:2510.11072, 2025

  2. [2]

    H. Weng, Y . Li, N. Sobanbabu, Z. Wang, Z. Luo, T. He, D. Ramanan, and G. Shi. Hdmi: Learning interactive humanoid whole-body control from human videos.arXiv preprint arXiv:2509.16757, 2025

  3. [3]

    S. Zhao, Y . Ze, Y . Wang, C. K. Liu, P. Abbeel, G. Shi, and R. Duan. Resmimic: From gen- eral motion tracking to humanoid whole-body loco-manipulation via residual learning.arXiv preprint arXiv:2510.05070, 2025

  4. [4]

    S. Yin, Y . Ze, H.-X. Yu, C. K. Liu, and J. Wu. Visualmimic: Visual humanoid loco- manipulation via motion tracking and generation.arXiv preprint arXiv:2509.20322, 2025

  5. [5]

    D. Li, X. Chen, Q. Wu, B. Chen, S. Wu, H. Wu, G. Zhang, L. Li, M. Zhou, D. Xiang, J. Ma, Q. Zhang, and R. Xu. Haic: Humanoid agile object interaction control via dynamics-aware world model.arXiv preprint arXiv:2602.11758, 2026

  6. [6]

    Q. Liao, T. E. Truong, X. Huang, Y . Gao, G. Tevet, K. Sreenath, and C. K. Liu. Beyondmimic: From motion tracking to versatile humanoid control via guided diffusion.arXiv preprint arXiv:2508.08241, 2025

  7. [7]

    Z. Luo, Y . Yuan, T. Wang, C. Li, S. Chen, F. Casta ˜neda, Z.-A. Cao, J. Li, D. Minor, Q. Ben, et al. Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025

  8. [8]

    X. B. Peng, Z. Ma, P. Abbeel, S. Levine, and A. Kanazawa. Amp: Adversarial motion priors for stylized physics-based character control.ACM Transactions on Graphics (ToG), 40(4): 1–20, 2021

  9. [9]

    Y . Lin, J. Shi, D. Wang, J. Kong, Y . Liu, C. Bai, and X. Li. Pro-hoi: Perceptive root-guided humanoid-object interaction.arXiv preprint arXiv:2603.01126, 2026

  10. [10]

    L. Yang, X. Huang, Z. Wu, A. Kanazawa, P. Abbeel, C. Sferrazza, C. K. Liu, R. Duan, and G. Shi. Omniretarget: Interaction-preserving data generation for humanoid whole-body loco- manipulation and scene interaction.arXiv preprint arXiv:2509.26633, 2025

  11. [11]

    Y . Wang, Q. Zhao, Y . F. Lau, R. Yu, H. W. Tsui, Q. Chen, J. Wang, J. Pang, and P. Tan. Humanx: Toward agile and generalizable humanoid interaction skills from human videos.arXiv preprint arXiv:2602.02473, 2026

  12. [12]

    X. B. Peng, P. Abbeel, S. Levine, and M. Van de Panne. Deepmimic: Example-guided deep re- inforcement learning of physics-based character skills.ACM Transactions On Graphics (TOG), 37(4):1–14, 2018

  13. [13]

    Z. Chen, M. Ji, X. Cheng, X. Peng, X. B. Peng, and X. Wang. Gmt: General motion tracking for humanoid whole-body control.arXiv preprint arXiv:2506.14770, 2025

  14. [14]

    K. Yin, W. Zeng, K. Fan, M. Dai, Z. Wang, Q. Zhang, Z. Tian, J. Wang, J. Pang, and W. Zhang. Unitracker: Learning universal whole-body motion tracker for humanoid robots.arXiv preprint arXiv:2507.07356, 2025

  15. [15]

    J. P. Araujo, Y . Ze, P. Xu, J. Wu, and C. K. Liu. Retargeting matters: General motion retargeting for humanoid motion tracking.arXiv preprint arXiv:2510.02252, 2025. 9

  16. [16]

    T. He, W. Xiao, T. Lin, Z. Luo, Z. Xu, Z. Jiang, J. Kautz, C. Liu, G. Shi, X. Wang, et al. Hover: Versatile neural whole-body controller for humanoid robots. InInternational Conference on Robotics and Automation (ICRA), 2025

  17. [17]

    T. He, J. Gao, W. Xiao, Y . Zhang, Z. Wang, J. Wang, Z. Luo, G. He, N. Sobanbabu, C. Pan, Z. Yi, G. Qu, K. Kitani, J. Hodgins, L. J. Fan, Y . Zhu, C. Liu, and G. Shi. Asap: Aligning sim- ulation and real-world physics for learning agile humanoid whole-body skills.arXiv preprint arXiv:2502.01143, 2025

  18. [18]

    Rempe, M

    D. Rempe, M. Petrovich, Y . Yuan, H. Zhang, X. B. Peng, Y . Jiang, T. Wang, U. Iqbal, D. Minor, M. de Ruyter, et al. Kimodo: Scaling controllable human motion generation.arXiv preprint arXiv:2603.15546, 2026

  19. [19]

    W. Zeng, S. Lu, K. Yin, X. Niu, M. Dai, J. Wang, and J. Pang. Behavior foundation model for humanoid robots.arXiv preprint arXiv:2509.13780, 2025

  20. [20]

    M. Yuan, T. Yu, W. Ge, X. Yao, D. Li, H. Wang, J. Chen, X. Jin, B. Li, H. Chen, et al. Behavior foundation model: Towards next-generation whole-body control system of humanoid robots. arXiv preprint arXiv:2506.20487, 2025

  21. [21]

    Y . Li, Z. Luo, T. Zhang, C. Dai, A. Kanervisto, A. Tirinzoni, H. Weng, K. Kitani, M. Guzek, A. Touati, et al. Bfm-zero: A promptable behavioral foundation model for humanoid control using unsupervised reinforcement learning.arXiv preprint arXiv:2511.04131, 2025

  22. [22]

    Jiang, Z

    N. Jiang, Z. He, W. Yu, L. Pang, Y . Li, H. Li, J. Cui, Y . Li, Y . Wang, Y . Zhu, et al. Uni- act: Unified motion generation and action streaming for humanoid robots.arXiv preprint arXiv:2512.24321, 2025

  23. [23]

    Y . Shao, X. Huang, B. Zhang, Q. Liao, Y . Gao, Y . Chi, Z. Li, S. Shao, and K. Sreenath. Langwbc: Language-directed humanoid whole-body control via end-to-end learning.arXiv preprint arXiv:2504.21738, 2025

  24. [24]

    B. L. Bhatnagar, X. Xie, I. Petrov, C. Sminchisescu, C. Theobalt, and G. Pons-Moll. BEHA VE: Dataset and method for tracking human object interactions. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 2022

  25. [25]

    Zhang, H

    J. Zhang, H. Luo, H. Yang, X. Xu, Q. Wu, Y . Shi, J. Yu, L. Xu, and J. Wang. Neuraldome: A neural modeling pipeline on multi-view human-object interactions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8834–8845, 2023

  26. [26]

    Lu, C.-H

    J. Lu, C.-H. P. Huang, U. Bhattacharya, Q. Huang, and Y . Zhou. Humoto: A 4d dataset of mocap human object interactions. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10886–10897, 2025

  27. [27]

    C. Zhao, J. Zhang, J. Du, Z. Shan, J. Wang, J. Yu, J. Wang, and L. Xu. I’m hoi: Inertia- aware monocular capture of 3d human-object interactions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 729–741, 2024

  28. [28]

    S. Xu, D. Li, Y . Zhang, X. Xu, Q. Long, Z. Wang, Y . Lu, S. Dong, H. Jiang, A. Gupta, Y .-X. Wang, and L.-Y . Gui. Interact: Advancing large-scale versatile 3d human-object interac- tion generation. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 2025

  29. [29]

    Y . Wang, J. Lin, A. Zeng, Z. Luo, J. Zhang, and L. Zhang. Physhoi: Physics-based imitation of dynamic human-object interaction.arXiv preprint arXiv:2312.04393, 2023

  30. [30]

    S. Xu, H. Y . Ling, Y .-X. Wang, and L. Gui. Intermimic: Towards universal whole-body con- trol for physics-based human-object interactions. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 2025. 10

  31. [31]

    Y . Lin, Y . Xie, J. Xie, Y . Huang, R. Wang, J. Lv, Y . Ma, and X. Zuo. Simgenhoi: Physically realistic whole-body humanoid-object interaction via generative modeling and reinforcement learning.arXiv preprint arXiv:2508.14120, 2025

  32. [32]

    Q. Wu, Y . Shi, X. Huang, J. Yu, L. Xu, and J. Wang. Thor: Text to human-object interaction diffusion via relation intervention.arXiv preprint arXiv:2403.11208, 2024

  33. [33]

    L. Pan, Z. Yang, Z. Dou, W. Wang, B. Huang, B. Dai, T. Komura, and J. Wang. Tokenhsi: Uni- fied synthesis of physical human-scene interactions through task tokenization. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 2025

  34. [34]

    J. Dao, H. Duan, and A. Fern. Sim-to-real learning for humanoid box loco-manipulation. In International Conference on Robotics and Automation (ICRA), 2024

  35. [35]

    Zhang, Y

    Y . Zhang, Y . Yuan, P. Gurunath, I. Gupta, S. Omidshafiei, A.-a. Agha-mohammadi, M. Vazquez-Chanlatte, L. Pedersen, T. He, and G. Shi. Falcon: Learning force-adaptive hu- manoid loco-manipulation.arXiv preprint arXiv:2505.06776, 2025

  36. [36]

    W. Sun, L. Feng, B. Cao, Y . Liu, Y . Jin, and Z. Xie. Ulc: A unified and fine-grained controller for humanoid loco-manipulation.arXiv preprint arXiv:2507.06905, 2025

  37. [37]

    L. Wei, X. Peng, R.-Z. Qiu, T. Huang, X. Cheng, and X. Wang. Hmc: Learning heterogeneous meta-control for contact-rich loco-manipulation.arXiv preprint arXiv:2511.14756, 2025

  38. [38]

    Zhang, H

    Z. Zhang, H. Lu, Y . Lian, Z. Chen, Y . Liu, C. Lin, H. Xue, Z. Zeng, Z. Qi, S. Zheng, et al. Learning athletic humanoid tennis skills from imperfect human motion data.arXiv preprint arXiv:2603.12686, 2026

  39. [39]

    Y . Chen, S. Dong, X. Ji, J. Sun, Z. Luo, L. Zhao, J. Zhang, W. Li, J. Ma, B. Xu, et al. Learning human-like badminton skills for humanoid robots.arXiv preprint arXiv:2602.08370, 2026

  40. [40]

    J. Kong, X. Liu, Y . Lin, J. Han, S. Schwertfeger, C. Bai, and X. Li. Learning soccer skills for humanoid robots: A progressive perception-action framework.arXiv preprint arXiv:2602.05310, 2026

  41. [41]

    J. Ren, Y . Li, K. Zhang, P. Fu, H. Jiang, Y . Pan, G. Zeng, T. Huang, W. Guo, P. Lu, et al. Smash: Mastering scalable whole-body skills for humanoid ping-pong with egocentric vision. arXiv preprint arXiv:2604.01158, 2026

  42. [42]

    Allshire, H

    A. Allshire, H. Choi, J. Zhang, D. McAllister, A. Zhang, C. M. Kim, T. Darrell, P. Abbeel, J. Malik, and A. Kanazawa. Visual imitation enables contextual humanoid control. InPro- ceedings of the Conference on Robot Learning (CoRL), 2025

  43. [43]

    T. Wu, X. Kong, Y . Chen, Q. Yu, H. Ye, J. Li, Y . Wang, and H. Dong. Sugar: A scalable human- video-driven generalizable humanoid loco-manipulation learning framework.arXiv preprint arXiv:2605.20373, 2026

  44. [44]

    Agarwal, A

    A. Agarwal, A. Kumar, J. Malik, and D. Pathak. Legged locomotion in challenging terrains using egocentric vision. InConference on Robot Learning (CoRL), 2022

  45. [45]

    J. Long, J. Ren, M. Shi, Z. Wang, T. Huang, P. Luo, and J. Pang. Learning humanoid locomo- tion with perceptive internal model.arXiv preprint arXiv:2411.14386, 2024

  46. [46]

    J. Sun, G. Han, P. Sun, W. Zhao, J. Cao, J. Wang, Y . Guo, and Q. Zhang. Dpl: Depth- only perceptive humanoid locomotion via realistic depth synthesis and cross-attention terrain reconstruction.arXiv preprint arXiv:2510.07152, 2025

  47. [47]

    C. Han, S. He, Y . Cheng, L. Ye, and H. Liu. Prior: Perceptive learning for humanoid locomo- tion with reference gait priors.arXiv preprint arXiv:2603.18979, 2026. 11

  48. [48]

    Zhang, Y

    Y . Zhang, Y . Seo, J. Chen, Y . Yuan, K. Sreenath, P. Abbeel, C. Sferrazza, K. Liu, R. Duan, and G. Shi. Rpl: Learning robust humanoid perceptive locomotion on challenging terrains.arXiv preprint arXiv:2602.03002, 2026

  49. [49]

    W. Sun, Y . Su, L. Huang, A. Zhang, D. Wei, M. San, D. Tian, E. Cao, B. Cao, Y . Liu, et al. Now you see that: Learning end-to-end humanoid locomotion from raw pixels.arXiv preprint arXiv:2602.06382, 2026

  50. [50]

    Zhuang, S

    Z. Zhuang, S. Yao, and H. Zhao. Humanoid parkour learning. InConference on Robot Learn- ing (CoRL), 2024

  51. [51]

    Z. Wu, X. Huang, L. Yang, Y . Zhang, K. Sreenath, X. Chen, P. Abbeel, R. Duan, A. Kanazawa, C. Sferrazza, et al. Perceptive humanoid parkour: Chaining dynamic human skills via motion matching.arXiv preprint arXiv:2602.15827, 2026

  52. [52]

    Zhuang, S

    Z. Zhuang, S. Zhu, M. Zhao, and H. Zhao. Deep whole-body parkour.arXiv preprint arXiv:2601.07701, 2026

  53. [53]

    S. Zhu, B. Ye, J. Wang, J. Chen, Z. Zhuang, L. Mou, R. Huang, and H. Zhao. Ttt-parkour: Rapid test-time training for perceptive robot parkour.arXiv preprint arXiv:2602.02331, 2026

  54. [54]

    H. Xue, X. Huang, D. Niu, Q. Liao, T. Kragerud, J. T. Gravdahl, X. B. Peng, G. Shi, T. Dar- rell, K. Sreenath, et al. Leverb: Humanoid whole-body control with latent vision-language instruction.arXiv preprint arXiv:2506.13751, 2025

  55. [55]

    T. He, Z. Wang, H. Xue, Q. Ben, Z. Luo, W. Xiao, Y . Yuan, X. Da, F. Casta ˜neda, S. Sas- try, et al. Viral: Visual sim-to-real at scale for humanoid loco-manipulation.arXiv preprint arXiv:2511.15200, 2025

  56. [56]

    Jiang, J

    H. Jiang, J. Chen, Q. Bu, L. Chen, M. Shi, Y . Zhang, D. Li, C. Suo, C. Wang, Z. Peng, et al. Wholebodyvla: Towards unified latent vla for whole-body loco-manipulation control.arXiv preprint arXiv:2512.11047, 2025

  57. [57]

    H. Liu, Y . Gao, S. Teng, Y . Chi, Y . S. Shao, Z. Li, M. Ghaffari, and K. Sreenath. Ego-vision world model for humanoid contact planning. InInternational Conference on Robotics and Automation (ICRA), 2026

  58. [58]

    Y . Lin, J. Cui, Y . Li, B. Jia, Y . Zhu, and S. Huang. Lessmimic: Long-horizon humanoid interaction with unified distance field representations.arXiv preprint arXiv:2602.21723, 2026

  59. [59]

    X. He, S. Xu, X. Li, R. Dong, L. Bian, Y .-X. Wang, and L.-Y . Gui. Ultra: Unified mul- timodal control for autonomous humanoid whole-body loco-manipulation.arXiv preprint arXiv:2603.03279, 2026

  60. [60]

    Schulman, F

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017. 12 V AIC: Vision-Guided Humanoid Agile Object Interaction Control via Decoupled Commands Appendix In this appendix, we provide additional experimental setups and details: 1.Demo Video.A demonstration video includin...