pith. sign in

arxiv: 2606.20990 · v1 · pith:NALKU4IAnew · submitted 2026-06-18 · 💻 cs.RO

Duet: Dual-Robot Understanding via Efficient Teaching

Pith reviewed 2026-06-26 16:44 UTC · model grok-4.3

classification 💻 cs.RO
keywords dual-robot collaborationVR teleoperationhuman priorsaction chunking transformermobile manipulationcollaborative tasksdata efficiencyheterogeneous robots
0
0 comments X

The pith

Pretraining dual-robot policies on human coordination data from VR achieves equal or better task performance than robot-only training while cutting human effort by 5.4 times on average.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DUET, a framework that collects human-human coordination demonstrations through a synchronized VR teleoperation system to capture collaborative priors for dual-robot mobile manipulation. These priors pretrain an Action Chunking Transformer policy, which is then fine-tuned on a small set of real-robot teleoperation trajectories collected from heterogeneous embodiments like a humanoid and mobile manipulator. The approach targets the data bottleneck in training robots for tasks such as joint object transport and handovers that exceed single-robot capabilities. If the transfer works, it shows that human priors can substitute for much of the expensive robot-specific data collection without sacrificing, and sometimes improving, final performance. Results on a four-task benchmark confirm the method matches or exceeds baselines trained only on robot data.

Core claim

DUET uses a unified VR-based teleoperation pipeline to record human-human coordination and collaborative mobile manipulation priors, then pretrains collaborative policies with an Action Chunking Transformer on these efficient human demonstrations before fine-tuning on a minimal set of real-robot trajectories. On a benchmark of four collaborative tasks with a Unitree G1 humanoid and Dexmate Vega1 mobile manipulator, policies trained this way yield superior or comparable task performance to robot-only baselines while the human data collection pipeline accelerates data gathering by 5.4 times on average relative to direct teleoperation.

What carries the argument

Action Chunking Transformer that first pretrains collaborative policies on human-human VR demonstrations before fine-tuning on limited real-robot teleoperation trajectories.

If this is right

  • Dual-robot policies for tasks exceeding single-robot reach can be trained with substantially less direct robot teleoperation time.
  • Human coordination priors captured in VR improve or maintain performance on coordinated transport and handover tasks after fine-tuning.
  • The synchronized VR collection system enables faster acquisition of in-domain data for heterogeneous robot pairs.
  • Fine-tuning after human pretraining sufficiently bridges embodiment differences between humans and the tested robot platforms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pretraining-then-fine-tune pattern might reduce data needs for other multi-robot coordination problems beyond the dual case tested here.
  • If VR hardware is unavailable, alternative motion-capture methods could be substituted but would require separate validation of transfer quality.
  • Increasing the volume or diversity of human-human demonstrations during pretraining could further shrink the required robot fine-tuning set.
  • Real-world deployment would still need checks for safety when human-derived coordination patterns interact with physical robot constraints not present in VR.

Load-bearing premise

Human-human coordination data collected in VR transfers to heterogeneous robot embodiments with only minimal fine-tuning and little domain gap.

What would settle it

Train identical Action Chunking Transformer policies on the same four tasks using only the real-robot teleoperation data without the human pretraining stage and compare success rates and completion times to the full DUET pipeline; equal or higher performance in the robot-only case would falsify the value of the human priors.

Figures

Figures reproduced from arXiv: 2606.20990 by Basile Van Hoorick, Celina Shiyu Wang, Gaurav S. Sukhatme, Junjie Ye, Jyotirmoy V. Deshmukh, Leonidas Guibas, Minhao Li, Muchen Xu, Ruohai Ge, Sergey Zakharov, Vitor Campagnolo Guizilini, Yiqi Zhao, Yue Wang.

Figure 1
Figure 1. Figure 1: DUET. We introduce a dual-robot policy learning framework that features efficient learn￾ing from human demonstrations. Left: Two humans perform a collaborative manipulation task. Right: A heterogeneous dual-robot system mimics the human teachers. Abstract: Dual-robot collaboration enables tasks that exceed the reach and pay￾load of a single robot, such as collaboratively transporting objects across envi￾ro… view at source ↗
Figure 2
Figure 2. Figure 2: Dual-Robot Teleoperation Pipeline. Two human operators utilize PICO VR interfaces to simultaneously control a heterogeneous robot duo, Vega1 via General Motion Retargeting [46] and G1 via the SONIC [20] framework. The system streams real-time visual feed￾back while asynchronously logging joint pose data. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: DUET Overview. In stage 1, we pretrain an ACT architecture ( [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Model Architecture. Our ACT policy leverages a 2-head design with the joint head randomly initialized during finetune. Our robots are controllable via joint-space commands. Our architecture ( [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Benchmark data distribution and collection efficiency. Left: Total data span and number of episodes per pipeline across four tasks. Right: Amortized collection time (AT) for each task. The value above each task group is the speedup ratio, defined as the teleoperation AT divided by the human-demonstration AT, where AT is the average time for each successful data collection. to answer the following research … view at source ↗
Figure 6
Figure 6. Figure 6: Overview of the Tasks. Snapshots show the initial setup and phases. 4.3 Robot-only Ablation Study To address Q2, we conduct an ablation study where only Stage 2 from Section 3.3, training with 50 robot-only data for each task, is carried out on a fresh ACT (implemented through [50]) without human prior. To quantify multi-stage performance, we implement a normalized metric scoring each task trial from 0.0 t… view at source ↗
Figure 7
Figure 7. Figure 7: Evaluation Results. Comparison of task success rates and accumulated points over 10 experiments for each task of T1–T4. DUET (60 human demos + 30 robot trajectories) is compared against robot-only baselines (trained on 30 and 50 trajectories). demonstrating that our teleoperation pipeline yields a capable standalone baseline for multi-robot learning. 4.4 DUET Performance To address Q3, we execute Stage 1 (… view at source ↗
Figure 8
Figure 8. Figure 8: Teleoperation System Example. A side-by-side of the human teleoperator and the synchronized dual-robot execution. The sequence demonstrates the G1 humanoid walking to the workspace to grasp the target object (a toy dog), followed by a coordinated physical handover to the Vega1 mobile manipulator, which subsequently pivots and places the object onto a secondary table. B.2 Mesh Extraction From each clip’s al… view at source ↗
Figure 9
Figure 9. Figure 9: Human Data Collection Example. B.4 Human Data Collection Example As shown in [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Zero-Shot Generalizability to OOD Conditions. We evaluate DUET under visual and physical distribution shifts. In T2, the standard boxes are replaced with a black foam box and a green-and-black box, altering visual cues and physical dynamics. In T3, the white board is entirely covered by a black cover to test resilience to drastic background variations. These changes are illustrated in the figure. Despite … view at source ↗
read the original abstract

Dual-robot collaboration enables tasks that exceed the reach and payload of a single robot, such as collaboratively transporting objects across environments and executing coordinated handovers. Data acquisition is the primary bottleneck for training these systems. To this end, we introduce DUET, a dual-robot learning framework for mobile manipulation. For efficient data collection, we create a unified dual-embodiment synchronized VR-based teleoperation system for in-domain heterogeneous robot data collection. We further develop a complementary tracking pipeline that records human-human coordination and collaborative mobile manipulation priors. To allow efficient learning, we introduce an Action Chunking Transformer based architecture that first pretrains collaborative policies on efficient human-human demonstrations, before finetuning them on a minimal set of real-robot teleoperation trajectories. We develop a benchmark of four collaborative tasks to evaluate our framework using a Unitree G1 humanoid and a Dexmate Vega1 mobile manipulator. The results demonstrate that harnessing human priors not only yields superior task performance compared to baselines trained only on robot data, but also reduces the total human effort required for data collection. Our human data collection pipeline achieves 5.4x acceleration on average from teleoperation, but we perform equally or better than robot-only data trained policies across all tasks. Our project page is available at https://zhaoy37.github.io/Duet/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces DUET, a dual-robot learning framework for mobile manipulation that collects human-human coordination priors via a unified VR-based synchronized teleoperation and tracking pipeline, pretrains an Action Chunking Transformer policy on these demonstrations, and fine-tunes on a minimal set of real-robot trajectories. It evaluates the approach on four collaborative tasks using a Unitree G1 humanoid and Dexmate Vega1 mobile manipulator, claiming superior task performance over robot-only baselines together with a 5.4x average acceleration in human data collection effort relative to teleoperation.

Significance. If the transfer from human-human VR priors to heterogeneous robot embodiments holds with the reported performance gains, the framework would meaningfully reduce the data-collection bottleneck for dual-robot collaborative systems by substituting expensive robot teleoperation with faster human-human demonstrations.

major comments (1)
  1. [Abstract] Abstract: the claim that human-human VR coordination data supplies priors that transfer to the Unitree G1 + Dexmate Vega1 pair with minimal domain gap after fine-tuning on only a minimal robot-teleop set is load-bearing for both the performance superiority and 5.4x effort-reduction assertions, yet the manuscript provides no quantitative transfer metrics, embodiment-mapping details, or ablation isolating the pretraining contribution versus architecture or data volume.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the abstract. We address the concern below and will revise accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that human-human VR coordination data supplies priors that transfer to the Unitree G1 + Dexmate Vega1 pair with minimal domain gap after fine-tuning on only a minimal robot-teleop set is load-bearing for both the performance superiority and 5.4x effort-reduction assertions, yet the manuscript provides no quantitative transfer metrics, embodiment-mapping details, or ablation isolating the pretraining contribution versus architecture or data volume.

    Authors: We agree the abstract would benefit from greater specificity. The reported performance gains are measured against robot-only baselines trained on equivalent or larger volumes of real-robot data, and the 5.4x figure is a direct measurement of human collection time (VR human-human vs. robot teleoperation). To isolate the pretraining contribution we will add (i) an ablation comparing the same ACT architecture trained from scratch on the minimal robot set versus pretrained on human-human data then fine-tuned, (ii) quantitative transfer metrics such as success-rate deltas and sample-efficiency curves attributable to the human priors, and (iii) a concise description of the embodiment mapping realized by the unified VR teleoperation pipeline. These additions will be placed in the experiments section and referenced from the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims rest on benchmark comparisons

full rationale

The manuscript presents a robotics framework for dual-robot collaboration via VR human-human pretraining followed by robot fine-tuning, evaluated on four tasks with a Unitree G1 and Dexmate Vega1. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. Performance superiority and 5.4x effort reduction are asserted via direct experimental comparison to robot-only baselines, not by construction from the inputs themselves. The central transfer assumption is an empirical claim open to falsification rather than a self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Framework depends on transferability of human coordination data to robots; no free parameters or invented entities visible in abstract.

axioms (1)
  • domain assumption Human-human coordination priors collected via VR transfer effectively to heterogeneous robot-robot collaboration after fine-tuning
    This is the core premise enabling the pretraining step and performance claims.

pith-pipeline@v0.9.1-grok · 5825 in / 1100 out tokens · 23597 ms · 2026-06-26T16:44:02.175157+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 9 canonical work pages

  1. [1]

    J. Fink, M. A. Hsieh, and V . Kumar. Multi-robot manipulation via caging in environments with obstacles. In2008 IEEE International Conference on Robotics and Automation, pages 1471–1476. IEEE, 2008

  2. [2]

    M. Lai, K. Go, Z. Li, T. Kr ¨oger, S. Schaal, K. Allen, and J. Scholz. Roboballet: Planning for multirobot reaching with graph neural networks and reinforcement learning.Science Robotics, 10(106):eads1204, 2025

  3. [3]

    Wang and M

    Z. Wang and M. Schwager. Multi-robot manipulation without communication. InDistributed autonomous robotic systems: The 12th international symposium, pages 135–149. Springer, 2016

  4. [4]

    Z. Fu, T. Z. Zhao, and C. Finn. Mobile ALOHA: Learning bimanual mobile manipulation using low-cost whole-body teleoperation. In8th Annual Conference on Robot Learning, 2024. URLhttps://openreview.net/forum?id=FO6tePGRZj

  5. [5]

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. OpenVLA: An open-source vision-language-action model. In8th Annual Conference on Robot Learning, 2024. URLhttps://openreview.net/forum?id=ZMnD6QZAE6

  6. [6]

    S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tan, C. Zhu, J. Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

  7. [7]

    Kareer, D

    S. Kareer, D. Patel, R. Punamiya, P. Mathur, S. Cheng, C. Wang, J. Hoffman, and D. Xu. Egomimic: Scaling imitation learning via egocentric video. In2025 IEEE International Con- ference on Robotics and Automation (ICRA), pages 13226–13233. IEEE, 2025

  8. [8]

    Hoque, P

    R. Hoque, P. Huang, D. J. Yoon, M. sivapurapu, and J. Zhang. Egodex: Learning dexter- ous manipulation from large-scale egocentric video. InThe F ourteenth International Con- ference on Learning Representations, 2026. URLhttps://openreview.net/forum?id= FFxkFMU89E. 9

  9. [9]

    Punamiya, S

    R. Punamiya, S. Kareer, Z. Liu, J. Citron, R.-Z. Qiu, X. Cai, A. Gavryushin, J. Chen, D. Li- conti, L. Y . Zhu, et al. Egoverse: An egocentric human dataset for robot learning from around the world.arXiv preprint arXiv:2604.07607, 2026

  10. [10]

    X. Yang, D. Kukreja, D. Pinkus, A. Sagar, T. Fan, J. Park, S. Shin, J. Cao, J. Liu, N. Ugrinovic, et al. Sam 3d body: Robust full-body human mesh recovery.arXiv preprint arXiv:2602.15989, 2026

  11. [11]

    T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware. InProceedings of Robotics: Science and Systems, Daegu, Republic of Korea, July 2023. doi:10.15607/RSS.2023.XIX.016

  12. [12]

    T. Z. Zhao, J. Tompson, D. Driess, P. Florence, S. K. S. Ghasemipour, C. Finn, and A. Wahid. ALOHA unleashed: A simple recipe for robot dexterity. In8th Annual Conference on Robot Learning, 2024. URLhttps://openreview.net/forum?id=gvdXE7ikHI

  13. [13]

    Khazatsky, K

    A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, P. D. Fagan, J. Hejna, M. Itkina, M. Lepert, Y . J. Ma, P. T. Miller, J. Wu, S. Belkhale, S. Dass, H. Ha, A. Jain, A. Lee, Y . Lee, M. Memmel, S. Park, I. Radosavovic, K. Wang, A. Zhan, K. Black, C. Chi, K. B. Hatch, S. Lin, J. ...

  14. [14]

    S. Dass, W. Ai, Y . Jiang, S. Singh, J. Hu, R. Zhang, P. Stone, B. Abbatematteo, and R. Mart´ın- Mart´ın. Telemoma: A modular and versatile teleoperation system for mobile manipulation. arXiv preprint arXiv:2403.07869, 2024

  15. [15]

    Jiang, R

    Y . Jiang, R. Zhang, J. Wong, C. Wang, Y . Ze, H. Yin, C. Gokmen, S. Song, J. Wu, and L. Fei- Fei. BEHA VIOR robot suite: Streamlining real-world whole-body manipulation for everyday household activities. In9th Annual Conference on Robot Learning, 2025. URLhttps:// openreview.net/forum?id=v2KevjWScT

  16. [16]

    C. Chi, Z. Xu, C. Pan, E. Cousineau, B. Burchfiel, S. Feng, R. Tedrake, and S. Song. Uni- versal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots. InProceedings of Robotics: Science and Systems, Delft, Netherlands, July 2024. doi: 10.15607/RSS.2024.XX.045

  17. [17]

    H. Choi, Y . Hou, C. Pan, S. Hong, A. Patel, X. Xu, M. R. Cutkosky, and S. Song. In-the-wild compliant manipulation with umi-ft, 2026. URLhttps://arxiv.org/abs/2601.09988

  18. [18]

    Shirwatkar, N

    H. Etukuru, N. Naka, Z. Hu, S. Lee, J. Mehu, A. Edsinger, C. Paxton, S. Chintala, L. Pinto, and N. M. Mahi Shafiullah. Robot utility models: General policies for zero-shot deployment in new environments. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 8275–8283, 2025. doi:10.1109/ICRA55743.2025.11127857

  19. [19]

    S. Wei, H. Jing, B. Li, Z. Zhao, J. Mao, Z. Ni, S. He, J. Liu, X. Liu, K. Kang, et al.Ψ 0: An open foundation model towards universal humanoid loco-manipulation.arXiv preprint arXiv:2603.12263, 2026. 10

  20. [20]

    Z. Luo, Y . Yuan, T. Wang, C. Li, S. Chen, F. Castaneda, Z.-A. Cao, J. Li, D. Minor, Q. Ben, et al. Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025

  21. [21]

    J. Li, X. Cheng, T. Huang, S. Yang, R.-Z. Qiu, and X. Wang. AMO: Adaptive Motion Opti- mization for Hyper-Dexterous Humanoid Whole-Body Control. InProceedings of Robotics: Science and Systems, LosAngeles, CA, USA, June 2025. doi:10.15607/RSS.2025.XXI.061

  22. [22]

    Y . Li, Y . Lin, J. Cui, T. Liu, W. Liang, Y . Zhu, and S. Huang. CLONE: Closed-loop whole-body humanoid teleoperation for long-horizon tasks. In9th Annual Conference on Robot Learning,

  23. [23]

    URLhttps://openreview.net/forum?id=Bw9NHYjDqR

  24. [24]

    Y . Ze, S. Zhao, W. Wang, A. Kanazawa, R. Duan, P. Abbeel, G. Shi, J. Wu, and C. K. Liu. Twist2: Scalable, portable, and holistic humanoid data collection system.arXiv preprint arXiv:2511.02832, 2025

  25. [25]

    Mattson, V

    C. Mattson, V . Raveendra, E. Novoseller, N. Waytowich, V . J. Lawhern, and D. S. Brown. R2bc: Multi-agent imitation learning from single-agent demonstrations.arXiv preprint arXiv:2510.18085, 2025

  26. [26]

    K. Song, S. Ma, G. Chen, N. Jin, G. Zhao, M. Ding, Z. Xiong, and J. Pan. Collabot: Vision- language guided simultaneous collaborative manipulation.arXiv preprint arXiv:2508.03526, 2025

  27. [27]

    Shirwatkar, N

    W. Zhang, C. Street, and M. Mansouri. Multi-nonholonomic robot object transportation with obstacle crossing using a deformable sheet. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 7349–7355, 2025. doi:10.1109/ICRA55743.2025. 11128313

  28. [28]

    Michael, J

    N. Michael, J. Fink, and V . Kumar. Cooperative manipulation and transportation with aerial robots. InProceedings of Robotics: Science and Systems, Seattle, USA, June 2009. doi: 10.15607/RSS.2009.V .001

  29. [29]

    C. Yang, G. N. Sue, Z. Li, L. Yang, H. Shen, Y . Chi, A. Rai, J. Zeng, and K. Sreenath. Col- laborative navigation and manipulation of a cable-towed load by multiple quadrupedal robots. IEEE Robotics and Automation Letters, 7(4):10041–10048, 2022

  30. [30]

    R. T. Fawcett, L. Amanzadeh, J. Kim, A. D. Ames, and K. A. Hamed. Distributed data- driven predictive control for multi-agent collaborative legged locomotion. In2023 IEEE In- ternational Conference on Robotics and Automation (ICRA), pages 9924–9930, 2023. doi: 10.1109/ICRA48891.2023.10160914

  31. [31]

    Y . Wang, M. Damani, P. Wang, Y . Cao, and G. Sartoretti. Distributed reinforcement learning for robot teams: A review.Current Robotics Reports, 3(4):239–257, 2022

  32. [32]

    Wu and C

    B. Wu and C. S. Suh. State-of-the-art in robot learning for multi-robot collaboration: A com- prehensive survey.arXiv preprint arXiv:2408.11822, 2024

  33. [33]

    Pandit, A

    B. Pandit, A. K. Shrestha, and A. Fern. Multi-quadruped cooperative object transport: Learning decentralized pinch-lift-move.arXiv preprint arXiv:2509.14342, 2025

  34. [34]

    Pandit, A

    B. Pandit, A. Gupta, M. S. Gadde, A. Johnson, A. K. Shrestha, H. Duan, J. Dao, and A. Fern. Learning decentralized multi-biped control for payload transport. In8th Annual Conference on Robot Learning, 2024. URLhttps://openreview.net/forum?id=vhGkyWgctu

  35. [35]

    J. Zeng, A. M. Gimenez, E. Vinitsky, J. Alonso-Mora, and S. Sun. Decentralized aerial manipulation of a cable-suspended load using multi-agent reinforcement learning. In2nd Workshop on Safe and Robust Robot Learning for Operation in the Real World, 2025. URL https://openreview.net/forum?id=yYYmqMv7Al. 11

  36. [36]

    Shibata, R

    K. Shibata, R. Sota, S. D. Bosch, Y . Kadokawa, T. Yoshihisa, and T. Matsubara. Dereco: Decoupling representation and coordination learning for object-adaptive decentralized multi- robot cooperative transport.arXiv preprint arXiv:2603.08111, 2026

  37. [37]

    Chen, Z.-a

    S. Chen, Z.-a. Cao, Z. Luo, F. Casta ˜neda, C. Li, T. Wang, Y . Yuan, L. Fan, C. K. Liu, Y . Zhu, et al. Chip: Adaptive compliance for humanoid control through hindsight perturbation.arXiv preprint arXiv:2512.14689, 2025

  38. [38]

    Black, N

    K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, brian ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. Vuong, H. Wa...

  39. [39]

    Zitkovich, T

    B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, Q. Vuong, V . Vanhoucke, H. Tran, R. Soricut, A. Singh, J. Singh, P. Sermanet, P. R. San- keti, G. Salazar, M. S. Ryoo, K. Reymann, K. Rao, K. Pertsch, I. Mordatch, H. Michalewski, Y . Lu, S. Levine, L. Lee, T.-W. E. Lee, I. Leal, Y . Kuang, D. Kalashnikov, R. Jul...

  40. [40]

    L. Heng, Y . Tang, J. Xu, H. Bao, D. Huang, and Y . Wang. Humdex: Humanoid dexterous manipulation made easy.arXiv preprint arXiv:2603.12260, 2026

  41. [41]

    J. Fan, Z. Zhao, Y . Zhang, C. Chen, P. Wang, H. Zhang, and Z. Cheng. Robopaint: From human demonstration to any robot and any view.arXiv preprint arXiv:2602.05325, 2026

  42. [42]

    M. Shi, S. Peng, J. Chen, H. Jiang, Y . Li, D. Huang, P. Luo, H. Li, and L. Chen. Egohu- manoid: Unlocking in-the-wild loco-manipulation with robot-free egocentric demonstration. arXiv preprint arXiv:2602.10106, 2026

  43. [43]

    R. Nai, B. Zheng, J. Zhao, H. Zhu, S. Dai, Z. Chen, Y . Hu, Y . Hu, T. Zhang, C. Wen, et al. Hu- manoid manipulation interface: Humanoid whole-body manipulation from robot-free demon- strations.arXiv preprint arXiv:2602.06643, 2026

  44. [44]

    H. Chen, W. Zhang, P. Li, S. Ma, K. Ma, Y . Jin, Z. Xu, X. Wang, Y . Zheng, Z. Wang, et al. Rhythm: Learning interactive whole-body control for dual humanoids.arXiv preprint arXiv:2603.02856, 2026

  45. [45]

    Huang, Y .-Y

    W.-J. Huang, Y .-Y . Zhang, Y .-L. Wei, Z.-W. Xia, J. Tan, Y .-M. Li, Z. Zhao, and W.-S. Zheng. Learning whole-body human-humanoid interaction from human-human demonstrations.arXiv preprint arXiv:2601.09518, 2026

  46. [46]

    J. Mao, S. Zhao, S. Song, C. Hong, T. Shi, J. Ye, M. Zhang, H. Geng, J. Malik, V . Guizilini, and Y . Wang. Universal humanoid robot pose learning from internet human videos. In2025 IEEE-RAS 24th International Conference on Humanoid Robots (Humanoids), pages 1–8, 2025. doi:10.1109/Humanoids65713.2025.11203143

  47. [47]

    J. P. Araujo, Y . Ze, P. Xu, J. Wu, and C. K. Liu. Retargeting matters: General motion retargeting for humanoid motion tracking.arXiv preprint arXiv:2510.02252, 2025. 12

  48. [48]

    Redmon, S

    J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real- time object detection. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016

  49. [49]

    K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. InPro- ceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  50. [50]

    Russakovsky, J

    O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge.Interna- tional journal of computer vision, 115(3):211–252, 2015

  51. [51]

    Cadene, S

    R. Cadene, S. Alibert, A. Soare, Q. Gallouedec, A. Zouitine, S. Palma, P. Kooijmans, M. Ar- actingi, M. Shukor, D. Aubakirova, M. Russi, F. Capuano, C. Pascal, J. Choghari, J. Moss, and T. Wolf. Lerobot: State-of-the-art machine learning for real-world robotics in pytorch. https://github.com/huggingface/lerobot, 2024. 13 Contents 1 Introduction 2 2 Relate...