Duet: Dual-Robot Understanding via Efficient Teaching

Basile Van Hoorick; Celina Shiyu Wang; Gaurav S. Sukhatme; Junjie Ye; Jyotirmoy V. Deshmukh; Leonidas Guibas; Minhao Li; Muchen Xu; Ruohai Ge; Sergey Zakharov

arxiv: 2606.20990 · v1 · pith:NALKU4IAnew · submitted 2026-06-18 · 💻 cs.RO

Duet: Dual-Robot Understanding via Efficient Teaching

Yiqi Zhao , Ruohai Ge , Celina Shiyu Wang , Junjie Ye , Muchen Xu , Minhao Li , Sergey Zakharov , Basile Van Hoorick

show 5 more authors

Vitor Campagnolo Guizilini Leonidas Guibas Gaurav S. Sukhatme Jyotirmoy V. Deshmukh Yue Wang

This is my paper

Pith reviewed 2026-06-26 16:44 UTC · model grok-4.3

classification 💻 cs.RO

keywords dual-robot collaborationVR teleoperationhuman priorsaction chunking transformermobile manipulationcollaborative tasksdata efficiencyheterogeneous robots

0 comments

The pith

Pretraining dual-robot policies on human coordination data from VR achieves equal or better task performance than robot-only training while cutting human effort by 5.4 times on average.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DUET, a framework that collects human-human coordination demonstrations through a synchronized VR teleoperation system to capture collaborative priors for dual-robot mobile manipulation. These priors pretrain an Action Chunking Transformer policy, which is then fine-tuned on a small set of real-robot teleoperation trajectories collected from heterogeneous embodiments like a humanoid and mobile manipulator. The approach targets the data bottleneck in training robots for tasks such as joint object transport and handovers that exceed single-robot capabilities. If the transfer works, it shows that human priors can substitute for much of the expensive robot-specific data collection without sacrificing, and sometimes improving, final performance. Results on a four-task benchmark confirm the method matches or exceeds baselines trained only on robot data.

Core claim

DUET uses a unified VR-based teleoperation pipeline to record human-human coordination and collaborative mobile manipulation priors, then pretrains collaborative policies with an Action Chunking Transformer on these efficient human demonstrations before fine-tuning on a minimal set of real-robot trajectories. On a benchmark of four collaborative tasks with a Unitree G1 humanoid and Dexmate Vega1 mobile manipulator, policies trained this way yield superior or comparable task performance to robot-only baselines while the human data collection pipeline accelerates data gathering by 5.4 times on average relative to direct teleoperation.

What carries the argument

Action Chunking Transformer that first pretrains collaborative policies on human-human VR demonstrations before fine-tuning on limited real-robot teleoperation trajectories.

If this is right

Dual-robot policies for tasks exceeding single-robot reach can be trained with substantially less direct robot teleoperation time.
Human coordination priors captured in VR improve or maintain performance on coordinated transport and handover tasks after fine-tuning.
The synchronized VR collection system enables faster acquisition of in-domain data for heterogeneous robot pairs.
Fine-tuning after human pretraining sufficiently bridges embodiment differences between humans and the tested robot platforms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pretraining-then-fine-tune pattern might reduce data needs for other multi-robot coordination problems beyond the dual case tested here.
If VR hardware is unavailable, alternative motion-capture methods could be substituted but would require separate validation of transfer quality.
Increasing the volume or diversity of human-human demonstrations during pretraining could further shrink the required robot fine-tuning set.
Real-world deployment would still need checks for safety when human-derived coordination patterns interact with physical robot constraints not present in VR.

Load-bearing premise

Human-human coordination data collected in VR transfers to heterogeneous robot embodiments with only minimal fine-tuning and little domain gap.

What would settle it

Train identical Action Chunking Transformer policies on the same four tasks using only the real-robot teleoperation data without the human pretraining stage and compare success rates and completion times to the full DUET pipeline; equal or higher performance in the robot-only case would falsify the value of the human priors.

Figures

Figures reproduced from arXiv: 2606.20990 by Basile Van Hoorick, Celina Shiyu Wang, Gaurav S. Sukhatme, Junjie Ye, Jyotirmoy V. Deshmukh, Leonidas Guibas, Minhao Li, Muchen Xu, Ruohai Ge, Sergey Zakharov, Vitor Campagnolo Guizilini, Yiqi Zhao, Yue Wang.

**Figure 1.** Figure 1: DUET. We introduce a dual-robot policy learning framework that features efficient learning from human demonstrations. Left: Two humans perform a collaborative manipulation task. Right: A heterogeneous dual-robot system mimics the human teachers. Abstract: Dual-robot collaboration enables tasks that exceed the reach and payload of a single robot, such as collaboratively transporting objects across enviro… view at source ↗

**Figure 2.** Figure 2: Dual-Robot Teleoperation Pipeline. Two human operators utilize PICO VR interfaces to simultaneously control a heterogeneous robot duo, Vega1 via General Motion Retargeting [46] and G1 via the SONIC [20] framework. The system streams real-time visual feedback while asynchronously logging joint pose data. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: DUET Overview. In stage 1, we pretrain an ACT architecture ( [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Model Architecture. Our ACT policy leverages a 2-head design with the joint head randomly initialized during finetune. Our robots are controllable via joint-space commands. Our architecture ( [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Benchmark data distribution and collection efficiency. Left: Total data span and number of episodes per pipeline across four tasks. Right: Amortized collection time (AT) for each task. The value above each task group is the speedup ratio, defined as the teleoperation AT divided by the human-demonstration AT, where AT is the average time for each successful data collection. to answer the following research … view at source ↗

**Figure 6.** Figure 6: Overview of the Tasks. Snapshots show the initial setup and phases. 4.3 Robot-only Ablation Study To address Q2, we conduct an ablation study where only Stage 2 from Section 3.3, training with 50 robot-only data for each task, is carried out on a fresh ACT (implemented through [50]) without human prior. To quantify multi-stage performance, we implement a normalized metric scoring each task trial from 0.0 t… view at source ↗

**Figure 7.** Figure 7: Evaluation Results. Comparison of task success rates and accumulated points over 10 experiments for each task of T1–T4. DUET (60 human demos + 30 robot trajectories) is compared against robot-only baselines (trained on 30 and 50 trajectories). demonstrating that our teleoperation pipeline yields a capable standalone baseline for multi-robot learning. 4.4 DUET Performance To address Q3, we execute Stage 1 (… view at source ↗

**Figure 8.** Figure 8: Teleoperation System Example. A side-by-side of the human teleoperator and the synchronized dual-robot execution. The sequence demonstrates the G1 humanoid walking to the workspace to grasp the target object (a toy dog), followed by a coordinated physical handover to the Vega1 mobile manipulator, which subsequently pivots and places the object onto a secondary table. B.2 Mesh Extraction From each clip’s al… view at source ↗

**Figure 9.** Figure 9: Human Data Collection Example. B.4 Human Data Collection Example As shown in [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Zero-Shot Generalizability to OOD Conditions. We evaluate DUET under visual and physical distribution shifts. In T2, the standard boxes are replaced with a black foam box and a green-and-black box, altering visual cues and physical dynamics. In T3, the white board is entirely covered by a black cover to test resilience to drastic background variations. These changes are illustrated in the figure. Despite … view at source ↗

read the original abstract

Dual-robot collaboration enables tasks that exceed the reach and payload of a single robot, such as collaboratively transporting objects across environments and executing coordinated handovers. Data acquisition is the primary bottleneck for training these systems. To this end, we introduce DUET, a dual-robot learning framework for mobile manipulation. For efficient data collection, we create a unified dual-embodiment synchronized VR-based teleoperation system for in-domain heterogeneous robot data collection. We further develop a complementary tracking pipeline that records human-human coordination and collaborative mobile manipulation priors. To allow efficient learning, we introduce an Action Chunking Transformer based architecture that first pretrains collaborative policies on efficient human-human demonstrations, before finetuning them on a minimal set of real-robot teleoperation trajectories. We develop a benchmark of four collaborative tasks to evaluate our framework using a Unitree G1 humanoid and a Dexmate Vega1 mobile manipulator. The results demonstrate that harnessing human priors not only yields superior task performance compared to baselines trained only on robot data, but also reduces the total human effort required for data collection. Our human data collection pipeline achieves 5.4x acceleration on average from teleoperation, but we perform equally or better than robot-only data trained policies across all tasks. Our project page is available at https://zhaoy37.github.io/Duet/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DUET builds a VR pipeline to harvest human coordination data for pretraining dual-robot policies and reports efficiency gains, but the transfer across heterogeneous embodiments is the claim that needs the most scrutiny.

read the letter

The main takeaway is that this paper gives a concrete way to collect human-human coordination demonstrations in VR for two different robot platforms, pretrain an ACT policy on that data, and then fine-tune on a small amount of actual robot teleoperation. They claim this cuts human effort by 5.4x on average while matching or exceeding policies trained only on robot data across four collaborative tasks with a Unitree G1 humanoid and a Dexmate Vega1 mobile manipulator.

What they actually did is put together a synchronized dual-embodiment VR teleop system and a complementary human tracking pipeline to capture those priors. The pretrain-then-fine-tune approach is the practical piece that targets the data bottleneck in dual-robot mobile manipulation. The benchmark itself is a useful addition for the community.

The results on the four tasks are presented as evidence that the human priors help. If the numbers and collection times hold up in the full experiments, that part is straightforward to evaluate.

The softer spot is the transfer assumption. Human-human VR data has to cross over to robot-robot coordination on kinematically different platforms, and the paper treats the remaining domain gap as small after minimal fine-tuning. Without clear ablations that separate the pretraining contribution from data volume or architecture choices, or quantitative measures of how much coordination knowledge actually carries over, it is hard to tell how robust the efficiency claim is. If the gap is larger than stated, both the performance edge and the 5.4x figure would shrink.

This is for people working on imitation learning for multi-robot systems who need practical data collection methods. A reader already thinking about human priors or VR teleop will get concrete details and a benchmark to compare against.

I would send it to peer review. The system is described enough to be checked, the problem is real, and the experiments can be assessed directly even if revisions are needed on the transfer analysis.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces DUET, a dual-robot learning framework for mobile manipulation that collects human-human coordination priors via a unified VR-based synchronized teleoperation and tracking pipeline, pretrains an Action Chunking Transformer policy on these demonstrations, and fine-tunes on a minimal set of real-robot trajectories. It evaluates the approach on four collaborative tasks using a Unitree G1 humanoid and Dexmate Vega1 mobile manipulator, claiming superior task performance over robot-only baselines together with a 5.4x average acceleration in human data collection effort relative to teleoperation.

Significance. If the transfer from human-human VR priors to heterogeneous robot embodiments holds with the reported performance gains, the framework would meaningfully reduce the data-collection bottleneck for dual-robot collaborative systems by substituting expensive robot teleoperation with faster human-human demonstrations.

major comments (1)

[Abstract] Abstract: the claim that human-human VR coordination data supplies priors that transfer to the Unitree G1 + Dexmate Vega1 pair with minimal domain gap after fine-tuning on only a minimal robot-teleop set is load-bearing for both the performance superiority and 5.4x effort-reduction assertions, yet the manuscript provides no quantitative transfer metrics, embodiment-mapping details, or ablation isolating the pretraining contribution versus architecture or data volume.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the abstract. We address the concern below and will revise accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that human-human VR coordination data supplies priors that transfer to the Unitree G1 + Dexmate Vega1 pair with minimal domain gap after fine-tuning on only a minimal robot-teleop set is load-bearing for both the performance superiority and 5.4x effort-reduction assertions, yet the manuscript provides no quantitative transfer metrics, embodiment-mapping details, or ablation isolating the pretraining contribution versus architecture or data volume.

Authors: We agree the abstract would benefit from greater specificity. The reported performance gains are measured against robot-only baselines trained on equivalent or larger volumes of real-robot data, and the 5.4x figure is a direct measurement of human collection time (VR human-human vs. robot teleoperation). To isolate the pretraining contribution we will add (i) an ablation comparing the same ACT architecture trained from scratch on the minimal robot set versus pretrained on human-human data then fine-tuned, (ii) quantitative transfer metrics such as success-rate deltas and sample-efficiency curves attributable to the human priors, and (iii) a concise description of the embodiment mapping realized by the unified VR teleoperation pipeline. These additions will be placed in the experiments section and referenced from the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims rest on benchmark comparisons

full rationale

The manuscript presents a robotics framework for dual-robot collaboration via VR human-human pretraining followed by robot fine-tuning, evaluated on four tasks with a Unitree G1 and Dexmate Vega1. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. Performance superiority and 5.4x effort reduction are asserted via direct experimental comparison to robot-only baselines, not by construction from the inputs themselves. The central transfer assumption is an empirical claim open to falsification rather than a self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Framework depends on transferability of human coordination data to robots; no free parameters or invented entities visible in abstract.

axioms (1)

domain assumption Human-human coordination priors collected via VR transfer effectively to heterogeneous robot-robot collaboration after fine-tuning
This is the core premise enabling the pretraining step and performance claims.

pith-pipeline@v0.9.1-grok · 5825 in / 1100 out tokens · 23597 ms · 2026-06-26T16:44:02.175157+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 9 canonical work pages

[1]

J. Fink, M. A. Hsieh, and V . Kumar. Multi-robot manipulation via caging in environments with obstacles. In2008 IEEE International Conference on Robotics and Automation, pages 1471–1476. IEEE, 2008

2008
[2]

M. Lai, K. Go, Z. Li, T. Kr ¨oger, S. Schaal, K. Allen, and J. Scholz. Roboballet: Planning for multirobot reaching with graph neural networks and reinforcement learning.Science Robotics, 10(106):eads1204, 2025

2025
[3]

Wang and M

Z. Wang and M. Schwager. Multi-robot manipulation without communication. InDistributed autonomous robotic systems: The 12th international symposium, pages 135–149. Springer, 2016

2016
[4]

Z. Fu, T. Z. Zhao, and C. Finn. Mobile ALOHA: Learning bimanual mobile manipulation using low-cost whole-body teleoperation. In8th Annual Conference on Robot Learning, 2024. URLhttps://openreview.net/forum?id=FO6tePGRZj

2024
[5]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. OpenVLA: An open-source vision-language-action model. In8th Annual Conference on Robot Learning, 2024. URLhttps://openreview.net/forum?id=ZMnD6QZAE6

2024
[6]

S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tan, C. Zhu, J. Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

Pith/arXiv arXiv 2026
[7]

Kareer, D

S. Kareer, D. Patel, R. Punamiya, P. Mathur, S. Cheng, C. Wang, J. Hoffman, and D. Xu. Egomimic: Scaling imitation learning via egocentric video. In2025 IEEE International Con- ference on Robotics and Automation (ICRA), pages 13226–13233. IEEE, 2025

2025
[8]

Hoque, P

R. Hoque, P. Huang, D. J. Yoon, M. sivapurapu, and J. Zhang. Egodex: Learning dexter- ous manipulation from large-scale egocentric video. InThe F ourteenth International Con- ference on Learning Representations, 2026. URLhttps://openreview.net/forum?id= FFxkFMU89E. 9

2026
[9]

Punamiya, S

R. Punamiya, S. Kareer, Z. Liu, J. Citron, R.-Z. Qiu, X. Cai, A. Gavryushin, J. Chen, D. Li- conti, L. Y . Zhu, et al. Egoverse: An egocentric human dataset for robot learning from around the world.arXiv preprint arXiv:2604.07607, 2026

Pith/arXiv arXiv 2026
[10]

X. Yang, D. Kukreja, D. Pinkus, A. Sagar, T. Fan, J. Park, S. Shin, J. Cao, J. Liu, N. Ugrinovic, et al. Sam 3d body: Robust full-body human mesh recovery.arXiv preprint arXiv:2602.15989, 2026

arXiv 2026
[11]

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware. InProceedings of Robotics: Science and Systems, Daegu, Republic of Korea, July 2023. doi:10.15607/RSS.2023.XIX.016

work page doi:10.15607/rss.2023.xix.016 2023
[12]

T. Z. Zhao, J. Tompson, D. Driess, P. Florence, S. K. S. Ghasemipour, C. Finn, and A. Wahid. ALOHA unleashed: A simple recipe for robot dexterity. In8th Annual Conference on Robot Learning, 2024. URLhttps://openreview.net/forum?id=gvdXE7ikHI

2024
[13]

Khazatsky, K

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, P. D. Fagan, J. Hejna, M. Itkina, M. Lepert, Y . J. Ma, P. T. Miller, J. Wu, S. Belkhale, S. Dass, H. Ha, A. Jain, A. Lee, Y . Lee, M. Memmel, S. Park, I. Radosavovic, K. Wang, A. Zhan, K. Black, C. Chi, K. B. Hatch, S. Lin, J. ...

work page doi:10.15607/rss.2024.xx.120 2024
[14]

S. Dass, W. Ai, Y . Jiang, S. Singh, J. Hu, R. Zhang, P. Stone, B. Abbatematteo, and R. Mart´ın- Mart´ın. Telemoma: A modular and versatile teleoperation system for mobile manipulation. arXiv preprint arXiv:2403.07869, 2024

arXiv 2024
[15]

Jiang, R

Y . Jiang, R. Zhang, J. Wong, C. Wang, Y . Ze, H. Yin, C. Gokmen, S. Song, J. Wu, and L. Fei- Fei. BEHA VIOR robot suite: Streamlining real-world whole-body manipulation for everyday household activities. In9th Annual Conference on Robot Learning, 2025. URLhttps:// openreview.net/forum?id=v2KevjWScT

2025
[16]

C. Chi, Z. Xu, C. Pan, E. Cousineau, B. Burchfiel, S. Feng, R. Tedrake, and S. Song. Uni- versal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots. InProceedings of Robotics: Science and Systems, Delft, Netherlands, July 2024. doi: 10.15607/RSS.2024.XX.045

work page doi:10.15607/rss.2024.xx.045 2024
[17]

H. Choi, Y . Hou, C. Pan, S. Hong, A. Patel, X. Xu, M. R. Cutkosky, and S. Song. In-the-wild compliant manipulation with umi-ft, 2026. URLhttps://arxiv.org/abs/2601.09988

arXiv 2026
[18]

Shirwatkar, N

H. Etukuru, N. Naka, Z. Hu, S. Lee, J. Mehu, A. Edsinger, C. Paxton, S. Chintala, L. Pinto, and N. M. Mahi Shafiullah. Robot utility models: General policies for zero-shot deployment in new environments. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 8275–8283, 2025. doi:10.1109/ICRA55743.2025.11127857

work page doi:10.1109/icra55743.2025.11127857 2025
[19]

S. Wei, H. Jing, B. Li, Z. Zhao, J. Mao, Z. Ni, S. He, J. Liu, X. Liu, K. Kang, et al.Ψ 0: An open foundation model towards universal humanoid loco-manipulation.arXiv preprint arXiv:2603.12263, 2026. 10

arXiv 2026
[20]

Z. Luo, Y . Yuan, T. Wang, C. Li, S. Chen, F. Castaneda, Z.-A. Cao, J. Li, D. Minor, Q. Ben, et al. Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025

Pith/arXiv arXiv 2025
[21]

J. Li, X. Cheng, T. Huang, S. Yang, R.-Z. Qiu, and X. Wang. AMO: Adaptive Motion Opti- mization for Hyper-Dexterous Humanoid Whole-Body Control. InProceedings of Robotics: Science and Systems, LosAngeles, CA, USA, June 2025. doi:10.15607/RSS.2025.XXI.061

work page doi:10.15607/rss.2025.xxi.061 2025
[22]

Y . Li, Y . Lin, J. Cui, T. Liu, W. Liang, Y . Zhu, and S. Huang. CLONE: Closed-loop whole-body humanoid teleoperation for long-horizon tasks. In9th Annual Conference on Robot Learning,
[23]

URLhttps://openreview.net/forum?id=Bw9NHYjDqR
[24]

Y . Ze, S. Zhao, W. Wang, A. Kanazawa, R. Duan, P. Abbeel, G. Shi, J. Wu, and C. K. Liu. Twist2: Scalable, portable, and holistic humanoid data collection system.arXiv preprint arXiv:2511.02832, 2025

arXiv 2025
[25]

Mattson, V

C. Mattson, V . Raveendra, E. Novoseller, N. Waytowich, V . J. Lawhern, and D. S. Brown. R2bc: Multi-agent imitation learning from single-agent demonstrations.arXiv preprint arXiv:2510.18085, 2025

Pith/arXiv arXiv 2025
[26]

K. Song, S. Ma, G. Chen, N. Jin, G. Zhao, M. Ding, Z. Xiong, and J. Pan. Collabot: Vision- language guided simultaneous collaborative manipulation.arXiv preprint arXiv:2508.03526, 2025

Pith/arXiv arXiv 2025
[27]

Shirwatkar, N

W. Zhang, C. Street, and M. Mansouri. Multi-nonholonomic robot object transportation with obstacle crossing using a deformable sheet. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 7349–7355, 2025. doi:10.1109/ICRA55743.2025. 11128313

work page doi:10.1109/icra55743.2025 2025
[28]

Michael, J

N. Michael, J. Fink, and V . Kumar. Cooperative manipulation and transportation with aerial robots. InProceedings of Robotics: Science and Systems, Seattle, USA, June 2009. doi: 10.15607/RSS.2009.V .001

work page doi:10.15607/rss.2009.v 2009
[29]

C. Yang, G. N. Sue, Z. Li, L. Yang, H. Shen, Y . Chi, A. Rai, J. Zeng, and K. Sreenath. Col- laborative navigation and manipulation of a cable-towed load by multiple quadrupedal robots. IEEE Robotics and Automation Letters, 7(4):10041–10048, 2022

2022
[30]

R. T. Fawcett, L. Amanzadeh, J. Kim, A. D. Ames, and K. A. Hamed. Distributed data- driven predictive control for multi-agent collaborative legged locomotion. In2023 IEEE In- ternational Conference on Robotics and Automation (ICRA), pages 9924–9930, 2023. doi: 10.1109/ICRA48891.2023.10160914

work page doi:10.1109/icra48891.2023.10160914 2023
[31]

Y . Wang, M. Damani, P. Wang, Y . Cao, and G. Sartoretti. Distributed reinforcement learning for robot teams: A review.Current Robotics Reports, 3(4):239–257, 2022

2022
[32]

Wu and C

B. Wu and C. S. Suh. State-of-the-art in robot learning for multi-robot collaboration: A com- prehensive survey.arXiv preprint arXiv:2408.11822, 2024

arXiv 2024
[33]

Pandit, A

B. Pandit, A. K. Shrestha, and A. Fern. Multi-quadruped cooperative object transport: Learning decentralized pinch-lift-move.arXiv preprint arXiv:2509.14342, 2025

arXiv 2025
[34]

Pandit, A

B. Pandit, A. Gupta, M. S. Gadde, A. Johnson, A. K. Shrestha, H. Duan, J. Dao, and A. Fern. Learning decentralized multi-biped control for payload transport. In8th Annual Conference on Robot Learning, 2024. URLhttps://openreview.net/forum?id=vhGkyWgctu

2024
[35]

J. Zeng, A. M. Gimenez, E. Vinitsky, J. Alonso-Mora, and S. Sun. Decentralized aerial manipulation of a cable-suspended load using multi-agent reinforcement learning. In2nd Workshop on Safe and Robust Robot Learning for Operation in the Real World, 2025. URL https://openreview.net/forum?id=yYYmqMv7Al. 11

2025
[36]

Shibata, R

K. Shibata, R. Sota, S. D. Bosch, Y . Kadokawa, T. Yoshihisa, and T. Matsubara. Dereco: Decoupling representation and coordination learning for object-adaptive decentralized multi- robot cooperative transport.arXiv preprint arXiv:2603.08111, 2026

arXiv 2026
[37]

Chen, Z.-a

S. Chen, Z.-a. Cao, Z. Luo, F. Casta ˜neda, C. Li, T. Wang, Y . Yuan, L. Fan, C. K. Liu, Y . Zhu, et al. Chip: Adaptive compliance for humanoid control through hindsight perturbation.arXiv preprint arXiv:2512.14689, 2025

arXiv 2025
[38]

Black, N

K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, brian ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. Vuong, H. Wa...

2025
[39]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, Q. Vuong, V . Vanhoucke, H. Tran, R. Soricut, A. Singh, J. Singh, P. Sermanet, P. R. San- keti, G. Salazar, M. S. Ryoo, K. Reymann, K. Rao, K. Pertsch, I. Mordatch, H. Michalewski, Y . Lu, S. Levine, L. Lee, T.-W. E. Lee, I. Leal, Y . Kuang, D. Kalashnikov, R. Jul...

2023
[40]

L. Heng, Y . Tang, J. Xu, H. Bao, D. Huang, and Y . Wang. Humdex: Humanoid dexterous manipulation made easy.arXiv preprint arXiv:2603.12260, 2026

arXiv 2026
[41]

J. Fan, Z. Zhao, Y . Zhang, C. Chen, P. Wang, H. Zhang, and Z. Cheng. Robopaint: From human demonstration to any robot and any view.arXiv preprint arXiv:2602.05325, 2026

arXiv 2026
[42]

M. Shi, S. Peng, J. Chen, H. Jiang, Y . Li, D. Huang, P. Luo, H. Li, and L. Chen. Egohu- manoid: Unlocking in-the-wild loco-manipulation with robot-free egocentric demonstration. arXiv preprint arXiv:2602.10106, 2026

Pith/arXiv arXiv 2026
[43]

R. Nai, B. Zheng, J. Zhao, H. Zhu, S. Dai, Z. Chen, Y . Hu, Y . Hu, T. Zhang, C. Wen, et al. Hu- manoid manipulation interface: Humanoid whole-body manipulation from robot-free demon- strations.arXiv preprint arXiv:2602.06643, 2026

arXiv 2026
[44]

H. Chen, W. Zhang, P. Li, S. Ma, K. Ma, Y . Jin, Z. Xu, X. Wang, Y . Zheng, Z. Wang, et al. Rhythm: Learning interactive whole-body control for dual humanoids.arXiv preprint arXiv:2603.02856, 2026

Pith/arXiv arXiv 2026
[45]

Huang, Y .-Y

W.-J. Huang, Y .-Y . Zhang, Y .-L. Wei, Z.-W. Xia, J. Tan, Y .-M. Li, Z. Zhao, and W.-S. Zheng. Learning whole-body human-humanoid interaction from human-human demonstrations.arXiv preprint arXiv:2601.09518, 2026

arXiv 2026
[46]

J. Mao, S. Zhao, S. Song, C. Hong, T. Shi, J. Ye, M. Zhang, H. Geng, J. Malik, V . Guizilini, and Y . Wang. Universal humanoid robot pose learning from internet human videos. In2025 IEEE-RAS 24th International Conference on Humanoid Robots (Humanoids), pages 1–8, 2025. doi:10.1109/Humanoids65713.2025.11203143

work page doi:10.1109/humanoids65713.2025.11203143 2025
[47]

J. P. Araujo, Y . Ze, P. Xu, J. Wu, and C. K. Liu. Retargeting matters: General motion retargeting for humanoid motion tracking.arXiv preprint arXiv:2510.02252, 2025. 12

arXiv 2025
[48]

Redmon, S

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real- time object detection. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016

2016
[49]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. InPro- ceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

2016
[50]

Russakovsky, J

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge.Interna- tional journal of computer vision, 115(3):211–252, 2015

2015
[51]

Cadene, S

R. Cadene, S. Alibert, A. Soare, Q. Gallouedec, A. Zouitine, S. Palma, P. Kooijmans, M. Ar- actingi, M. Shukor, D. Aubakirova, M. Russi, F. Capuano, C. Pascal, J. Choghari, J. Moss, and T. Wolf. Lerobot: State-of-the-art machine learning for real-world robotics in pytorch. https://github.com/huggingface/lerobot, 2024. 13 Contents 1 Introduction 2 2 Relate...

2024

[1] [1]

J. Fink, M. A. Hsieh, and V . Kumar. Multi-robot manipulation via caging in environments with obstacles. In2008 IEEE International Conference on Robotics and Automation, pages 1471–1476. IEEE, 2008

2008

[2] [2]

M. Lai, K. Go, Z. Li, T. Kr ¨oger, S. Schaal, K. Allen, and J. Scholz. Roboballet: Planning for multirobot reaching with graph neural networks and reinforcement learning.Science Robotics, 10(106):eads1204, 2025

2025

[3] [3]

Wang and M

Z. Wang and M. Schwager. Multi-robot manipulation without communication. InDistributed autonomous robotic systems: The 12th international symposium, pages 135–149. Springer, 2016

2016

[4] [4]

Z. Fu, T. Z. Zhao, and C. Finn. Mobile ALOHA: Learning bimanual mobile manipulation using low-cost whole-body teleoperation. In8th Annual Conference on Robot Learning, 2024. URLhttps://openreview.net/forum?id=FO6tePGRZj

2024

[5] [5]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. OpenVLA: An open-source vision-language-action model. In8th Annual Conference on Robot Learning, 2024. URLhttps://openreview.net/forum?id=ZMnD6QZAE6

2024

[6] [6]

S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tan, C. Zhu, J. Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

Pith/arXiv arXiv 2026

[7] [7]

Kareer, D

S. Kareer, D. Patel, R. Punamiya, P. Mathur, S. Cheng, C. Wang, J. Hoffman, and D. Xu. Egomimic: Scaling imitation learning via egocentric video. In2025 IEEE International Con- ference on Robotics and Automation (ICRA), pages 13226–13233. IEEE, 2025

2025

[8] [8]

Hoque, P

R. Hoque, P. Huang, D. J. Yoon, M. sivapurapu, and J. Zhang. Egodex: Learning dexter- ous manipulation from large-scale egocentric video. InThe F ourteenth International Con- ference on Learning Representations, 2026. URLhttps://openreview.net/forum?id= FFxkFMU89E. 9

2026

[9] [9]

Punamiya, S

R. Punamiya, S. Kareer, Z. Liu, J. Citron, R.-Z. Qiu, X. Cai, A. Gavryushin, J. Chen, D. Li- conti, L. Y . Zhu, et al. Egoverse: An egocentric human dataset for robot learning from around the world.arXiv preprint arXiv:2604.07607, 2026

Pith/arXiv arXiv 2026

[10] [10]

X. Yang, D. Kukreja, D. Pinkus, A. Sagar, T. Fan, J. Park, S. Shin, J. Cao, J. Liu, N. Ugrinovic, et al. Sam 3d body: Robust full-body human mesh recovery.arXiv preprint arXiv:2602.15989, 2026

arXiv 2026

[11] [11]

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware. InProceedings of Robotics: Science and Systems, Daegu, Republic of Korea, July 2023. doi:10.15607/RSS.2023.XIX.016

work page doi:10.15607/rss.2023.xix.016 2023

[12] [12]

T. Z. Zhao, J. Tompson, D. Driess, P. Florence, S. K. S. Ghasemipour, C. Finn, and A. Wahid. ALOHA unleashed: A simple recipe for robot dexterity. In8th Annual Conference on Robot Learning, 2024. URLhttps://openreview.net/forum?id=gvdXE7ikHI

2024

[13] [13]

Khazatsky, K

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, P. D. Fagan, J. Hejna, M. Itkina, M. Lepert, Y . J. Ma, P. T. Miller, J. Wu, S. Belkhale, S. Dass, H. Ha, A. Jain, A. Lee, Y . Lee, M. Memmel, S. Park, I. Radosavovic, K. Wang, A. Zhan, K. Black, C. Chi, K. B. Hatch, S. Lin, J. ...

work page doi:10.15607/rss.2024.xx.120 2024

[14] [14]

S. Dass, W. Ai, Y . Jiang, S. Singh, J. Hu, R. Zhang, P. Stone, B. Abbatematteo, and R. Mart´ın- Mart´ın. Telemoma: A modular and versatile teleoperation system for mobile manipulation. arXiv preprint arXiv:2403.07869, 2024

arXiv 2024

[15] [15]

Jiang, R

Y . Jiang, R. Zhang, J. Wong, C. Wang, Y . Ze, H. Yin, C. Gokmen, S. Song, J. Wu, and L. Fei- Fei. BEHA VIOR robot suite: Streamlining real-world whole-body manipulation for everyday household activities. In9th Annual Conference on Robot Learning, 2025. URLhttps:// openreview.net/forum?id=v2KevjWScT

2025

[16] [16]

C. Chi, Z. Xu, C. Pan, E. Cousineau, B. Burchfiel, S. Feng, R. Tedrake, and S. Song. Uni- versal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots. InProceedings of Robotics: Science and Systems, Delft, Netherlands, July 2024. doi: 10.15607/RSS.2024.XX.045

work page doi:10.15607/rss.2024.xx.045 2024

[17] [17]

H. Choi, Y . Hou, C. Pan, S. Hong, A. Patel, X. Xu, M. R. Cutkosky, and S. Song. In-the-wild compliant manipulation with umi-ft, 2026. URLhttps://arxiv.org/abs/2601.09988

arXiv 2026

[18] [18]

Shirwatkar, N

H. Etukuru, N. Naka, Z. Hu, S. Lee, J. Mehu, A. Edsinger, C. Paxton, S. Chintala, L. Pinto, and N. M. Mahi Shafiullah. Robot utility models: General policies for zero-shot deployment in new environments. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 8275–8283, 2025. doi:10.1109/ICRA55743.2025.11127857

work page doi:10.1109/icra55743.2025.11127857 2025

[19] [19]

S. Wei, H. Jing, B. Li, Z. Zhao, J. Mao, Z. Ni, S. He, J. Liu, X. Liu, K. Kang, et al.Ψ 0: An open foundation model towards universal humanoid loco-manipulation.arXiv preprint arXiv:2603.12263, 2026. 10

arXiv 2026

[20] [20]

Z. Luo, Y . Yuan, T. Wang, C. Li, S. Chen, F. Castaneda, Z.-A. Cao, J. Li, D. Minor, Q. Ben, et al. Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025

Pith/arXiv arXiv 2025

[21] [21]

J. Li, X. Cheng, T. Huang, S. Yang, R.-Z. Qiu, and X. Wang. AMO: Adaptive Motion Opti- mization for Hyper-Dexterous Humanoid Whole-Body Control. InProceedings of Robotics: Science and Systems, LosAngeles, CA, USA, June 2025. doi:10.15607/RSS.2025.XXI.061

work page doi:10.15607/rss.2025.xxi.061 2025

[22] [22]

Y . Li, Y . Lin, J. Cui, T. Liu, W. Liang, Y . Zhu, and S. Huang. CLONE: Closed-loop whole-body humanoid teleoperation for long-horizon tasks. In9th Annual Conference on Robot Learning,

[23] [23]

URLhttps://openreview.net/forum?id=Bw9NHYjDqR

[24] [24]

Y . Ze, S. Zhao, W. Wang, A. Kanazawa, R. Duan, P. Abbeel, G. Shi, J. Wu, and C. K. Liu. Twist2: Scalable, portable, and holistic humanoid data collection system.arXiv preprint arXiv:2511.02832, 2025

arXiv 2025

[25] [25]

Mattson, V

C. Mattson, V . Raveendra, E. Novoseller, N. Waytowich, V . J. Lawhern, and D. S. Brown. R2bc: Multi-agent imitation learning from single-agent demonstrations.arXiv preprint arXiv:2510.18085, 2025

Pith/arXiv arXiv 2025

[26] [26]

K. Song, S. Ma, G. Chen, N. Jin, G. Zhao, M. Ding, Z. Xiong, and J. Pan. Collabot: Vision- language guided simultaneous collaborative manipulation.arXiv preprint arXiv:2508.03526, 2025

Pith/arXiv arXiv 2025

[27] [27]

Shirwatkar, N

W. Zhang, C. Street, and M. Mansouri. Multi-nonholonomic robot object transportation with obstacle crossing using a deformable sheet. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 7349–7355, 2025. doi:10.1109/ICRA55743.2025. 11128313

work page doi:10.1109/icra55743.2025 2025

[28] [28]

Michael, J

N. Michael, J. Fink, and V . Kumar. Cooperative manipulation and transportation with aerial robots. InProceedings of Robotics: Science and Systems, Seattle, USA, June 2009. doi: 10.15607/RSS.2009.V .001

work page doi:10.15607/rss.2009.v 2009

[29] [29]

C. Yang, G. N. Sue, Z. Li, L. Yang, H. Shen, Y . Chi, A. Rai, J. Zeng, and K. Sreenath. Col- laborative navigation and manipulation of a cable-towed load by multiple quadrupedal robots. IEEE Robotics and Automation Letters, 7(4):10041–10048, 2022

2022

[30] [30]

R. T. Fawcett, L. Amanzadeh, J. Kim, A. D. Ames, and K. A. Hamed. Distributed data- driven predictive control for multi-agent collaborative legged locomotion. In2023 IEEE In- ternational Conference on Robotics and Automation (ICRA), pages 9924–9930, 2023. doi: 10.1109/ICRA48891.2023.10160914

work page doi:10.1109/icra48891.2023.10160914 2023

[31] [31]

Y . Wang, M. Damani, P. Wang, Y . Cao, and G. Sartoretti. Distributed reinforcement learning for robot teams: A review.Current Robotics Reports, 3(4):239–257, 2022

2022

[32] [32]

Wu and C

B. Wu and C. S. Suh. State-of-the-art in robot learning for multi-robot collaboration: A com- prehensive survey.arXiv preprint arXiv:2408.11822, 2024

arXiv 2024

[33] [33]

Pandit, A

B. Pandit, A. K. Shrestha, and A. Fern. Multi-quadruped cooperative object transport: Learning decentralized pinch-lift-move.arXiv preprint arXiv:2509.14342, 2025

arXiv 2025

[34] [34]

Pandit, A

B. Pandit, A. Gupta, M. S. Gadde, A. Johnson, A. K. Shrestha, H. Duan, J. Dao, and A. Fern. Learning decentralized multi-biped control for payload transport. In8th Annual Conference on Robot Learning, 2024. URLhttps://openreview.net/forum?id=vhGkyWgctu

2024

[35] [35]

J. Zeng, A. M. Gimenez, E. Vinitsky, J. Alonso-Mora, and S. Sun. Decentralized aerial manipulation of a cable-suspended load using multi-agent reinforcement learning. In2nd Workshop on Safe and Robust Robot Learning for Operation in the Real World, 2025. URL https://openreview.net/forum?id=yYYmqMv7Al. 11

2025

[36] [36]

Shibata, R

K. Shibata, R. Sota, S. D. Bosch, Y . Kadokawa, T. Yoshihisa, and T. Matsubara. Dereco: Decoupling representation and coordination learning for object-adaptive decentralized multi- robot cooperative transport.arXiv preprint arXiv:2603.08111, 2026

arXiv 2026

[37] [37]

Chen, Z.-a

S. Chen, Z.-a. Cao, Z. Luo, F. Casta ˜neda, C. Li, T. Wang, Y . Yuan, L. Fan, C. K. Liu, Y . Zhu, et al. Chip: Adaptive compliance for humanoid control through hindsight perturbation.arXiv preprint arXiv:2512.14689, 2025

arXiv 2025

[38] [38]

Black, N

K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, brian ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. Vuong, H. Wa...

2025

[39] [39]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, Q. Vuong, V . Vanhoucke, H. Tran, R. Soricut, A. Singh, J. Singh, P. Sermanet, P. R. San- keti, G. Salazar, M. S. Ryoo, K. Reymann, K. Rao, K. Pertsch, I. Mordatch, H. Michalewski, Y . Lu, S. Levine, L. Lee, T.-W. E. Lee, I. Leal, Y . Kuang, D. Kalashnikov, R. Jul...

2023

[40] [40]

L. Heng, Y . Tang, J. Xu, H. Bao, D. Huang, and Y . Wang. Humdex: Humanoid dexterous manipulation made easy.arXiv preprint arXiv:2603.12260, 2026

arXiv 2026

[41] [41]

J. Fan, Z. Zhao, Y . Zhang, C. Chen, P. Wang, H. Zhang, and Z. Cheng. Robopaint: From human demonstration to any robot and any view.arXiv preprint arXiv:2602.05325, 2026

arXiv 2026

[42] [42]

M. Shi, S. Peng, J. Chen, H. Jiang, Y . Li, D. Huang, P. Luo, H. Li, and L. Chen. Egohu- manoid: Unlocking in-the-wild loco-manipulation with robot-free egocentric demonstration. arXiv preprint arXiv:2602.10106, 2026

Pith/arXiv arXiv 2026

[43] [43]

R. Nai, B. Zheng, J. Zhao, H. Zhu, S. Dai, Z. Chen, Y . Hu, Y . Hu, T. Zhang, C. Wen, et al. Hu- manoid manipulation interface: Humanoid whole-body manipulation from robot-free demon- strations.arXiv preprint arXiv:2602.06643, 2026

arXiv 2026

[44] [44]

H. Chen, W. Zhang, P. Li, S. Ma, K. Ma, Y . Jin, Z. Xu, X. Wang, Y . Zheng, Z. Wang, et al. Rhythm: Learning interactive whole-body control for dual humanoids.arXiv preprint arXiv:2603.02856, 2026

Pith/arXiv arXiv 2026

[45] [45]

Huang, Y .-Y

W.-J. Huang, Y .-Y . Zhang, Y .-L. Wei, Z.-W. Xia, J. Tan, Y .-M. Li, Z. Zhao, and W.-S. Zheng. Learning whole-body human-humanoid interaction from human-human demonstrations.arXiv preprint arXiv:2601.09518, 2026

arXiv 2026

[46] [46]

J. Mao, S. Zhao, S. Song, C. Hong, T. Shi, J. Ye, M. Zhang, H. Geng, J. Malik, V . Guizilini, and Y . Wang. Universal humanoid robot pose learning from internet human videos. In2025 IEEE-RAS 24th International Conference on Humanoid Robots (Humanoids), pages 1–8, 2025. doi:10.1109/Humanoids65713.2025.11203143

work page doi:10.1109/humanoids65713.2025.11203143 2025

[47] [47]

J. P. Araujo, Y . Ze, P. Xu, J. Wu, and C. K. Liu. Retargeting matters: General motion retargeting for humanoid motion tracking.arXiv preprint arXiv:2510.02252, 2025. 12

arXiv 2025

[48] [48]

Redmon, S

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real- time object detection. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016

2016

[49] [49]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. InPro- ceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

2016

[50] [50]

Russakovsky, J

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge.Interna- tional journal of computer vision, 115(3):211–252, 2015

2015

[51] [51]

Cadene, S

R. Cadene, S. Alibert, A. Soare, Q. Gallouedec, A. Zouitine, S. Palma, P. Kooijmans, M. Ar- actingi, M. Shukor, D. Aubakirova, M. Russi, F. Capuano, C. Pascal, J. Choghari, J. Moss, and T. Wolf. Lerobot: State-of-the-art machine learning for real-world robotics in pytorch. https://github.com/huggingface/lerobot, 2024. 13 Contents 1 Introduction 2 2 Relate...

2024