arxiv: 2604.20348 · v1 · submitted 2026-04-22 · 💻 cs.RO · cs.AI· cs.MA

Recognition: unknown

Bimanual Robot Manipulation via Multi-Agent In-Context Learning

Alessio Palma , Indro Spinelli , Vignesh Prasad , Luca Scofano , Yufeng Jin , Georgia Chalvatzaki , Fabio Galasso

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:22 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.MA

keywords bimanual manipulationin-context learninglarge language modelsmulti-agent systemsrobot controlfew-shot learningcoordinated actions

0 comments

The pith

A multi-agent leader-follower debate lets off-the-shelf LLMs control two robot arms in coordinated tasks without any training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BiCICLe to let text-only LLMs perform bimanual robot tasks using in-context learning. It decouples the two arms into a leader and follower that condition each other sequentially, then uses iterative debate between two LLMs and a judge to refine the plan. This achieves high success rates on benchmark tasks and generalizes to new ones with few examples, suggesting that LLMs can handle complex coordination through structured prompting rather than retraining.

Core claim

BiCICLe is the first framework that enables standard LLMs to perform few-shot bimanual manipulation without fine-tuning by framing bimanual control as a multi-agent leader-follower problem, decoupling the action space into sequential conditioned single-arm predictions, and extending this with Arms' Debate for iterative refinement plus a third LLM-as-Judge to select plausible coordinated trajectories.

What carries the argument

The leader-follower decoupling of bimanual actions into sequential single-arm predictions, extended by iterative Arms' Debate and judgment by LLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same leader-follower structure with debate could be tested on multi-robot teams beyond two arms.
Adding vision inputs to the context might allow the method to handle more dynamic environments without changing the core prompting approach.
The performance gap over training-free baselines suggests that explicit coordination mechanisms can substitute for joint training data in embodied control.

Load-bearing premise

That framing bimanual control as sequential conditioned single-arm predictions plus iterative LLM debate sufficiently captures tight inter-arm coordination constraints without losing critical joint information.

What would settle it

Running BiCICLe on TWIN benchmark tasks that require highly synchronized simultaneous movements of both arms and measuring whether success rates drop below the reported 71.1 percent average.

Figures

Figures reproduced from arXiv: 2604.20348 by Alessio Palma, Fabio Galasso, Georgia Chalvatzaki, Indro Spinelli, Luca Scofano, Vignesh Prasad, Yufeng Jin.

**Figure 1.** Figure 1: Overview of the BiCICLe Framework. (Left) Bimanual demonstrations are serialized into textual sequences of state observations and actions to construct the in-context prompt. (Right) During inference, a leader-follower decomposition enforces inter-arm coordination: the Leader agent predicts its full trajectory first; the Follower agent then predicts its actions conditioned on the Leader’s plan. No task-spec… view at source ↗

**Figure 2.** Figure 2: Bimanual action prediction architectures. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: New generalization tasks. Two bimanual tasks designed outside the TWIN benchmark. (Top) Close Jar. (Bottom) Take Item Out of Box. Generalization to New Tasks. A key advantage of ICL over supervised methods is the ability to generalize to novel tasks without retraining, as providing a few demonstrations at test time is sufficient. To evaluate this, we design two new bimanual tasks ( [PITH_FULL_IMAGE:figur… view at source ↗

**Figure 4.** Figure 4: Qualitative comparison. Each pair of rows contrasts a successful BiCICLe episode (✓) with a failed baseline episode (✗) on the same task. (Top two rows) Lift Ball (tightly coupled symmetric): BiCICLe vs. RoboPrompt-SA. (Bottom two rows) Tray Oven (loosely coupled): BiCICLe vs. RoboPrompt-DA. Columns show four keyframes sampled along the trajectory (left to right: initial approach, contact, manipulation, o… view at source ↗

**Figure 5.** Figure 5: Real-world task executions. Top row: Lift Box task. Bottom row: Open Pot task. Both tasks are completed successfully by BiCICLe deployed on a physical bimanual Franka Panda system. These results demonstrate that our approach can be deployed on real world systems capable of executing physically demanding bimanual tasks, including those requiring tight inter-arm coordination and fine manipulation, with a low… view at source ↗

read the original abstract

Language Models (LLMs) have emerged as powerful reasoning engines for embodied control. In particular, In-Context Learning (ICL) enables off-the-shelf, text-only LLMs to predict robot actions without any task-specific training while preserving their generalization capabilities. Applying ICL to bimanual manipulation remains challenging, as the high-dimensional joint action space and tight inter-arm coordination constraints rapidly overwhelm standard context windows. To address this, we introduce BiCICLe (Bimanual Coordinated In-Context Learning), the first framework that enables standard LLMs to perform few-shot bimanual manipulation without fine-tuning. BiCICLe frames bimanual control as a multi-agent leader-follower problem, decoupling the action space into sequential, conditioned single-arm predictions. This naturally extends to Arms' Debate, an iterative refinement process, and to the introduction of a third LLM-as-Judge to evaluate and select the most plausible coordinated trajectories. Evaluated on 13 tasks from the TWIN benchmark, BiCICLe achieves up to 71.1% average success rate, outperforming the best training-free baseline by 6.7 percentage points and surpassing most supervised methods. We further demonstrate strong few-shot generalization on novel tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BiCICLe splits bimanual control into leader-follower ICL with debate and a judge LLM, hitting 71.1% on TWIN but resting on an untested assumption about sequential conditioning.

read the letter

BiCICLe turns bimanual manipulation into a multi-agent prompting setup: one arm leads with a plan, the second conditions on it, they debate iterations, and a third LLM picks the best trajectory. This gets 71.1% average success across 13 TWIN tasks and beats the strongest training-free baseline by 6.7 points while matching or beating some trained methods. It also shows few-shot transfer to unseen tasks without any fine-tuning on the robot data.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces BiCICLe (Bimanual Coordinated In-Context Learning), a framework that enables standard LLMs to perform few-shot bimanual manipulation without fine-tuning by framing it as a multi-agent leader-follower problem with sequential conditioned single-arm predictions, an iterative Arms' Debate refinement process, and an LLM-as-Judge for trajectory selection. Evaluated on 13 tasks from the TWIN benchmark, it claims up to 71.1% average success rate, outperforming the best training-free baseline by 6.7 percentage points and surpassing most supervised methods, while demonstrating strong few-shot generalization on novel tasks.

Significance. If the results hold, this work would be significant for the field of robot learning and embodied AI, as it provides a training-free approach to complex bimanual tasks using off-the-shelf LLMs, potentially improving generalization and reducing data requirements compared to supervised methods. The multi-agent debate mechanism is a novel way to handle coordination in high-dimensional action spaces.

major comments (2)

[Experimental Evaluation] The abstract reports concrete benchmark numbers (71.1% success rate, 6.7 pp improvement) but provides no details on exact baselines, task definitions, statistical significance, number of trials, or failure modes. The full experimental section is required to assess whether results support the central claim that the framework sufficiently captures inter-arm coordination.
[Method (BiCICLe framework)] The claim that decoupling bimanual control into sequential leader-follower single-arm predictions plus iterative LLM debate captures tight coordination constraints is load-bearing but not sufficiently justified. For tasks with high simultaneity (e.g., object handoff or synchronized lifting), the sequential nature and text-based conditioning may lose critical joint-state information, potentially making reported gains artifactual.

minor comments (2)

[Abstract] The abstract mentions 'up to 71.1%' which could be clarified if it's the average or peak across tasks.
[Notation] The terms 'Arms' Debate' and 'LLM-as-Judge' are introduced without prior definition in the abstract; ensure they are clearly defined early in the paper.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and positive assessment of the work's potential significance. We address each major comment below with clarifications from the manuscript and indicate where revisions will strengthen the presentation.

read point-by-point responses

Referee: [Experimental Evaluation] The abstract reports concrete benchmark numbers (71.1% success rate, 6.7 pp improvement) but provides no details on exact baselines, task definitions, statistical significance, number of trials, or failure modes. The full experimental section is required to assess whether results support the central claim that the framework sufficiently captures inter-arm coordination.

Authors: We agree that abstracts are concise by nature and that full details are essential for evaluating the coordination claim. The manuscript's Experimental Evaluation section (Section 4) specifies the exact baselines (with the strongest training-free baseline at 64.4%), the 13 TWIN benchmark task definitions, 10 independent trials per task for computing success rates and variability, and qualitative analysis of failure modes including coordination failures. To improve accessibility, we will add a short reference in the abstract to these experimental details and expand the failure mode discussion with a summary table. This revision ensures the results' support for inter-arm coordination is fully transparent. revision: partial
Referee: [Method (BiCICLe framework)] The claim that decoupling bimanual control into sequential leader-follower single-arm predictions plus iterative LLM debate captures tight coordination constraints is load-bearing but not sufficiently justified. For tasks with high simultaneity (e.g., object handoff or synchronized lifting), the sequential nature and text-based conditioning may lose critical joint-state information, potentially making reported gains artifactual.

Authors: We acknowledge the importance of rigorously justifying the leader-follower decoupling and debate mechanism. The Method section details how the follower arm conditions its prediction directly on the leader's generated action sequence and shared textual state, while Arms' Debate performs iterative critique and refinement to resolve inter-arm dependencies before the LLM judge selects the trajectory. For high-simultaneity tasks, the text-based conditioning and multi-turn debate allow the LLM to reason about timing and joint constraints. To further substantiate this and rule out artifactual gains, we will add a new subsection with ablations on debate iterations for simultaneous tasks (e.g., handoffs) and qualitative trajectory examples showing how coordination is recovered. This addresses the concern directly. revision: yes

Circularity Check

0 steps flagged

No circularity: framework is a prompting structure evaluated on external benchmark

full rationale

The paper presents BiCICLe as a new multi-agent prompting framework that decouples bimanual actions into sequential leader-follower predictions plus debate and judging, then reports success rates on the external TWIN benchmark. No equations, fitted parameters, or derivations are present that reduce to the inputs by construction. The central claim is an empirical evaluation of a novel structure rather than a self-referential mathematical result, and no load-bearing self-citations or ansatzes are invoked to justify the core method.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claim depends on the untested premise that standard LLMs possess sufficient in-context coordination reasoning when actions are decoupled sequentially.

axioms (1)

domain assumption Standard text-only LLMs can perform reliable sequential action prediction and iterative refinement for physical robot coordination when given few-shot examples.
Invoked throughout the description of BiCICLe, Arms' Debate, and LLM-as-Judge.

invented entities (3)

BiCICLe framework no independent evidence
purpose: Decouple bimanual action space into leader-follower single-arm predictions
Core new structure introduced to fit ICL to bimanual control.
Arms' Debate no independent evidence
purpose: Iterative refinement of coordinated trajectories
New process added to improve plausibility of multi-arm plans.
LLM-as-Judge no independent evidence
purpose: Evaluate and select most plausible coordinated trajectories
Third LLM component introduced to break ties and improve selection.

pith-pipeline@v0.9.0 · 5542 in / 1316 out tokens · 54263 ms · 2026-05-10T00:22:38.743296+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 20 canonical work pages · 5 internal anchors

[1]

ECCVW What is Motion For? (2022)

Amir, S., Gandelsman, Y., Bagon, S., Dekel, T.: Deep vit features as dense visual descriptors. ECCVW What is Motion For? (2022)

2022
[2]

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report (2025),https://arxiv.org/abs/2502.13923

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Seshia, and Anca D

Black, K., Brown, N., Driess, D., Esmail, A., Equi, M.R., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., Jakubczak, S., Jones, T., Ke, L., Levine, S., Li-Bell, A., Mothukuri, M., Nair, S., Pertsch, K., Shi, L.X., Smith, L., Tanner, J., Vuong, Q., Walling, A., Wang, H., Zhilinsky, U.:π0: A Vision-Language-Action Flow Model for General Robot Contr...

work page doi:10.15607/rss 2025
[4]

In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A....

1901
[5]

In: 2024 IEEE International Conference on Advanced Intelligent Mechatronics (AIM)

Buamanee, T., Kobayashi, M., Uranishi, Y., Takemura, H.: Bi-act: Bilateral control-based imitation learning via action chunking with transformer. In: 2024 IEEE International Conference on Advanced Intelligent Mechatronics (AIM). pp. 410–415 (2024).https://doi.org/10.1109/AIM55361.2024.10637173

work page doi:10.1109/aim55361.2024.10637173 2024
[6]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 9650–9660 (October 2021)

2021
[7]

In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A

Chen, Y., Wu, T., Wang, S., Feng, X., Jiang, J., Lu, Z., McAleer, S., Dong, H., Zhu, S.C., Yang, Y.: Towards human-level bimanual dexterous manipulation with rein- forcement learning. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Advances in Neural Information Processing Systems. vol. 35, pp. 5150–5163. Curran Associates, ...

2022
[8]

Sundaresan, J

Chitnis, R., Tulsiani, S., Gupta, S., Gupta, A.: Efficient bimanual manipulation using learned task schemas. In: 2020 IEEE International Conference on Robotics and Automation (ICRA). pp. 1149–1155 (2020).https://doi.org/10.1109/ ICRA40945.2020.9196958

work page arXiv 2020
[9]

JOURNAL OF SOFTWARE ENGINEERING IN ROBOTICS5(1), 3–16 (2014)

Coleman, D.T., Sucan, I.A., Chitta, S., Correll, N.: Reducing the barrier to entry of complex robotic software: a moveit! case study. JOURNAL OF SOFTWARE ENGINEERING IN ROBOTICS5(1), 3–16 (2014)

2014
[10]

In: Proceedings of Robotics: Science and Systems (RSS) (2024)

Di Palo, N., Johns, E.: Keypoint action tokens enable in-context imitation learning in robotics. In: Proceedings of Robotics: Science and Systems (RSS) (2024)

2024
[11]

In: Proceedings 16 Palma et al

Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factual- ity and reasoning in language models through multiagent debate. In: Proceedings 16 Palma et al. of the 41st International Conference on Machine Learning. ICML’24, JMLR.org (2024)

2024
[12]

Arxiv (2025)

Gkanatsios, N., Xu, J., Bronars, M., Mousavian, A., Ke, T.W., Fragkiadaki, K.: 3d flowmatch actor: Unified 3d policy for single- and dual-arm manipulation. Arxiv (2025)

2025
[13]

In: Tan, J., Toussaint, M., Darvish, K

Goyal, A., Xu, J., Guo, Y., Blukis, V., Chao, Y.W., Fox, D.: Rvt: Robotic view transformer for 3d object manipulation. In: Tan, J., Toussaint, M., Darvish, K. (eds.) Proceedings of The 7th Conference on Robot Learning. Proceedings of Machine Learning Research, vol. 229, pp. 694–710. PMLR (06–09 Nov 2023), https://proceedings.mlr.press/v229/goyal23a.html

2023
[14]

In: Tan, J., Toussaint, M., Darvish, K

Grannen, J., Wu, Y., Vu, B., Sadigh, D.: Stabilize to act: Learning to coordi- nate for bimanual manipulation. In: Tan, J., Toussaint, M., Darvish, K. (eds.) Proceedings of The 7th Conference on Robot Learning. Proceedings of Machine Learning Research, vol. 229, pp. 563–576. PMLR (06–09 Nov 2023),https: //proceedings.mlr.press/v229/grannen23a.html

2023
[15]

Grotz, M., Shridhar, M., Asfour, T., Fox, D.: Peract2: Benchmarking and learning for robotic bimanual manipulation tasks (2024),https://arxiv.org/abs/2407. 00278

2024
[16]

2025 IEEE International Con- ference on Robotics and Automation (ICRA) pp

Grotz, M., Shridhar, M., Chao, Y.W., Asfour, T., Fox, D.: Twin: Two-handed intelligent benchmark for bimanual manipulation. 2025 IEEE International Con- ference on Robotics and Automation (ICRA) pp. 7945–7951 (2025),https://api. semanticscholar.org/CorpusID:281094284

2025
[17]

In: Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., Sabato, S

Huang, W., Abbeel, P., Pathak, D., Mordatch, I.: Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In: Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., Sabato, S. (eds.) Proceedings of the 39th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 162, pp. 9118–...

2022
[18]

In: 6th Annual Conference on Robot Learning (2022),https: //openreview.net/forum?id=bdHkMjBJG_w

brian ichter, Brohan, A., Chebotar, Y., Finn, C., Hausman, K., Herzog, A., Ho, D., Ibarz, J., Irpan, A., Jang, E., Julian, R., Kalashnikov, D., Levine, S., Lu, Y., Parada, C., Rao, K., Sermanet, P., Toshev, A.T., Vanhoucke, V., Xia, F., Xiao, T., Xu, P., Yan, M., Brown, N., Ahn, M., Cortes, O., Sievers, N., Tan, C., Xu, S., Reyes, D., Rettinghouse, J., Qu...

2022
[19]

In: The Fourteenth International Conference on Learning Representations (2026), https://openreview.net/forum?id=jG9W6nAwVz

Im, H., Jeong, E., Kolobov, A., Fu, J., Lee, Y.: TwinVLA: Data-efficient bi- manual manipulation with twin single-arm vision-language-action models. In: The Fourteenth International Conference on Learning Representations (2026), https://openreview.net/forum?id=jG9W6nAwVz

2026
[20]

Intelligence, P., Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Galliker, M.Y., Ghosh, D., Groom, L., Hausman, K., Ichter, B., Jakubczak, S., Jones, T., Ke, L., LeBlanc, D., Levine, S., Li-Bell, A., Mothukuri, M., Nair, S., Pertsch, K., Ren, A.Z., Shi, L.X., Smith, L., Springenberg, J.T., Stachow...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

IEEE Robotics and Automation Letters5, 3019– 3026 (2019),https://api.semanticscholar.org/CorpusID:202889132

James, S., Ma, Z., Arrojo, D.R., Davison, A.J.: Rlbench: The robot learning bench- mark & learning environment. IEEE Robotics and Automation Letters5, 3019– 3026 (2019),https://api.semanticscholar.org/CorpusID:202889132

2019
[22]

In: Proceedings of Robotics: Science and Systems XX

Khazatsky, A., et al.: Droid: A large-scale in-the-wild robot manipulation dataset. In: Proceedings of Robotics: Science and Systems XX. Robotics: science and sys- tems (May 2024),https://roboticsconference.org/, robotics: Science and Sys- tems, R:SS ; Conference date: 15-07-2024 Through 19-07-2024

2024
[23]

In: Agrawal, P., Kroemer, O., Burgard, W

Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E.P., Sanketi, P.R., Vuong, Q., Kollar, T., Burchfiel, B., Tedrake, R., Sadigh, D., Levine, S., Liang, P., Finn, C.: Openvla: An open- source vision-language-action model. In: Agrawal, P., Kroemer, O., Burgard, W. (eds.) Proceedings of The 8th Conference on ...

2025
[24]

In: Proceedings of the 1994 IEEE International Conference on Robotics and Automation

Koga, Y., Latombe, J.C.: On multi-arm manipulation planning. In: Proceedings of the 1994 IEEE International Conference on Robotics and Automation. pp. 945–952 vol.2 (1994).https://doi.org/10.1109/ROBOT.1994.351231

work page doi:10.1109/robot.1994.351231 1994
[25]

Dexforce: Extracting force-informed actions from kinesthetic demonstrations for dexterous manipulation.IEEE Robotics and Automa- tion Letters, 10(6):6416–6423, 2025

Krebs, F., Asfour, T.: A bimanual manipulation taxonomy. IEEE Robotics and Automation Letters7(4), 11031–11038 (2022).https://doi.org/10.1109/LRA. 2022.3196158

work page doi:10.1109/lra 2022
[26]

In: 2015 IEEE In- ternational Conference on Robotics and Automation (ICRA)

Lee, A.X., Lu, H., Gupta, A., Levine, S., Abbeel, P.: Learning force-based ma- nipulation of deformable objects from multiple demonstrations. In: 2015 IEEE In- ternational Conference on Robotics and Automation (ICRA). pp. 177–184 (2015). https://doi.org/10.1109/ICRA.2015.7138997

work page doi:10.1109/icra.2015.7138997 2015
[27]

Li, X., Li, P., Qian, L., Liu, M., Wang, D., Liu, J., Kang, B., Ma, X., Wang, X., Guo, D., Kong, T., Zhang, H., Liu, H.: What matters in building vision-language- action models for generalist robots (2026),https://arxiv.org/abs/2412.14058

work page arXiv 2026
[28]

Curobo: Parallelized collision-free robot motion generation

Liang, J., Huang, W., Xia, F., Xu, P., Hausman, K., Ichter, B., Florence, P., Zeng, A.: Code as policies: Language model programs for embodied control. In: 2023 IEEE International Conference on Robotics and Automation (ICRA). pp. 9493– 9500 (2023).https://doi.org/10.1109/ICRA48891.2023.10160591

work page doi:10.1109/icra48891.2023.10160591 2023
[29]

InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Liang, T., He, Z., Jiao, W., Wang, X., Wang, Y., Wang, R., Yang, Y., Shi, S., Tu, Z.: Encouraging divergent thinking in large language models through multi-agent debate. In: Al-Onaizan, Y., Bansal, M., Chen, Y.N. (eds.) Proceed- ings of the 2024 Conference on Empirical Methods in Natural Language Process- ing. pp. 17889–17904. Association for Computationa...

work page doi:10.18653/v1/2024.emnlp-main.992 2024
[30]

Anybimanual: Transferring unimanual policy for general bimanual manipulation.arXiv preprint arXiv:2412.06779, 2024

Lu, G., Yu, T., Deng, H., Chen, S.S., Tang, Y., Wang, Z.: Anybimanual: Trans- ferring unimanual policy for general bimanual manipulation. arXiv preprint arXiv:2412.06779 (2024)

work page arXiv 2024
[31]

What’s in the image? a deep-dive into the vision of vision language models

Lv, Q., Li, H., Deng, X., Shao, R., Li, Y., Hao, J., Gao, L., Wang, M.Y., Nie, L.: Spatial-temporal graph diffusion policy with kinematic modeling for bimanual robotic manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 17394–17404 (2025).https://doi. org/10.1109/CVPR52734.2025.01621

work page doi:10.1109/cvpr52734.2025.01621 2025
[32]

2024 IEEE International Conference on Robotics and Automa- tion (ICRA) pp

Mandi, Z., Jain, S., Song, S.: Roco: Dialectic multi-robot collaboration with large language models. 2024 IEEE International Conference on Robotics and Automa- tion (ICRA) pp. 286–299 (2023),https://api.semanticscholar.org/CorpusID: 259501567 18 Palma et al

2024
[33]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Pertsch, K., Stachowicz, K., Ichter, B., Driess, D., Nair, S., Vuong, Q., Mees, O., Finn, C., Levine, S.: Fast: Efficient action tokenization for vision-language-action models. arXiv preprint arXiv:2501.09747 (2025)

work page internal anchor Pith review arXiv 2025
[34]

Qwen, :, Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Tang, T., Xia, T., Ren, X., Ren, X., Fan, Y., Su, Y., Zhang, Y., Wa...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

In: Proc

Rohmer, E., Singh, S.P.N., Freese, M.: Coppeliasim (formerly v-rep): a versatile and scalable robot simulation framework. In: Proc. of The International Conference on Intelligent Robots and Systems (IROS) (2013), www.coppeliarobotics.com

2013
[36]

arXiv preprint arXiv:2509.09769 (2025)

Shah, R., Liu, S., Wang, Q., Jiang, Z., Kumar, S., Seo, M., Martín-Martín, R., Zhu, Y.: Mimicdroid: In-context learning for humanoid robot manipulation from human play videos. arXiv preprint arXiv:2509.09769 (2025)

work page arXiv 2025
[37]

In: Liu, K., Kulic, D., Ichnowski, J

Shridhar, M., Manuelli, L., Fox, D.: Perceiver-actor: A multi-task transformer for robotic manipulation. In: Liu, K., Kulic, D., Ichnowski, J. (eds.) Proceedings of The 6th Conference on Robot Learning. Proceedings of Machine Learning Re- search, vol. 205, pp. 785–799. PMLR (14–18 Dec 2023),https://proceedings. mlr.press/v205/shridhar23a.html

2023
[38]

Singh, A., et al.: Openai gpt-5 system card (2025),https://arxiv.org/abs/2601. 03267

2025
[39]

Curobo: Parallelized collision-free robot motion generation

Singh, I., Blukis, V., Mousavian, A., Goyal, A., Xu, D., Tremblay, J., Fox, D., Thomason, J., Garg, A.: Progprompt: Generating situated robot task plans us- ing large language models. In: 2023 IEEE International Conference on Robotics and Automation (ICRA). pp. 11523–11530 (2023).https://doi.org/10.1109/ ICRA48891.2023.10161317

work page arXiv 2023
[40]

Smith, C., Karayiannidis, Y., Nalpantidis, L., Gratal, X., Qi, P., Dimarogonas, D.V., Kragic, D.: Dual arm manipulation—a survey. Robotics and Autonomous Systems60(10), 1340–1353 (2012).https://doi.org/https://doi.org/10.1016/ j.robot.2012.07.005,https://www.sciencedirect.com/science/article/ pii/S092188901200108X

2012
[41]

In: 9th Annual Conference on Robot Learning (2025),https://openreview.net/forum?id=6AASPlloSt

Sridhar, K., Dutta, S., Jayaraman, D., Lee, I.: RICL: Adding in-context adapt- ability to pre-trained vision-language-action models. In: 9th Annual Conference on Robot Learning (2025),https://openreview.net/forum?id=6AASPlloSt

2025
[42]

IEEE Access 12, 55682– 55696

Vemprala, S.H., Bonatti, R., Bucker, A., Kapoor, A.: Chatgpt for robotics: Design principles and model abilities. IEEE Access12, 55682–55696 (2024).https://doi. org/10.1109/ACCESS.2024.3387941

work page doi:10.1109/access.2024.3387941 2024
[43]

In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=1PL1NIMMrw

Wang, X., Wei, J., Schuurmans, D., Le, Q.V., Chi, E.H., Narang, S., Chowdhery, A., Zhou, D.: Self-consistency improves chain of thought reasoning in language models. In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=1PL1NIMMrw

2023
[44]

NIPS ’22, Curran Associates Inc., Red Hook, NY, USA (2022)

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E.H., Le, Q.V., Zhou, D.: Chain-of-thought prompting elicits reasoning in large language models.In:Proceedingsofthe36thInternationalConferenceonNeuralInformation Processing Systems. NIPS ’22, Curran Associates Inc., Red Hook, NY, USA (2022)

2022
[45]

17868–17879 (2023),https://api

Wen, B., Yang, W., Kautz, J., Birchfield, S.T.: Foundationpose: Unified 6d pose estimationandtrackingofnovelobjects.2024IEEE/CVFConferenceonComputer Vision and Pattern Recognition (CVPR) pp. 17868–17879 (2023),https://api. semanticscholar.org/CorpusID:266191252 Bimanual Robot Manipulation via Multi-Agent In-Context Learning 19

2023
[46]

In: Proceedings of the 34th International Conference on Neural Information Processing Systems

Xie, F., Chowdhury, A., De Paolis Kaluza, M.C., Zhao, L., Wong, L.L., Yu, R.: Deep imitation learning for bimanual robotic manipulation. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. NIPS ’20, Curran Associates Inc., Red Hook, NY, USA (2020)

2020
[47]

In: Proceedings of Robotics: Science and Systems (RSS) (2025)

Yang, Y., Cai, Z., Tian, Y., Zeng, J., Pang, J.: Gripper keypose and object point- flow as interfaces for bimanual robotic manipulation. In: Proceedings of Robotics: Science and Systems (RSS) (2025)

2025
[48]

Polytouch: A robust multi-modal tactile sensor for contact-rich manip- ulation using tactile-diffusion policies

Yin, Y., Wang, Z., Sharma, Y., Niu, D., Darrell, T., Herzig, R.: In-context learning enables robot action prediction in llms. In: 2025 IEEE International Conference on Robotics and Automation (ICRA). pp. 8972–8979 (2025).https://doi.org/10. 1109/ICRA55743.2025.11128807

work page arXiv 2025
[49]

In: Proceedings of Robotics: Science and Systems (RSS) (2024)

Ze, Y., Zhang, G., Zhang, K., Hu, C., Wang, M., Xu, H.: 3d diffusion policy: Gener- alizable visuomotor policy learning via simple 3d representations. In: Proceedings of Robotics: Science and Systems (RSS) (2024)

2024
[50]

In: The Twelfth International Conference on Learning Representations (2024),https: //openreview.net/forum?id=EnXJfQqy0K

Zhang, H., Du, W., Shan, J., Zhou, Q., Du, Y., Tenenbaum, J.B., Shu, T., Gan, C.: Building cooperative embodied agents modularly with large language models. In: The Twelfth International Conference on Learning Representations (2024),https: //openreview.net/forum?id=EnXJfQqy0K

2024
[51]

In: International Conference on Learning Representations (ICLR) (2026)

Zhang, J., Chen, X., Wang, Q., Li, M., Guo, Y., Hu, Y., Zhang, J., Bai, S., Lin, J.: VLM4VLA: Revisiting vision-language-models in vision-language-action models. In: International Conference on Learning Representations (ICLR) (2026)

2026
[52]

In: Proceedings of Robotics: Science and Systems (RSS) (2023)

Zhao,T.Z.,Kumar,V.,Levine,S.,Finn,C.:Learningfine-grainedbimanualmanip- ulation with low-cost hardware. In: Proceedings of Robotics: Science and Systems (RSS) (2023)

2023
[53]

Open3D: A Modern Library for 3D Data Processing

Zhou, Q.Y., Park, J., Koltun, V.: Open3D: A modern library for 3D data process- ing. arXiv:1801.09847 (2018)

work page internal anchor Pith review arXiv 2018
[54]

right-handed

Zitkovich,B.,Yu,T.,Xu,S.,Xu,P.,Xiao,T.,Xia,F.,Wu,J.,Wohlhart,P.,Welker, S., Wahid, A., Vuong, Q., Vanhoucke, V., Tran, H., Soricut, R., Singh, A., Singh, J., Sermanet, P., Sanketi, P.R., Salazar, G., Ryoo, M.S., Reymann, K., Rao, K., Pertsch, K., Mordatch, I., Michalewski, H., Lu, Y., Levine, S., Lee, L., Lee, T.W.E., Leal, I., Kuang, Y., Kalashnikov, D.,...

2023
[55]

check1":

with independent sampling, then scores each candidate with a validator call. Validator scoring. System prompt (validator) You are a strict judge evaluating bimanual robot action plans. CONTEXT: Two Franka Panda arms (right=indices 0–6, left=indices 7–13) in a 100×100×100 voxel workspace. Each 14-dim action is [right_x, right_y, right_z, right_rot1, right_...