pith. machine review for the scientific record. sign in

arxiv: 2604.20348 · v1 · submitted 2026-04-22 · 💻 cs.RO · cs.AI· cs.MA

Recognition: unknown

Bimanual Robot Manipulation via Multi-Agent In-Context Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:22 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.MA
keywords bimanual manipulationin-context learninglarge language modelsmulti-agent systemsrobot controlfew-shot learningcoordinated actions
0
0 comments X

The pith

A multi-agent leader-follower debate lets off-the-shelf LLMs control two robot arms in coordinated tasks without any training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BiCICLe to let text-only LLMs perform bimanual robot tasks using in-context learning. It decouples the two arms into a leader and follower that condition each other sequentially, then uses iterative debate between two LLMs and a judge to refine the plan. This achieves high success rates on benchmark tasks and generalizes to new ones with few examples, suggesting that LLMs can handle complex coordination through structured prompting rather than retraining.

Core claim

BiCICLe is the first framework that enables standard LLMs to perform few-shot bimanual manipulation without fine-tuning by framing bimanual control as a multi-agent leader-follower problem, decoupling the action space into sequential conditioned single-arm predictions, and extending this with Arms' Debate for iterative refinement plus a third LLM-as-Judge to select plausible coordinated trajectories.

What carries the argument

The leader-follower decoupling of bimanual actions into sequential single-arm predictions, extended by iterative Arms' Debate and judgment by LLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same leader-follower structure with debate could be tested on multi-robot teams beyond two arms.
  • Adding vision inputs to the context might allow the method to handle more dynamic environments without changing the core prompting approach.
  • The performance gap over training-free baselines suggests that explicit coordination mechanisms can substitute for joint training data in embodied control.

Load-bearing premise

That framing bimanual control as sequential conditioned single-arm predictions plus iterative LLM debate sufficiently captures tight inter-arm coordination constraints without losing critical joint information.

What would settle it

Running BiCICLe on TWIN benchmark tasks that require highly synchronized simultaneous movements of both arms and measuring whether success rates drop below the reported 71.1 percent average.

Figures

Figures reproduced from arXiv: 2604.20348 by Alessio Palma, Fabio Galasso, Georgia Chalvatzaki, Indro Spinelli, Luca Scofano, Vignesh Prasad, Yufeng Jin.

Figure 1
Figure 1. Figure 1: Overview of the BiCICLe Framework. (Left) Bimanual demonstrations are serialized into textual sequences of state observations and actions to construct the in-context prompt. (Right) During inference, a leader-follower decomposition enforces inter-arm coordination: the Leader agent predicts its full trajectory first; the Follower agent then predicts its actions conditioned on the Leader’s plan. No task-spec… view at source ↗
Figure 2
Figure 2. Figure 2: Bimanual action prediction architectures. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: New generalization tasks. Two bimanual tasks designed outside the TWIN benchmark. (Top) Close Jar. (Bottom) Take Item Out of Box. Generalization to New Tasks. A key advantage of ICL over supervised meth￾ods is the ability to generalize to novel tasks without retraining, as providing a few demonstrations at test time is sufficient. To evaluate this, we design two new bimanual tasks ( [PITH_FULL_IMAGE:figur… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison. Each pair of rows contrasts a successful BiCI￾CLe episode (✓) with a failed baseline episode (✗) on the same task. (Top two rows) Lift Ball (tightly coupled symmetric): BiCICLe vs. RoboPrompt-SA. (Bottom two rows) Tray Oven (loosely coupled): BiCICLe vs. RoboPrompt-DA. Columns show four keyframes sampled along the trajectory (left to right: initial approach, contact, manipulation, o… view at source ↗
Figure 5
Figure 5. Figure 5: Real-world task executions. Top row: Lift Box task. Bottom row: Open Pot task. Both tasks are completed successfully by BiCICLe deployed on a physical bimanual Franka Panda system. These results demonstrate that our approach can be deployed on real world systems capable of executing physically demanding bimanual tasks, including those requiring tight inter-arm coordination and fine manipulation, with a low… view at source ↗
read the original abstract

Language Models (LLMs) have emerged as powerful reasoning engines for embodied control. In particular, In-Context Learning (ICL) enables off-the-shelf, text-only LLMs to predict robot actions without any task-specific training while preserving their generalization capabilities. Applying ICL to bimanual manipulation remains challenging, as the high-dimensional joint action space and tight inter-arm coordination constraints rapidly overwhelm standard context windows. To address this, we introduce BiCICLe (Bimanual Coordinated In-Context Learning), the first framework that enables standard LLMs to perform few-shot bimanual manipulation without fine-tuning. BiCICLe frames bimanual control as a multi-agent leader-follower problem, decoupling the action space into sequential, conditioned single-arm predictions. This naturally extends to Arms' Debate, an iterative refinement process, and to the introduction of a third LLM-as-Judge to evaluate and select the most plausible coordinated trajectories. Evaluated on 13 tasks from the TWIN benchmark, BiCICLe achieves up to 71.1% average success rate, outperforming the best training-free baseline by 6.7 percentage points and surpassing most supervised methods. We further demonstrate strong few-shot generalization on novel tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces BiCICLe (Bimanual Coordinated In-Context Learning), a framework that enables standard LLMs to perform few-shot bimanual manipulation without fine-tuning by framing it as a multi-agent leader-follower problem with sequential conditioned single-arm predictions, an iterative Arms' Debate refinement process, and an LLM-as-Judge for trajectory selection. Evaluated on 13 tasks from the TWIN benchmark, it claims up to 71.1% average success rate, outperforming the best training-free baseline by 6.7 percentage points and surpassing most supervised methods, while demonstrating strong few-shot generalization on novel tasks.

Significance. If the results hold, this work would be significant for the field of robot learning and embodied AI, as it provides a training-free approach to complex bimanual tasks using off-the-shelf LLMs, potentially improving generalization and reducing data requirements compared to supervised methods. The multi-agent debate mechanism is a novel way to handle coordination in high-dimensional action spaces.

major comments (2)
  1. [Experimental Evaluation] The abstract reports concrete benchmark numbers (71.1% success rate, 6.7 pp improvement) but provides no details on exact baselines, task definitions, statistical significance, number of trials, or failure modes. The full experimental section is required to assess whether results support the central claim that the framework sufficiently captures inter-arm coordination.
  2. [Method (BiCICLe framework)] The claim that decoupling bimanual control into sequential leader-follower single-arm predictions plus iterative LLM debate captures tight coordination constraints is load-bearing but not sufficiently justified. For tasks with high simultaneity (e.g., object handoff or synchronized lifting), the sequential nature and text-based conditioning may lose critical joint-state information, potentially making reported gains artifactual.
minor comments (2)
  1. [Abstract] The abstract mentions 'up to 71.1%' which could be clarified if it's the average or peak across tasks.
  2. [Notation] The terms 'Arms' Debate' and 'LLM-as-Judge' are introduced without prior definition in the abstract; ensure they are clearly defined early in the paper.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and positive assessment of the work's potential significance. We address each major comment below with clarifications from the manuscript and indicate where revisions will strengthen the presentation.

read point-by-point responses
  1. Referee: [Experimental Evaluation] The abstract reports concrete benchmark numbers (71.1% success rate, 6.7 pp improvement) but provides no details on exact baselines, task definitions, statistical significance, number of trials, or failure modes. The full experimental section is required to assess whether results support the central claim that the framework sufficiently captures inter-arm coordination.

    Authors: We agree that abstracts are concise by nature and that full details are essential for evaluating the coordination claim. The manuscript's Experimental Evaluation section (Section 4) specifies the exact baselines (with the strongest training-free baseline at 64.4%), the 13 TWIN benchmark task definitions, 10 independent trials per task for computing success rates and variability, and qualitative analysis of failure modes including coordination failures. To improve accessibility, we will add a short reference in the abstract to these experimental details and expand the failure mode discussion with a summary table. This revision ensures the results' support for inter-arm coordination is fully transparent. revision: partial

  2. Referee: [Method (BiCICLe framework)] The claim that decoupling bimanual control into sequential leader-follower single-arm predictions plus iterative LLM debate captures tight coordination constraints is load-bearing but not sufficiently justified. For tasks with high simultaneity (e.g., object handoff or synchronized lifting), the sequential nature and text-based conditioning may lose critical joint-state information, potentially making reported gains artifactual.

    Authors: We acknowledge the importance of rigorously justifying the leader-follower decoupling and debate mechanism. The Method section details how the follower arm conditions its prediction directly on the leader's generated action sequence and shared textual state, while Arms' Debate performs iterative critique and refinement to resolve inter-arm dependencies before the LLM judge selects the trajectory. For high-simultaneity tasks, the text-based conditioning and multi-turn debate allow the LLM to reason about timing and joint constraints. To further substantiate this and rule out artifactual gains, we will add a new subsection with ablations on debate iterations for simultaneous tasks (e.g., handoffs) and qualitative trajectory examples showing how coordination is recovered. This addresses the concern directly. revision: yes

Circularity Check

0 steps flagged

No circularity: framework is a prompting structure evaluated on external benchmark

full rationale

The paper presents BiCICLe as a new multi-agent prompting framework that decouples bimanual actions into sequential leader-follower predictions plus debate and judging, then reports success rates on the external TWIN benchmark. No equations, fitted parameters, or derivations are present that reduce to the inputs by construction. The central claim is an empirical evaluation of a novel structure rather than a self-referential mathematical result, and no load-bearing self-citations or ansatzes are invoked to justify the core method.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claim depends on the untested premise that standard LLMs possess sufficient in-context coordination reasoning when actions are decoupled sequentially.

axioms (1)
  • domain assumption Standard text-only LLMs can perform reliable sequential action prediction and iterative refinement for physical robot coordination when given few-shot examples.
    Invoked throughout the description of BiCICLe, Arms' Debate, and LLM-as-Judge.
invented entities (3)
  • BiCICLe framework no independent evidence
    purpose: Decouple bimanual action space into leader-follower single-arm predictions
    Core new structure introduced to fit ICL to bimanual control.
  • Arms' Debate no independent evidence
    purpose: Iterative refinement of coordinated trajectories
    New process added to improve plausibility of multi-arm plans.
  • LLM-as-Judge no independent evidence
    purpose: Evaluate and select most plausible coordinated trajectories
    Third LLM component introduced to break ties and improve selection.

pith-pipeline@v0.9.0 · 5542 in / 1316 out tokens · 54263 ms · 2026-05-10T00:22:38.743296+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 20 canonical work pages · 5 internal anchors

  1. [1]

    ECCVW What is Motion For? (2022)

    Amir, S., Gandelsman, Y., Bagon, S., Dekel, T.: Deep vit features as dense visual descriptors. ECCVW What is Motion For? (2022)

  2. [2]

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report (2025),https://arxiv.org/abs/2502.13923

  3. [3]

    Seshia, and Anca D

    Black, K., Brown, N., Driess, D., Esmail, A., Equi, M.R., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., Jakubczak, S., Jones, T., Ke, L., Levine, S., Li-Bell, A., Mothukuri, M., Nair, S., Pertsch, K., Shi, L.X., Smith, L., Tanner, J., Vuong, Q., Walling, A., Wang, H., Zhilinsky, U.:π0: A Vision-Language-Action Flow Model for General Robot Contr...

  4. [4]

    In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A....

  5. [5]

    In: 2024 IEEE International Conference on Advanced Intelligent Mechatronics (AIM)

    Buamanee, T., Kobayashi, M., Uranishi, Y., Takemura, H.: Bi-act: Bilateral control-based imitation learning via action chunking with transformer. In: 2024 IEEE International Conference on Advanced Intelligent Mechatronics (AIM). pp. 410–415 (2024).https://doi.org/10.1109/AIM55361.2024.10637173

  6. [6]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 9650–9660 (October 2021)

  7. [7]

    In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A

    Chen, Y., Wu, T., Wang, S., Feng, X., Jiang, J., Lu, Z., McAleer, S., Dong, H., Zhu, S.C., Yang, Y.: Towards human-level bimanual dexterous manipulation with rein- forcement learning. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Advances in Neural Information Processing Systems. vol. 35, pp. 5150–5163. Curran Associates, ...

  8. [8]

    Sundaresan, J

    Chitnis, R., Tulsiani, S., Gupta, S., Gupta, A.: Efficient bimanual manipulation using learned task schemas. In: 2020 IEEE International Conference on Robotics and Automation (ICRA). pp. 1149–1155 (2020).https://doi.org/10.1109/ ICRA40945.2020.9196958

  9. [9]

    JOURNAL OF SOFTWARE ENGINEERING IN ROBOTICS5(1), 3–16 (2014)

    Coleman, D.T., Sucan, I.A., Chitta, S., Correll, N.: Reducing the barrier to entry of complex robotic software: a moveit! case study. JOURNAL OF SOFTWARE ENGINEERING IN ROBOTICS5(1), 3–16 (2014)

  10. [10]

    In: Proceedings of Robotics: Science and Systems (RSS) (2024)

    Di Palo, N., Johns, E.: Keypoint action tokens enable in-context imitation learning in robotics. In: Proceedings of Robotics: Science and Systems (RSS) (2024)

  11. [11]

    In: Proceedings 16 Palma et al

    Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factual- ity and reasoning in language models through multiagent debate. In: Proceedings 16 Palma et al. of the 41st International Conference on Machine Learning. ICML’24, JMLR.org (2024)

  12. [12]

    Arxiv (2025)

    Gkanatsios, N., Xu, J., Bronars, M., Mousavian, A., Ke, T.W., Fragkiadaki, K.: 3d flowmatch actor: Unified 3d policy for single- and dual-arm manipulation. Arxiv (2025)

  13. [13]

    In: Tan, J., Toussaint, M., Darvish, K

    Goyal, A., Xu, J., Guo, Y., Blukis, V., Chao, Y.W., Fox, D.: Rvt: Robotic view transformer for 3d object manipulation. In: Tan, J., Toussaint, M., Darvish, K. (eds.) Proceedings of The 7th Conference on Robot Learning. Proceedings of Machine Learning Research, vol. 229, pp. 694–710. PMLR (06–09 Nov 2023), https://proceedings.mlr.press/v229/goyal23a.html

  14. [14]

    In: Tan, J., Toussaint, M., Darvish, K

    Grannen, J., Wu, Y., Vu, B., Sadigh, D.: Stabilize to act: Learning to coordi- nate for bimanual manipulation. In: Tan, J., Toussaint, M., Darvish, K. (eds.) Proceedings of The 7th Conference on Robot Learning. Proceedings of Machine Learning Research, vol. 229, pp. 563–576. PMLR (06–09 Nov 2023),https: //proceedings.mlr.press/v229/grannen23a.html

  15. [15]

    Grotz, M., Shridhar, M., Asfour, T., Fox, D.: Peract2: Benchmarking and learning for robotic bimanual manipulation tasks (2024),https://arxiv.org/abs/2407. 00278

  16. [16]

    2025 IEEE International Con- ference on Robotics and Automation (ICRA) pp

    Grotz, M., Shridhar, M., Chao, Y.W., Asfour, T., Fox, D.: Twin: Two-handed intelligent benchmark for bimanual manipulation. 2025 IEEE International Con- ference on Robotics and Automation (ICRA) pp. 7945–7951 (2025),https://api. semanticscholar.org/CorpusID:281094284

  17. [17]

    In: Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., Sabato, S

    Huang, W., Abbeel, P., Pathak, D., Mordatch, I.: Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In: Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., Sabato, S. (eds.) Proceedings of the 39th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 162, pp. 9118–...

  18. [18]

    In: 6th Annual Conference on Robot Learning (2022),https: //openreview.net/forum?id=bdHkMjBJG_w

    brian ichter, Brohan, A., Chebotar, Y., Finn, C., Hausman, K., Herzog, A., Ho, D., Ibarz, J., Irpan, A., Jang, E., Julian, R., Kalashnikov, D., Levine, S., Lu, Y., Parada, C., Rao, K., Sermanet, P., Toshev, A.T., Vanhoucke, V., Xia, F., Xiao, T., Xu, P., Yan, M., Brown, N., Ahn, M., Cortes, O., Sievers, N., Tan, C., Xu, S., Reyes, D., Rettinghouse, J., Qu...

  19. [19]

    In: The Fourteenth International Conference on Learning Representations (2026), https://openreview.net/forum?id=jG9W6nAwVz

    Im, H., Jeong, E., Kolobov, A., Fu, J., Lee, Y.: TwinVLA: Data-efficient bi- manual manipulation with twin single-arm vision-language-action models. In: The Fourteenth International Conference on Learning Representations (2026), https://openreview.net/forum?id=jG9W6nAwVz

  20. [20]

    Intelligence, P., Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Galliker, M.Y., Ghosh, D., Groom, L., Hausman, K., Ichter, B., Jakubczak, S., Jones, T., Ke, L., LeBlanc, D., Levine, S., Li-Bell, A., Mothukuri, M., Nair, S., Pertsch, K., Ren, A.Z., Shi, L.X., Smith, L., Springenberg, J.T., Stachow...

  21. [21]

    IEEE Robotics and Automation Letters5, 3019– 3026 (2019),https://api.semanticscholar.org/CorpusID:202889132

    James, S., Ma, Z., Arrojo, D.R., Davison, A.J.: Rlbench: The robot learning bench- mark & learning environment. IEEE Robotics and Automation Letters5, 3019– 3026 (2019),https://api.semanticscholar.org/CorpusID:202889132

  22. [22]

    In: Proceedings of Robotics: Science and Systems XX

    Khazatsky, A., et al.: Droid: A large-scale in-the-wild robot manipulation dataset. In: Proceedings of Robotics: Science and Systems XX. Robotics: science and sys- tems (May 2024),https://roboticsconference.org/, robotics: Science and Sys- tems, R:SS ; Conference date: 15-07-2024 Through 19-07-2024

  23. [23]

    In: Agrawal, P., Kroemer, O., Burgard, W

    Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E.P., Sanketi, P.R., Vuong, Q., Kollar, T., Burchfiel, B., Tedrake, R., Sadigh, D., Levine, S., Liang, P., Finn, C.: Openvla: An open- source vision-language-action model. In: Agrawal, P., Kroemer, O., Burgard, W. (eds.) Proceedings of The 8th Conference on ...

  24. [24]

    In: Proceedings of the 1994 IEEE International Conference on Robotics and Automation

    Koga, Y., Latombe, J.C.: On multi-arm manipulation planning. In: Proceedings of the 1994 IEEE International Conference on Robotics and Automation. pp. 945–952 vol.2 (1994).https://doi.org/10.1109/ROBOT.1994.351231

  25. [25]

    Dexforce: Extracting force-informed actions from kinesthetic demonstrations for dexterous manipulation.IEEE Robotics and Automa- tion Letters, 10(6):6416–6423, 2025

    Krebs, F., Asfour, T.: A bimanual manipulation taxonomy. IEEE Robotics and Automation Letters7(4), 11031–11038 (2022).https://doi.org/10.1109/LRA. 2022.3196158

  26. [26]

    In: 2015 IEEE In- ternational Conference on Robotics and Automation (ICRA)

    Lee, A.X., Lu, H., Gupta, A., Levine, S., Abbeel, P.: Learning force-based ma- nipulation of deformable objects from multiple demonstrations. In: 2015 IEEE In- ternational Conference on Robotics and Automation (ICRA). pp. 177–184 (2015). https://doi.org/10.1109/ICRA.2015.7138997

  27. [27]

    Li, X., Li, P., Qian, L., Liu, M., Wang, D., Liu, J., Kang, B., Ma, X., Wang, X., Guo, D., Kong, T., Zhang, H., Liu, H.: What matters in building vision-language- action models for generalist robots (2026),https://arxiv.org/abs/2412.14058

  28. [28]

    Curobo: Parallelized collision-free robot motion generation

    Liang, J., Huang, W., Xia, F., Xu, P., Hausman, K., Ichter, B., Florence, P., Zeng, A.: Code as policies: Language model programs for embodied control. In: 2023 IEEE International Conference on Robotics and Automation (ICRA). pp. 9493– 9500 (2023).https://doi.org/10.1109/ICRA48891.2023.10160591

  29. [29]

    InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

    Liang, T., He, Z., Jiao, W., Wang, X., Wang, Y., Wang, R., Yang, Y., Shi, S., Tu, Z.: Encouraging divergent thinking in large language models through multi-agent debate. In: Al-Onaizan, Y., Bansal, M., Chen, Y.N. (eds.) Proceed- ings of the 2024 Conference on Empirical Methods in Natural Language Process- ing. pp. 17889–17904. Association for Computationa...

  30. [30]

    Anybimanual: Transferring unimanual policy for general bimanual manipulation.arXiv preprint arXiv:2412.06779, 2024

    Lu, G., Yu, T., Deng, H., Chen, S.S., Tang, Y., Wang, Z.: Anybimanual: Trans- ferring unimanual policy for general bimanual manipulation. arXiv preprint arXiv:2412.06779 (2024)

  31. [31]

    What’s in the image? a deep-dive into the vision of vision language models

    Lv, Q., Li, H., Deng, X., Shao, R., Li, Y., Hao, J., Gao, L., Wang, M.Y., Nie, L.: Spatial-temporal graph diffusion policy with kinematic modeling for bimanual robotic manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 17394–17404 (2025).https://doi. org/10.1109/CVPR52734.2025.01621

  32. [32]

    2024 IEEE International Conference on Robotics and Automa- tion (ICRA) pp

    Mandi, Z., Jain, S., Song, S.: Roco: Dialectic multi-robot collaboration with large language models. 2024 IEEE International Conference on Robotics and Automa- tion (ICRA) pp. 286–299 (2023),https://api.semanticscholar.org/CorpusID: 259501567 18 Palma et al

  33. [33]

    FAST: Efficient Action Tokenization for Vision-Language-Action Models

    Pertsch, K., Stachowicz, K., Ichter, B., Driess, D., Nair, S., Vuong, Q., Mees, O., Finn, C., Levine, S.: Fast: Efficient action tokenization for vision-language-action models. arXiv preprint arXiv:2501.09747 (2025)

  34. [34]

    Qwen, :, Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Tang, T., Xia, T., Ren, X., Ren, X., Fan, Y., Su, Y., Zhang, Y., Wa...

  35. [35]

    In: Proc

    Rohmer, E., Singh, S.P.N., Freese, M.: Coppeliasim (formerly v-rep): a versatile and scalable robot simulation framework. In: Proc. of The International Conference on Intelligent Robots and Systems (IROS) (2013), www.coppeliarobotics.com

  36. [36]

    arXiv preprint arXiv:2509.09769 (2025)

    Shah, R., Liu, S., Wang, Q., Jiang, Z., Kumar, S., Seo, M., Martín-Martín, R., Zhu, Y.: Mimicdroid: In-context learning for humanoid robot manipulation from human play videos. arXiv preprint arXiv:2509.09769 (2025)

  37. [37]

    In: Liu, K., Kulic, D., Ichnowski, J

    Shridhar, M., Manuelli, L., Fox, D.: Perceiver-actor: A multi-task transformer for robotic manipulation. In: Liu, K., Kulic, D., Ichnowski, J. (eds.) Proceedings of The 6th Conference on Robot Learning. Proceedings of Machine Learning Re- search, vol. 205, pp. 785–799. PMLR (14–18 Dec 2023),https://proceedings. mlr.press/v205/shridhar23a.html

  38. [38]

    Singh, A., et al.: Openai gpt-5 system card (2025),https://arxiv.org/abs/2601. 03267

  39. [39]

    Curobo: Parallelized collision-free robot motion generation

    Singh, I., Blukis, V., Mousavian, A., Goyal, A., Xu, D., Tremblay, J., Fox, D., Thomason, J., Garg, A.: Progprompt: Generating situated robot task plans us- ing large language models. In: 2023 IEEE International Conference on Robotics and Automation (ICRA). pp. 11523–11530 (2023).https://doi.org/10.1109/ ICRA48891.2023.10161317

  40. [40]

    Smith, C., Karayiannidis, Y., Nalpantidis, L., Gratal, X., Qi, P., Dimarogonas, D.V., Kragic, D.: Dual arm manipulation—a survey. Robotics and Autonomous Systems60(10), 1340–1353 (2012).https://doi.org/https://doi.org/10.1016/ j.robot.2012.07.005,https://www.sciencedirect.com/science/article/ pii/S092188901200108X

  41. [41]

    In: 9th Annual Conference on Robot Learning (2025),https://openreview.net/forum?id=6AASPlloSt

    Sridhar, K., Dutta, S., Jayaraman, D., Lee, I.: RICL: Adding in-context adapt- ability to pre-trained vision-language-action models. In: 9th Annual Conference on Robot Learning (2025),https://openreview.net/forum?id=6AASPlloSt

  42. [42]

    IEEE Access 12, 55682– 55696

    Vemprala, S.H., Bonatti, R., Bucker, A., Kapoor, A.: Chatgpt for robotics: Design principles and model abilities. IEEE Access12, 55682–55696 (2024).https://doi. org/10.1109/ACCESS.2024.3387941

  43. [43]

    In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=1PL1NIMMrw

    Wang, X., Wei, J., Schuurmans, D., Le, Q.V., Chi, E.H., Narang, S., Chowdhery, A., Zhou, D.: Self-consistency improves chain of thought reasoning in language models. In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=1PL1NIMMrw

  44. [44]

    NIPS ’22, Curran Associates Inc., Red Hook, NY, USA (2022)

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E.H., Le, Q.V., Zhou, D.: Chain-of-thought prompting elicits reasoning in large language models.In:Proceedingsofthe36thInternationalConferenceonNeuralInformation Processing Systems. NIPS ’22, Curran Associates Inc., Red Hook, NY, USA (2022)

  45. [45]

    17868–17879 (2023),https://api

    Wen, B., Yang, W., Kautz, J., Birchfield, S.T.: Foundationpose: Unified 6d pose estimationandtrackingofnovelobjects.2024IEEE/CVFConferenceonComputer Vision and Pattern Recognition (CVPR) pp. 17868–17879 (2023),https://api. semanticscholar.org/CorpusID:266191252 Bimanual Robot Manipulation via Multi-Agent In-Context Learning 19

  46. [46]

    In: Proceedings of the 34th International Conference on Neural Information Processing Systems

    Xie, F., Chowdhury, A., De Paolis Kaluza, M.C., Zhao, L., Wong, L.L., Yu, R.: Deep imitation learning for bimanual robotic manipulation. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. NIPS ’20, Curran Associates Inc., Red Hook, NY, USA (2020)

  47. [47]

    In: Proceedings of Robotics: Science and Systems (RSS) (2025)

    Yang, Y., Cai, Z., Tian, Y., Zeng, J., Pang, J.: Gripper keypose and object point- flow as interfaces for bimanual robotic manipulation. In: Proceedings of Robotics: Science and Systems (RSS) (2025)

  48. [48]

    Polytouch: A robust multi-modal tactile sensor for contact-rich manip- ulation using tactile-diffusion policies

    Yin, Y., Wang, Z., Sharma, Y., Niu, D., Darrell, T., Herzig, R.: In-context learning enables robot action prediction in llms. In: 2025 IEEE International Conference on Robotics and Automation (ICRA). pp. 8972–8979 (2025).https://doi.org/10. 1109/ICRA55743.2025.11128807

  49. [49]

    In: Proceedings of Robotics: Science and Systems (RSS) (2024)

    Ze, Y., Zhang, G., Zhang, K., Hu, C., Wang, M., Xu, H.: 3d diffusion policy: Gener- alizable visuomotor policy learning via simple 3d representations. In: Proceedings of Robotics: Science and Systems (RSS) (2024)

  50. [50]

    In: The Twelfth International Conference on Learning Representations (2024),https: //openreview.net/forum?id=EnXJfQqy0K

    Zhang, H., Du, W., Shan, J., Zhou, Q., Du, Y., Tenenbaum, J.B., Shu, T., Gan, C.: Building cooperative embodied agents modularly with large language models. In: The Twelfth International Conference on Learning Representations (2024),https: //openreview.net/forum?id=EnXJfQqy0K

  51. [51]

    In: International Conference on Learning Representations (ICLR) (2026)

    Zhang, J., Chen, X., Wang, Q., Li, M., Guo, Y., Hu, Y., Zhang, J., Bai, S., Lin, J.: VLM4VLA: Revisiting vision-language-models in vision-language-action models. In: International Conference on Learning Representations (ICLR) (2026)

  52. [52]

    In: Proceedings of Robotics: Science and Systems (RSS) (2023)

    Zhao,T.Z.,Kumar,V.,Levine,S.,Finn,C.:Learningfine-grainedbimanualmanip- ulation with low-cost hardware. In: Proceedings of Robotics: Science and Systems (RSS) (2023)

  53. [53]

    Open3D: A Modern Library for 3D Data Processing

    Zhou, Q.Y., Park, J., Koltun, V.: Open3D: A modern library for 3D data process- ing. arXiv:1801.09847 (2018)

  54. [54]

    right-handed

    Zitkovich,B.,Yu,T.,Xu,S.,Xu,P.,Xiao,T.,Xia,F.,Wu,J.,Wohlhart,P.,Welker, S., Wahid, A., Vuong, Q., Vanhoucke, V., Tran, H., Soricut, R., Singh, A., Singh, J., Sermanet, P., Sanketi, P.R., Salazar, G., Ryoo, M.S., Reymann, K., Rao, K., Pertsch, K., Mordatch, I., Michalewski, H., Lu, Y., Levine, S., Lee, L., Lee, T.W.E., Leal, I., Kuang, Y., Kalashnikov, D.,...

  55. [55]

    check1":

    with independent sampling, then scores each candidate with a validator call. Validator scoring. System prompt (validator) You are a strict judge evaluating bimanual robot action plans. CONTEXT: Two Franka Panda arms (right=indices 0–6, left=indices 7–13) in a 100×100×100 voxel workspace. Each 14-dim action is [right_x, right_y, right_z, right_rot1, right_...