See Selectively, Act Adaptively: Dual-Level Structural Decomposition for Bimanual Robot Manipulation

Soo-Chul Lim; Yoon-Ji Choi; Young-Chae Son

arxiv: 2606.13279 · v1 · pith:ODQZZKKRnew · submitted 2026-06-11 · 💻 cs.RO

See Selectively, Act Adaptively: Dual-Level Structural Decomposition for Bimanual Robot Manipulation

Yoon-Ji Choi , Young-Chae Son , Soo-Chul Lim This is my paper

Pith reviewed 2026-06-27 06:44 UTC · model grok-4.3

classification 💻 cs.RO

keywords bimanual manipulationvision-language-actionmixture of expertsvisual routerstructural decompositionrobot learningdual-level architecture

0 comments

The pith

Dual-level decomposition of visual selection and bimanual modes raises manipulation success rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that standard Vision-Language-Action policies for bimanual robots use a single shared pathway for all visual inputs and action patterns, which fails to handle task-stage changes in visual relevance or shifts between independent and coordinated arm behavior. It introduces Dual-Level Structural Decomposition: a View-Selective Visual Router that dynamically weights wrist-camera contributions and an Interaction-Aware Action Mixture-of-Experts that routes action generation through coordinated or arm-specific experts. Experiments on six simulated tasks and three real long-horizon tasks show the combined decomposition lifts average success rates by 27.7 percent in simulation and 43.3 percent in the real world over a monolithic baseline.

Core claim

Jointly considering selective visual processing and explicit decomposition of bimanual interaction structures provides an effective inductive bias for robust bimanual manipulation.

What carries the argument

Dual-Level Structural Decomposition, implemented as a View-Selective Visual Router that adjusts wrist-view contributions and an Interaction-Aware Action Mixture-of-Experts that splits action generation into coordinated and arm-wise pathways.

If this is right

The combined router and MoE yields a 27.7 percent higher average success rate across six simulated bimanual tasks.
The same architecture produces a 43.3 percent higher average success rate on three long-horizon real-world tasks.
Single-module ablations that remove either the visual router or the action MoE each underperform the full dual-level model.
The router learns to emphasize task-relevant wrist views at different stages while the MoE adapts to independent versus coordinated arm modes.
Performance gains appear consistently in both simulation and real-world settings when both decomposition levels are present.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation of visual gating from action routing could be tested on other multi-limb or multi-robot coordination problems where mode switching occurs.
If the router and MoE are kept lightweight, the approach may reduce the data volume needed to reach high success on new bimanual tasks.
A natural extension would measure whether the learned routing decisions align with human intuition about when each arm should act independently.

Load-bearing premise

Visual relevance and bimanual interaction modes vary in ways that a learnable router and mixture-of-experts can capture without losing critical cross-modal information or causing training instability.

What would settle it

Run the full model against the monolithic baseline on the same six simulated tasks; if average success rate improvement falls below 10 percent while training remains stable, the claimed inductive bias does not hold.

Figures

Figures reproduced from arXiv: 2606.13279 by Soo-Chul Lim, Yoon-Ji Choi, Young-Chae Son.

**Figure 2.** Figure 2: Overall architecture. (a) The View-Selective Visual Router is inserted before the VLM [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Real-world task structure and cumulative stage-wise success rates. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Simulation task overview and success rates under easy and hard settings. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Robustness evaluation results. (a) Average simulation success rates under Easy/Hard [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Detailed architecture of the proposed dual-level routing framework. (a) View-Selective [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Simulation task configurations for S1–S4. [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Simulation task configurations for S5–S6. [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Real-world objects used in the easy and hard conditions for R1–R3. [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Router outputs across S1–S6. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

read the original abstract

In bimanual robotic manipulation, task-relevant visual information varies with the task stage and context, while the interaction of the two arms shifts between independent and coordinated modes, making policy learning challenging. However, existing monolithic Vision-Language-Action (VLA) policies process diverse visual inputs and interaction patterns through a single shared representation and action generation pathway, often failing to separately account for visual relevance and bimanual interaction structure. To address this issue, we propose a bimanual manipulation VLA framework based on Dual-Level Structural Decomposition. The View-Selective Visual Router dynamically adjusts wrist-view contributions to emphasize relevant visual cues, while the Interaction-Aware Action Mixture-of-Experts (MoE) decomposes action generation into coordinated and arm-wise pathways to adapt to varying bimanual interaction modes. We evaluate the proposed method on six simulated bimanual manipulation tasks in RoboTwin 2.0 and three long-horizon real-world tasks. Our model improves the overall average success rate over a monolithic baseline by 27.7% in simulation and 43.3% in real-world evaluation, while consistently outperforming single-module variants across both settings. These results demonstrate that jointly considering selective visual processing and explicit decomposition of bimanual interaction structures provides an effective inductive bias for robust bimanual manipulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Dual visual router plus action MoE gives reported gains on bimanual VLA tasks, but missing methods and stats block verification of where the gains come from.

read the letter

The colleague should know two things. The paper adds a View-Selective Visual Router that adjusts wrist-view weights and an Interaction-Aware Action MoE that splits action generation into coordinated and arm-wise paths inside a VLA policy. The abstract reports 27.7% higher average success in simulation and 43.3% higher in real-world bimanual tasks over a monolithic baseline, with consistent wins over single-module versions.

What is new is the explicit pairing of those two modules to match the stated problem of shifting visual relevance and changing bimanual interaction modes. The paper does a clear job naming why monolithic VLAs struggle with those variations and showing that the decomposed version improves results on six simulated tasks in RoboTwin 2.0 plus three long-horizon real tasks.

The soft spots are all in the evaluation. The abstract gives no error bars, no statistical tests, no capacity-matched baseline details, and no ablation on data splits or training stability. The full manuscript was not supplied, so the router and MoE architectures, training dynamics, and exact baseline implementations cannot be checked. This leaves open whether the gains trace to the claimed inductive bias or to extra parameters and tuning. The assumption that the added components preserve cross-modal information without instability is plausible but untested in the available text.

This is for robotics researchers building VLA policies for two-arm manipulation. A reader who wants a concrete architectural pattern to try on bimanual tasks will get something usable. It deserves a serious referee because the problem is real, the proposed fix is targeted, and the results span simulation and hardware, even if the current reporting needs strengthening.

I would send it to peer review with the expectation that the authors add the missing controls.

Referee Report

2 major / 1 minor

Summary. The paper claims that jointly considering selective visual processing and explicit decomposition of bimanual interaction structures via a Dual-Level Structural Decomposition framework (View-Selective Visual Router plus Interaction-Aware Action MoE) supplies an effective inductive bias for robust bimanual manipulation. This is supported by reported average success-rate gains of 27.7% in simulation (six RoboTwin 2.0 tasks) and 43.3% in real-world evaluation (three long-horizon tasks) over a monolithic VLA baseline, with consistent outperformance of single-module variants.

Significance. If the empirical gains are shown to arise specifically from the claimed structural decomposition rather than capacity or tuning effects, the work would supply a concrete inductive bias for handling stage-dependent visual relevance and shifting independent/coordinated bimanual modes, which is a recurring challenge in VLA policy learning for manipulation.

major comments (2)

[Evaluation] Abstract and Evaluation section: the reported 27.7 % / 43.3 % gains are presented without error bars, statistical significance tests, or details on baseline architectures and data-split ablations; these omissions are load-bearing because the central claim rests on the assertion that the dual-level modules, rather than other factors, produce the observed improvements.
[Methods] Methods section: the precise architectures of the View-Selective Visual Router and Interaction-Aware Action MoE, together with training dynamics and capacity-matched controls, are not verifiable from the supplied text, preventing confirmation that the inductive bias is the operative mechanism.

minor comments (1)

The abstract would benefit from explicit naming of the monolithic baseline and the six simulated tasks to allow immediate assessment of task diversity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger statistical reporting and more verifiable methodological details. We address each major comment below and commit to revisions that directly strengthen the empirical claims without altering the core contributions.

read point-by-point responses

Referee: [Evaluation] Abstract and Evaluation section: the reported 27.7 % / 43.3 % gains are presented without error bars, statistical significance tests, or details on baseline architectures and data-split ablations; these omissions are load-bearing because the central claim rests on the assertion that the dual-level modules, rather than other factors, produce the observed improvements.

Authors: We agree that the absence of error bars and significance testing weakens the load-bearing claim. In the revised manuscript we will report mean success rates with standard deviations computed over at least three random seeds for all methods, include paired t-test p-values between our model and the monolithic baseline, and expand Section 4 to detail the exact baseline architectures (including parameter counts and training protocols) as well as the train/validation/test splits used for the six RoboTwin 2.0 tasks. We will also add an explicit data-split ablation table. These additions will be placed in the main evaluation section rather than the appendix. revision: yes
Referee: [Methods] Methods section: the precise architectures of the View-Selective Visual Router and Interaction-Aware Action MoE, together with training dynamics and capacity-matched controls, are not verifiable from the supplied text, preventing confirmation that the inductive bias is the operative mechanism.

Authors: We acknowledge that the current text does not supply sufficient architectural specificity or capacity-matched controls. In revision we will insert (i) a detailed diagram and layer-by-layer specification of the View-Selective Visual Router (including the gating network, feature dimensions, and how wrist-view weights are computed from task-stage embeddings), (ii) the expert specialization, router architecture, and load-balancing loss for the Interaction-Aware Action MoE, (iii) a hyperparameter table and training curve appendix, and (iv) a new capacity-matched ablation in which the MoE is replaced by a single feed-forward network whose parameter count equals the sum of the experts. These changes will allow readers to verify that performance gains arise from the structural decomposition rather than capacity differences. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical inductive bias claim rests on held-out task evaluation

full rationale

The paper introduces a View-Selective Visual Router and Interaction-Aware Action MoE as architectural components for bimanual VLA policies. Its central claim—that this dual-level decomposition supplies an effective inductive bias—is supported solely by reported success-rate gains (27.7 % simulation, 43.3 % real-world) over a monolithic baseline and single-module ablations on six simulated and three real tasks. No equations, parameter-fitting steps, or self-citations appear in the provided text that would reduce these empirical outcomes to quantities defined inside the model itself. The evaluation uses held-out tasks, so the performance numbers are not forced by construction from the training data or internal definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on standard supervised or imitation-learning assumptions for VLA training plus the domain assumption that bimanual interaction modes are separable into coordinated versus arm-wise pathways. No free parameters beyond ordinary network weights are named. Two new architectural components are introduced without external falsifiable handles.

axioms (1)

domain assumption Existing monolithic VLA policies process diverse visual inputs and interaction patterns through a single shared representation and action generation pathway
Stated directly in the abstract as the motivation for the new framework.

invented entities (2)

View-Selective Visual Router no independent evidence
purpose: Dynamically adjusts wrist-view contributions to emphasize relevant visual cues
New module introduced to handle varying visual relevance; no independent evidence outside the proposed framework.
Interaction-Aware Action Mixture-of-Experts no independent evidence
purpose: Decomposes action generation into coordinated and arm-wise pathways
New module introduced to adapt to varying bimanual interaction modes; no independent evidence outside the proposed framework.

pith-pipeline@v0.9.1-grok · 5769 in / 1469 out tokens · 19637 ms · 2026-06-27T06:44:33.011073+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 11 canonical work pages

[1]

Krebs and T

F. Krebs and T. Asfour. A bimanual manipulation taxonomy.IEEE Robotics and Automation Letters, 7(4):11031–11038, 2022. doi:10.1109/LRA.2022.3196158

work page doi:10.1109/lra.2022.3196158 2022
[2]

Smith, Y

C. Smith, Y . Karayiannidis, L. Nalpantidis, X. Gratal, P. Qi, D. V . Dimarogonas, and D. Kragic. Dual arm manipulation—a survey.Robotics and Autonomous Systems, 60(10):1340–1353,
[3]

doi:https://doi.org/10.1016/j.robot.2012.07.005

ISSN 0921-8890. doi:https://doi.org/10.1016/j.robot.2012.07.005

work page doi:10.1016/j.robot.2012.07.005 2012
[4]

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware. InProceedings of Robotics: Science and Systems, 2023. doi: 10.15607/RSS.2023.XIX.016

work page doi:10.15607/rss.2023.xix.016 2023
[5]

Z. Fu, T. Z. Zhao, and C. Finn. Mobile ALOHA: Learning bimanual mobile manipulation using low-cost whole-body teleoperation. InConference on Robot Learning (CoRL), volume 270, pages 4066–4083. PMLR, 2024

2024
[6]

Grotz, M

M. Grotz, M. Shridhar, Y .-W. Chao, T. Asfour, and D. Fox. Peract2: Benchmarking and learning for robotic bimanual manipulation tasks. InCoRL Workshop on Whole-body Control and Bimanual Manipulation: Applications in Humanoids and Beyond, 2024

2024
[7]

Y . Li, C. Pan, H. Xu, X. Wang, and Y . Wu. Efficient bimanual handover and rearrangement via symmetry-aware actor-critic learning. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 3867–3874, 2023. doi:10.1109/ICRA48891.2023.10160739

work page doi:10.1109/icra48891.2023.10160739 2023
[8]

A. Tung, J. Wong, A. Mandlekar, R. Mart´ın-Mart´ın, Y . Zhu, L. Fei-Fei, and S. Savarese. Learn- ing multi-arm manipulation through collaborative teleoperation. InIEEE International Confer- ence on Robotics and Automation (ICRA), pages 9212–9219, 2021. doi:10.1109/ICRA48506. 2021.9561491

work page doi:10.1109/icra48506 2021
[9]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Fos- ter, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model. In Conference on Robot Learning (CoRL), volume 270, pages 2679–2713. PMLR, 2025

2025
[10]

M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

Pith/arXiv arXiv 2025
[11]

and Kragic, D

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.π 0: A vision-language-action flow model for general robot control. InRobotics: Science and Systems (RSS), 2025. doi:10.15607/RSS.2025.XXI.010

work page doi:10.15607/rss.2025.xxi.010 2025
[12]

Black, N

K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, M. Y . Galliker, et al.π0.5: a vision-language-action model with open-world general- ization. InConference on Robot Learning (CoRL), volume 305, pages 17–40. PMLR, 2025

2025
[13]

Brohan, N

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, et al. RT-1: Robotics Transformer for Real-World Control at Scale. InRobotics: Science and Systems (RSS), 2023. doi:10.15607/RSS.2023.XIX.025

work page doi:10.15607/rss.2023.xix.025 2023
[14]

Brohan, N

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning (CoRL), volume 229, pages 2165–2183. PMLR, 2023

2023
[15]

O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, et al. Octo: An Open-Source Generalist Robot Policy. InRobotics: Science and Systems (RSS), 2024. doi:10.15607/RSS.2024.XX.090. 10

work page doi:10.15607/rss.2024.xx.090 2024
[16]

S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu. RDT-1b: a diffu- sion foundation model for bimanual manipulation. InInternational Conference on Learning Representations (ICLR), 2025

2025
[17]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

2025
[18]

Sferrazza, Y

F. Krebs and T. Asfour. Formalization of temporal and spatial constraints of bimanual manipu- lation categories. InInternational Conference on Intelligent Robots and Systems (IROS), pages 1302–1309, 2024. doi:10.1109/IROS58592.2024.10801861

work page doi:10.1109/iros58592.2024.10801861 2024
[19]

H. Deng, W. Guo, Q. Wang, Z. Wu, and Z. Wang. Safebimanual: Diffusion-based trajec- tory optimization for safe bimanual manipulation. InConference on Robot Learning (CoRL), volume 305, pages 3218–3238. PMLR, 2025

2025
[20]

I.-C. A. Liu, J. Chen, G. S. Sukhatme, and D. Seita. D-coda: Diffusion for coordinated dual- arm data augmentation. InConference on Robot Learning (CoRL), volume 305, pages 3569–
[21]

Z. Lan, W. Mao, H. Li, L. Wang, T. Wang, H. Fan, and O. Yoshie. Bfa: Best-feature-aware fusion for multi-view fine-grained manipulation.IEEE Robotics and Automation Letters, 10 (9):8930–8937, 2025. doi:10.1109/LRA.2025.3589149

work page doi:10.1109/lra.2025.3589149 2025
[22]

Y . Bai, Z. Wang, Y . Liu, K. Luo, Y . Wen, M. Dai, W. Chen, Z. Chen, L. Liu, G. Li, and L. Lin. Learning to see and act: Task-aware virtual view exploration for robotic manipulation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

2026
[23]

Chuang, A

I. Chuang, A. Lee, D. Gao, M.-M. Naddaf-Sh, and I. Soltani. Active vision might be all you need: Exploring active vision in bimanual robotic manipulation. InIEEE International Conference on Robotics and Automation (ICRA), pages 7952–7959, 2025

2025
[24]

R. Feng, D. Hu, W. Ma, and X. Li. Play to the score: Stage-guided dynamic multi-sensory fusion for robotic manipulation. InConference on Robot Learning (CoRL), volume 270, pages 340–363. PMLR, 2025

2025
[25]

Son, J.-W

Y .-C. Son, J.-W. Lee, Y .-J. Choi, D.-K. Ko, and S.-C. Lim. Selective perception for robot: Task-aware attention in multimodal vla.arXiv preprint arXiv:2602.15543, 2026

arXiv 2026
[26]

X. Zhai, Z. Huang, L. Wu, Q. Zhao, Q. Yu, J. Ren, C. Hao, and H. Soh. Skillvla: Tackling com- binatorial diversity in dual-arm manipulation via skill reuse.arXiv preprint arXiv:2603.03836, 2026

arXiv 2026
[27]

Jiang, X.-M

J.-J. Jiang, X.-M. Wu, Y .-X. He, L.-A. Zeng, Y .-L. Wei, D. Zhang, and W.-S. Zheng. Re- thinking bimanual robotic manipulation: Learning with decoupled interaction framework. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 12427–12437, 2025

2025
[28]

A. C.-W. Lee, I. Chuang, L.-Y . Chen, and I. Soltani. Interact: Inter-dependency aware action chunking with hierarchical attention transformers for bimanual manipulation. InConference on Robot Learning (CoRL), volume 270, pages 1730–1743. PMLR, 2025

2025
[29]

H. Xu, L. Lei, J. Gu, C. Tang, J. Chen, and R. Wang. Move-then-operate: Behavioral phasing for human-like robotic manipulation.arXiv preprint arXiv:2604.23620, 2026

Pith/arXiv arXiv 2026
[30]

J. Ma, Z. Zhao, X. Yi, J. Chen, L. Hong, and E. H. Chi. Modeling task relationships in multi- task learning with multi-gate mixture-of-experts. InProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, page 1930–1939. Associa- tion for Computing Machinery, 2018. ISBN 9781450355520. doi:10.1145/3219819.3220007. 11

work page doi:10.1145/3219819.3220007 1930
[31]

Xiong, X

H. Xiong, X. Xu, J. Wu, Y . Hou, J. Bohg, and S. Song. Vision in action: Learning active perception from human demonstrations. InConference on Robot Learning (CoRL), pages 5450–5463. PMLR, 2025

2025
[32]

I.-C. A. Liu, S. He, D. Seita, and G. S. Sukhatme. V oxact-b: V oxel-based acting and stabilizing policy for bimanual manipulation. InConference on Robot Learning (CoRL), volume 270, pages 4354–4370. PMLR, 2025

2025
[33]

H. Im, E. Jeong, A. Kolobov, J. Fu, and Y . Lee. Twinvla: Data-efficient bimanual manipulation with twin single-arm vision-language-action models.arXiv preprint arXiv:2511.05275, 2025

arXiv 2025
[34]

G. Lu, T. Yu, H. Deng, S. S. Chen, Y . Tang, and Z. Wang. Anybimanual: Transferring uni- manual policy for general bimanual manipulation. InIEEE/CVF International Conference on Computer Vision (ICCV), pages 13662–13672, 2025

2025
[35]

Z. Yang, Y . Chai, X. Jia, Q. Li, Y . Shao, X. Zhu, H. Su, and J. Yan. Drivemoe: Mixture-of- experts for vision-language-action model in end-to-end autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10678–10688, 2026

2026
[36]

Grannen, Y

J. Grannen, Y . Wu, B. Vu, and D. Sadigh. Stabilize to act: Learning to coordinate for bimanual manipulation. InConference on Robot Learning (CoRL), volume 229, pages 563–576. PMLR, 2023

2023
[37]

Lipman, R

Y . Lipman, R. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023

2023
[38]

A. Tong, K. FATRAS, N. Malkin, G. Huguet, Y . Zhang, J. Rector-Brooks, G. Wolf, and Y . Ben- gio. Improving and generalizing flow-based generative models with minibatch optimal trans- port.Transactions on Machine Learning Research, 2024. ISSN 2835-8856

2024
[39]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Rep- resentations (ICLR), 2022

2022
[40]

E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-softmax. InInter- national Conference on Learning Representations (ICLR), 2017

2017
[41]

T. Chen, Z. Chen, B. Chen, Z. Cai, Y . Liu, Z. Li, Q. Liang, X. Lin, Y . Ge, Z. Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025. 12 A Architecture Implementation Details This section provides implementation details omitted from ...

Pith/arXiv arXiv 2025

[1] [1]

Krebs and T

F. Krebs and T. Asfour. A bimanual manipulation taxonomy.IEEE Robotics and Automation Letters, 7(4):11031–11038, 2022. doi:10.1109/LRA.2022.3196158

work page doi:10.1109/lra.2022.3196158 2022

[2] [2]

Smith, Y

C. Smith, Y . Karayiannidis, L. Nalpantidis, X. Gratal, P. Qi, D. V . Dimarogonas, and D. Kragic. Dual arm manipulation—a survey.Robotics and Autonomous Systems, 60(10):1340–1353,

[3] [3]

doi:https://doi.org/10.1016/j.robot.2012.07.005

ISSN 0921-8890. doi:https://doi.org/10.1016/j.robot.2012.07.005

work page doi:10.1016/j.robot.2012.07.005 2012

[4] [4]

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware. InProceedings of Robotics: Science and Systems, 2023. doi: 10.15607/RSS.2023.XIX.016

work page doi:10.15607/rss.2023.xix.016 2023

[5] [5]

Z. Fu, T. Z. Zhao, and C. Finn. Mobile ALOHA: Learning bimanual mobile manipulation using low-cost whole-body teleoperation. InConference on Robot Learning (CoRL), volume 270, pages 4066–4083. PMLR, 2024

2024

[6] [6]

Grotz, M

M. Grotz, M. Shridhar, Y .-W. Chao, T. Asfour, and D. Fox. Peract2: Benchmarking and learning for robotic bimanual manipulation tasks. InCoRL Workshop on Whole-body Control and Bimanual Manipulation: Applications in Humanoids and Beyond, 2024

2024

[7] [7]

Y . Li, C. Pan, H. Xu, X. Wang, and Y . Wu. Efficient bimanual handover and rearrangement via symmetry-aware actor-critic learning. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 3867–3874, 2023. doi:10.1109/ICRA48891.2023.10160739

work page doi:10.1109/icra48891.2023.10160739 2023

[8] [8]

A. Tung, J. Wong, A. Mandlekar, R. Mart´ın-Mart´ın, Y . Zhu, L. Fei-Fei, and S. Savarese. Learn- ing multi-arm manipulation through collaborative teleoperation. InIEEE International Confer- ence on Robotics and Automation (ICRA), pages 9212–9219, 2021. doi:10.1109/ICRA48506. 2021.9561491

work page doi:10.1109/icra48506 2021

[9] [9]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Fos- ter, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model. In Conference on Robot Learning (CoRL), volume 270, pages 2679–2713. PMLR, 2025

2025

[10] [10]

M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

Pith/arXiv arXiv 2025

[11] [11]

and Kragic, D

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.π 0: A vision-language-action flow model for general robot control. InRobotics: Science and Systems (RSS), 2025. doi:10.15607/RSS.2025.XXI.010

work page doi:10.15607/rss.2025.xxi.010 2025

[12] [12]

Black, N

K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, M. Y . Galliker, et al.π0.5: a vision-language-action model with open-world general- ization. InConference on Robot Learning (CoRL), volume 305, pages 17–40. PMLR, 2025

2025

[13] [13]

Brohan, N

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, et al. RT-1: Robotics Transformer for Real-World Control at Scale. InRobotics: Science and Systems (RSS), 2023. doi:10.15607/RSS.2023.XIX.025

work page doi:10.15607/rss.2023.xix.025 2023

[14] [14]

Brohan, N

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning (CoRL), volume 229, pages 2165–2183. PMLR, 2023

2023

[15] [15]

O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, et al. Octo: An Open-Source Generalist Robot Policy. InRobotics: Science and Systems (RSS), 2024. doi:10.15607/RSS.2024.XX.090. 10

work page doi:10.15607/rss.2024.xx.090 2024

[16] [16]

S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu. RDT-1b: a diffu- sion foundation model for bimanual manipulation. InInternational Conference on Learning Representations (ICLR), 2025

2025

[17] [17]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

2025

[18] [18]

Sferrazza, Y

F. Krebs and T. Asfour. Formalization of temporal and spatial constraints of bimanual manipu- lation categories. InInternational Conference on Intelligent Robots and Systems (IROS), pages 1302–1309, 2024. doi:10.1109/IROS58592.2024.10801861

work page doi:10.1109/iros58592.2024.10801861 2024

[19] [19]

H. Deng, W. Guo, Q. Wang, Z. Wu, and Z. Wang. Safebimanual: Diffusion-based trajec- tory optimization for safe bimanual manipulation. InConference on Robot Learning (CoRL), volume 305, pages 3218–3238. PMLR, 2025

2025

[20] [20]

I.-C. A. Liu, J. Chen, G. S. Sukhatme, and D. Seita. D-coda: Diffusion for coordinated dual- arm data augmentation. InConference on Robot Learning (CoRL), volume 305, pages 3569–

[21] [21]

Z. Lan, W. Mao, H. Li, L. Wang, T. Wang, H. Fan, and O. Yoshie. Bfa: Best-feature-aware fusion for multi-view fine-grained manipulation.IEEE Robotics and Automation Letters, 10 (9):8930–8937, 2025. doi:10.1109/LRA.2025.3589149

work page doi:10.1109/lra.2025.3589149 2025

[22] [22]

Y . Bai, Z. Wang, Y . Liu, K. Luo, Y . Wen, M. Dai, W. Chen, Z. Chen, L. Liu, G. Li, and L. Lin. Learning to see and act: Task-aware virtual view exploration for robotic manipulation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

2026

[23] [23]

Chuang, A

I. Chuang, A. Lee, D. Gao, M.-M. Naddaf-Sh, and I. Soltani. Active vision might be all you need: Exploring active vision in bimanual robotic manipulation. InIEEE International Conference on Robotics and Automation (ICRA), pages 7952–7959, 2025

2025

[24] [24]

R. Feng, D. Hu, W. Ma, and X. Li. Play to the score: Stage-guided dynamic multi-sensory fusion for robotic manipulation. InConference on Robot Learning (CoRL), volume 270, pages 340–363. PMLR, 2025

2025

[25] [25]

Son, J.-W

Y .-C. Son, J.-W. Lee, Y .-J. Choi, D.-K. Ko, and S.-C. Lim. Selective perception for robot: Task-aware attention in multimodal vla.arXiv preprint arXiv:2602.15543, 2026

arXiv 2026

[26] [26]

X. Zhai, Z. Huang, L. Wu, Q. Zhao, Q. Yu, J. Ren, C. Hao, and H. Soh. Skillvla: Tackling com- binatorial diversity in dual-arm manipulation via skill reuse.arXiv preprint arXiv:2603.03836, 2026

arXiv 2026

[27] [27]

Jiang, X.-M

J.-J. Jiang, X.-M. Wu, Y .-X. He, L.-A. Zeng, Y .-L. Wei, D. Zhang, and W.-S. Zheng. Re- thinking bimanual robotic manipulation: Learning with decoupled interaction framework. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 12427–12437, 2025

2025

[28] [28]

A. C.-W. Lee, I. Chuang, L.-Y . Chen, and I. Soltani. Interact: Inter-dependency aware action chunking with hierarchical attention transformers for bimanual manipulation. InConference on Robot Learning (CoRL), volume 270, pages 1730–1743. PMLR, 2025

2025

[29] [29]

H. Xu, L. Lei, J. Gu, C. Tang, J. Chen, and R. Wang. Move-then-operate: Behavioral phasing for human-like robotic manipulation.arXiv preprint arXiv:2604.23620, 2026

Pith/arXiv arXiv 2026

[30] [30]

J. Ma, Z. Zhao, X. Yi, J. Chen, L. Hong, and E. H. Chi. Modeling task relationships in multi- task learning with multi-gate mixture-of-experts. InProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, page 1930–1939. Associa- tion for Computing Machinery, 2018. ISBN 9781450355520. doi:10.1145/3219819.3220007. 11

work page doi:10.1145/3219819.3220007 1930

[31] [31]

Xiong, X

H. Xiong, X. Xu, J. Wu, Y . Hou, J. Bohg, and S. Song. Vision in action: Learning active perception from human demonstrations. InConference on Robot Learning (CoRL), pages 5450–5463. PMLR, 2025

2025

[32] [32]

I.-C. A. Liu, S. He, D. Seita, and G. S. Sukhatme. V oxact-b: V oxel-based acting and stabilizing policy for bimanual manipulation. InConference on Robot Learning (CoRL), volume 270, pages 4354–4370. PMLR, 2025

2025

[33] [33]

H. Im, E. Jeong, A. Kolobov, J. Fu, and Y . Lee. Twinvla: Data-efficient bimanual manipulation with twin single-arm vision-language-action models.arXiv preprint arXiv:2511.05275, 2025

arXiv 2025

[34] [34]

G. Lu, T. Yu, H. Deng, S. S. Chen, Y . Tang, and Z. Wang. Anybimanual: Transferring uni- manual policy for general bimanual manipulation. InIEEE/CVF International Conference on Computer Vision (ICCV), pages 13662–13672, 2025

2025

[35] [35]

Z. Yang, Y . Chai, X. Jia, Q. Li, Y . Shao, X. Zhu, H. Su, and J. Yan. Drivemoe: Mixture-of- experts for vision-language-action model in end-to-end autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10678–10688, 2026

2026

[36] [36]

Grannen, Y

J. Grannen, Y . Wu, B. Vu, and D. Sadigh. Stabilize to act: Learning to coordinate for bimanual manipulation. InConference on Robot Learning (CoRL), volume 229, pages 563–576. PMLR, 2023

2023

[37] [37]

Lipman, R

Y . Lipman, R. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023

2023

[38] [38]

A. Tong, K. FATRAS, N. Malkin, G. Huguet, Y . Zhang, J. Rector-Brooks, G. Wolf, and Y . Ben- gio. Improving and generalizing flow-based generative models with minibatch optimal trans- port.Transactions on Machine Learning Research, 2024. ISSN 2835-8856

2024

[39] [39]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Rep- resentations (ICLR), 2022

2022

[40] [40]

E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-softmax. InInter- national Conference on Learning Representations (ICLR), 2017

2017

[41] [41]

T. Chen, Z. Chen, B. Chen, Z. Cai, Y . Liu, Z. Li, Q. Liang, X. Lin, Y . Ge, Z. Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025. 12 A Architecture Implementation Details This section provides implementation details omitted from ...

Pith/arXiv arXiv 2025