pith. sign in

arxiv: 2606.13279 · v1 · pith:ODQZZKKRnew · submitted 2026-06-11 · 💻 cs.RO

See Selectively, Act Adaptively: Dual-Level Structural Decomposition for Bimanual Robot Manipulation

Pith reviewed 2026-06-27 06:44 UTC · model grok-4.3

classification 💻 cs.RO
keywords bimanual manipulationvision-language-actionmixture of expertsvisual routerstructural decompositionrobot learningdual-level architecture
0
0 comments X

The pith

Dual-level decomposition of visual selection and bimanual modes raises manipulation success rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that standard Vision-Language-Action policies for bimanual robots use a single shared pathway for all visual inputs and action patterns, which fails to handle task-stage changes in visual relevance or shifts between independent and coordinated arm behavior. It introduces Dual-Level Structural Decomposition: a View-Selective Visual Router that dynamically weights wrist-camera contributions and an Interaction-Aware Action Mixture-of-Experts that routes action generation through coordinated or arm-specific experts. Experiments on six simulated tasks and three real long-horizon tasks show the combined decomposition lifts average success rates by 27.7 percent in simulation and 43.3 percent in the real world over a monolithic baseline.

Core claim

Jointly considering selective visual processing and explicit decomposition of bimanual interaction structures provides an effective inductive bias for robust bimanual manipulation.

What carries the argument

Dual-Level Structural Decomposition, implemented as a View-Selective Visual Router that adjusts wrist-view contributions and an Interaction-Aware Action Mixture-of-Experts that splits action generation into coordinated and arm-wise pathways.

If this is right

  • The combined router and MoE yields a 27.7 percent higher average success rate across six simulated bimanual tasks.
  • The same architecture produces a 43.3 percent higher average success rate on three long-horizon real-world tasks.
  • Single-module ablations that remove either the visual router or the action MoE each underperform the full dual-level model.
  • The router learns to emphasize task-relevant wrist views at different stages while the MoE adapts to independent versus coordinated arm modes.
  • Performance gains appear consistently in both simulation and real-world settings when both decomposition levels are present.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same separation of visual gating from action routing could be tested on other multi-limb or multi-robot coordination problems where mode switching occurs.
  • If the router and MoE are kept lightweight, the approach may reduce the data volume needed to reach high success on new bimanual tasks.
  • A natural extension would measure whether the learned routing decisions align with human intuition about when each arm should act independently.

Load-bearing premise

Visual relevance and bimanual interaction modes vary in ways that a learnable router and mixture-of-experts can capture without losing critical cross-modal information or causing training instability.

What would settle it

Run the full model against the monolithic baseline on the same six simulated tasks; if average success rate improvement falls below 10 percent while training remains stable, the claimed inductive bias does not hold.

Figures

Figures reproduced from arXiv: 2606.13279 by Soo-Chul Lim, Yoon-Ji Choi, Young-Chae Son.

Figure 1
Figure 1. Figure 1: Dual-level structural decomposition framework. (a) Task-relevant views change as inter [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overall architecture. (a) The View-Selective Visual Router is inserted before the VLM [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Real-world task structure and cumulative stage-wise success rates. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Simulation task overview and success rates under easy and hard settings. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Robustness evaluation results. (a) Average simulation success rates under Easy/Hard [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Detailed architecture of the proposed dual-level routing framework. (a) View-Selective [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Simulation task configurations for S1–S4. [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Simulation task configurations for S5–S6. [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Real-world objects used in the easy and hard conditions for R1–R3. [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Router outputs across S1–S6. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
read the original abstract

In bimanual robotic manipulation, task-relevant visual information varies with the task stage and context, while the interaction of the two arms shifts between independent and coordinated modes, making policy learning challenging. However, existing monolithic Vision-Language-Action (VLA) policies process diverse visual inputs and interaction patterns through a single shared representation and action generation pathway, often failing to separately account for visual relevance and bimanual interaction structure. To address this issue, we propose a bimanual manipulation VLA framework based on Dual-Level Structural Decomposition. The View-Selective Visual Router dynamically adjusts wrist-view contributions to emphasize relevant visual cues, while the Interaction-Aware Action Mixture-of-Experts (MoE) decomposes action generation into coordinated and arm-wise pathways to adapt to varying bimanual interaction modes. We evaluate the proposed method on six simulated bimanual manipulation tasks in RoboTwin 2.0 and three long-horizon real-world tasks. Our model improves the overall average success rate over a monolithic baseline by 27.7% in simulation and 43.3% in real-world evaluation, while consistently outperforming single-module variants across both settings. These results demonstrate that jointly considering selective visual processing and explicit decomposition of bimanual interaction structures provides an effective inductive bias for robust bimanual manipulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that jointly considering selective visual processing and explicit decomposition of bimanual interaction structures via a Dual-Level Structural Decomposition framework (View-Selective Visual Router plus Interaction-Aware Action MoE) supplies an effective inductive bias for robust bimanual manipulation. This is supported by reported average success-rate gains of 27.7% in simulation (six RoboTwin 2.0 tasks) and 43.3% in real-world evaluation (three long-horizon tasks) over a monolithic VLA baseline, with consistent outperformance of single-module variants.

Significance. If the empirical gains are shown to arise specifically from the claimed structural decomposition rather than capacity or tuning effects, the work would supply a concrete inductive bias for handling stage-dependent visual relevance and shifting independent/coordinated bimanual modes, which is a recurring challenge in VLA policy learning for manipulation.

major comments (2)
  1. [Evaluation] Abstract and Evaluation section: the reported 27.7 % / 43.3 % gains are presented without error bars, statistical significance tests, or details on baseline architectures and data-split ablations; these omissions are load-bearing because the central claim rests on the assertion that the dual-level modules, rather than other factors, produce the observed improvements.
  2. [Methods] Methods section: the precise architectures of the View-Selective Visual Router and Interaction-Aware Action MoE, together with training dynamics and capacity-matched controls, are not verifiable from the supplied text, preventing confirmation that the inductive bias is the operative mechanism.
minor comments (1)
  1. The abstract would benefit from explicit naming of the monolithic baseline and the six simulated tasks to allow immediate assessment of task diversity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger statistical reporting and more verifiable methodological details. We address each major comment below and commit to revisions that directly strengthen the empirical claims without altering the core contributions.

read point-by-point responses
  1. Referee: [Evaluation] Abstract and Evaluation section: the reported 27.7 % / 43.3 % gains are presented without error bars, statistical significance tests, or details on baseline architectures and data-split ablations; these omissions are load-bearing because the central claim rests on the assertion that the dual-level modules, rather than other factors, produce the observed improvements.

    Authors: We agree that the absence of error bars and significance testing weakens the load-bearing claim. In the revised manuscript we will report mean success rates with standard deviations computed over at least three random seeds for all methods, include paired t-test p-values between our model and the monolithic baseline, and expand Section 4 to detail the exact baseline architectures (including parameter counts and training protocols) as well as the train/validation/test splits used for the six RoboTwin 2.0 tasks. We will also add an explicit data-split ablation table. These additions will be placed in the main evaluation section rather than the appendix. revision: yes

  2. Referee: [Methods] Methods section: the precise architectures of the View-Selective Visual Router and Interaction-Aware Action MoE, together with training dynamics and capacity-matched controls, are not verifiable from the supplied text, preventing confirmation that the inductive bias is the operative mechanism.

    Authors: We acknowledge that the current text does not supply sufficient architectural specificity or capacity-matched controls. In revision we will insert (i) a detailed diagram and layer-by-layer specification of the View-Selective Visual Router (including the gating network, feature dimensions, and how wrist-view weights are computed from task-stage embeddings), (ii) the expert specialization, router architecture, and load-balancing loss for the Interaction-Aware Action MoE, (iii) a hyperparameter table and training curve appendix, and (iv) a new capacity-matched ablation in which the MoE is replaced by a single feed-forward network whose parameter count equals the sum of the experts. These changes will allow readers to verify that performance gains arise from the structural decomposition rather than capacity differences. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical inductive bias claim rests on held-out task evaluation

full rationale

The paper introduces a View-Selective Visual Router and Interaction-Aware Action MoE as architectural components for bimanual VLA policies. Its central claim—that this dual-level decomposition supplies an effective inductive bias—is supported solely by reported success-rate gains (27.7 % simulation, 43.3 % real-world) over a monolithic baseline and single-module ablations on six simulated and three real tasks. No equations, parameter-fitting steps, or self-citations appear in the provided text that would reduce these empirical outcomes to quantities defined inside the model itself. The evaluation uses held-out tasks, so the performance numbers are not forced by construction from the training data or internal definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on standard supervised or imitation-learning assumptions for VLA training plus the domain assumption that bimanual interaction modes are separable into coordinated versus arm-wise pathways. No free parameters beyond ordinary network weights are named. Two new architectural components are introduced without external falsifiable handles.

axioms (1)
  • domain assumption Existing monolithic VLA policies process diverse visual inputs and interaction patterns through a single shared representation and action generation pathway
    Stated directly in the abstract as the motivation for the new framework.
invented entities (2)
  • View-Selective Visual Router no independent evidence
    purpose: Dynamically adjusts wrist-view contributions to emphasize relevant visual cues
    New module introduced to handle varying visual relevance; no independent evidence outside the proposed framework.
  • Interaction-Aware Action Mixture-of-Experts no independent evidence
    purpose: Decomposes action generation into coordinated and arm-wise pathways
    New module introduced to adapt to varying bimanual interaction modes; no independent evidence outside the proposed framework.

pith-pipeline@v0.9.1-grok · 5769 in / 1469 out tokens · 19637 ms · 2026-06-27T06:44:33.011073+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 11 canonical work pages

  1. [1]

    Krebs and T

    F. Krebs and T. Asfour. A bimanual manipulation taxonomy.IEEE Robotics and Automation Letters, 7(4):11031–11038, 2022. doi:10.1109/LRA.2022.3196158

  2. [2]

    Smith, Y

    C. Smith, Y . Karayiannidis, L. Nalpantidis, X. Gratal, P. Qi, D. V . Dimarogonas, and D. Kragic. Dual arm manipulation—a survey.Robotics and Autonomous Systems, 60(10):1340–1353,

  3. [3]

    doi:https://doi.org/10.1016/j.robot.2012.07.005

    ISSN 0921-8890. doi:https://doi.org/10.1016/j.robot.2012.07.005

  4. [4]

    T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware. InProceedings of Robotics: Science and Systems, 2023. doi: 10.15607/RSS.2023.XIX.016

  5. [5]

    Z. Fu, T. Z. Zhao, and C. Finn. Mobile ALOHA: Learning bimanual mobile manipulation using low-cost whole-body teleoperation. InConference on Robot Learning (CoRL), volume 270, pages 4066–4083. PMLR, 2024

  6. [6]

    Grotz, M

    M. Grotz, M. Shridhar, Y .-W. Chao, T. Asfour, and D. Fox. Peract2: Benchmarking and learning for robotic bimanual manipulation tasks. InCoRL Workshop on Whole-body Control and Bimanual Manipulation: Applications in Humanoids and Beyond, 2024

  7. [7]

    Y . Li, C. Pan, H. Xu, X. Wang, and Y . Wu. Efficient bimanual handover and rearrangement via symmetry-aware actor-critic learning. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 3867–3874, 2023. doi:10.1109/ICRA48891.2023.10160739

  8. [8]

    A. Tung, J. Wong, A. Mandlekar, R. Mart´ın-Mart´ın, Y . Zhu, L. Fei-Fei, and S. Savarese. Learn- ing multi-arm manipulation through collaborative teleoperation. InIEEE International Confer- ence on Robotics and Automation (ICRA), pages 9212–9219, 2021. doi:10.1109/ICRA48506. 2021.9561491

  9. [9]

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Fos- ter, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model. In Conference on Robot Learning (CoRL), volume 270, pages 2679–2713. PMLR, 2025

  10. [10]

    M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

  11. [11]

    and Kragic, D

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.π 0: A vision-language-action flow model for general robot control. InRobotics: Science and Systems (RSS), 2025. doi:10.15607/RSS.2025.XXI.010

  12. [12]

    Black, N

    K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, M. Y . Galliker, et al.π0.5: a vision-language-action model with open-world general- ization. InConference on Robot Learning (CoRL), volume 305, pages 17–40. PMLR, 2025

  13. [13]

    Brohan, N

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, et al. RT-1: Robotics Transformer for Real-World Control at Scale. InRobotics: Science and Systems (RSS), 2023. doi:10.15607/RSS.2023.XIX.025

  14. [14]

    Brohan, N

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning (CoRL), volume 229, pages 2165–2183. PMLR, 2023

  15. [15]

    O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, et al. Octo: An Open-Source Generalist Robot Policy. InRobotics: Science and Systems (RSS), 2024. doi:10.15607/RSS.2024.XX.090. 10

  16. [16]

    S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu. RDT-1b: a diffu- sion foundation model for bimanual manipulation. InInternational Conference on Learning Representations (ICLR), 2025

  17. [17]

    C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

  18. [18]

    Sferrazza, Y

    F. Krebs and T. Asfour. Formalization of temporal and spatial constraints of bimanual manipu- lation categories. InInternational Conference on Intelligent Robots and Systems (IROS), pages 1302–1309, 2024. doi:10.1109/IROS58592.2024.10801861

  19. [19]

    H. Deng, W. Guo, Q. Wang, Z. Wu, and Z. Wang. Safebimanual: Diffusion-based trajec- tory optimization for safe bimanual manipulation. InConference on Robot Learning (CoRL), volume 305, pages 3218–3238. PMLR, 2025

  20. [20]

    I.-C. A. Liu, J. Chen, G. S. Sukhatme, and D. Seita. D-coda: Diffusion for coordinated dual- arm data augmentation. InConference on Robot Learning (CoRL), volume 305, pages 3569–

  21. [21]

    Z. Lan, W. Mao, H. Li, L. Wang, T. Wang, H. Fan, and O. Yoshie. Bfa: Best-feature-aware fusion for multi-view fine-grained manipulation.IEEE Robotics and Automation Letters, 10 (9):8930–8937, 2025. doi:10.1109/LRA.2025.3589149

  22. [22]

    Y . Bai, Z. Wang, Y . Liu, K. Luo, Y . Wen, M. Dai, W. Chen, Z. Chen, L. Liu, G. Li, and L. Lin. Learning to see and act: Task-aware virtual view exploration for robotic manipulation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

  23. [23]

    Chuang, A

    I. Chuang, A. Lee, D. Gao, M.-M. Naddaf-Sh, and I. Soltani. Active vision might be all you need: Exploring active vision in bimanual robotic manipulation. InIEEE International Conference on Robotics and Automation (ICRA), pages 7952–7959, 2025

  24. [24]

    R. Feng, D. Hu, W. Ma, and X. Li. Play to the score: Stage-guided dynamic multi-sensory fusion for robotic manipulation. InConference on Robot Learning (CoRL), volume 270, pages 340–363. PMLR, 2025

  25. [25]

    Son, J.-W

    Y .-C. Son, J.-W. Lee, Y .-J. Choi, D.-K. Ko, and S.-C. Lim. Selective perception for robot: Task-aware attention in multimodal vla.arXiv preprint arXiv:2602.15543, 2026

  26. [26]

    X. Zhai, Z. Huang, L. Wu, Q. Zhao, Q. Yu, J. Ren, C. Hao, and H. Soh. Skillvla: Tackling com- binatorial diversity in dual-arm manipulation via skill reuse.arXiv preprint arXiv:2603.03836, 2026

  27. [27]

    Jiang, X.-M

    J.-J. Jiang, X.-M. Wu, Y .-X. He, L.-A. Zeng, Y .-L. Wei, D. Zhang, and W.-S. Zheng. Re- thinking bimanual robotic manipulation: Learning with decoupled interaction framework. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 12427–12437, 2025

  28. [28]

    A. C.-W. Lee, I. Chuang, L.-Y . Chen, and I. Soltani. Interact: Inter-dependency aware action chunking with hierarchical attention transformers for bimanual manipulation. InConference on Robot Learning (CoRL), volume 270, pages 1730–1743. PMLR, 2025

  29. [29]

    H. Xu, L. Lei, J. Gu, C. Tang, J. Chen, and R. Wang. Move-then-operate: Behavioral phasing for human-like robotic manipulation.arXiv preprint arXiv:2604.23620, 2026

  30. [30]

    J. Ma, Z. Zhao, X. Yi, J. Chen, L. Hong, and E. H. Chi. Modeling task relationships in multi- task learning with multi-gate mixture-of-experts. InProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, page 1930–1939. Associa- tion for Computing Machinery, 2018. ISBN 9781450355520. doi:10.1145/3219819.3220007. 11

  31. [31]

    Xiong, X

    H. Xiong, X. Xu, J. Wu, Y . Hou, J. Bohg, and S. Song. Vision in action: Learning active perception from human demonstrations. InConference on Robot Learning (CoRL), pages 5450–5463. PMLR, 2025

  32. [32]

    I.-C. A. Liu, S. He, D. Seita, and G. S. Sukhatme. V oxact-b: V oxel-based acting and stabilizing policy for bimanual manipulation. InConference on Robot Learning (CoRL), volume 270, pages 4354–4370. PMLR, 2025

  33. [33]

    H. Im, E. Jeong, A. Kolobov, J. Fu, and Y . Lee. Twinvla: Data-efficient bimanual manipulation with twin single-arm vision-language-action models.arXiv preprint arXiv:2511.05275, 2025

  34. [34]

    G. Lu, T. Yu, H. Deng, S. S. Chen, Y . Tang, and Z. Wang. Anybimanual: Transferring uni- manual policy for general bimanual manipulation. InIEEE/CVF International Conference on Computer Vision (ICCV), pages 13662–13672, 2025

  35. [35]

    Z. Yang, Y . Chai, X. Jia, Q. Li, Y . Shao, X. Zhu, H. Su, and J. Yan. Drivemoe: Mixture-of- experts for vision-language-action model in end-to-end autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10678–10688, 2026

  36. [36]

    Grannen, Y

    J. Grannen, Y . Wu, B. Vu, and D. Sadigh. Stabilize to act: Learning to coordinate for bimanual manipulation. InConference on Robot Learning (CoRL), volume 229, pages 563–576. PMLR, 2023

  37. [37]

    Lipman, R

    Y . Lipman, R. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023

  38. [38]

    A. Tong, K. FATRAS, N. Malkin, G. Huguet, Y . Zhang, J. Rector-Brooks, G. Wolf, and Y . Ben- gio. Improving and generalizing flow-based generative models with minibatch optimal trans- port.Transactions on Machine Learning Research, 2024. ISSN 2835-8856

  39. [39]

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Rep- resentations (ICLR), 2022

  40. [40]

    E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-softmax. InInter- national Conference on Learning Representations (ICLR), 2017

  41. [41]

    T. Chen, Z. Chen, B. Chen, Z. Cai, Y . Liu, Z. Li, Q. Liang, X. Lin, Y . Ge, Z. Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025. 12 A Architecture Implementation Details This section provides implementation details omitted from ...