pith. sign in

arxiv: 2605.28486 · v1 · pith:M4ANGOUZnew · submitted 2026-05-27 · 💻 cs.RO

Mag-VLA: Vision-Language-Action Model for Bimanual Magnetically Actuated Microrobot Manipulation

Pith reviewed 2026-06-29 11:57 UTC · model grok-4.3

classification 💻 cs.RO
keywords vision-language-action modelbimanual manipulationmagnetic microrobotsaction chunking transformerteleoperated datasetminimally invasivedexterous control
0
0 comments X

The pith

Mag-VLA adapts a vision-language model to predict coordinated actions for two magnetic arms manipulating microrobots.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Mag-VLA to address the challenges of indirect magnetic actuation, limited sensing, and nonlinear interactions in microrobot control. It adapts a Qwen2.5-VL-7B backbone with LoRA to map visual observations and language instructions into actions for bimanual robotic arms. A motion-aware phase classifier combined with a phase-conditioned Action Chunking Transformer decoder produces temporally coherent trajectories that handle coupled control in a shared workspace. Training relies on a custom teleoperated dataset spanning three task configurations. Real-robot tests report 90 percent approach success across tasks and transport success of 80, 70, and 50 percent as difficulty increases.

Core claim

Mag-VLA adapts a vision-language backbone using Low-Rank Adaptation to process visual observations and language instructions, then employs a motion-aware phase classifier and phase-conditioned Action Chunking Transformer decoder to output coordinated multi-step trajectories for two magnetic actuators. This hierarchical structure enables bimanual capabilities such as microrobot reorientation. On a teleoperated dataset of three task configurations, the model achieves a 90 percent approach success rate in real-robot experiments and transport success rates of 80 percent, 70 percent, and 50 percent as task difficulty increases.

What carries the argument

The phase-conditioned Action Chunking Transformer decoder that generates temporally coherent multi-step control actions conditioned on motion phase for bimanual coordination.

If this is right

  • Bimanual coordination enables microrobot reorientation that is difficult or infeasible with a single arm.
  • The ACT-based decoder substantially outperforms alternative generative action heads in ablation studies.
  • Hierarchical VLA modeling supplies a framework that learns task progression through phase classification.
  • The approach handles coupled control challenges arising from two actuators operating in one workspace.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same phase-conditioning approach might transfer to other indirect actuation methods if the classifier can be retrained on new sensor data.
  • Higher transport success on difficult tasks would likely require expanding the teleoperated dataset to include more varied magnetic nonlinearities.
  • Integration with real-time magnetic field sensing could reduce reliance on the assumption that training conditions match deployment conditions.

Load-bearing premise

The teleoperated dataset and real-robot test conditions sufficiently capture the nonlinear magnetic interactions and workspace constraints that occur in the intended minimally invasive applications.

What would settle it

A controlled test in which success rates fall below 50 percent when the model is evaluated on microrobot poses or magnetic field strengths outside the three training task configurations.

Figures

Figures reproduced from arXiv: 2605.28486 by Dandan Zhang, Kangyi Lu, Lan Wei, Yongchen Wang.

Figure 1
Figure 1. Figure 1: Experimental setup and end-to-end manipulation pipeline of Mag-VLA. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed Mag-VLA framework. A history of four RGB observations and a language instruction are encoded by a LoRA-adapted [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Three language-conditioned manipulation tasks (A–C) of increasing [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Directional comparison of VLM backbones under the shared MLP [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Successful real-robot demonstrations on Tasks A, B, and C. The three [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
read the original abstract

Magnetically actuated microrobots have been used as wireless, non-contact manipulation tools at microscales, making them promising for minimally invasive applications. However, their control remains challenging due to indirect actuation, limited sensing, and nonlinear magnetic interactions. In this work, we propose Mag-VLA, a vision-language-action (VLA) model for dexterous magnetic microrobot manipulation using two robotic arms with mounted magnets for dynamic magnetic-field construction. Bimanual coordination enables capabilities such as microrobot reorientation that are difficult or infeasible with a single arm, but it also introduces coupled control challenges, as the policy must generate coordinated trajectories for both actuators within a shared workspace. Our framework adapts a Qwen2.5-VL-7B backbone using Low-Rank Adaptation (LoRA) to process visual observations and language instructions for action prediction. To capture task progression, we introduce a motion-aware phase classifier and a phase-conditioned Action Chunking Transformer (ACT) decoder for temporally coherent multi-step control. We further construct a teleoperated magnetic microrobot manipulation dataset covering three task configurations. Ablation studies show that the ACT-based decoder substantially outperforms alternative generative action heads. In real-robot experiments, Mag-VLA achieves a 90% approach success rate across all tasks and transport success rates of 80%, 70%, and 50% as task difficulty increases. These results demonstrate that hierarchical VLA modeling provides a promising framework for magnetic microrobot manipulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Mag-VLA, a vision-language-action model for bimanual magnetically actuated microrobot manipulation. It adapts the Qwen2.5-VL-7B backbone with LoRA, introduces a motion-aware phase classifier and phase-conditioned Action Chunking Transformer (ACT) decoder, and trains on a new teleoperated dataset spanning three task configurations. Ablation studies compare the ACT decoder against alternative generative heads. Real-robot experiments report a 90% approach success rate across tasks and transport success rates of 80%, 70%, and 50% as task difficulty increases.

Significance. If the reported success rates prove statistically robust and the teleoperated data generalizes beyond open-workspace conditions, the hierarchical VLA approach with bimanual coordination could offer a practical route to dexterous control of microrobots under indirect actuation. The explicit ablation of the phase-conditioned ACT decoder and the construction of a task-progression-aware dataset constitute concrete, reproducible contributions that future work in learned magnetic control can build upon.

major comments (2)
  1. [Abstract] Abstract: The central performance claims (90% approach success; 80/70/50% transport success) are stated without trial counts, standard deviations, confidence intervals, or any statistical test, making it impossible to evaluate whether the numbers support the generalization to minimally invasive settings.
  2. [Dataset Construction and Experiments] Dataset and real-robot experiments description: No quantitative metrics (e.g., field nonlinearity error, workspace overlap with tissue constraints, or sensing noise levels) are supplied to show that the teleoperated open-workspace data reproduces the coupled magnetic interactions and limited observability of the intended applications; this assumption is load-bearing for the application-level conclusions.
minor comments (1)
  1. [Abstract] The abstract introduces the motion-aware phase classifier and phase-conditioned ACT decoder without a one-sentence statement of how phase information is obtained at inference time.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's detailed feedback on our manuscript. The comments highlight important aspects for improving the clarity and applicability of our results. We provide point-by-point responses below and indicate where revisions will be made.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central performance claims (90% approach success; 80/70/50% transport success) are stated without trial counts, standard deviations, confidence intervals, or any statistical test, making it impossible to evaluate whether the numbers support the generalization to minimally invasive settings.

    Authors: We agree with this observation. The abstract currently presents aggregate success rates without accompanying statistical details. In the revised version, we will include the number of trials conducted for each task configuration, along with standard deviations and confidence intervals where applicable. This will allow readers to better assess the robustness of the reported performance. revision: yes

  2. Referee: [Dataset Construction and Experiments] Dataset and real-robot experiments description: No quantitative metrics (e.g., field nonlinearity error, workspace overlap with tissue constraints, or sensing noise levels) are supplied to show that the teleoperated open-workspace data reproduces the coupled magnetic interactions and limited observability of the intended applications; this assumption is load-bearing for the application-level conclusions.

    Authors: We acknowledge that our experiments are conducted in an open workspace and do not include direct quantitative comparisons to tissue-constrained environments, such as field nonlinearity errors or workspace overlaps with tissue. The teleoperated dataset captures the core challenges of bimanual magnetic actuation, including coupled interactions between the two arms and the microrobot. However, we recognize this as a limitation for claiming direct applicability to minimally invasive settings. In the revision, we will expand the discussion section to explicitly address the differences between open-workspace conditions and in vivo scenarios, and outline future work to bridge this gap. We believe the current results provide a valuable baseline for the VLA approach in magnetic microrobot control. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical model training and real-robot evaluation on newly collected dataset

full rationale

The paper introduces Mag-VLA by adapting an external Qwen2.5-VL backbone with LoRA, adding a motion-aware phase classifier and phase-conditioned ACT decoder, training on a teleoperated dataset constructed for this work, and reporting direct experimental success rates (90% approach, 80/70/50% transport) in real-robot tests. No equations, fitted parameters renamed as predictions, self-definitional relations, or load-bearing self-citations appear in the derivation or evaluation chain. Ablations compare decoder variants on the same data, but the central claims rest on independent experimental outcomes rather than reductions to inputs by construction. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract; no explicit free parameters, axioms, or invented entities beyond the model architecture itself are described.

pith-pipeline@v0.9.1-grok · 5809 in / 1155 out tokens · 32563 ms · 2026-06-29T11:57:21.525032+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 7 canonical work pages · 4 internal anchors

  1. [1]

    Medical micro- robots,

    V . Iacovacci, E. Diller, D. Ahmed, and A. Menciassi, “Medical micro- robots,”Annual Review of Biomedical Engineering, vol. 26, no. 1, pp. 561–591, 2024

  2. [2]

    Advanced medical micro-robotics for early diagnosis and therapeutic interventions,

    D. Zhang, T. E. Gorochowski, L. Marucci, H.-T. Lee, B. Gil, B. Li, S. Hauert, and E. Yeatman, “Advanced medical micro-robotics for early diagnosis and therapeutic interventions,”Frontiers in Robotics and AI, vol. 9, p. 1086043, 2023

  3. [3]

    External field-driven untethered microrobots for targeted cargo delivery,

    S. Zhu, Y . Chen, G. Liu, H. Qian, F. Niu, Y . Wang, Y . Zhao, T. Luo, and R. Yang, “External field-driven untethered microrobots for targeted cargo delivery,”Advanced Materials Technologies, vol. 7, no. 5, p. 2101256, 2022

  4. [4]

    Magnetic microrobots for in vivo cargo delivery: A review,

    J. Lin, Q. Cong, and D. Zhang, “Magnetic microrobots for in vivo cargo delivery: A review,”Micromachines, vol. 15, no. 5, p. 664, 2024

  5. [5]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control,

    B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahidet al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” inConference on Robot Learning. PMLR, 2023, pp. 2165–2183

  6. [6]

    OpenVLA: An Open-Source Vision-Language-Action Model

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketiet al., “Openvla: An open- source vision-language-action model,”arXiv preprint arXiv:2406.09246, 2024

  7. [7]

    π0.5: a vision- language-action model with open-world generalization,

    K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, M. Y . Gallikeret al., “ π0.5: a vision- language-action model with open-world generalization,” in9th Annual Conference on Robot Learning, 2025

  8. [8]

    Tmr-vla: Vision-language-action model for magnetic motion control of tri-leg silicone-based soft robot,

    R. Tang, C. K. Ng, K. Wu, L. Bai, G. Wang, Y . Huang, Y . Wang, and H. Ren, “Tmr-vla: Vision-language-action model for magnetic motion control of tri-leg silicone-based soft robot,”arXiv preprint arXiv:2603.00420, 2026

  9. [9]

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

    T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine- grained bimanual manipulation with low-cost hardware,”arXiv preprint arXiv:2304.13705, 2023

  10. [10]

    A computer-aided teleoperation system for intuitively controlling the behavior of a magnetic millirobot within a stomach phantom,

    R. Liu, Y . Xiang, Z. Wei, and J. Zhang, “A computer-aided teleoperation system for intuitively controlling the behavior of a magnetic millirobot within a stomach phantom,”Advanced Intelligent Systems, vol. 6, no. 2, p. 2300325, 2024

  11. [11]

    Scheduling adaptive imitation learning for long-horizon dexterous robot micromanipulation of deformable cell,

    Y . Zhang, X. Shen, C. Wang, F. Wang, A. Zhao, Y . Lyu, A. Knoll, Y . Liu, Y . Ying, and M. Zhou, “Scheduling adaptive imitation learning for long-horizon dexterous robot micromanipulation of deformable cell,” IEEE Robotics and Automation Letters, vol. 11, no. 1, pp. 41–48, 2025

  12. [12]

    Context-aware adaptive shared control for magnetically-driven bimanual dexterous micromanipulation,

    Y . Wang, K. Lu, L. Wei, and D. Zhang, “Context-aware adaptive shared control for magnetically-driven bimanual dexterous micromanipulation,” arXiv preprint arXiv:2603.14388, 2026

  13. [13]

    An automatic navigation framework for magnetic fish-like millirobot in uncertain dynamic environments,

    C. Tian, X. Fan, J. Jia, Z. Yang, and H. Xie, “An automatic navigation framework for magnetic fish-like millirobot in uncertain dynamic environments,”IEEE Robotics and Automation Letters, vol. 10, no. 3, pp. 2422–2429, 2025

  14. [14]

    Robust 3-d path following control framework for magnetic helical millirobots subject to fluid flow and input saturation,

    Z. Qi, M. Cai, B. Hao, Y . Cao, L. Su, X. Liu, K. F. Chan, C. Yang, and L. Zhang, “Robust 3-d path following control framework for magnetic helical millirobots subject to fluid flow and input saturation,”IEEE transactions on cybernetics, vol. 54, no. 12, pp. 7629–7641, 2024

  15. [15]

    Deep reinforcement learning-based semi- autonomous control for magnetic micro-robot navigation with immersive manipulation,

    Y . Mao and D. Zhang, “Deep reinforcement learning-based semi- autonomous control for magnetic micro-robot navigation with immersive manipulation,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 9088–9094

  16. [16]

    Autonomous navigation of magnetic microrobots with improved planning and control in complex environments,

    Y . Liu, H. Wang, X. Wu, J. Qu, X. Liu, and Q. Fan, “Autonomous navigation of magnetic microrobots with improved planning and control in complex environments,”IEEE Transactions on Automation Science and Engineering, vol. 22, pp. 2421–2432, 2024

  17. [17]

    Diffusion policy: Visuomotor policy learning via action diffusion,

    C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,”The International Journal of Robotics Research, vol. 44, no. 10-11, pp. 1684–1704, 2025

  18. [18]

    Flow Matching for Generative Modeling

    Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling,”arXiv preprint arXiv:2210.02747, 2022

  19. [19]

    Qwen2.5-VL Technical Report

    S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin, “Qwen2.5-vl technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2502.13923

  20. [20]

    Lora: Low-rank adaptation of large language models

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.” Iclr, vol. 1, no. 2, p. 3, 2022

  21. [21]

    Cadene, S

    R. Cadene, S. Aliberts, F. Capuano, M. Aractingi, A. Zouitine, P. Kooi- jmans, J. Choghari, M. Russi, C. Pascal, S. Palmaet al., “Lerobot: An open-source library for end-to-end robot learning,”arXiv preprint arXiv:2602.22818, 2026