arxiv: 2605.07306 · v1 · submitted 2026-05-08 · 💻 cs.RO · cs.AI

Recognition: no theorem link

BioProVLA-Agent: An Affordable, Protocol-Driven, Vision-Enhanced VLA-Enabled Embodied Multi-Agent System with Closed-Loop-Capable Reasoning for Biological Laboratory Manipulation

Zhaohui Du , Zhe Wang , Hongmei Fei , Xiwen Cao , Ting Xiao , Qi Wang , Huanbo Jin , Jiaming Gu

show 2 more authors

Quan Lu Zhe Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:25 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords embodied AIVLA modelsbiological laboratory automationmulti-agent systemsprotocol-driven roboticsvision-language-actionwet-lab manipulationclosed-loop control

0 comments

The pith

BioProVLA-Agent provides an affordable multi-agent robotic system that parses lab protocols, verifies states with vision and RAG, and executes actions via augmented VLA policies in closed loops.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Biological labs require reproducible multi-step manipulation amid unstructured protocols, transparent or reflective labware, and variable lighting, yet most robotic solutions demand costly hardware or fixed setups. This paper introduces BioProVLA-Agent as a protocol-centered alternative that decomposes tasks through an LLM agent, checks readiness and completion with a VLM-RAG verification agent drawing on observations and examples, and performs actions with a lightweight VLA policy. An online augmentation method called AugSmolVLA specifically targets visual perturbations typical of wet labs. The system is tested on a benchmark of atomic tasks, composite workflows, and bimanual actions. If the approach holds, it lowers the barrier to automated biological work by letting researchers supply natural-language protocols rather than custom robot code.

Core claim

BioProVLA-Agent integrates a Tailored LLM Protocol Agent that turns protocols into verifiable subtasks, a VLM-RAG Verification Agent that judges readiness and completion from robot states and retrieved knowledge, and a VLA Embodied Agent whose AugSmolVLA policy uses online augmentation to maintain stability under transparent objects, reflections, and illumination shifts; evaluations across normal and high-exposure conditions show higher execution stability than ACT, X-VLA, and baseline SmolVLA on precise placement, composite sequences, and visually degraded scenes.

What carries the argument

The three-agent architecture (LLM Protocol Agent, VLM-RAG Verification Agent, VLA Embodied Agent) with AugSmolVLA online augmentation that targets transparent labware and lighting changes to support closed-loop protocol execution.

If this is right

Protocols become the direct interface for specifying tasks instead of low-level robot commands.
Execution stability increases for precise placement and handling of transparent objects in both normal and overexposed lighting.
The system supports hierarchical benchmarks covering atomic manipulations, composite workflows, and bimanual actions such as tube loading, sorting, and liquid pouring.
Closed-loop verification reduces reliance on one-shot instruction following in state-dependent procedures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same protocol-plus-verification pattern could apply to other protocol-heavy domains like chemical synthesis or sample preparation.
Adding real-time sensor fusion beyond vision might further reduce errors when lab conditions change mid-task.
The augmentation strategy for visual robustness could be tested on additional VLA backbones to measure transfer.

Load-bearing premise

The VLM-RAG Verification Agent can reliably judge task readiness and completion from observations, robot states, and examples in unstructured wet-lab conditions without frequent errors.

What would settle it

Repeated incorrect readiness or completion judgments by the Verification Agent during actual wet-lab runs of the tested workflows would show the closed-loop claim does not hold.

Figures

Figures reproduced from arXiv: 2605.07306 by Hongmei Fei, Huanbo Jin, Jiaming Gu, Qi Wang, Quan Lu, Ting Xiao, Xiwen Cao, Zhaohui Du, Zhe Liu, Zhe Wang.

**Figure 1.** Figure 1: Overall framework of BioProVLA-Agent. The Guiding Decision Agent coordinates protocol parsing, state verification, and embodied execution through the Tailored LLM Protocol Agent, VLM-RAG Verification Agent, and VLA Embodied Agent. τi = (Ii , Prei , Posti , Ki) (3) where Ii denotes the natural-language execution instruction of the i-th subtask, Prei and Posti denote the task precondition and completion co… view at source ↗

**Figure 2.** Figure 2: Structured parsing workflow of biological experimental protocols in the Tailored LLM Protocol Agent. v post i = V(Ot+1, Rt+1, Posti , Ki) (6) When v post i = 1, the subtask is marked as completed, and the system proceeds to the next subtask. If the verification fails, the Guiding Decision Agent triggers re-execution, sequence adjustment, or human intervention according to the failure reason and the numb… view at source ↗

**Figure 3.** Figure 3: Closed-loop pre- and post-action state verification workflow of the VLM-RAG Verification Agent. placement, object removal, grasping, movement, button pressing, and waiting. This constraint reduces the generation of action descriptions that exceed the capability range of the robotic manipulator. The Custom LLM Parse AgentTailored LLM Protocol Agent outputs the parsing results in a structured JSON format, … view at source ↗

**Figure 4.** Figure 4: Curriculum-based online data augmentation framework in the VLA Embodied Agent. where v j i ∈ {0, 1} is the binary verification result, and r j i denotes the corresponding natural-language explanation and failure reason. When j = pre and v pre i = 1, the current environment is considered to satisfy the execution requirements, and the Guiding Decision Agent sends the task to the VLA Embodied Agent for ex… view at source ↗

**Figure 5.** Figure 5: Dataset assets. generalization under complex visual conditions. This process can be formulated as: O (e) t =    Ot , e ∈ E1, Tα(e)(Ot), e ∈ E2, λOt + (1 − λ)Tα(e)(Ot), e ∈ E3, (18) where e denotes the training stage, and mathcalE1, mathcalE2, and mathcalE3 correspond to the original-data learning stage, augmented-data adaptation stage, and mixeddata consolidation stage, respectively. The pertur… view at source ↗

**Figure 6.** Figure 6: Examples of single-task execution processes. the original data distribution cannot cover all observation conditions. By simulating diverse visual scenarios, data augmentation effectively expands the visual distribution during training and improves the models adaptability to different lighting conditions. Considering the observation distribution shift caused by lighting variations in real experimental en… view at source ↗

**Figure 7.** Figure 7: Examples of Dual-Arm task execution processes. proving object localization and action stability during task transitions. 4.5 Dual-Arm Task Experiments Following the composite-task experiments that evaluated multi-step continuous execution, we further conducted dualArm task experiments to assess the performance of the proposed method in coordinated multi-manipulator operation scenarios. Compared with si… view at source ↗

read the original abstract

Biological laboratory automation can reduce repetitive manual work and improve reproducibility, but reliable embodied execution in wet-lab environments remains challenging. Protocols are often unstructured, labware is frequently transparent or reflective, and multi-step procedures require state-aware execution beyond one-shot instruction following. Existing robotic systems often rely on costly hardware, fixed workflows, dedicated instruments, or robotics-oriented interfaces. Here, we introduce BioProVLA-Agent, an affordable, protocol-driven, vision-enhanced embodied multi-agent system enabled by Vision-Language-Action (VLA) models for biological manipulation. The system uses protocols as the task interface and integrates protocol parsing, visual state verification, and embodied execution in a closed-loop workflow. A Tailored LLM Protocol Agent converts protocols into verifiable subtasks; a VLM-RAG Verification Agent assesses readiness and completion using observations, robot states, retrieved knowledge, and success/failure examples; and a VLA Embodied Agent executes verified subtasks through a lightweight policy. To improve robustness under wet-lab visual perturbations, we develop AugSmolVLA, an online augmentation strategy targeting transparent labware, reflections, illumination shifts, and overexposure. We evaluate the system on a hierarchical benchmark covering 15 atomic tasks, 6 composite workflows, and 3 bimanual tasks, including tube loading, sorting, waste disposal, cap twisting, and liquid pouring. Across normal and high-exposure settings, AugSmolVLA improves execution stability over ACT, X-VLA, and the original SmolVLA, especially for precise placement, transparent-object manipulation, composite workflows, and visually degraded scenes. These results suggest a practical route toward accessible, protocol-centered, and verification-capable embodied AI for biological manipulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper puts together a protocol-driven multi-agent VLA setup with a targeted augmentation for wet-lab visuals, but the closed-loop gains rest on an unquantified verification agent.

read the letter

The main takeaway is a practical multi-agent system that turns lab protocols into subtasks, checks states with a VLM-RAG agent, and executes via a lightweight VLA policy, plus AugSmolVLA to handle transparent labware, reflections, and lighting shifts. This combination is new as a specific package for biological manipulation, and it directly tackles real constraints like unstructured instructions and affordable hardware that most prior VLA work ignores. The hierarchical benchmark covering atomic actions, composite workflows, and bimanual tasks such as pouring and sorting is a reasonable way to test the full loop, and the claim that the augmentation improves stability over ACT, X-VLA, and SmolVLA in normal and high-exposure conditions fits the described visual problems. Credit is due for focusing on reproducibility in life-science settings rather than toy environments. The soft spot is the missing numbers. The abstract asserts better execution stability especially for precise placement and degraded scenes, yet supplies no accuracy rates, failure breakdowns, or statistical comparisons for the VLM-RAG verification step itself under the exact perturbations it targets. If that agent misjudges readiness or completion at any non-trivial rate, the closed-loop credit for the policy improvements does not hold. The stress-test concern lands because the paper provides no evidence on verification error rates. This work is for people building or evaluating embodied systems for messy real labs who already know the VLA literature. A reader looking for concrete ideas on protocol interfaces and visual robustness would pick up usable details from the augmentation strategy. It deserves peer review because the application area matters and the architecture is coherent, even though the evaluation needs tightening on the verification metrics and more detailed results before the stability claims can be taken at face value.

Referee Report

2 major / 0 minor

Summary. The paper introduces BioProVLA-Agent, an affordable protocol-driven embodied multi-agent system for biological lab manipulation that combines a Tailored LLM Protocol Agent, a VLM-RAG Verification Agent for state assessment, and a VLA Embodied Agent using AugSmolVLA (an online augmentation strategy for visual robustness). It claims improved execution stability over ACT, X-VLA, and SmolVLA on a hierarchical benchmark of 15 atomic tasks, 6 composite workflows, and 3 bimanual tasks, particularly under normal and high-exposure settings for precise placement, transparent objects, and degraded scenes.

Significance. If the empirical claims hold with supporting metrics, the work could offer a practical, low-cost route to closed-loop embodied AI for wet-lab automation that leverages unstructured protocols and handles common visual challenges without specialized hardware, potentially aiding reproducibility in biological procedures.

major comments (2)

Abstract and evaluation description: the claims of improved execution stability for AugSmolVLA over baselines (especially for precise placement, transparent-object manipulation, composite workflows, and visually degraded scenes) are presented without any quantitative metrics, error bars, statistical tests, or tabulated results. This absence prevents assessment of effect sizes and reproducibility of the reported gains.
Abstract and system description: the closed-loop workflow's performance is attributed in part to the VLM-RAG Verification Agent's ability to assess readiness and completion using observations, robot states, retrieved knowledge, and examples, yet no accuracy, error rates, or failure-case analysis is provided for this agent under the targeted perturbations (transparent labware, reflections, overexposure). This is load-bearing for crediting stability improvements to the verification component.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We appreciate the identification of areas where additional quantitative support would strengthen the presentation of our results and the role of individual components. We address each major comment below and commit to revisions that incorporate the requested metrics and analyses.

read point-by-point responses

Referee: Abstract and evaluation description: the claims of improved execution stability for AugSmolVLA over baselines (especially for precise placement, transparent-object manipulation, composite workflows, and visually degraded scenes) are presented without any quantitative metrics, error bars, statistical tests, or tabulated results. This absence prevents assessment of effect sizes and reproducibility of the reported gains.

Authors: We agree that the abstract and evaluation description would be strengthened by the inclusion of quantitative metrics. We will revise the abstract to report specific success rates and improvement percentages across the evaluated settings. In the evaluation section, we will add tabulated results that include success rates with error bars, statistical significance tests (e.g., paired t-tests or Wilcoxon tests), and effect size measures to enable assessment of reproducibility and the magnitude of gains over ACT, X-VLA, and SmolVLA. revision: yes
Referee: Abstract and system description: the closed-loop workflow's performance is attributed in part to the VLM-RAG Verification Agent's ability to assess readiness and completion using observations, robot states, retrieved knowledge, and examples, yet no accuracy, error rates, or failure-case analysis is provided for this agent under the targeted perturbations (transparent labware, reflections, overexposure). This is load-bearing for crediting stability improvements to the verification component.

Authors: We concur that the VLM-RAG Verification Agent's contribution requires explicit quantitative backing to substantiate its impact on closed-loop stability. We will add a dedicated analysis in the revised manuscript, including accuracy rates, error rates, and a breakdown of failure cases for the Verification Agent when operating under the specified perturbations (transparent labware, reflections, and overexposure). This evaluation will draw on the same benchmark tasks and will be presented alongside the overall system results to clarify the agent's role. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents an engineering system description and empirical evaluation of BioProVLA-Agent and AugSmolVLA on a hierarchical benchmark, with claims of improved stability over baselines like ACT and SmolVLA. No equations, first-principles derivations, or predictions are provided that reduce by construction to inputs, fitted parameters, or self-citations. The framework integrates protocol parsing, verification, and execution without self-definitional loops or load-bearing self-citations that collapse the central claims. The evaluation focuses on task performance metrics rather than any circular renaming or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are detailed. The approach assumes effective integration of existing VLA models with new components for wet-lab robustness.

pith-pipeline@v0.9.0 · 5655 in / 1345 out tokens · 42226 ms · 2026-05-11T01:25:34.847805+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 8 internal anchors

[1]

N. J. Szymanski, B. Rendy, Y . Fei, R. E. Kumar, T. He, D. Milsted, M. J. McDermott, M. Gallant, E. D. Cubuk, A. Merchant, H. Kim, A. Jain, C. J. Bartel, K. Persson, Y . Zeng, and G. Ceder. An autonomous laboratory for the accelerated synthesis of novel materials. Nature, 624 (7990):8691, 2023. doi: 10.1038 /s41586-023-06734-w

work page 2023
[2]

Burger, P

B. Burger, P . M. Maﬀettone, V . V . Gusev, C. M. Aitchison, Y . Bai, X. Wang, X. Li, B. M. Alston, B. Li, R. Clowes, N. Rankin, B. Harris, R. S. Sprick, and A. I. Cooper. A14 mobile robotic chemist. Nature, 583(7815):237241, 2020. doi: 10.1038 /s41586-020-2442-2

work page 2020
[3]

Jiang, D

S. Jiang, D. Evans-Y amamoto, D. Bersenev, S. K. Pala- niappan, and A. Y achie-Kinoshita. Protocode: Leverag- ing large language models (llms) for automated gener- ation of machine-readable pcr protocols from scientiﬁc publications. SLAS Technol, 29(3):100134, 2024. doi: 10.1016/j.slast.2024.100134

work page doi:10.1016/j.slast.2024.100134 2024
[4]

B. Ma, M. Ma, R. Li, J. Zheng, and D. Li. Tosq: Transparent object segmentation via query-based dictio- nary lookup with transformers. Sensors (Basel) , 25(15),

work page
[5]

doi: 10.3390 /s25154700

work page
[6]

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

Wenlong Huang, Chen Wang, Ruohan Zhang, Y unzhu Li, Jiajun Wu, and Li Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language mod- els. arXiv preprint arXiv:2307.05973, 2023

work page internal anchor Pith review arXiv 2023
[7]

Rekep: Spatio- temporal reasoning of relational keypoint constraints for robotic manipulation.arXiv preprint arXiv:2409.01652, 2024

Wenlong Huang, Chen Wang, Y unzhu Li, Ruohan Zhang, and Li Fei-Fei. Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation. arXiv preprint arXiv:2409.01652, 2024

work page arXiv 2024
[8]

X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Y uchun Feng, Yinan Zheng, Ji- ayin Zou, Yilun Chen, and Jia Zeng. X-vla: Soft- prompted transformer as scalable cross-embodiment vision-language-action model, 2025. URL https: //arxiv. org/abs/2510.10274

work page internal anchor Pith review arXiv 2025
[9]

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, and Andres Maraﬁoti. Smolvla: A vision-language-action model for a ﬀordable and e ﬃcient robotics. arXiv preprint arXiv:2506.01844, 2025

work page internal anchor Pith review arXiv 2025
[10]

T. Dai, S. Vijayakrishnan, F. T. Szczypiski, J. F. Ayme, E. Simaei, T. Fellowes, R. Clowes, L. Kotopanov, C. E. Shields, Z. Zhou, J. W. Ward, and A. I. Cooper. Au- tonomous mobile robots for exploratory synthetic chem- istry. Nature, 635(8040):890897, 2024. doi: 10.1038 / s41586-024-08173-7

work page 2024
[11]

G. Tom, S. P . Schmid, S. G. Baird, Y . Cao, K. Darvish, H. Hao, S. Lo, S. Pablo-García, E. M. Rajaonson, M. Skreta, N. Y oshikawa, S. Corapi, G. D. Akkoc, F. Strieth-Kalthoﬀ, M. Seifrid, and A. Aspuru-Guzik. Self- driving laboratories for chemistry and materials science. Chem Rev , 124(16):96339732, 2024. doi: 10.1021 /acs. chemrev.4c00055

work page 2024
[12]

Stephenson, L

A. Stephenson, L. Lastra, B. Nguyen, Y . J. Chen, J. Nivala, L. Ceze, and K. Strauss. Physical laboratory automation in synthetic biology. ACS Synth Biol , 12(11):31563169,

work page
[13]

doi: 10.1021 /acssynbio.3c00345

work page
[14]

A. M. Anhel, L. Alejaldre, and Á Goñi-Moreno. The labo- ratory automation protocol (lap) format and repository: A platform for enhancing workﬂow e ﬃciency in synthetic biology. ACS Synth Biol , 12(12):35143520, 2023. doi: 10.1021/acssynbio.3c00397

work page doi:10.1021/acssynbio.3c00397 2023
[15]

Singh, S

N. Singh, S. Lane, T. Y u, J. Lu, A. Ramos, H. Cui, and H. Zhao. A generalized platform for artiﬁcial intelligence- powered autonomous enzyme engineering. Nat Commun, 16(1):5648, 2025. doi: 10.1038 /s41467-025-61209-y

work page 2025
[16]

Pivin, A

D. Pivin, A. Champie, M. Plante, F. Ferland, F. Michaud, and S. Rodrigue. Oscar: A modular open-source robotic platform for biological laboratories. ACS Synth Biol , 15 (3):10621072, 2026. doi: 10.1021 /acssynbio.5c00733

work page 2026
[17]

Salazar-Villacis and B

P . Salazar-Villacis and B. Benyahia. The adept framework for assessing autonomous laboratory robotics. Commun Chem, 9(1), 2026. doi: 10.1038 /s42004-026-01932-9

work page 2026
[18]

Chatgpt for robotics: Design prin- ciples and model abilities

Sai V emprala, Rogerio Bonatti, Arthur Bucker, and Ashish Kapoor. Chatgpt for robotics: Design prin- ciples and model abilities. 2023. URL https: //arxiv. org/abs/2306.17582, 2023

work page arXiv 2023
[19]

Llms can generate robotic scripts from goal-oriented instructions in biological lab- oratory automation

Takashi Inagaki, Akari Kato, Koichi Takahashi, Haruka Ozaki, and Genki N Kanda. Llms can generate robotic scripts from goal-oriented instructions in biological lab- oratory automation. arXiv preprint arXiv:2304.10267 , 2023

work page arXiv 2023
[20]

Bioplanner: automatic evaluation of llms on protocol plan- ning in biology

Odhran ODonoghue, Aleksandar Shtedritski, John Gin- ger, Ralph Abboud, Ali Ghareeb, and Samuel Rodriques. Bioplanner: automatic evaluation of llms on protocol plan- ning in biology. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , page 26762694

work page 2023
[21]

Protomed- llm: An automatic evaluation framework for large lan- guage models in medical protocol formulation

Seungjun Yi, Jaeyoung Lim, and Juyong Y oon. Protomed- llm: An automatic evaluation framework for large lan- guage models in medical protocol formulation. arXiv preprint arXiv:2410.04601, 2024

work page arXiv 2024
[22]

D. A. Boiko, R. MacKnight, B. Kline, and G. Gomes. Autonomous chemical research with large language mod- els. Nature, 624(7992):570578, 2023. doi: 10.1038 / s41586-023-06792-0

work page 2023
[23]

Bran A, S

M. Bran A, S. Cox, O. Schilter, C. Baldassari, A. D. White, and P . Schwaller. Augmenting large language mod- els with chemistry tools. Nat Mach Intell , 6(5):525535,

work page
[24]

doi: 10.1038 /s42256-024-00832-8

work page
[25]

PaLM-E: An Embodied Multimodal Language Model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, and Tianhe Y u. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Robochemist: Long-horizon and safety- compliant robotic chemical experimentation

Zongzheng Zhang, Chenghao Y ue, Haobo Xu, Minwen Liao, Xianglin Qi, Huan-ang Gao, Ziwei Wang, and Hao Zhao. Robochemist: Long-horizon and safety- compliant robotic chemical experimentation. arXiv preprint arXiv:2509.08820, 2025

work page arXiv 2025
[27]

Retrieval-Augmented Generation for Large Language Models: A Survey

Y unfan Gao, Y un Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Y uxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, and Haofen Wang. Retrieval-augmented genera- tion for large language models: A survey. arXiv preprint arXiv:2312.10997, 2(1):32, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning ﬁne-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705 , 2023

work page internal anchor Pith review arXiv 2023
[29]

Di ﬀusion policy: Visuomotor policy learning via action di ﬀusion

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchﬁel, Russ Tedrake, and Shuran Song. Di ﬀusion policy: Visuomotor policy learning via action di ﬀusion. The International Journal of Robotics Research, 44(10-11):16841704, 2025.15

work page 2025
[30]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Y u, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, and Ayzaan Wahid. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning , page 21652183. PMLR. ISBN 2640- 3498

work page
[31]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a di ﬀusion foundation model for bimanual manipulation. arXiv preprint arXiv:2410.07864, 2024

work page internal anchor Pith review arXiv 2024
[32]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, and Brian Ichter. π0: A vision- language-action ﬂow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Scaling robot learning with semantically imagined experience.arXiv preprint arXiv:2302.11550, 2023

Tianhe Y u, Ted Xiao, Austin Stone, Jonathan Tompson, Anthony Brohan, Su Wang, Jaspiar Singh, Clayton Tan, Jodilyn Peralta, and Brian Ichter. Scaling robot learning with semantically imagined experience. arXiv preprint arXiv:2302.11550, 2023

work page arXiv 2023
[34]

Genaug: Retargeting behaviors to unseen situ- ations via generative augmentation, 2023

Zoey Chen, Sho Kiami, Abhishek Gupta, and Vikash Kumar. Genaug: Retargeting behaviors to unseen sit- uations via generative augmentation. arXiv preprint arXiv:2302.06671, 2023

work page arXiv 2023
[35]

Semantically controllable augmentations for gen- eralizable robot learning

Zoey Chen, Zhao Mandi, Homanga Bharadhwaj, Mohit Sharma, Shuran Song, Abhishek Gupta, and Vikash Ku- mar. Semantically controllable augmentations for gen- eralizable robot learning. The International Journal of Robotics Research, 44(10-11):17051726, 2025

work page 2025
[36]

Dabi: Evaluation of data augmentation meth- ods using downsampling in bilateral control-based imita- tion learning with images

Masato Kobayashi, Thanpimon Buamanee, and Y uki Uranishi. Dabi: Evaluation of data augmentation meth- ods using downsampling in bilateral control-based imita- tion learning with images. In 2025 IEEE International Conference on Robotics and Automation (ICRA) , page 1689216898. IEEE. ISBN 9798331541392

work page 2025
[37]

Robovqa: Multimodal long-horizon rea- soning for robotics

Pierre Sermanet, Tianli Ding, Je ﬀrey Zhao, Fei Xia, De- bidatta Dwibedi, Keerthana Gopalakrishnan, Christine Chan, Gabriel Dulac-Arnold, Sharath Maddineni, and Nikhil J Joshi. Robovqa: Multimodal long-horizon rea- soning for robotics. In 2024 IEEE International Confer- ence on Robotics and Automation (ICRA) , page 645652. IEEE. ISBN 9798350384574

work page 2024
[38]

X. Wang, Y . Chen, and W. Zhu. A survey on curricu- lum learning. IEEE Trans Pattern Anal Mach Intell , 44 (9):45554576, 2022. doi: 10.1109 /tpami.2021.3069908. 16

work page arXiv 2022