pith. machine review for the scientific record. sign in

arxiv: 2605.07306 · v1 · submitted 2026-05-08 · 💻 cs.RO · cs.AI

Recognition: no theorem link

BioProVLA-Agent: An Affordable, Protocol-Driven, Vision-Enhanced VLA-Enabled Embodied Multi-Agent System with Closed-Loop-Capable Reasoning for Biological Laboratory Manipulation

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:25 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords embodied AIVLA modelsbiological laboratory automationmulti-agent systemsprotocol-driven roboticsvision-language-actionwet-lab manipulationclosed-loop control
0
0 comments X

The pith

BioProVLA-Agent provides an affordable multi-agent robotic system that parses lab protocols, verifies states with vision and RAG, and executes actions via augmented VLA policies in closed loops.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Biological labs require reproducible multi-step manipulation amid unstructured protocols, transparent or reflective labware, and variable lighting, yet most robotic solutions demand costly hardware or fixed setups. This paper introduces BioProVLA-Agent as a protocol-centered alternative that decomposes tasks through an LLM agent, checks readiness and completion with a VLM-RAG verification agent drawing on observations and examples, and performs actions with a lightweight VLA policy. An online augmentation method called AugSmolVLA specifically targets visual perturbations typical of wet labs. The system is tested on a benchmark of atomic tasks, composite workflows, and bimanual actions. If the approach holds, it lowers the barrier to automated biological work by letting researchers supply natural-language protocols rather than custom robot code.

Core claim

BioProVLA-Agent integrates a Tailored LLM Protocol Agent that turns protocols into verifiable subtasks, a VLM-RAG Verification Agent that judges readiness and completion from robot states and retrieved knowledge, and a VLA Embodied Agent whose AugSmolVLA policy uses online augmentation to maintain stability under transparent objects, reflections, and illumination shifts; evaluations across normal and high-exposure conditions show higher execution stability than ACT, X-VLA, and baseline SmolVLA on precise placement, composite sequences, and visually degraded scenes.

What carries the argument

The three-agent architecture (LLM Protocol Agent, VLM-RAG Verification Agent, VLA Embodied Agent) with AugSmolVLA online augmentation that targets transparent labware and lighting changes to support closed-loop protocol execution.

If this is right

  • Protocols become the direct interface for specifying tasks instead of low-level robot commands.
  • Execution stability increases for precise placement and handling of transparent objects in both normal and overexposed lighting.
  • The system supports hierarchical benchmarks covering atomic manipulations, composite workflows, and bimanual actions such as tube loading, sorting, and liquid pouring.
  • Closed-loop verification reduces reliance on one-shot instruction following in state-dependent procedures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same protocol-plus-verification pattern could apply to other protocol-heavy domains like chemical synthesis or sample preparation.
  • Adding real-time sensor fusion beyond vision might further reduce errors when lab conditions change mid-task.
  • The augmentation strategy for visual robustness could be tested on additional VLA backbones to measure transfer.

Load-bearing premise

The VLM-RAG Verification Agent can reliably judge task readiness and completion from observations, robot states, and examples in unstructured wet-lab conditions without frequent errors.

What would settle it

Repeated incorrect readiness or completion judgments by the Verification Agent during actual wet-lab runs of the tested workflows would show the closed-loop claim does not hold.

Figures

Figures reproduced from arXiv: 2605.07306 by Hongmei Fei, Huanbo Jin, Jiaming Gu, Qi Wang, Quan Lu, Ting Xiao, Xiwen Cao, Zhaohui Du, Zhe Liu, Zhe Wang.

Figure 1
Figure 1. Figure 1: Overall framework of BioProVLA-Agent. The Guiding Decision Agent coordinates protocol parsing, state verification, and embodied execution through the Tailored LLM Protocol Agent, VLM-RAG Verification Agent, and VLA Embodied Agent. τi = (Ii , Prei , Posti , Ki) (3) where Ii denotes the natural-language execution instruc￾tion of the i-th subtask, Prei and Posti denote the task precon￾dition and completion co… view at source ↗
Figure 2
Figure 2. Figure 2: Structured parsing workflow of biological experimental protocols in the Tailored LLM Protocol Agent. v post i = V(Ot+1, Rt+1, Posti , Ki) (6) When v post i = 1, the subtask is marked as completed, and the system proceeds to the next subtask. If the verifica￾tion fails, the Guiding Decision Agent triggers re-execution, se￾quence adjustment, or human intervention according to the fail￾ure reason and the numb… view at source ↗
Figure 3
Figure 3. Figure 3: Closed-loop pre- and post-action state verification workflow of the VLM-RAG Verification Agent. placement, object removal, grasping, movement, button press￾ing, and waiting. This constraint reduces the generation of ac￾tion descriptions that exceed the capability range of the robotic manipulator. The Custom LLM Parse AgentTailored LLM Protocol Agent outputs the parsing results in a structured JSON format, … view at source ↗
Figure 4
Figure 4. Figure 4: Curriculum-based online data augmentation framework in the VLA Embodied Agent. where v j i ∈ {0, 1} is the binary verification result, and r j i de￾notes the corresponding natural-language explanation and fail￾ure reason. When j = pre and v pre i = 1, the current environ￾ment is considered to satisfy the execution requirements, and the Guiding Decision Agent sends the task to the VLA Em￾bodied Agent for ex… view at source ↗
Figure 5
Figure 5. Figure 5: Dataset assets. generalization under complex visual conditions. This process can be formulated as: O (e) t =    Ot , e ∈ E1, Tα(e)(Ot), e ∈ E2, λOt + (1 − λ)Tα(e)(Ot), e ∈ E3, (18) where e denotes the training stage, and mathcalE1, mathcalE2, and mathcalE3 correspond to the original-data learning stage, augmented-data adaptation stage, and mixed￾data consolidation stage, respectively. The pertur… view at source ↗
Figure 6
Figure 6. Figure 6: Examples of single-task execution processes. the original data distribution cannot cover all observation con￾ditions. By simulating diverse visual scenarios, data augmenta￾tion effectively expands the visual distribution during training and improves the models adaptability to different lighting con￾ditions. Considering the observation distribution shift caused by lighting variations in real experimental en… view at source ↗
Figure 7
Figure 7. Figure 7: Examples of Dual-Arm task execution processes. proving object localization and action stability during task tran￾sitions. 4.5 Dual-Arm Task Experiments Following the composite-task experiments that evaluated multi-step continuous execution, we further conducted dual￾Arm task experiments to assess the performance of the pro￾posed method in coordinated multi-manipulator operation sce￾narios. Compared with si… view at source ↗
read the original abstract

Biological laboratory automation can reduce repetitive manual work and improve reproducibility, but reliable embodied execution in wet-lab environments remains challenging. Protocols are often unstructured, labware is frequently transparent or reflective, and multi-step procedures require state-aware execution beyond one-shot instruction following. Existing robotic systems often rely on costly hardware, fixed workflows, dedicated instruments, or robotics-oriented interfaces. Here, we introduce BioProVLA-Agent, an affordable, protocol-driven, vision-enhanced embodied multi-agent system enabled by Vision-Language-Action (VLA) models for biological manipulation. The system uses protocols as the task interface and integrates protocol parsing, visual state verification, and embodied execution in a closed-loop workflow. A Tailored LLM Protocol Agent converts protocols into verifiable subtasks; a VLM-RAG Verification Agent assesses readiness and completion using observations, robot states, retrieved knowledge, and success/failure examples; and a VLA Embodied Agent executes verified subtasks through a lightweight policy. To improve robustness under wet-lab visual perturbations, we develop AugSmolVLA, an online augmentation strategy targeting transparent labware, reflections, illumination shifts, and overexposure. We evaluate the system on a hierarchical benchmark covering 15 atomic tasks, 6 composite workflows, and 3 bimanual tasks, including tube loading, sorting, waste disposal, cap twisting, and liquid pouring. Across normal and high-exposure settings, AugSmolVLA improves execution stability over ACT, X-VLA, and the original SmolVLA, especially for precise placement, transparent-object manipulation, composite workflows, and visually degraded scenes. These results suggest a practical route toward accessible, protocol-centered, and verification-capable embodied AI for biological manipulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces BioProVLA-Agent, an affordable protocol-driven embodied multi-agent system for biological lab manipulation that combines a Tailored LLM Protocol Agent, a VLM-RAG Verification Agent for state assessment, and a VLA Embodied Agent using AugSmolVLA (an online augmentation strategy for visual robustness). It claims improved execution stability over ACT, X-VLA, and SmolVLA on a hierarchical benchmark of 15 atomic tasks, 6 composite workflows, and 3 bimanual tasks, particularly under normal and high-exposure settings for precise placement, transparent objects, and degraded scenes.

Significance. If the empirical claims hold with supporting metrics, the work could offer a practical, low-cost route to closed-loop embodied AI for wet-lab automation that leverages unstructured protocols and handles common visual challenges without specialized hardware, potentially aiding reproducibility in biological procedures.

major comments (2)
  1. Abstract and evaluation description: the claims of improved execution stability for AugSmolVLA over baselines (especially for precise placement, transparent-object manipulation, composite workflows, and visually degraded scenes) are presented without any quantitative metrics, error bars, statistical tests, or tabulated results. This absence prevents assessment of effect sizes and reproducibility of the reported gains.
  2. Abstract and system description: the closed-loop workflow's performance is attributed in part to the VLM-RAG Verification Agent's ability to assess readiness and completion using observations, robot states, retrieved knowledge, and examples, yet no accuracy, error rates, or failure-case analysis is provided for this agent under the targeted perturbations (transparent labware, reflections, overexposure). This is load-bearing for crediting stability improvements to the verification component.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We appreciate the identification of areas where additional quantitative support would strengthen the presentation of our results and the role of individual components. We address each major comment below and commit to revisions that incorporate the requested metrics and analyses.

read point-by-point responses
  1. Referee: Abstract and evaluation description: the claims of improved execution stability for AugSmolVLA over baselines (especially for precise placement, transparent-object manipulation, composite workflows, and visually degraded scenes) are presented without any quantitative metrics, error bars, statistical tests, or tabulated results. This absence prevents assessment of effect sizes and reproducibility of the reported gains.

    Authors: We agree that the abstract and evaluation description would be strengthened by the inclusion of quantitative metrics. We will revise the abstract to report specific success rates and improvement percentages across the evaluated settings. In the evaluation section, we will add tabulated results that include success rates with error bars, statistical significance tests (e.g., paired t-tests or Wilcoxon tests), and effect size measures to enable assessment of reproducibility and the magnitude of gains over ACT, X-VLA, and SmolVLA. revision: yes

  2. Referee: Abstract and system description: the closed-loop workflow's performance is attributed in part to the VLM-RAG Verification Agent's ability to assess readiness and completion using observations, robot states, retrieved knowledge, and examples, yet no accuracy, error rates, or failure-case analysis is provided for this agent under the targeted perturbations (transparent labware, reflections, overexposure). This is load-bearing for crediting stability improvements to the verification component.

    Authors: We concur that the VLM-RAG Verification Agent's contribution requires explicit quantitative backing to substantiate its impact on closed-loop stability. We will add a dedicated analysis in the revised manuscript, including accuracy rates, error rates, and a breakdown of failure cases for the Verification Agent when operating under the specified perturbations (transparent labware, reflections, and overexposure). This evaluation will draw on the same benchmark tasks and will be presented alongside the overall system results to clarify the agent's role. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents an engineering system description and empirical evaluation of BioProVLA-Agent and AugSmolVLA on a hierarchical benchmark, with claims of improved stability over baselines like ACT and SmolVLA. No equations, first-principles derivations, or predictions are provided that reduce by construction to inputs, fitted parameters, or self-citations. The framework integrates protocol parsing, verification, and execution without self-definitional loops or load-bearing self-citations that collapse the central claims. The evaluation focuses on task performance metrics rather than any circular renaming or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are detailed. The approach assumes effective integration of existing VLA models with new components for wet-lab robustness.

pith-pipeline@v0.9.0 · 5655 in / 1345 out tokens · 42226 ms · 2026-05-11T01:25:34.847805+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 8 internal anchors

  1. [1]

    N. J. Szymanski, B. Rendy, Y . Fei, R. E. Kumar, T. He, D. Milsted, M. J. McDermott, M. Gallant, E. D. Cubuk, A. Merchant, H. Kim, A. Jain, C. J. Bartel, K. Persson, Y . Zeng, and G. Ceder. An autonomous laboratory for the accelerated synthesis of novel materials. Nature, 624 (7990):8691, 2023. doi: 10.1038 /s41586-023-06734-w

  2. [2]

    Burger, P

    B. Burger, P . M. Maffettone, V . V . Gusev, C. M. Aitchison, Y . Bai, X. Wang, X. Li, B. M. Alston, B. Li, R. Clowes, N. Rankin, B. Harris, R. S. Sprick, and A. I. Cooper. A14 mobile robotic chemist. Nature, 583(7815):237241, 2020. doi: 10.1038 /s41586-020-2442-2

  3. [3]

    Jiang, D

    S. Jiang, D. Evans-Y amamoto, D. Bersenev, S. K. Pala- niappan, and A. Y achie-Kinoshita. Protocode: Leverag- ing large language models (llms) for automated gener- ation of machine-readable pcr protocols from scientific publications. SLAS Technol, 29(3):100134, 2024. doi: 10.1016/j.slast.2024.100134

  4. [4]

    B. Ma, M. Ma, R. Li, J. Zheng, and D. Li. Tosq: Transparent object segmentation via query-based dictio- nary lookup with transformers. Sensors (Basel) , 25(15),

  5. [5]

    doi: 10.3390 /s25154700

  6. [6]

    VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

    Wenlong Huang, Chen Wang, Ruohan Zhang, Y unzhu Li, Jiajun Wu, and Li Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language mod- els. arXiv preprint arXiv:2307.05973, 2023

  7. [7]

    Rekep: Spatio- temporal reasoning of relational keypoint constraints for robotic manipulation.arXiv preprint arXiv:2409.01652, 2024

    Wenlong Huang, Chen Wang, Y unzhu Li, Ruohan Zhang, and Li Fei-Fei. Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation. arXiv preprint arXiv:2409.01652, 2024

  8. [8]

    X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

    Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Y uchun Feng, Yinan Zheng, Ji- ayin Zou, Yilun Chen, and Jia Zeng. X-vla: Soft- prompted transformer as scalable cross-embodiment vision-language-action model, 2025. URL https: //arxiv. org/abs/2510.10274

  9. [9]

    SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

    Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, and Andres Marafioti. Smolvla: A vision-language-action model for a ffordable and e fficient robotics. arXiv preprint arXiv:2506.01844, 2025

  10. [10]

    T. Dai, S. Vijayakrishnan, F. T. Szczypiski, J. F. Ayme, E. Simaei, T. Fellowes, R. Clowes, L. Kotopanov, C. E. Shields, Z. Zhou, J. W. Ward, and A. I. Cooper. Au- tonomous mobile robots for exploratory synthetic chem- istry. Nature, 635(8040):890897, 2024. doi: 10.1038 / s41586-024-08173-7

  11. [11]

    G. Tom, S. P . Schmid, S. G. Baird, Y . Cao, K. Darvish, H. Hao, S. Lo, S. Pablo-García, E. M. Rajaonson, M. Skreta, N. Y oshikawa, S. Corapi, G. D. Akkoc, F. Strieth-Kalthoff, M. Seifrid, and A. Aspuru-Guzik. Self- driving laboratories for chemistry and materials science. Chem Rev , 124(16):96339732, 2024. doi: 10.1021 /acs. chemrev.4c00055

  12. [12]

    Stephenson, L

    A. Stephenson, L. Lastra, B. Nguyen, Y . J. Chen, J. Nivala, L. Ceze, and K. Strauss. Physical laboratory automation in synthetic biology. ACS Synth Biol , 12(11):31563169,

  13. [13]

    doi: 10.1021 /acssynbio.3c00345

  14. [14]

    A. M. Anhel, L. Alejaldre, and Á Goñi-Moreno. The labo- ratory automation protocol (lap) format and repository: A platform for enhancing workflow e fficiency in synthetic biology. ACS Synth Biol , 12(12):35143520, 2023. doi: 10.1021/acssynbio.3c00397

  15. [15]

    Singh, S

    N. Singh, S. Lane, T. Y u, J. Lu, A. Ramos, H. Cui, and H. Zhao. A generalized platform for artificial intelligence- powered autonomous enzyme engineering. Nat Commun, 16(1):5648, 2025. doi: 10.1038 /s41467-025-61209-y

  16. [16]

    Pivin, A

    D. Pivin, A. Champie, M. Plante, F. Ferland, F. Michaud, and S. Rodrigue. Oscar: A modular open-source robotic platform for biological laboratories. ACS Synth Biol , 15 (3):10621072, 2026. doi: 10.1021 /acssynbio.5c00733

  17. [17]

    Salazar-Villacis and B

    P . Salazar-Villacis and B. Benyahia. The adept framework for assessing autonomous laboratory robotics. Commun Chem, 9(1), 2026. doi: 10.1038 /s42004-026-01932-9

  18. [18]

    Chatgpt for robotics: Design prin- ciples and model abilities

    Sai V emprala, Rogerio Bonatti, Arthur Bucker, and Ashish Kapoor. Chatgpt for robotics: Design prin- ciples and model abilities. 2023. URL https: //arxiv. org/abs/2306.17582, 2023

  19. [19]

    Llms can generate robotic scripts from goal-oriented instructions in biological lab- oratory automation

    Takashi Inagaki, Akari Kato, Koichi Takahashi, Haruka Ozaki, and Genki N Kanda. Llms can generate robotic scripts from goal-oriented instructions in biological lab- oratory automation. arXiv preprint arXiv:2304.10267 , 2023

  20. [20]

    Bioplanner: automatic evaluation of llms on protocol plan- ning in biology

    Odhran ODonoghue, Aleksandar Shtedritski, John Gin- ger, Ralph Abboud, Ali Ghareeb, and Samuel Rodriques. Bioplanner: automatic evaluation of llms on protocol plan- ning in biology. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , page 26762694

  21. [21]

    Protomed- llm: An automatic evaluation framework for large lan- guage models in medical protocol formulation

    Seungjun Yi, Jaeyoung Lim, and Juyong Y oon. Protomed- llm: An automatic evaluation framework for large lan- guage models in medical protocol formulation. arXiv preprint arXiv:2410.04601, 2024

  22. [22]

    D. A. Boiko, R. MacKnight, B. Kline, and G. Gomes. Autonomous chemical research with large language mod- els. Nature, 624(7992):570578, 2023. doi: 10.1038 / s41586-023-06792-0

  23. [23]

    Bran A, S

    M. Bran A, S. Cox, O. Schilter, C. Baldassari, A. D. White, and P . Schwaller. Augmenting large language mod- els with chemistry tools. Nat Mach Intell , 6(5):525535,

  24. [24]

    doi: 10.1038 /s42256-024-00832-8

  25. [25]

    PaLM-E: An Embodied Multimodal Language Model

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, and Tianhe Y u. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023

  26. [26]

    Robochemist: Long-horizon and safety- compliant robotic chemical experimentation

    Zongzheng Zhang, Chenghao Y ue, Haobo Xu, Minwen Liao, Xianglin Qi, Huan-ang Gao, Ziwei Wang, and Hao Zhao. Robochemist: Long-horizon and safety- compliant robotic chemical experimentation. arXiv preprint arXiv:2509.08820, 2025

  27. [27]

    Retrieval-Augmented Generation for Large Language Models: A Survey

    Y unfan Gao, Y un Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Y uxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, and Haofen Wang. Retrieval-augmented genera- tion for large language models: A survey. arXiv preprint arXiv:2312.10997, 2(1):32, 2023

  28. [28]

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

    Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705 , 2023

  29. [29]

    Di ffusion policy: Visuomotor policy learning via action di ffusion

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Di ffusion policy: Visuomotor policy learning via action di ffusion. The International Journal of Robotics Research, 44(10-11):16841704, 2025.15

  30. [30]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control

    Brianna Zitkovich, Tianhe Y u, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, and Ayzaan Wahid. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning , page 21652183. PMLR. ISBN 2640- 3498

  31. [31]

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

    Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a di ffusion foundation model for bimanual manipulation. arXiv preprint arXiv:2410.07864, 2024

  32. [32]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, and Brian Ichter. π0: A vision- language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

  33. [33]

    Scaling robot learning with semantically imagined experience.arXiv preprint arXiv:2302.11550, 2023

    Tianhe Y u, Ted Xiao, Austin Stone, Jonathan Tompson, Anthony Brohan, Su Wang, Jaspiar Singh, Clayton Tan, Jodilyn Peralta, and Brian Ichter. Scaling robot learning with semantically imagined experience. arXiv preprint arXiv:2302.11550, 2023

  34. [34]

    Genaug: Retargeting behaviors to unseen situ- ations via generative augmentation, 2023

    Zoey Chen, Sho Kiami, Abhishek Gupta, and Vikash Kumar. Genaug: Retargeting behaviors to unseen sit- uations via generative augmentation. arXiv preprint arXiv:2302.06671, 2023

  35. [35]

    Semantically controllable augmentations for gen- eralizable robot learning

    Zoey Chen, Zhao Mandi, Homanga Bharadhwaj, Mohit Sharma, Shuran Song, Abhishek Gupta, and Vikash Ku- mar. Semantically controllable augmentations for gen- eralizable robot learning. The International Journal of Robotics Research, 44(10-11):17051726, 2025

  36. [36]

    Dabi: Evaluation of data augmentation meth- ods using downsampling in bilateral control-based imita- tion learning with images

    Masato Kobayashi, Thanpimon Buamanee, and Y uki Uranishi. Dabi: Evaluation of data augmentation meth- ods using downsampling in bilateral control-based imita- tion learning with images. In 2025 IEEE International Conference on Robotics and Automation (ICRA) , page 1689216898. IEEE. ISBN 9798331541392

  37. [37]

    Robovqa: Multimodal long-horizon rea- soning for robotics

    Pierre Sermanet, Tianli Ding, Je ffrey Zhao, Fei Xia, De- bidatta Dwibedi, Keerthana Gopalakrishnan, Christine Chan, Gabriel Dulac-Arnold, Sharath Maddineni, and Nikhil J Joshi. Robovqa: Multimodal long-horizon rea- soning for robotics. In 2024 IEEE International Confer- ence on Robotics and Automation (ICRA) , page 645652. IEEE. ISBN 9798350384574

  38. [38]

    X. Wang, Y . Chen, and W. Zhu. A survey on curricu- lum learning. IEEE Trans Pattern Anal Mach Intell , 44 (9):45554576, 2022. doi: 10.1109 /tpami.2021.3069908. 16