LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

Baochang Ren; Chenxi Li; Daqi Gao; Dongzhan Zhou; Huajun Chen; Jintao Xing; Lei Bai; Minting Pan; Ningyu Zhang; Rui Li

arxiv: 2606.13578 · v1 · pith:QLU3NAUBnew · submitted 2026-06-11 · 💻 cs.CL · cs.AI· cs.LG· cs.MM· cs.RO

LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

Baochang Ren , Xinjie Liu , Xi Chen , Yanshuo Liu , Chenxi Li , Daqi Gao , Zeqin Su , Jintao Xing

show 10 more authors

Zirui Xue Rui Li Xiangyu Zhao Shuofei Qiao Minting Pan Wangmeng Zuo Lei Bai Dongzhan Zhou Ningyu Zhang Huajun Chen

This is my paper

Pith reviewed 2026-06-27 06:52 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LGcs.MMcs.RO

keywords LabVLARoboGenesisVision-Language-ActionLabUtopiascientific laboratoriesrobot policiesflow matchingsimulation data

0 comments

The pith

LabVLA achieves the highest average success rate on the LabUtopia benchmark under both in-distribution and out-of-distribution settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Scientific laboratories need AI that can execute physical protocols with robots, not just plan them. Existing vision-language-action models rarely train on lab instruments, transparent liquids, or fixed workflows, so the paper builds RoboGenesis to generate simulation demonstrations from atomic skills and presents LabVLA with a two-stage training recipe on a Qwen3-VL backbone. FAST pretraining first makes the model action-aware, then flow matching attaches a DiT action expert. On the new LabUtopia benchmark this produces the top success rates among baselines in both familiar and novel settings. A reader would care because successful transfer would let AI move from hypothesis generation to actual bench execution.

Core claim

LabVLA is a vision-language-action model trained with FAST action token pretraining on the Qwen3-VL-4B-Instruct backbone followed by flow matching posttraining that attaches a DiT action expert under knowledge insulation; when supplied demonstrations generated by the RoboGenesis simulation workflow engine that composes laboratory protocols from atomic skills, it achieves the highest average success rate among all evaluated baselines on the LabUtopia benchmark under both in-distribution and out-of-distribution settings.

What carries the argument

The two-stage training recipe of FAST action token pretraining to render the vision-language backbone action-aware, followed by flow matching posttraining that attaches a DiT action expert, applied to data exported by RoboGenesis.

If this is right

Laboratory protocols can be composed from atomic skills and executed across supported robot profiles.
Vision-language-action policies can manage instruments and transparent liquids typical of scientific workflows.
Highest success rates hold in both in-distribution and out-of-distribution settings on LabUtopia.
Simulation-based data engines can supply structured demonstrations for specialized domains.
A unified learning framework accommodates diverse robot embodiments for experimental protocols.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If real-world transfer holds, automated execution could reduce human operator time on routine bench protocols.
The atomic-skill composition method could scale to longer multi-step experiments beyond the current benchmark.
The same data-generation and two-stage recipe might apply to other precision domains such as pharmaceutical compounding or materials synthesis.
End-to-end pipelines could link literature reasoning models directly to physical execution once transfer is demonstrated.

Load-bearing premise

Simulation-generated demonstrations from RoboGenesis accurately capture the dynamics of real laboratory instruments, transparent liquids, and fixed protocol workflows sufficiently for policy transfer to physical execution.

What would settle it

Physical-robot execution of the same LabUtopia protocols in a real laboratory, with measured success rates compared directly to the reported simulation numbers.

read the original abstract

Scientific laboratories increasingly rely on AI systems to reason about experiments, but the physical act of doing science remains largely outside their reach. AI can help read literature, generate hypotheses, and plan protocols, yet the execution of those protocols at the bench still requires a human operator. Vision-Language-Action (VLA) models provide one possible interface between written protocols and robot execution, but existing policies are trained mostly on household and tabletop demonstrations and rarely encounter the instruments, transparent liquids, or fixed protocol workflows found in scientific laboratories. Closing this gap requires both laboratory-specific supervision and a unified learning framework that can accommodate the diverse robot embodiments used to execute experimental protocols. We therefore identify data and embodiment as central bottlenecks alongside model design. To address the data side, we build RoboGenesis, a simulation-based workflow and data engine that composes configured laboratory workflows from atomic skills, validates and filters rollouts, and exports structured demonstrations across supported robot profiles. On the policy side, we present LabVLA, trained with a two-stage recipe: FAST action token pretraining first makes the Qwen3-VL-4B-Instruct backbone action aware before any continuous control is learned, and flow matching posttraining then attaches a DiT action expert under knowledge insulation. On the LabUtopia benchmark, LabVLA achieves the highest average success rate among all evaluated baselines under both in-distribution and out-of-distribution settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LabVLA adds a lab-specific sim engine and two-stage training to VLA models and claims top benchmark scores, but the abstract gives no numbers or setup details to judge the size of the step.

read the letter

Hi,

The main point here is a new simulation engine for generating lab workflow data plus a VLA model trained in two stages that reports the highest success rate on their benchmark. The abstract does not include any actual numbers, baseline list, or evaluation protocol, so the strength of that claim is hard to assess from what is shown.

What is new is RoboGenesis, which builds lab protocols from atomic skills in simulation and exports demonstrations for different robot profiles, plus the LabVLA recipe that first does FAST token pretraining on Qwen3-VL-4B to add action awareness and then attaches a flow-matching DiT expert. The target domain is also new: scientific instruments, transparent liquids, and fixed experimental protocols rather than household tasks.

The paper correctly flags that existing VLA work mostly skips these lab constraints, so the data and embodiment focus is reasonable. The two-stage approach is a straightforward way to adapt a vision-language backbone without starting from scratch.

The soft spot is the missing quantitative support. Without success rates, variance, or a description of the baselines, it is impossible to tell whether the ranking reflects a real advance or just a narrow comparison set. The sim-to-real transfer question is noted but not tested, which is acceptable for a benchmark paper yet caps how far the result travels.

This is aimed at people working on embodied models for experimental science. If the full paper supplies the missing numbers, ablations, and clear OOD definitions, it would be worth a referee's time.

I would send it for peer review to get the evaluation details checked.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces RoboGenesis, a simulation-based workflow and data engine that composes laboratory protocols from atomic skills and exports structured demonstrations, and LabVLA, a VLA model obtained by FAST action-token pretraining on the Qwen3-VL-4B-Instruct backbone followed by flow-matching DiT attachment under knowledge insulation. It reports that LabVLA attains the highest average success rate on the newly constructed LabUtopia benchmark under both in-distribution and out-of-distribution conditions.

Significance. If the empirical ranking is reproducible and the benchmark tasks are representative, the work would be significant for robotics in scientific domains by supplying a scalable data-generation pipeline and a two-stage training recipe that first renders a VL backbone action-aware before continuous control. The explicit construction of RoboGenesis and LabUtopia as new artifacts is a concrete contribution that future lab-automation research can build upon.

major comments (2)

[Abstract] Abstract: the central claim that LabVLA records the highest average success rate is presented without any numerical values, baseline identities, number of trials, or statistical tests; this absence makes the magnitude and reliability of the reported improvement impossible to assess and is load-bearing for the empirical contribution.
[Evaluation] Evaluation section (inferred from benchmark description): the LabUtopia benchmark definition, task success criteria, and the precise in-distribution versus out-of-distribution splits are not specified, preventing verification that the ranking is not an artifact of benchmark construction or evaluation protocol.

minor comments (1)

[Abstract] Abstract: the phrase 'knowledge insulation' is introduced without definition or reference to the mechanism that prevents interference between the pretrained backbone and the DiT expert.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the need for greater specificity in the abstract and evaluation protocol. We will revise the manuscript to address both points directly.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that LabVLA records the highest average success rate is presented without any numerical values, baseline identities, number of trials, or statistical tests; this absence makes the magnitude and reliability of the reported improvement impossible to assess and is load-bearing for the empirical contribution.

Authors: We agree that the abstract should report concrete numbers to substantiate the central claim. In the revision we will insert the average success rates achieved by LabVLA and the primary baselines, state the number of evaluation trials per task, and note any statistical tests performed. This change will make the magnitude and reliability of the reported gains immediately verifiable. revision: yes
Referee: [Evaluation] Evaluation section (inferred from benchmark description): the LabUtopia benchmark definition, task success criteria, and the precise in-distribution versus out-of-distribution splits are not specified, preventing verification that the ranking is not an artifact of benchmark construction or evaluation protocol.

Authors: We acknowledge that the current manuscript does not supply a sufficiently detailed description of LabUtopia. We will expand the evaluation section to define the benchmark tasks, specify the exact success criteria for each task, and enumerate the precise task splits used for the in-distribution and out-of-distribution conditions. These additions will enable independent verification of the evaluation protocol. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical contribution: construction of the RoboGenesis simulation engine for laboratory workflows, a two-stage training recipe (FAST token pretraining on Qwen3-VL-4B followed by flow-matching DiT) for LabVLA, and benchmark results on LabUtopia showing highest average success rates under in- and out-of-distribution conditions. No equations, fitted parameters renamed as predictions, self-citation load-bearing uniqueness theorems, or ansatzes smuggled via prior work appear in the provided text. The central claim reduces to measured performance on a newly constructed benchmark rather than any derivation that collapses to its own inputs by construction. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

Abstract-only view limits visibility into internal parameters; the central claim rests on the unverified transferability of simulation data and the effectiveness of the two-stage training recipe.

axioms (1)

domain assumption Simulation rollouts from composed atomic skills produce demonstrations that are valid for real lab protocol execution
Invoked when RoboGenesis is presented as the solution to the data bottleneck for lab-specific supervision.

invented entities (3)

RoboGenesis no independent evidence
purpose: Simulation-based workflow and data engine that composes lab workflows and exports demonstrations
New component introduced to address the data bottleneck for laboratory VLA training.
LabVLA no independent evidence
purpose: Vision-language-action policy specialized for scientific laboratory tasks
The proposed model trained with the two-stage recipe.
LabUtopia no independent evidence
purpose: Benchmark for evaluating VLA models on laboratory tasks under in- and out-of-distribution conditions
New evaluation environment used to report the central performance claim.

pith-pipeline@v0.9.1-grok · 5856 in / 1408 out tokens · 35522 ms · 2026-06-27T06:52:04.942345+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 25 linked inside Pith

[1]

Qwen3-vl technical report.arXiv preprint arXiv:2511.21631,

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631,

Pith/arXiv arXiv
[2]

Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734,

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734,

Pith/arXiv arXiv
[3]

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.𝜋0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164,

Pith/arXiv arXiv
[4]

Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817,

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817,

Pith/arXiv arXiv
[5]

Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025a

Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025a. Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and ...

Pith/arXiv arXiv
[6]

Internvla-a1: Unifying understanding, generation and action for robotic manipulation.arXiv preprint arXiv:2601.02456,

Junhao Cai, Zetao Cai, Jiafei Cao, Yilun Chen, Zeyu He, Lei Jiang, Hang Li, Hengjie Li, Yang Li, Yufei Liu, et al. Internvla-a1: Unifying understanding, generation and action for robotic manipulation.arXiv preprint arXiv:2601.02456,

arXiv
[7]

Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088,

Tianxing Chen, Zanxin Chen, Baĳun Chen, Zĳian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088,

Pith/arXiv arXiv
[8]

Robonet: Large-scale multi-robot learning.arXiv preprint arXiv:1910.11215,

Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh, Sergey Levine, and Chelsea Finn. Robonet: Large-scale multi-robot learning.arXiv preprint arXiv:1910.11215,

Pith/arXiv arXiv 1910
[9]

Act3d: 3d feature field transformers for multi-task robotic manipulation.arXiv preprint arXiv:2306.17817,

Theophile Gervet, Zhou Xian, Nikolaos Gkanatsios, and Katerina Fragkiadaki. Act3d: 3d feature field transformers for multi-task robotic manipulation.arXiv preprint arXiv:2306.17817,

arXiv
[10]

Rvt-2: Learning precise manipulation from few demonstrations.arXiv preprint arXiv:2406.08545,

Ankit Goyal, Valts Blukis, Jie Xu, Yĳie Guo, Yu-Wei Chao, and Dieter Fox. Rvt-2: Learning precise manipulation from few demonstrations.arXiv preprint arXiv:2406.08545,

arXiv
[11]

Maniskill2: A unified benchmark for generalizable manipulation skills.arXiv preprint arXiv:2302.04659,

Jiayuan Gu, Fanbo Xiang, Xuanlin Li, Zhan Ling, Xiqiang Liu, Tongzhou Mu, Yihe Tang, Stone Tao, Xinyue Wei, Yunchao Yao, et al. Maniskill2: A unified benchmark for generalizable manipulation skills.arXiv preprint arXiv:2302.04659,

arXiv
[12]

Liger kernel: Efficient triton kernels for llm training.arXiv preprint arXiv:2410.10989,

Pin-Lun Hsu, Yun Dai, Vignesh Kothapalli, Qingquan Song, Shao Tang, Siyu Zhu, Steven Shimizu, Shivam Sahni, Haowen Ning, and Yanning Chen. Liger kernel: Efficient triton kernels for llm training.arXiv preprint arXiv:2410.10989,

arXiv
[13]

Tarmac: A taxonomy for robot manipulation in chemistry.arXiv preprint arXiv:2510.19289,

Kefeng Huang, Jonathon Pipe, Alice E Martin, Tianyuan Wang, Barnabas A Franklin, Andy M Tyrrell, Ian JS Fairlamb, and Jihong Zhu. Tarmac: A taxonomy for robot manipulation in chemistry.arXiv preprint arXiv:2510.19289,

arXiv
[14]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.𝜋0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054,

Pith/arXiv arXiv
[15]

Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945,

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945,

Pith/arXiv arXiv
[16]

Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246,

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246,

Pith/arXiv arXiv
[17]

Fine-tuning vision-language-action models: Optimizing speed and success

Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success. arXiv preprint arXiv:2502.19645,

Pith/arXiv arXiv
[18]

igibson 2.0: Object-centric simulation for robot learning of everyday household tasks

Chengshu Li, Fei Xia, Roberto Martín-Martín, Michael Lingelbach, Sanjana Srivastava, Bokui Shen, Kent Vainio, Cem Gokmen, Gokul Dharan, Tanish Jain, et al. igibson 2.0: Object-centric simulation for robot learning of everyday household tasks. arXiv preprint arXiv:2108.03272,

arXiv
[19]

Coa-vla: Improving vision-language-action models via visual-text chain-of-affordance

Jinming Li, Yichen Zhu, Zhibin Tang, Junjie Wen, Minjie Zhu, Xiaoyu Liu, Chengmeng Li, Ran Cheng, Yaxin Peng, Yan Peng, et al. Coa-vla: Improving vision-language-action models via visual-text chain-of-affordance. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9759–9769, 2025a. 16 Technical Report Zhejiang University & Sha...

Pith/arXiv arXiv
[20]

Map-vla: Memory-augmented prompting for vision-language-action model in robotic manipulation.arXiv preprint arXiv:2511.09516, 2025b

Runhao Li, Wenkai Guo, Zhenyu Wu, Changyuan Wang, Haoyuan Deng, Zhenyu Weng, Yap-Peng Tan, and Ziwei Wang. Map-vla: Memory-augmented prompting for vision-language-action model in robotic manipulation.arXiv preprint arXiv:2511.09516, 2025b. Shoujie Li, Yan Huang, Changqing Guo, Tong Wu, Jiawei Zhang, Linrui Zhang, and Wenbo Ding. Chemistry3d: Robotic inter...

arXiv
[21]

Flow matching for generative modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747,

Pith/arXiv arXiv
[22]

Rdt-1b: a diffusion foundation model for bimanual manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation. InInternational Conference on Learning Representations, volume 2025, pages 29982–30009,

2025
[23]

Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003,

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003,

Pith/arXiv arXiv
[24]

The ai scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292,

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292,

Pith/arXiv arXiv
[25]

Robocasa: Large-scale simulation of everyday tasks for generalist robots.arXiv preprint arXiv:2406.02523,

Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots.arXiv preprint arXiv:2406.02523,

Pith/arXiv arXiv
[26]

Robocasa365: A large-scale simulation framework for training and benchmarking generalist robots.arXiv preprint arXiv:2603.04356,

Soroush Nasiriany, Sepehr Nasiriany, Abhiram Maddukuri, and Yuke Zhu. Robocasa365: A large-scale simulation framework for training and benchmarking generalist robots.arXiv preprint arXiv:2603.04356,

arXiv
[27]

Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747,

17 Technical Report Zhejiang University & Shanghai AI Lab Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747,

Pith/arXiv arXiv
[28]

Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830,

Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830,

Pith/arXiv arXiv
[29]

Memoryvla: Perceptual-cognitive memory in vision-language-action models for robotic manipulation.arXiv preprint arXiv:2508.19236,

Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, and Gao Huang. Memoryvla: Perceptual-cognitive memory in vision-language-action models for robotic manipulation.arXiv preprint arXiv:2508.19236,

Pith/arXiv arXiv
[30]

Smolvla: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844,

Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepĳn Kooĳmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, et al. Smolvla: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844,

Pith/arXiv arXiv
[31]

Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai.arXiv preprint arXiv:2410.00425,

Stone Tao, Fanbo Xiang, Arth Shukla, Yuzhe Qin, Xander Hinrichsen, Xiaodi Yuan, Chen Bao, Xinsong Lin, Yulin Liu, Tse-kai Chan, et al. Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai.arXiv preprint arXiv:2410.00425,

arXiv
[32]

Galactica: A large language model for science.arXiv preprint arXiv:2211.09085,

Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic. Galactica: A large language model for science.arXiv preprint arXiv:2211.09085,

Pith/arXiv arXiv
[33]

Interndata-a1: Pioneering high-fidelity synthetic data for pre-training generalist policy.arXiv preprint arXiv:2511.16651,

Yang Tian, Yuyin Yang, Yiman Xie, Zetao Cai, Xu Shi, Ning Gao, Hangxu Liu, Xuekun Jiang, Zherui Qiu, Feng Yuan, et al. Interndata-a1: Pioneering high-fidelity synthetic data for pre-training generalist policy.arXiv preprint arXiv:2511.16651,

arXiv
[34]

Robogen: Towards unleashing infinite data for automated robot learning via generative simulation.arXiv preprint arXiv:2311.01455,

Yufei Wang, Zhou Xian, Feng Chen, Tsun-Hsuan Wang, Yian Wang, Katerina Fragkiadaki, Zackory Erickson, David Held, and Chuang Gan. Robogen: Towards unleashing infinite data for automated robot learning via generative simulation.arXiv preprint arXiv:2311.01455,

arXiv
[35]

Instructvla: Vision-language-actioninstructiontuningfromunderstandingtomanipulation.arXivpreprintarXiv:2507.17520,

Shuai Yang, Hao Li, Bin Wang, Yilun Chen, Yang Tian, Tai Wang, Hanqing Wang, Feng Zhao, Yiyi Liao, and Jiangmiao Pang. Instructvla: Vision-language-actioninstructiontuningfromunderstandingtomanipulation.arXivpreprintarXiv:2507.17520,

arXiv
[36]

World action models are zero-shot policies.arXiv preprint arXiv:2602.15922,

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922,

Pith/arXiv arXiv
[37]

Homerobot: Open-vocabulary mobile manipulation.arXiv preprint arXiv:2306.11565,

18 Technical Report Zhejiang University & Shanghai AI Lab Sriram Yenamandra, Arun Ramachandran, Karmesh Yadav, Austin Wang, Mukul Khanna, Theophile Gervet, Tsung-Yen Yang, Vidhi Jain, Alexander William Clegg, John Turner, et al. Homerobot: Open-vocabulary mobile manipulation.arXiv preprint arXiv:2306.11565,

arXiv
[38]

Chemistry lab automation via constrained task and motion planning.arXiv preprint arXiv:2212.09672,

Naruki Yoshikawa, Andrew Zou Li, Kourosh Darvish, Yuchi Zhao, Haoping Xu, Artur Kuramshin, Alán Aspuru-Guzik, Animesh Garg, and Florian Shkurti. Chemistry lab automation via constrained task and motion planning.arXiv preprint arXiv:2212.09672,

arXiv
[39]

Robotic control via embodied chain-of-thought reasoning.arXiv preprint arXiv:2407.08693,

Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning.arXiv preprint arXiv:2407.08693,

Pith/arXiv arXiv
[40]

Igniting vlms toward the embodied space.arXiv preprint arXiv:2509.11766,

Andy Zhai, Brae Liu, Bruno Fang, Chalse Cai, Ellie Ma, Ethan Yin, Hao Wang, Hugo Zhou, James Wang, Lights Shi, et al. Igniting vlms toward the embodied space.arXiv preprint arXiv:2509.11766,

arXiv
[41]

X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model.arXiv preprint arXiv:2510.10274, 2025a

Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model.arXiv preprint arXiv:2510.10274, 2025a. Ruĳie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daumé III, Andrey Kolobov, Fu...

Pith/arXiv arXiv 2025
[42]

robosuite: A modular simulation framework and benchmark for robot learning.arXiv preprint arXiv:2009.12293,

Yuke Zhu, Josiah Wong, Ajay Mandlekar, Roberto Martín-Martín, Abhishek Joshi, Kevin Lin, Abhiram Maddukuri, Soroush Nasiriany, and Yifeng Zhu. robosuite: A modular simulation framework and benchmark for robot learning.arXiv preprint arXiv:2009.12293,

Pith/arXiv arXiv 2009

[1] [1]

Qwen3-vl technical report.arXiv preprint arXiv:2511.21631,

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631,

Pith/arXiv arXiv

[2] [2]

Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734,

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734,

Pith/arXiv arXiv

[3] [3]

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.𝜋0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164,

Pith/arXiv arXiv

[4] [4]

Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817,

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817,

Pith/arXiv arXiv

[5] [5]

Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025a

Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025a. Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and ...

Pith/arXiv arXiv

[6] [6]

Internvla-a1: Unifying understanding, generation and action for robotic manipulation.arXiv preprint arXiv:2601.02456,

Junhao Cai, Zetao Cai, Jiafei Cao, Yilun Chen, Zeyu He, Lei Jiang, Hang Li, Hengjie Li, Yang Li, Yufei Liu, et al. Internvla-a1: Unifying understanding, generation and action for robotic manipulation.arXiv preprint arXiv:2601.02456,

arXiv

[7] [7]

Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088,

Tianxing Chen, Zanxin Chen, Baĳun Chen, Zĳian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088,

Pith/arXiv arXiv

[8] [8]

Robonet: Large-scale multi-robot learning.arXiv preprint arXiv:1910.11215,

Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh, Sergey Levine, and Chelsea Finn. Robonet: Large-scale multi-robot learning.arXiv preprint arXiv:1910.11215,

Pith/arXiv arXiv 1910

[9] [9]

Act3d: 3d feature field transformers for multi-task robotic manipulation.arXiv preprint arXiv:2306.17817,

Theophile Gervet, Zhou Xian, Nikolaos Gkanatsios, and Katerina Fragkiadaki. Act3d: 3d feature field transformers for multi-task robotic manipulation.arXiv preprint arXiv:2306.17817,

arXiv

[10] [10]

Rvt-2: Learning precise manipulation from few demonstrations.arXiv preprint arXiv:2406.08545,

Ankit Goyal, Valts Blukis, Jie Xu, Yĳie Guo, Yu-Wei Chao, and Dieter Fox. Rvt-2: Learning precise manipulation from few demonstrations.arXiv preprint arXiv:2406.08545,

arXiv

[11] [11]

Maniskill2: A unified benchmark for generalizable manipulation skills.arXiv preprint arXiv:2302.04659,

Jiayuan Gu, Fanbo Xiang, Xuanlin Li, Zhan Ling, Xiqiang Liu, Tongzhou Mu, Yihe Tang, Stone Tao, Xinyue Wei, Yunchao Yao, et al. Maniskill2: A unified benchmark for generalizable manipulation skills.arXiv preprint arXiv:2302.04659,

arXiv

[12] [12]

Liger kernel: Efficient triton kernels for llm training.arXiv preprint arXiv:2410.10989,

Pin-Lun Hsu, Yun Dai, Vignesh Kothapalli, Qingquan Song, Shao Tang, Siyu Zhu, Steven Shimizu, Shivam Sahni, Haowen Ning, and Yanning Chen. Liger kernel: Efficient triton kernels for llm training.arXiv preprint arXiv:2410.10989,

arXiv

[13] [13]

Tarmac: A taxonomy for robot manipulation in chemistry.arXiv preprint arXiv:2510.19289,

Kefeng Huang, Jonathon Pipe, Alice E Martin, Tianyuan Wang, Barnabas A Franklin, Andy M Tyrrell, Ian JS Fairlamb, and Jihong Zhu. Tarmac: A taxonomy for robot manipulation in chemistry.arXiv preprint arXiv:2510.19289,

arXiv

[14] [14]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.𝜋0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054,

Pith/arXiv arXiv

[15] [15]

Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945,

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945,

Pith/arXiv arXiv

[16] [16]

Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246,

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246,

Pith/arXiv arXiv

[17] [17]

Fine-tuning vision-language-action models: Optimizing speed and success

Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success. arXiv preprint arXiv:2502.19645,

Pith/arXiv arXiv

[18] [18]

igibson 2.0: Object-centric simulation for robot learning of everyday household tasks

Chengshu Li, Fei Xia, Roberto Martín-Martín, Michael Lingelbach, Sanjana Srivastava, Bokui Shen, Kent Vainio, Cem Gokmen, Gokul Dharan, Tanish Jain, et al. igibson 2.0: Object-centric simulation for robot learning of everyday household tasks. arXiv preprint arXiv:2108.03272,

arXiv

[19] [19]

Coa-vla: Improving vision-language-action models via visual-text chain-of-affordance

Jinming Li, Yichen Zhu, Zhibin Tang, Junjie Wen, Minjie Zhu, Xiaoyu Liu, Chengmeng Li, Ran Cheng, Yaxin Peng, Yan Peng, et al. Coa-vla: Improving vision-language-action models via visual-text chain-of-affordance. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9759–9769, 2025a. 16 Technical Report Zhejiang University & Sha...

Pith/arXiv arXiv

[20] [20]

Map-vla: Memory-augmented prompting for vision-language-action model in robotic manipulation.arXiv preprint arXiv:2511.09516, 2025b

Runhao Li, Wenkai Guo, Zhenyu Wu, Changyuan Wang, Haoyuan Deng, Zhenyu Weng, Yap-Peng Tan, and Ziwei Wang. Map-vla: Memory-augmented prompting for vision-language-action model in robotic manipulation.arXiv preprint arXiv:2511.09516, 2025b. Shoujie Li, Yan Huang, Changqing Guo, Tong Wu, Jiawei Zhang, Linrui Zhang, and Wenbo Ding. Chemistry3d: Robotic inter...

arXiv

[21] [21]

Flow matching for generative modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747,

Pith/arXiv arXiv

[22] [22]

Rdt-1b: a diffusion foundation model for bimanual manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation. InInternational Conference on Learning Representations, volume 2025, pages 29982–30009,

2025

[23] [23]

Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003,

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003,

Pith/arXiv arXiv

[24] [24]

The ai scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292,

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292,

Pith/arXiv arXiv

[25] [25]

Robocasa: Large-scale simulation of everyday tasks for generalist robots.arXiv preprint arXiv:2406.02523,

Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots.arXiv preprint arXiv:2406.02523,

Pith/arXiv arXiv

[26] [26]

Robocasa365: A large-scale simulation framework for training and benchmarking generalist robots.arXiv preprint arXiv:2603.04356,

Soroush Nasiriany, Sepehr Nasiriany, Abhiram Maddukuri, and Yuke Zhu. Robocasa365: A large-scale simulation framework for training and benchmarking generalist robots.arXiv preprint arXiv:2603.04356,

arXiv

[27] [27]

Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747,

17 Technical Report Zhejiang University & Shanghai AI Lab Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747,

Pith/arXiv arXiv

[28] [28]

Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830,

Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830,

Pith/arXiv arXiv

[29] [29]

Memoryvla: Perceptual-cognitive memory in vision-language-action models for robotic manipulation.arXiv preprint arXiv:2508.19236,

Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, and Gao Huang. Memoryvla: Perceptual-cognitive memory in vision-language-action models for robotic manipulation.arXiv preprint arXiv:2508.19236,

Pith/arXiv arXiv

[30] [30]

Smolvla: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844,

Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepĳn Kooĳmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, et al. Smolvla: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844,

Pith/arXiv arXiv

[31] [31]

Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai.arXiv preprint arXiv:2410.00425,

Stone Tao, Fanbo Xiang, Arth Shukla, Yuzhe Qin, Xander Hinrichsen, Xiaodi Yuan, Chen Bao, Xinsong Lin, Yulin Liu, Tse-kai Chan, et al. Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai.arXiv preprint arXiv:2410.00425,

arXiv

[32] [32]

Galactica: A large language model for science.arXiv preprint arXiv:2211.09085,

Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic. Galactica: A large language model for science.arXiv preprint arXiv:2211.09085,

Pith/arXiv arXiv

[33] [33]

Interndata-a1: Pioneering high-fidelity synthetic data for pre-training generalist policy.arXiv preprint arXiv:2511.16651,

Yang Tian, Yuyin Yang, Yiman Xie, Zetao Cai, Xu Shi, Ning Gao, Hangxu Liu, Xuekun Jiang, Zherui Qiu, Feng Yuan, et al. Interndata-a1: Pioneering high-fidelity synthetic data for pre-training generalist policy.arXiv preprint arXiv:2511.16651,

arXiv

[34] [34]

Robogen: Towards unleashing infinite data for automated robot learning via generative simulation.arXiv preprint arXiv:2311.01455,

Yufei Wang, Zhou Xian, Feng Chen, Tsun-Hsuan Wang, Yian Wang, Katerina Fragkiadaki, Zackory Erickson, David Held, and Chuang Gan. Robogen: Towards unleashing infinite data for automated robot learning via generative simulation.arXiv preprint arXiv:2311.01455,

arXiv

[35] [35]

Instructvla: Vision-language-actioninstructiontuningfromunderstandingtomanipulation.arXivpreprintarXiv:2507.17520,

Shuai Yang, Hao Li, Bin Wang, Yilun Chen, Yang Tian, Tai Wang, Hanqing Wang, Feng Zhao, Yiyi Liao, and Jiangmiao Pang. Instructvla: Vision-language-actioninstructiontuningfromunderstandingtomanipulation.arXivpreprintarXiv:2507.17520,

arXiv

[36] [36]

World action models are zero-shot policies.arXiv preprint arXiv:2602.15922,

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922,

Pith/arXiv arXiv

[37] [37]

Homerobot: Open-vocabulary mobile manipulation.arXiv preprint arXiv:2306.11565,

18 Technical Report Zhejiang University & Shanghai AI Lab Sriram Yenamandra, Arun Ramachandran, Karmesh Yadav, Austin Wang, Mukul Khanna, Theophile Gervet, Tsung-Yen Yang, Vidhi Jain, Alexander William Clegg, John Turner, et al. Homerobot: Open-vocabulary mobile manipulation.arXiv preprint arXiv:2306.11565,

arXiv

[38] [38]

Chemistry lab automation via constrained task and motion planning.arXiv preprint arXiv:2212.09672,

Naruki Yoshikawa, Andrew Zou Li, Kourosh Darvish, Yuchi Zhao, Haoping Xu, Artur Kuramshin, Alán Aspuru-Guzik, Animesh Garg, and Florian Shkurti. Chemistry lab automation via constrained task and motion planning.arXiv preprint arXiv:2212.09672,

arXiv

[39] [39]

Robotic control via embodied chain-of-thought reasoning.arXiv preprint arXiv:2407.08693,

Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning.arXiv preprint arXiv:2407.08693,

Pith/arXiv arXiv

[40] [40]

Igniting vlms toward the embodied space.arXiv preprint arXiv:2509.11766,

Andy Zhai, Brae Liu, Bruno Fang, Chalse Cai, Ellie Ma, Ethan Yin, Hao Wang, Hugo Zhou, James Wang, Lights Shi, et al. Igniting vlms toward the embodied space.arXiv preprint arXiv:2509.11766,

arXiv

[41] [41]

X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model.arXiv preprint arXiv:2510.10274, 2025a

Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model.arXiv preprint arXiv:2510.10274, 2025a. Ruĳie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daumé III, Andrey Kolobov, Fu...

Pith/arXiv arXiv 2025

[42] [42]

robosuite: A modular simulation framework and benchmark for robot learning.arXiv preprint arXiv:2009.12293,

Yuke Zhu, Josiah Wong, Ajay Mandlekar, Roberto Martín-Martín, Abhishek Joshi, Kevin Lin, Abhiram Maddukuri, Soroush Nasiriany, and Yifeng Zhu. robosuite: A modular simulation framework and benchmark for robot learning.arXiv preprint arXiv:2009.12293,

Pith/arXiv arXiv 2009