pith. machine review for the scientific record. sign in

arxiv: 2604.19728 · v1 · submitted 2026-04-21 · 💻 cs.RO · cs.AI· cs.CV· cs.LG· cs.SE

Recognition: unknown

VLA Foundry: A Unified Framework for Training Vision-Language-Action Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:05 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CVcs.LGcs.SE
keywords vision-language-actionrobot learningopen-source frameworkvision-language modellarge language modeltabletop manipulationsimulator evaluationpolicy training
0
0 comments X

The pith

VLA Foundry supplies a single open codebase for training vision-language-action models from language pretraining through to action fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VLA Foundry as a framework that brings language model pretraining, vision-language pretraining, and vision-language-action fine-tuning into one shared training stack instead of requiring separate incompatible pipelines. This setup supports training entirely from scratch as well as swapping in existing pretrained backbones. The authors train and release two models, then test them in closed-loop on a tabletop manipulation simulator where the from-scratch version matches prior closed-source performance and the backbone-augmented version exceeds a baseline by a large margin. A sympathetic reader would care because the approach reduces the custom engineering needed to build capable robot policies and makes the full training chain publicly accessible. The work also adds usability fixes to the evaluation simulator and analysis tools so others can run similar experiments more easily.

Core claim

VLA Foundry is an open-source framework that provides a shared training stack with end-to-end control from language pretraining to action-expert fine-tuning of vision-language-action models. It supports both fully from-scratch training through an LLM to VLM to VLA pipeline and the use of pretrained backbones from Hugging Face. Two models trained with the framework are evaluated in closed-loop on the LBM Eval simulator: a fully open from-scratch model that reaches performance on par with prior closed-source work in the nominal setting, and a model built on the Qwen3-VL backbone that produces a strong multi-task tabletop manipulation policy outperforming the baseline by a wide margin.

What carries the argument

the shared training stack that handles end-to-end control from language pretraining to action-expert fine-tuning

If this is right

  • Fully open from-scratch models can reach performance levels comparable to previous closed-source work on tabletop manipulation tasks.
  • Substituting a strong pretrained vision-language backbone produces multi-task policies that substantially outperform a baseline on the same simulator.
  • Releasing the full codebase and model weights allows other researchers to train and evaluate their own VLA models without building separate pipelines for each stage.
  • Usability improvements to the simulator and analysis tools make it simpler for the community to reproduce and extend the evaluations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Other groups could adopt the same unified stack to test new backbones or tasks without rebuilding the entire training pipeline from scratch each time.
  • Running the released models on physical robot hardware would provide a direct check on whether the simulator results translate to real environments.
  • The framework's support for swapping components might enable faster iteration when newer vision-language models become available.

Load-bearing premise

Performance gains measured in the LBM Eval simulator reflect meaningful improvements in real-world robot capabilities, and the unified framework structure itself, rather than other implementation details, drives the reported results.

What would settle it

A side-by-side evaluation on the same LBM Eval tasks showing that the from-scratch model does not match prior closed-source success rates or that the Qwen3-VL model does not outperform the baseline by the reported margin.

read the original abstract

We present VLA Foundry, an open-source framework that unifies LLM, VLM, and VLA training in a single codebase. Most open-source VLA efforts specialize on the action training stage, often stitching together incompatible pretraining pipelines. VLA Foundry instead provides a shared training stack with end-to-end control, from language pretraining to action-expert fine-tuning. VLA Foundry supports both from-scratch training and pretrained backbones from Hugging Face. To demonstrate the utility of our framework, we train and release two types of models: the first trained fully from scratch through our LLM-->VLM-->VLA pipeline and the second built on the pretrained Qwen3-VL backbone. We evaluate closed-loop policy performance of both models on LBM Eval, an open-data, open-source simulator. We also contribute usability improvements to the simulator and the STEP analysis tools for easier public use. In the nominal evaluation setting, our fully-open from-scratch model is on par with our prior closed-source work and substituting in the Qwen3-VL backbone leads to a strong multi-task table top manipulation policy outperforming our baseline by a wide margin. The VLA Foundry codebase is available at https://github.com/TRI-ML/vla_foundry and all multi-task model weights are released on https://huggingface.co/collections/TRI-ML/vla-foundry. Additional qualitative videos are available on the project website https://tri-ml.github.io/vla_foundry.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces VLA Foundry, an open-source framework that unifies LLM, VLM, and VLA training within a single codebase supporting end-to-end control from language pretraining to action fine-tuning. It enables both fully from-scratch training via an LLM-to-VLM-to-VLA pipeline and integration of pretrained backbones such as Qwen3-VL. The authors train and release two models, evaluate closed-loop performance on the LBM Eval simulator for multi-task tabletop manipulation, contribute simulator usability improvements and STEP analysis tools, and claim that the from-scratch model matches prior closed-source work while the Qwen3-VL variant substantially outperforms the baseline. Code and weights are publicly released.

Significance. If the performance attribution holds, the work provides a reusable open engineering artifact that could reduce fragmentation in VLA development by replacing stitched pipelines with a shared stack. Explicit strengths include the public release of the full codebase, model weights, and simulator enhancements, which directly support reproducibility and community extension in robotics research.

major comments (2)
  1. [Abstract and Experiments] Abstract and Experiments: The positioning of the shared training stack and end-to-end control as the key innovation is load-bearing, yet no ablation studies are presented that hold data volume, compute budget, and backbone fixed while varying only the unification aspect of the pipeline. This makes it impossible to credit the reported gains (from-scratch parity and Qwen3-VL outperformance) specifically to VLA Foundry rather than implementation details or backbone strength.
  2. [Evaluation] Evaluation section: All quantitative claims rest on closed-loop policy performance within the LBM Eval simulator alone; no real-robot transfer experiments or physical deployment results are included, so the practical significance for tabletop manipulation remains untested even if simulator numbers are accurate.
minor comments (2)
  1. The phrase 'nominal evaluation setting' appears in the abstract without a clear definition or pointer to the corresponding experimental protocol in the main text.
  2. Training hyperparameter tables or configuration files could be expanded to list exact data mixtures and optimizer settings used for the from-scratch versus backbone-substituted runs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We are pleased that the open-source nature and potential for reducing fragmentation in VLA development are recognized as strengths. Below, we provide point-by-point responses to the major comments, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract and Experiments] Abstract and Experiments: The positioning of the shared training stack and end-to-end control as the key innovation is load-bearing, yet no ablation studies are presented that hold data volume, compute budget, and backbone fixed while varying only the unification aspect of the pipeline. This makes it impossible to credit the reported gains (from-scratch parity and Qwen3-VL outperformance) specifically to VLA Foundry rather than implementation details or backbone strength.

    Authors: We agree that an ablation isolating the unification aspect—while controlling for data, compute, and backbone—would more definitively attribute performance improvements to the shared stack. Our experiments instead demonstrate the framework's end-to-end capabilities by training competitive models from scratch and with pretrained backbones, achieving parity with prior closed-source results and outperforming baselines in the Qwen3-VL case. These results highlight the practical utility of VLA Foundry for unified training. To address the concern, we will revise the abstract and experiments section to better scope the claims to the framework's enabling role and add a limitations paragraph discussing the lack of such fine-grained ablations, along with our intent to pursue them in follow-up work. revision: partial

  2. Referee: [Evaluation] Evaluation section: All quantitative claims rest on closed-loop policy performance within the LBM Eval simulator alone; no real-robot transfer experiments or physical deployment results are included, so the practical significance for tabletop manipulation remains untested even if simulator numbers are accurate.

    Authors: We concur that real-robot validation is essential for establishing practical significance beyond simulation. The manuscript's evaluations are deliberately scoped to the LBM Eval simulator to enable reproducible, large-scale multi-task assessment in a controlled environment, consistent with many prior VLA works. We will update the evaluation and conclusion sections to explicitly acknowledge this limitation and to highlight that the released codebase and models are designed to facilitate future real-world deployment and transfer studies. revision: yes

Circularity Check

0 steps flagged

No circularity: engineering framework with external-benchmark evaluations only

full rationale

The paper introduces an open-source codebase for unified LLM/VLM/VLA training and reports empirical results on the external LBM Eval simulator. No mathematical derivations, fitted parameters renamed as predictions, self-definitional claims, or load-bearing self-citation chains appear in the provided text. Performance statements (on-par with prior closed work, Qwen3-VL variant outperforming baseline) are direct comparisons to external references rather than reductions to the paper's own inputs. The contribution is an artifact release plus simulator runs; the derivation chain is empty.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an engineering and empirical release paper. No free parameters, mathematical axioms, or invented entities are introduced; the framework consists of software design choices and standard training practices.

pith-pipeline@v0.9.0 · 5601 in / 1014 out tokens · 59337 ms · 2026-05-10T02:05:56.453087+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

78 extracted references · 39 canonical work pages · 24 internal anchors

  1. [1]

    AlexAndonianetal.GPT-NeoX: Large Scale Autoregressive Language Modeling in PyTorch.Version2.0.0. Sept. 2023.doi:10.5281/zenodo.5879544.url:https://www.github.com/eleutherai/gpt-neox

  2. [2]

    OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

    Anas Awadalla et al. “OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models”. In:arXiv preprint arXiv:2308.01390(2023)

  3. [3]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai et al. “Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond”. In:arXiv preprint arXiv:2308.12966(2023)

  4. [4]

    Qwen3-VL Technical Report

    Shuai Bai et al. “Qwen3-vl technical report”. In:arXiv preprint arXiv:2511.21631(2025)

  5. [5]

    Foundation Models for Robotics: Vision-Language-Action (VLA)

    Rohit Bandaru. “Foundation Models for Robotics: Vision-Language-Action (VLA)”. In: (Sept. 2025). url:https://rohitbandaru.github.io/blog/Foundation-Models-for-Robotics-VLA/

  6. [6]

    Significance Tests for 2×2 Tables

    G. A. Barnard. “Significance Tests for 2×2 Tables”. In:Biometrika34.1-2 (Jan. 1947), pp. 123–138. issn: 0006-3444.doi:10.1093/biomet/34.1-2.123. (Visited on 01/20/2025)

  7. [7]

    Lucas Beyer et al.PaliGemma: A versatile 3B VLM for transfer. en. July 2024.url:https://arxiv. org/abs/2407.07726v1(visited on 09/05/2024)

  8. [8]

    PIQA: Reasoning about Physical Commonsense in Natural Language

    Yonatan Bisk et al. “PIQA: Reasoning about Physical Commonsense in Natural Language”. In:The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educa- tional Advances in Artificial Intelligence, EAAI 2020, New York, NY,...

  9. [9]

    2020.url:https://github.com/webdataset/webdataset

    Thomas Breuel.WebDataset: A High-Performance Python-Based I/O System for Large Deep Learning Problems. 2020.url:https://github.com/webdataset/webdataset. 15

  10. [10]

    https://github.com/huggingface/lerobot

    Remi Cadene et al.LeRobot: State-of-the-art Machine Learning for Real-World Robotics in Pytorch. https://github.com/huggingface/lerobot. 2024

  11. [11]

    Microsoft COCO Captions: Data Collection and Evaluation Server

    Xinlei Chen et al. “Microsoft coco captions: Data collection and evaluation server”. In:arXiv preprint arXiv:1504.00325(2015)

  12. [12]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen et al. “Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks”. In:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024, pp. 24185–24198

  13. [13]

    Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots

    Cheng Chi et al. “Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots”. In:Proceedings of Robotics: Science and Systems (RSS). 2024

  14. [14]

    BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

    Christopher Clark et al. “BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions”. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, 2019, pp....

  15. [15]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark et al. “Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge”. In:ArXiv preprintabs/1803.05457 (2018).url:https://arxiv.org/abs/1803.05457

  16. [16]

    Open X-Embodiment Collaboration et al.Open X-Embodiment: Robotic Learning Datasets and RT-X Models.https://arxiv.org/abs/2310.08864. 2023

  17. [17]

    StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing

    StarVLA Community. “StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing”. In:arXiv preprint arXiv:2604.05014(2026)

  18. [18]

    Mustafa Shukor Dana Aubakirova, Jade Cholgari, and Leandro von Werra.VLAb: Your Laboratory for Pretraining VLAs.https://github.com/huggingface/vlab. 2025

  19. [19]

    Alexey Dosovitskiy et al.An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

  20. [20]

    arXiv:2010.11929 [cs.CV].url:https://arxiv.org/abs/2010.11929

  21. [21]

    arXiv preprint arXiv:1902.04023 , year=

    Ted Dunning and Otmar Ertl. “Computing Extremely Accurate Quantiles Using t-Digests”. In:arXiv preprint arXiv:1902.04023(2019).url:https://arxiv.org/abs/1902.04023

  22. [22]

    https://github.com/EGalahad/vla-scratch

    EGalahad.VLA-Scratch: A Modular, Performant, Efficient Stack For Vision-Language-Action Models. https://github.com/EGalahad/vla-scratch. GitHub repository. 2025

  23. [23]

    Datacomp: In search of the next generation of multimodal datasets

    Samir Yitzhak Gadre et al. “Datacomp: In search of the next generation of multimodal datasets”. In: Advances in Neural Information Processing Systems36 (2023), pp. 27092–27112

  24. [24]

    GitHub repository

    Suchin Gururangan et al.open_lm: a minimal but performative language modeling (LM) repository. GitHub repository. 2023.url:https://github.com/mlfoundations/open_lm/

  25. [25]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks et al. “Measuring Massive Multitask Language Understanding”. In:9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenRe- view.net, 2021.url:https://openreview.net/forum?id=d7KBjmI3GmQ

  26. [26]

    MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

    Shengding Hu et al. “Minicpm: Unveiling the potential of small language models with scalable training strategies”. In:arXiv preprint arXiv:2404.06395(2024)

  27. [27]

    Scaling Laws for Neural Language Models

    Jared Kaplan et al. “Scaling laws for neural language models”. In:arXiv preprint arXiv:2001.08361 (2020)

  28. [28]

    Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models

    Siddharth Karamcheti et al. “Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models”. In:International Conference on Machine Learning (ICML). 2024

  29. [29]

    2022.url:https://github.com/karpathy/nanoGPT

    Andrej Karpathy.nanoGPT: The Simplest, Fastest Repository for Training/Finetuning Medium-Sized GPTs. 2022.url:https://github.com/karpathy/nanoGPT

  30. [30]

    Should VLMs be Pre-trained with Image Data?

    Sedrick Keh et al. “Should VLMs be Pre-trained with Image Data?” In:arXiv preprint arXiv:2503.07603 (2025)

  31. [31]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim et al. “Openvla: An open-source vision-language-action model”. In:arXiv preprint arXiv:2406.09246(2024)

  32. [32]

    Accessed: 2026-04-17

    Pepijn Kooijmans et al.Unfolding Robotics: The Open-Source Recipe for Teaching a Robot to Fold Your Clothes. Accessed: 2026-04-17. 2026.url:https://huggingface.co/spaces/lerobot/robot-folding

  33. [33]

    Jason Lee et al.MolmoAct: Action Reasoning Models that can Reason in Space. 2025. arXiv:2508.07917 [cs.RO].url:https://arxiv.org/abs/2508.07917

  34. [34]

    Datacomp-lm: In search of the next generation of training sets for language models

    Jeffrey Li et al. “Datacomp-lm: In search of the next generation of training sets for language models”. In:Advances in Neural Information Processing Systems37 (2024), pp. 14200–14282. 16

  35. [35]

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    Junnan Li et al. “BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models”. In:Proceedings of the 40th International Conference on Machine Learning (ICML). 2023.url:https://arxiv.org/abs/2301.12597

  36. [36]

    A Systematic Study of Data Modalities and Strategies for Co-training Large Behavior Models for Robot Manipulation

    Fanqi Lin et al. “A Systematic Study of Data Modalities and Strategies for Co-training Large Behavior Models for Robot Manipulation”. In:arXiv preprint arXiv:2602.01067(2026)

  37. [37]

    HoloBrain-0 technical report.arXiv preprint arXiv:2602.12062, 2026

    Xuewu Lin et al. “HoloBrain-0 Technical Report”. In:arXiv preprint arXiv:2602.12062(2026).url: https://arxiv.org/abs/2602.12062

  38. [38]

    Flow Matching for Generative Modeling

    Yaron Lipman et al. “Flow matching for generative modeling”. In:arXiv preprint arXiv:2210.02747 (2022)

  39. [39]

    Visual instruction tuning

    Haotian Liu et al. “Visual instruction tuning”. In:Advances in neural information processing systems36 (2023), pp. 34892–34916

  40. [40]

    LLM360 K2: Building a 65B 360-Open-Source Large Language Model from Scratch

    Zhengzhong Liu et al. “LLM360 K2: Building a 65B 360-Open-Source Large Language Model from Scratch”. In:arXiv preprint arXiv:2501.07124(2025).doi: 10.48550/arXiv.2501.07124.url: https: //arxiv.org/abs/2501.07124

  41. [41]

    arXiv preprint arXiv:2312.06550 , year=

    Zhengzhong Liu et al. “LLM360: Towards Fully Transparent Open-Source LLMs”. In:arXiv preprint arXiv:2312.06550(2023).doi:10.48550/arXiv.2312.06550.url:https://arxiv.org/abs/2312.06550

  42. [42]

    SmolVLM: Redefining small and efficient multimodal models

    Andrés Marafioti et al. “Smolvlm: Redefining small and efficient multimodal models”. In:arXiv preprint arXiv:2504.05299(2025)

  43. [43]

    marin-community.Draccus: Configuration with Dataclasses+YAML+Argparse.https://github.com/ marin-community/draccus. 2026

  44. [44]

    Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

    Todor Mihaylov et al. “Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering”. In:Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics, 2018, pp. 2381–2391.doi: 10.18653/v1/D18-1260.url:https://aclanthology.org/D18-1260

  45. [45]

    Ray: A Distributed Framework for Emerging AI Applications

    Philipp Moritz et al. “Ray: A Distributed Framework for Emerging AI Applications”. In:Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 2018, pp. 561–577.url:https://www.usenix.org/conference/osdi18/presentation/moritz

  46. [46]

    Haruki Nishimura and Masha Itkina.Statistical Thinking for Robot Policy Evaluation: From Rigorous A/B Testing to Effective Visualization. Medium. Accessed: 2026-04-17. 2026.url:https://medium. com/toyotaresearch/statistical-thinking-for-robot-policy-evaluation-from-rigorous-a-b-testing-to- effective-0ae886fbd68d

  47. [47]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    NVIDIA et al. “GR00T N1: An Open Foundation Model for Generalist Humanoid Robots”. In:ArXiv Preprint. Mar. 2025. arXiv:2503.14734

  48. [48]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab et al. “Dinov2: Learning robust visual features without supervision”. In:arXiv preprint arXiv:2304.07193(2023)

  49. [49]

    The fineweb datasets: Decanting the web for the finest text data at scale

    Guilherme Penedo et al. “The fineweb datasets: Decanting the web for the finest text data at scale”. In: Advances in Neural Information Processing Systems37 (2024), pp. 30811–30849

  50. [50]

    https://github.com/nepfaff/drake-blender-tools

    Nicholas Pfaff and Peter Werner.Drake Blender Tools: Importing Drake Simulations into Blender. https://github.com/nepfaff/drake-blender-tools. 2025

  51. [51]

    GitHub repository

    Physical Intelligence.openpi: Open-Source Models and Packages for Robotics.https://github.com/ Physical-Intelligence/openpi. GitHub repository. Apache-2.0 License. 2025

  52. [52]

    Physical Intelligence et al.π0.5: a Vision-Language-Action Model with Open-World Generalization. 2025. arXiv:2504.16054 [cs.RO].url:https://arxiv.org/abs/2504.16054

  53. [53]

    Physical Intelligence et al.π∗ 0.6: a VLA That Learns From Experience. 2025. arXiv:2511.14759 [cs.LG]. url:https://arxiv.org/abs/2511.14759

  54. [54]

    An algorithm for a letter-based representation of all-pairwise comparisons

    Hans-Peter Piepho. “An algorithm for a letter-based representation of all-pairwise comparisons”. In: Journal of Computational and Graphical Statistics13.2 (2004), pp. 456–466

  55. [55]

    Learning transferable visual models from natural language supervision

    Alec Radford et al. “Learning transferable visual models from natural language supervision”. In:Inter- national conference on machine learning. PmLR. 2021, pp. 8748–8763

  56. [56]

    Manning, 2024.isbn: 978-1633437166

    Sebastian Raschka.Build A Large Language Model (From Scratch). Manning, 2024.isbn: 978-1633437166. url:https://www.manning.com/books/build-a-large-language-model-from-scratch

  57. [57]

    Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters

    Jeff Rasley et al. “DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters”. In:Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 2020, pp. 3505–3506.doi:10.1145/3394486.3406703.url: https://github.com/deepspeedai/DeepSpeed. 17

  58. [58]

    WinoGrande: An Adversarial Winograd Schema Challenge at Scale

    Keisuke Sakaguchi et al. “WinoGrande: An Adversarial Winograd Schema Challenge at Scale”. In:The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educa- tional Advances in Artificial Intelligence, EAAI 2020, New York,...

  59. [59]

    2024.url:https: //github.com/ServiceNow/Fast-LLM

    ServiceNow Research.Fast-LLM: Accelerating Your LLM Training to Full Speed. 2024.url:https: //github.com/ServiceNow/Fast-LLM

  60. [60]

    Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network

    Wenzhe Shi et al. “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network”. In:Proceedings of the IEEE conference on computer vision and pattern recognition. 2016, pp. 1874–1883

  61. [61]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi et al. “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism”. In:arXiv preprint arXiv:1909.08053(2019)

  62. [62]

    SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

    Mustafa Shukor et al. “SmolVLA: A vision-language-action model for affordable and efficient robotics”. In:arXiv preprint(2025). arXiv:2506.01844 [cs.RO]

  63. [63]

    DINOv3

    Oriane Siméoni et al. “Dinov3”. In:arXiv preprint arXiv:2508.10104(2025)

  64. [64]

    Is Your Imitation Learning Policy Better Than Mine? Policy Comparison with Near-Optimal Stopping

    David Snyder et al. “Is Your Imitation Learning Policy Better Than Mine? Policy Comparison with Near-Optimal Stopping”. In:Proceedings of the Robotics: Science and Systems Conference (RSS) XXI. 2025

  65. [65]

    Big little lies: A compendium and simulation of p-hacking strategies

    Angelika M Stefan and Felix D Schönbrodt. “Big little lies: A compendium and simulation of p-hacking strategies”. In:Royal Society Open Science10.2 (2023)

  66. [66]

    Improving and generalizing flow-based generative models with minibatch optimal transport.Transactions on Machine Learning Research, 2024a

    TRI LBM Team et al. “A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation”. In: (2025). arXiv:2507.05331 [cs.RO].url:https://arxiv.org/abs/2507.05331

  67. [67]

    Toyota Research Institute

    TRI LBM Team et al.LBM Eval: A Simulation Benchmark for Large Behavior Model Policies.https: //github.com/ToyotaResearchInstitute/lbm_eval. Toyota Research Institute. Version 1.1.0. 2025

  68. [68]

    Team OLMo et al.2 OLMo 2 Furious. 2024. arXiv:2501.00656 [cs.CL].url: https://arxiv.org/abs/ 2501.00656

  69. [69]

    2019.url:https://drake.mit.edu

    Russ Tedrake and the Drake Development Team.Drake: Model-based design and verification for robotics. 2019.url:https://drake.mit.edu

  70. [70]

    GitHub repository

    Haoyang Weng et al.VLA-Scratch: Modular, Performant and Efficient Stack.https://github.com/ EGalahad/vla-scratch. GitHub repository. 2026

  71. [71]

    Luis Wiedmann and Juyoung Suk.nanoVLM: The simplest repository to train your VLM in pure PyTorch.https://github.com/huggingface/nanoVLM. 2024

  72. [72]

    Dexbotic: Open-source vision-language-action toolbox.arXiv preprint arXiv:2510.23511, 2025

    BinXieetal.“Dexbotic:Open-sourcevision-language-actiontoolbox”.In:arXiv preprint arXiv:2510.23511 (2025)

  73. [73]

    Seonghyeon Ye et al.World Action Models are Zero-shot Policies. 2026. arXiv:2602.15922 [cs.RO]. url:https://arxiv.org/abs/2602.15922

  74. [74]

    H ella S wag: Can a Machine Really Finish Your Sentence?

    Rowan Zellers et al. “HellaSwag: Can a Machine Really Finish Your Sentence?” In:Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, 2019, pp. 4791–4800.doi: 10.18653/v1/P19-1472 .url: https: //aclanthology.org/P19-1472

  75. [75]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai et al. “Sigmoid loss for language image pre-training”. In:Proceedings of the IEEE/CVF international conference on computer vision. 2023, pp. 11975–11986

  76. [76]

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

    Tony Z. Zhao et al. “Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware”. In: Proceedings of Robotics: Science and Systems. Daegu, Republic of Korea, July 2023.doi:10.15607/ RSS.2023.XIX.016

  77. [77]

    On the Continuity of Rotation Representations in Neural Networks

    Yi Zhou et al. “On the Continuity of Rotation Representations in Neural Networks”. In:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). June 2019, pp. 5745– 5753. Appendix 18 A VLA Foundry – Detailed Reference This appendix expands on the description of VLA Foundry in Section 3. Section A.1 covers the core framework...

  78. [78]

    for further details. Table 7 shows the number of training samples per dataset split used to train VLA Foundry models; the number of samples generated by a single demonstration episode depends on the length of each demonstration and preprocessing configurations such as padding. While the internally collected real and simulated data is largely shared betwee...