pith. sign in

arxiv: 2606.30456 · v1 · pith:3ADTAFWAnew · submitted 2026-06-29 · 💻 cs.RO · cs.CV

Vision-Language-Action Models: Experimental Insights from a Real-World UR5 Platform

Pith reviewed 2026-06-30 05:22 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords vision-language-actionreal-world roboticsUR5eOpenVLAdata pipelinefine-tuningclosed-loop controldataset engineering
0
0 comments X

The pith

VLA deployment on physical robots requires precise management of the data-model-control pipeline rather than larger models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines the transfer of Vision-Language-Action models from benchmarks to a real UR5e robotic arm through data acquisition, RLDS-compatible dataset engineering, and fine-tuning of OpenVLA models. It identifies a consistent performance gap between offline metrics and closed-loop physical behavior. This gap stems from issues in action semantics, coordinate frames, temporal alignment, preprocessing, and dataset quality rather than model capacity alone. The work concludes that real-world success depends on controlling the entire pipeline, reframing VLA robotics as a system-level challenge.

Core claim

Experiments on the UR5e platform demonstrate that promising offline performance of VLA models like OpenVLA does not translate to stable real-robot execution, with the discrepancy arising from pipeline elements including action semantics, coordinate frame conventions, temporal alignment between modalities, image preprocessing consistency, and dataset coverage and quality, establishing that deployment hinges on precise control of the data-model-control pipeline.

What carries the argument

The data-model-control pipeline, which integrates action semantics, coordinate frames, temporal alignment, preprocessing, and dataset engineering to bridge model outputs to physical robot actions.

If this is right

  • Action semantics and coordinate frame conventions must be aligned for stable closed-loop control.
  • Dataset conversion workflows aligned with RLDS enable reproducible real-robot evaluation.
  • Fine-tuning and inference infrastructure must incorporate real-robot specifics for reliable deployment.
  • Systematic validation of control interfaces is essential beyond simulation benchmarks.
  • Real-world VLA use is a system integration problem more than a model scaling problem.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • These pipeline factors could be tested for generalizability on other manipulators or VLA architectures.
  • Ablation experiments isolating each factor would quantify their individual impacts on the performance gap.
  • The framework could extend to multi-task or long-horizon tasks to assess scalability of the pipeline approach.

Load-bearing premise

The performance gap on the UR5e arises primarily from the listed pipeline factors rather than unmeasured model limitations or platform-specific issues, and these factors apply beyond the tested models and hardware.

What would settle it

Achieving stable closed-loop task execution on the UR5e by scaling model capacity alone while leaving action semantics, coordinate frames, and data preprocessing unchanged would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.30456 by Marc Lalonde, Mathilde Hochedel.

Figure 1
Figure 1. Figure 1: Architecture of the OpenVLA model, as proposed in [4] [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: OpenVLA-OFT architecture from [6] The experimental results reported by the authors indicate that these modifications lead to consistent improvements across multiple dimensions [6]. In particular, OpenVLA-OFT achieves significantly higher inference efficiency, with reported speedups up to 26× through parallel decoding and action chunking, faster training convergence under the L1 regression objective, and im… view at source ↗
Figure 3
Figure 3. Figure 3: Experimental robotic plateform used for the project (UR5e, Gripper Robotic 2F-140, Vention supports, [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Global overview of the applied deployment workflow. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Scripted protocol. Waypoint-based trajectories with injected variability. Variability introduced through scene [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Freedrive protocol. Two-phase collection: manual guidance records state at 10Hz followed by trajectory [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Action distributions (∆x, ∆y, ∆z, ∆rx, ∆ry, ∆rz) across selected dataset versions compared to the Berkeley UR5 reference. Note that Berkeley is displayed in raw counts while local datasets are displayed in density, reflecting differences in dataset scale; the comparison focuses on distribution shape rather than absolute values. The gripper dimension is excluded from this comparison. While this chronologica… view at source ↗
Figure 8
Figure 8. Figure 8: 3D trajectory comparison (GT vs OpenVLA predictions). This figure shows the ground-truth trajectory [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Frame from OOD episode from our own custom dataset. [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: OpenVLA - Training curves (loss and action accuracy): This figure presents the evolution of training loss [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Input image degradation test: predicted trajectories and corresponding static frames under increasing levels [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: 3D trajectory comparison (GT (blue) vs. OFT predictions (orange : closed loop and green : step-wise [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: OpenVLA-OFT - Loss curves (L1 loss on continuous action prediction; training and validation) on a [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Architecture of the imitation-learning baseline. [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: (a) This figure compares the ground-truth trajectory with the imitation-learning baseline in closed-loop mode. [PITH_FULL_IMAGE:figures/full_fig_p020_16.png] view at source ↗
read the original abstract

This project investigates whether recent Vision-Language-Action (VLA) models can be transferred from controlled research benchmarks to a real-world robotic platform, specifically a UR5e manipulator, in a reproducible and operationally meaningful manner. The work integrates real-robot data acquisition, dataset engineering (compatible with the RLDS format), and the fine-tuning and deployment of OpenVLA and OpenVLA-OFT models, with systematic validation of action representations and control interfaces. The project resulted in several foundational assets: (i) a complete real-robot data acquisition pipeline, (ii) a dataset conversion workflow aligned with RLDS standards, (iii) an initial fine-tuning and inference infrastructure for VLA models, and (iv) a structured set of experimental observations grounded in real-robot trials. These elements collectively establish a reproducible framework for evaluating learning-based manipulation systems beyond simulation. Empirically, the experiments reveal a consistent gap between promising offline indicators and unstable closed-loop behavior on the physical system: this gap cannot be attributed solely to model limitations, it is strongly influenced by action semantics, coordinate frame conventions, temporal alignment between modalities, image preprocessing consistency, and dataset coverage and quality. These observations lead to a key interpretation: the successful deployment of VLA systems in real-world settings depends less on incremental improvements in model capacity and more on precise control of the entire data-model-control pipeline. The project reframes VLA-based robotics from a primarily model-centric challenge to a system-level problem; it highlights the difficulty of running robust task execution on the real robot and provides a clear, experimentally grounded understanding of the conditions required for reliable deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. This paper investigates transferring Vision-Language-Action (VLA) models (OpenVLA and OpenVLA-OFT) from research benchmarks to a real UR5e manipulator. It describes development of a real-robot data acquisition pipeline, RLDS-compatible dataset conversion, fine-tuning and inference infrastructure, and reports a consistent gap between promising offline indicators and unstable closed-loop behavior. The authors state this gap cannot be attributed solely to model limitations and is strongly influenced by action semantics, coordinate frame conventions, temporal alignment, image preprocessing, and dataset coverage/quality, leading to the claim that successful real-world VLA deployment depends more on precise control of the entire data-model-control pipeline than on incremental model-capacity improvements. The work positions itself as establishing a reproducible framework and foundational assets for evaluating learning-based manipulation beyond simulation.

Significance. If the reported observations and attribution hold with adequate supporting data, the manuscript offers practical value by documenting real-robot deployment challenges and supplying reusable assets (data acquisition pipeline, RLDS workflow, fine-tuning infrastructure). These elements could aid reproducibility in robotics research. The reframing of VLA issues as system-level rather than purely model-centric provides a useful perspective, though its broader significance is limited by the absence of detailed quantitative validation in the current presentation.

major comments (2)
  1. [Abstract] Abstract (final paragraph): the central claim that the offline-to-closed-loop gap 'cannot be attributed solely to model limitations' and 'is strongly influenced by' the listed pipeline factors (action semantics, coordinate frames, temporal alignment, preprocessing, dataset coverage) is load-bearing for the interpretation, yet the text supplies no quantitative metrics, error bars, statistical tests, or descriptions of controlled ablations that vary one factor while holding the model and hardware fixed.
  2. [Abstract] Abstract (empirical observations paragraph): the conclusion that deployment success 'depends less on incremental improvements in model capacity and more on precise control of the entire data-model-control pipeline' requires evidence isolating the causal contribution of the pipeline factors versus unmeasured model or platform effects; the reported observations appear to rest on correlational trials without such isolation experiments.
minor comments (2)
  1. [Title/Abstract] Title vs. abstract: the title states 'UR5 Platform' while the abstract specifies 'UR5e manipulator'; standardize the hardware designation for clarity and reproducibility.
  2. [Abstract] Abstract: the description of 'systematic validation of action representations and control interfaces' is stated without reference to specific metrics or protocols used; add a brief enumeration or pointer to the methods section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract claims. We address the two major comments below and commit to revisions that improve clarity and evidence presentation without altering the core contributions of the real-robot pipeline and observations.

read point-by-point responses
  1. Referee: [Abstract] Abstract (final paragraph): the central claim that the offline-to-closed-loop gap 'cannot be attributed solely to model limitations' and 'is strongly influenced by' the listed pipeline factors (action semantics, coordinate frames, temporal alignment, preprocessing, dataset coverage) is load-bearing for the interpretation, yet the text supplies no quantitative metrics, error bars, statistical tests, or descriptions of controlled ablations that vary one factor while holding the model and hardware fixed.

    Authors: The manuscript's experimental sections detail multiple real-robot trials in which pipeline elements were systematically varied (action semantics, coordinate conventions, temporal alignment, image preprocessing, and dataset subsets) with the model and UR5e hardware held fixed, producing observable differences in closed-loop stability. These are supported by logged performance indicators across trials. We agree the abstract does not reference this supporting material and will revise it to include a concise summary of the key observed differences and point to the relevant experimental subsections. revision: yes

  2. Referee: [Abstract] Abstract (empirical observations paragraph): the conclusion that deployment success 'depends less on incremental improvements in model capacity and more on precise control of the entire data-model-control pipeline' requires evidence isolating the causal contribution of the pipeline factors versus unmeasured model or platform effects; the reported observations appear to rest on correlational trials without such isolation experiments.

    Authors: The reported trials isolate pipeline factors by fixing the model weights and robot platform while changing one element at a time (e.g., switching coordinate frames or alignment offsets) and documenting resulting behavior changes. This design provides direct evidence of influence beyond model capacity. We acknowledge that formal statistical ablations with error bars are not presented in the current version and will add a new subsection that tabulates the variations performed, associated performance shifts, and any quantitative metrics available from the trial logs. revision: partial

Circularity Check

0 steps flagged

No circularity: purely observational experimental report

full rationale

The manuscript presents an experimental study of VLA model transfer to a physical UR5e robot, including data pipelines, fine-tuning, and observed offline-to-closed-loop gaps. The central interpretation—that deployment success depends more on pipeline control than model capacity—is stated as an empirical conclusion from trial outcomes rather than any derivation, equation, fitted parameter, or self-citation chain. No load-bearing step reduces a claimed result to its own inputs by construction; the work contains no mathematical predictions, uniqueness theorems, or ansatzes. This is the expected non-finding for an observational robotics paper whose claims rest on reported hardware behavior.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is experimental and introduces no new mathematical constructs; it relies on standard assumptions about robotics interfaces and dataset formats already established in the field.

axioms (1)
  • domain assumption RLDS format and standard UR5 control interfaces are suitable and unbiased representations for evaluating VLA transfer
    Invoked when the authors treat successful conversion to RLDS and use of existing control interfaces as neutral steps that do not themselves introduce the observed performance gap.

pith-pipeline@v0.9.1-grok · 5824 in / 1303 out tokens · 36577 ms · 2026-06-30T05:22:14.196087+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 22 canonical work pages · 13 internal anchors

  1. [1]

    Ma et al.A Survey on Vision-Language-Action Models for Embodied AI

    Y . Ma et al.A Survey on Vision-Language-Action Models for Embodied AI. 2024.URL: http://dx.doi.org/ 10.1109/TNNLS.2025.3650584

  2. [2]

    RT-1: Robotics Transformer for Real-World Control at Scale

    A. Brohan et al.RT-1: Robotics Transformer for Real-World Control at Scale. 2023. arXiv: 2212 . 06817 [cs.RO].URL:https://arxiv.org/abs/2212.06817

  3. [3]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    A. Brohan et al.RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. 2023.URL: https://arxiv.org/abs/2307.15818

  4. [4]

    M. J. Kim et al.OpenVLA: An Open-Source Vision-Language-Action Model. 2024.DOI: 10.48550/arXiv. 2406.09246.URL:https://arxiv.org/abs/2406.09246

  5. [5]

    OpenVLA Team.OpenVLA GitHub Repository. 2024

  6. [6]

    M. J. Kim, C. Finn, and P. Liang.Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success. 2025.DOI:10.48550/arXiv.2502.19645.URL:https://arxiv.org/abs/2502.19645

  7. [7]

    Vision-Language-Action Models for Robotics: A Review Towards Real-World Appli- cations

    K. Kawaharazuka et al. “Vision-Language-Action Models for Robotics: A Review Towards Real-World Appli- cations”. In:IEEE Access(2025).DOI: 10.1109/ACCESS.2025.3609980.URL: https://arxiv.org/abs/ 2510.07077

  8. [8]

    Octo: An Open-Source Generalist Robot Policy

    Octo Model Team et al.Octo: An Open-Source Generalist Robot Policy. 2024.DOI: 10.48550/arXiv.2405. 12213. arXiv:2405.12213 [cs.RO].URL:https://arxiv.org/abs/2405.12213

  9. [9]

    Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Open X-Embodiment Collaboration et al.Open X-Embodiment: Robotic Learning Datasets and RT-X Models. 2023.URL:https://arxiv.org/abs/2310.08864

  10. [10]

    LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

    B. Liu et al.LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning. 2023.DOI: 10.48550/ arXiv.2306.03310.URL:https://arxiv.org/abs/2306.03310

  11. [11]

    CALVIN: A Benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks

    Oier Mees et al. “CALVIN: A Benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks”. In:Conference on Robot Learning (CoRL). 2022.URL: https://arxiv.org/abs/2112. 03227

  12. [12]

    Xueyang Zhou et al.LIBERO-PRO: Towards Robust and Fair Evaluation of Vision-Language-Action Models Beyond Memorization. 2026. arXiv:2510.03827 [cs.CV].URL:https://arxiv.org/abs/2510.03827

  13. [13]

    Senyu Fei et al.LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models. 2025. arXiv: 2510.13626 [cs.RO].URL:https://arxiv.org/abs/2510.13626

  14. [14]

    Ramos et al.RLDS: an Ecosystem to Generate, Share and Use Datasets in Reinforcement Learning

    S. Ramos et al.RLDS: an Ecosystem to Generate, Share and Use Datasets in Reinforcement Learning. 2021. URL:https://arxiv.org/pdf/2111.02767

  15. [15]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J. Hu et al.LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685 [cs.CL]. 2021. DOI: 10.48550/arXiv.2106.09685. arXiv: 2106.09685 [cs.CL].URL: https://arxiv.org/abs/2106. 09685

  16. [16]

    VLATest: Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation

    Zhijie Wang et al. “VLATest: Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation”. In:Proceedings of the ACM on Software Engineering2.FSE (2025). arXiv:2409.12894v2.DOI: 10.1145/ 3729343. arXiv:2409.12894 [cs.SE].URL:https://arxiv.org/abs/2409.12894. 22 Vision-Language-Action Models: Experimental Insights from a Real-World UR5 Platform

  17. [17]

    A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning

    S. Ross, G.J. Gordon, and J.A. Bagnell. “A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning”. In:Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS). V ol. 15. Proceedings of Machine Learning Research. arXiv:1011.0686. PMLR, 2011, pp. 627–635.URL:https://arxiv.org/pdf/1011.0686

  18. [18]

    Transporter Networks: Rearranging the Visual World for Robotic Manipulation

    Andy Zeng et al. “Transporter Networks: Rearranging the Visual World for Robotic Manipulation”. In:Pro- ceedings of the 2020 Conference on Robot Learning. 2021.URL: https://proceedings.mlr.press/v155/ zeng21a.html

  19. [19]

    Yuhui Chen et al.ConRFT: A Reinforced Fine-tuning Method for VLA Models via Consistency Policy. 2025

  20. [20]

    Jijia Liu et al.What Can RL Bring to VLA Generalization? An Empirical Study. 2026. arXiv: 2505.19789 [cs.LG].URL:https://arxiv.org/abs/2505.19789

  21. [21]

    2025.DOI: 10.36227/techrxiv.176531955.54563920/v1

    Haoyuan Deng et al.A Survey on Reinforcement Learning of Vision-Language-Action Models for Robotic Ma- nipulation. 2025.DOI: 10.36227/techrxiv.176531955.54563920/v1 . eprint: https://www.techrxiv. org/doi/pdf/10.36227/techrxiv.176531955.54563920/v1 .URL: https://www.techrxiv.org/ doi/abs/10.36227/techrxiv.176531955.54563920/v1

  22. [22]

    Yulin Luo et al.Look Before Acting: Enhancing Vision Foundation Representations for Vision-Language-Action Models. 2026. arXiv:2603.15618 [cs.CV].URL:https://arxiv.org/abs/2603.15618

  23. [23]

    Anupam Pani and Yanchao Yang.Gaze-Regularized Vision-Language-Action Models for Robotic Manipulation

  24. [24]

    arXiv:2603.23202 [cs.CV].URL:https://arxiv.org/abs/2603.23202

  25. [25]

    Springer, 2009.ISBN: 978-1-84628-641-4

    Bruno Siciliano et al.Robotics: Modelling, Planning and Control. Springer, 2009.ISBN: 978-1-84628-641-4

  26. [26]

    Spong, Seth Hutchinson, and M

    Mark W. Spong, Seth Hutchinson, and M. Vidyasagar.Robot Modeling and Control. John Wiley & Sons, 2005. ISBN: 978-0-471-64990-8

  27. [27]

    Robot Collisions: A Survey on Detection, Isolation, and Identification

    Sami Haddadin, Alessandro De Luca, and Alin Albu-Schaffer. “Robot Collisions: A Survey on Detection, Isolation, and Identification”. In:IEEE Transactions on Robotics33.6 (2017), pp. 1292–1312.DOI: 10.1109/ TRO.2017.2723903.URL:https://ieeexplore.ieee.org/document/8059840

  28. [28]

    Zhao, J.P

    W. Zhao, J.P. Queralta, and T. Westerlund.Sim-to-Real Transfer in Deep Reinforcement Learning for Robotics: a Survey. 2020.URL:https://arxiv.org/abs/2009.13303

  29. [29]

    Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World

    J. Tobin et al. “Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World”. In:2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). arXiv:1703.06907. 2017, pp. 23–30.DOI: 10.1109/IROS.2017.8202133 .URL: https://arxiv.org/ abs/1703.06907. 23