pith. sign in

arxiv: 2606.08881 · v1 · pith:7OQXIGDOnew · submitted 2026-06-07 · 💻 cs.RO · cs.AI

Benchmarking Vision-Language-Action Models on SO-101: Failure and Recovery Analysis

Pith reviewed 2026-06-27 18:01 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords vision-language-action modelsrobotic manipulationlow-cost robotsfailure analysisrecovery metricsbenchmarkimitation learningembodiment uncertainty
0
0 comments X

The pith

Pretrained VLA policies outperform imitation learning on low-cost SO-101 robots but performance stays highly task-dependent with execution instability as the main failure mode.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates a standardized real-world benchmark on the affordable SO-101 robotic platform to test vision-language-action models and an imitation learning baseline across four manipulation tasks. It fine-tunes representative policies using teleoperated demonstrations and evaluates them with both standard success rates and new metrics that break failures into semantic versus execution categories while tracking recovery. Results indicate that stronger pretrained VLAs tend to beat the imitation baseline overall, yet outcomes differ sharply by task under low-cost hardware conditions. Execution instability accounts for most failures, and the ability to recover from errors varies across model architectures. The work matters because it moves evaluation away from simulations and expensive platforms toward practical low-cost deployment.

Core claim

On the SO-101 platform the benchmark shows that stronger pretrained VLA policies generally outperform the imitation learning baseline across four tasks, although success rates remain highly task-dependent under low-cost robotic conditions. Execution instability is the dominant failure source while recovery capability differs substantially across architectures. The evaluation uses a structured failure taxonomy, semantic- and execution-level decomposition, and recovery-aware metrics beyond binary success rates.

What carries the argument

The SO-101 benchmark with four manipulation tasks, unified evaluation protocols, structured failure taxonomy, and recovery-aware metrics that decompose failures and assess recovery.

If this is right

  • Pretrained VLA models deliver higher task success than imitation learning under low-cost deployment.
  • Task success rates depend strongly on the specific manipulation task chosen.
  • Execution instability accounts for the largest share of policy failures.
  • Recovery performance after errors varies substantially between different model architectures.
  • Failure and recovery analysis yields more information than binary success rates alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future model development may need to target execution stability as a priority separate from pretraining strength.
  • Consistent performance on consumer robots could require additional task-specific adaptation beyond standard fine-tuning.
  • Extending the benchmark to additional low-cost platforms would test whether the observed patterns generalize.
  • Recovery mechanisms represent a distinct direction for improving real-world robustness.

Load-bearing premise

The four tasks together with unified evaluation protocols enable systematic comparison under embodiment uncertainty.

What would settle it

If pretrained VLA policies fail to outperform the imitation learning baseline on the SO-101 platform across the four tasks or if execution instability does not emerge as the dominant failure source in the structured taxonomy.

Figures

Figures reproduced from arXiv: 2606.08881 by Xinchuan Qiu, Yi Yu.

Figure 1
Figure 1. Figure 1: Semantic-level and execution-level failure analysis. [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Example trajectories for the Multi-object Packing task. Success cases are taken from the best-performing model, while failure [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
read the original abstract

Vision-Language-Action (VLA) models have demonstrated strong generalization in robotic manipulation, yet existing evaluations are primarily conducted in simulation or on expensive robotic platforms, leaving their robustness on affordable real-world robots largely unexplored. We present a standardized real-world benchmark for evaluating representative VLA and imitation learning policies on the low-cost SO-101 robotic platform. The benchmark comprises four representative manipulation tasks together with unified evaluation protocols, enabling systematic comparison under embodiment uncertainty. Using real-world teleoperated demonstrations, we fine-tune and evaluate $\pi_{0.5}$, SmolVLA, Wall-X, and ACT directly on the physical platform. Beyond conventional task success rates, the benchmark incorporates a structured failure taxonomy, semantic- and execution-level failure decomposition, and recovery-aware evaluation metrics to characterize policy robustness. Experimental results show that stronger pretrained VLA policies generally outperform the imitation learning baseline, although performance remains highly task-dependent under low-cost robotic deployment conditions. Execution instability emerges as the dominant failure source, while recovery capability varies substantially across architectures. These results highlight the importance of failure and recovery analysis beyond binary task success and establish SO-101 as a practical benchmark for evaluating embodied AI systems under realistic low-cost robotic deployment conditions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces a real-world benchmark on the low-cost SO-101 robotic arm for evaluating Vision-Language-Action (VLA) models and an imitation-learning baseline. It defines four manipulation tasks with unified protocols, fine-tunes and tests π0.5, SmolVLA, Wall-X, and ACT from teleoperated demonstrations, and augments standard success rates with a structured failure taxonomy, semantic/execution-level decomposition, and recovery-aware metrics. The central empirical finding is that stronger pretrained VLA policies generally outperform the baseline, yet performance is highly task-dependent, execution instability is the dominant failure mode, and recovery capability varies across architectures.

Significance. If the reported measurements hold, the work supplies a practical, accessible testbed for embodied AI under realistic hardware constraints and embodiment uncertainty. The emphasis on failure decomposition and recovery metrics, rather than binary success alone, is a constructive addition to the evaluation toolkit. The explicit qualification that results are task-dependent avoids over-generalization and aligns with the benchmark's stated scope.

minor comments (2)
  1. Abstract: the high-level claims would be strengthened by the inclusion of at least one quantitative result (e.g., mean success rate or dominant failure percentage) with an indication of variability across runs or tasks.
  2. The four-task composition is presented as enabling systematic comparison; a brief justification in §3 or §4 for why these particular tasks adequately sample the space of low-cost manipulation would help readers assess the scope of the benchmark.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The report does not enumerate any specific major comments, so we have no point-by-point responses to provide. We remain ready to incorporate any minor editorial or clarification changes requested by the editor.

Circularity Check

0 steps flagged

No significant circularity; purely empirical benchmark

full rationale

The paper is a real-world empirical benchmark study that reports direct experimental measurements of policy performance, failure modes, and recovery on the SO-101 platform across four tasks. No derivations, equations, fitted parameters presented as predictions, or self-citation chains appear in the provided text or abstract. Central claims rest on observed task success rates and failure taxonomy rather than any reduction to inputs by construction, satisfying the self-contained criterion for score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the domain assumption that the four tasks and protocols are representative for low-cost robotic evaluation; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption The four manipulation tasks are representative and the unified protocols enable systematic comparison under embodiment uncertainty.
    Invoked in abstract when describing benchmark composition and purpose.

pith-pipeline@v0.9.1-grok · 5740 in / 1094 out tokens · 16684 ms · 2026-06-27T18:01:38.805517+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

15 extracted references · 9 canonical work pages · 8 internal anchors

  1. [1]

    Danny Driess, Fei Xia, Mehdi S M Sajjadi, et al. 2023. PaLM-E: An Embodied Multimodal Language Model.arXiv preprint arXiv:2303.03378(2023)

  2. [2]

    Franka Emika GmbH. 2020. Franka Emika Panda Robot. https://www.franka.de/technology. Accessed: 2026-05-28

  3. [3]

    Physical Intelligence, Kevin Black, Noah Brown, et al. 2025. 𝑝𝑖 0.5: A Vision-Language-Action Model with Open-World Generalization.arXiv preprint arXiv:2504.16054(2025)

  4. [4]

    Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J Davison. 2020. RLBench: The Robot Learning Benchmark & Learning Environment. IEEE Robotics and Automation Letters5, 2 (2020), 3019–3026. doi:10.1109/LRA.2020.2974707

  5. [5]

    M. J. Kim, K. Pertsch, S. Karamcheti, et al. 2024. OpenVLA: An Open-Source Vision-Language-Action Model.arXiv preprint arXiv:2406.09246(2024)

  6. [6]

    LeRobot Community. 2024. SO-101 Low-Cost Robotic Manipulation Platform. https://github.com/huggingface/lerobot. Open-source low-cost robot platform used for embodied AI research

  7. [7]

    Open X-Embodiment Collaboration et al. 2023. Open X-Embodiment: Robotic Learning Datasets and RT-X Models.arXiv preprint arXiv:2310.08864 (2023)

  8. [8]

    Xinchuan Qiu and Yi Yu. 2026. SO-101 400-Demonstrations VLA Evaluation Dataset. https://huggingface.co/collections/Qiu-Xinchuan/400-so-101- vla-evaluate-dataset

  9. [9]

    Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, et al. 2025. SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics.arXiv preprint arXiv:2506.01844 (2025)

  10. [10]

    Octo Model Team, D Ghosh, H Walke, et al. 2024. Octo: An Open-Source Generalist Robot Policy.arXiv preprint arXiv:2405.12213(2024)

  11. [11]

    Trossen Robotics. 2020. WidowX Robot Arm. https://www.trossenrobotics.com/widowxrobotarm. Accessed: 2026-05-28

  12. [12]

    X-Square Robot Team. 2025. Building General-Purpose Robots Based on Embodied Foundation Models. https://github.com/X-Square-Robot/wall-x

  13. [13]

    Tony Z Zhao, Vikash Kumar, Sergey Levine, et al. 2023. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware.arXiv preprint arXiv:2304.13705(2023)

  14. [14]

    Yuke Zhu, Josiah Wong, Ajay Mandlekar, et al. 2020. robosuite: A Modular Simulation Framework and Benchmark for Robot Learning. https: //arxiv.org/abs/2009.12293. InConference on Robot Learning (CoRL)

  15. [15]

    Blake Zitkovich, Tianli Yu, Sherry Xu, et al. 2023. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. InConference on Robot Learning (CoRL). 2165–2183. Manuscript submitted to ACM