Benchmarking Vision-Language-Action Models on SO-101: Failure and Recovery Analysis

Xinchuan Qiu; Yi Yu

arxiv: 2606.08881 · v1 · pith:7OQXIGDOnew · submitted 2026-06-07 · 💻 cs.RO · cs.AI

Benchmarking Vision-Language-Action Models on SO-101: Failure and Recovery Analysis

Yi Yu , Xinchuan Qiu This is my paper

Pith reviewed 2026-06-27 18:01 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords vision-language-action modelsrobotic manipulationlow-cost robotsfailure analysisrecovery metricsbenchmarkimitation learningembodiment uncertainty

0 comments

The pith

Pretrained VLA policies outperform imitation learning on low-cost SO-101 robots but performance stays highly task-dependent with execution instability as the main failure mode.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates a standardized real-world benchmark on the affordable SO-101 robotic platform to test vision-language-action models and an imitation learning baseline across four manipulation tasks. It fine-tunes representative policies using teleoperated demonstrations and evaluates them with both standard success rates and new metrics that break failures into semantic versus execution categories while tracking recovery. Results indicate that stronger pretrained VLAs tend to beat the imitation baseline overall, yet outcomes differ sharply by task under low-cost hardware conditions. Execution instability accounts for most failures, and the ability to recover from errors varies across model architectures. The work matters because it moves evaluation away from simulations and expensive platforms toward practical low-cost deployment.

Core claim

On the SO-101 platform the benchmark shows that stronger pretrained VLA policies generally outperform the imitation learning baseline across four tasks, although success rates remain highly task-dependent under low-cost robotic conditions. Execution instability is the dominant failure source while recovery capability differs substantially across architectures. The evaluation uses a structured failure taxonomy, semantic- and execution-level decomposition, and recovery-aware metrics beyond binary success rates.

What carries the argument

The SO-101 benchmark with four manipulation tasks, unified evaluation protocols, structured failure taxonomy, and recovery-aware metrics that decompose failures and assess recovery.

If this is right

Pretrained VLA models deliver higher task success than imitation learning under low-cost deployment.
Task success rates depend strongly on the specific manipulation task chosen.
Execution instability accounts for the largest share of policy failures.
Recovery performance after errors varies substantially between different model architectures.
Failure and recovery analysis yields more information than binary success rates alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future model development may need to target execution stability as a priority separate from pretraining strength.
Consistent performance on consumer robots could require additional task-specific adaptation beyond standard fine-tuning.
Extending the benchmark to additional low-cost platforms would test whether the observed patterns generalize.
Recovery mechanisms represent a distinct direction for improving real-world robustness.

Load-bearing premise

The four tasks together with unified evaluation protocols enable systematic comparison under embodiment uncertainty.

What would settle it

If pretrained VLA policies fail to outperform the imitation learning baseline on the SO-101 platform across the four tasks or if execution instability does not emerge as the dominant failure source in the structured taxonomy.

Figures

Figures reproduced from arXiv: 2606.08881 by Xinchuan Qiu, Yi Yu.

**Figure 2.** Figure 2: Example trajectories for the Multi-object Packing task. Success cases are taken from the best-performing model, while failure [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

read the original abstract

Vision-Language-Action (VLA) models have demonstrated strong generalization in robotic manipulation, yet existing evaluations are primarily conducted in simulation or on expensive robotic platforms, leaving their robustness on affordable real-world robots largely unexplored. We present a standardized real-world benchmark for evaluating representative VLA and imitation learning policies on the low-cost SO-101 robotic platform. The benchmark comprises four representative manipulation tasks together with unified evaluation protocols, enabling systematic comparison under embodiment uncertainty. Using real-world teleoperated demonstrations, we fine-tune and evaluate $\pi_{0.5}$, SmolVLA, Wall-X, and ACT directly on the physical platform. Beyond conventional task success rates, the benchmark incorporates a structured failure taxonomy, semantic- and execution-level failure decomposition, and recovery-aware evaluation metrics to characterize policy robustness. Experimental results show that stronger pretrained VLA policies generally outperform the imitation learning baseline, although performance remains highly task-dependent under low-cost robotic deployment conditions. Execution instability emerges as the dominant failure source, while recovery capability varies substantially across architectures. These results highlight the importance of failure and recovery analysis beyond binary task success and establish SO-101 as a practical benchmark for evaluating embodied AI systems under realistic low-cost robotic deployment conditions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper sets up a narrow benchmark on the cheap SO-101 arm and finds pretrained VLA models beat imitation learning but execution instability dominates and everything stays very task-dependent.

read the letter

The main takeaway is a benchmark paper on the low-cost SO-101 robot that evaluates a handful of VLA models and an imitation learning baseline across four manipulation tasks. Stronger pretrained policies come out ahead on average, but results vary sharply by task and execution failures are the biggest problem while recovery ability differs across models.

What is new is the combination of this affordable real platform with a structured failure taxonomy and recovery-aware metrics. Most VLA work stays in simulation or on expensive arms, so shifting to hardware most labs can actually buy and run is a reasonable step. They use teleoperated demonstrations for fine-tuning and apply unified protocols, which lets them compare π0.5, SmolVLA, Wall-X, and ACT directly.

The paper does a reasonable job by moving past binary success rates and flagging that low-cost deployment brings its own issues. The observation that performance is highly task-dependent under these conditions is consistent with what people usually see when leaving simulation.

The soft spots are mainly scope and evidence depth. Four tasks on one arm is a small set, and the abstract already notes the task-dependence, which caps how far the results can travel. Without visible numbers, error bars, or run counts in the provided text, it is hard to judge how solid the comparisons are. The failure taxonomy sounds useful but needs the full methods section to check if it is reproducible.

This is for researchers who want a practical reference point for testing VLA policies on budget hardware and who care about failure modes rather than broad theory. A reader needing large-scale or general claims will find it limited.

It deserves peer review because accessible real-world benchmarks are still scarce, even if this one stays modest in reach. The empirical framing looks straightforward with no obvious circularity.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces a real-world benchmark on the low-cost SO-101 robotic arm for evaluating Vision-Language-Action (VLA) models and an imitation-learning baseline. It defines four manipulation tasks with unified protocols, fine-tunes and tests π0.5, SmolVLA, Wall-X, and ACT from teleoperated demonstrations, and augments standard success rates with a structured failure taxonomy, semantic/execution-level decomposition, and recovery-aware metrics. The central empirical finding is that stronger pretrained VLA policies generally outperform the baseline, yet performance is highly task-dependent, execution instability is the dominant failure mode, and recovery capability varies across architectures.

Significance. If the reported measurements hold, the work supplies a practical, accessible testbed for embodied AI under realistic hardware constraints and embodiment uncertainty. The emphasis on failure decomposition and recovery metrics, rather than binary success alone, is a constructive addition to the evaluation toolkit. The explicit qualification that results are task-dependent avoids over-generalization and aligns with the benchmark's stated scope.

minor comments (2)

Abstract: the high-level claims would be strengthened by the inclusion of at least one quantitative result (e.g., mean success rate or dominant failure percentage) with an indication of variability across runs or tasks.
The four-task composition is presented as enabling systematic comparison; a brief justification in §3 or §4 for why these particular tasks adequately sample the space of low-cost manipulation would help readers assess the scope of the benchmark.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The report does not enumerate any specific major comments, so we have no point-by-point responses to provide. We remain ready to incorporate any minor editorial or clarification changes requested by the editor.

Circularity Check

0 steps flagged

No significant circularity; purely empirical benchmark

full rationale

The paper is a real-world empirical benchmark study that reports direct experimental measurements of policy performance, failure modes, and recovery on the SO-101 platform across four tasks. No derivations, equations, fitted parameters presented as predictions, or self-citation chains appear in the provided text or abstract. Central claims rest on observed task success rates and failure taxonomy rather than any reduction to inputs by construction, satisfying the self-contained criterion for score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the domain assumption that the four tasks and protocols are representative for low-cost robotic evaluation; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The four manipulation tasks are representative and the unified protocols enable systematic comparison under embodiment uncertainty.
Invoked in abstract when describing benchmark composition and purpose.

pith-pipeline@v0.9.1-grok · 5740 in / 1094 out tokens · 16684 ms · 2026-06-27T18:01:38.805517+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 9 canonical work pages · 8 internal anchors

[1]

Danny Driess, Fei Xia, Mehdi S M Sajjadi, et al. 2023. PaLM-E: An Embodied Multimodal Language Model.arXiv preprint arXiv:2303.03378(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Franka Emika GmbH. 2020. Franka Emika Panda Robot. https://www.franka.de/technology. Accessed: 2026-05-28

2020
[3]

Physical Intelligence, Kevin Black, Noah Brown, et al. 2025. 𝑝𝑖 0.5: A Vision-Language-Action Model with Open-World Generalization.arXiv preprint arXiv:2504.16054(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J Davison. 2020. RLBench: The Robot Learning Benchmark & Learning Environment. IEEE Robotics and Automation Letters5, 2 (2020), 3019–3026. doi:10.1109/LRA.2020.2974707

work page doi:10.1109/lra.2020.2974707 2020
[5]

M. J. Kim, K. Pertsch, S. Karamcheti, et al. 2024. OpenVLA: An Open-Source Vision-Language-Action Model.arXiv preprint arXiv:2406.09246(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

LeRobot Community. 2024. SO-101 Low-Cost Robotic Manipulation Platform. https://github.com/huggingface/lerobot. Open-source low-cost robot platform used for embodied AI research

2024
[7]

Open X-Embodiment Collaboration et al. 2023. Open X-Embodiment: Robotic Learning Datasets and RT-X Models.arXiv preprint arXiv:2310.08864 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Xinchuan Qiu and Yi Yu. 2026. SO-101 400-Demonstrations VLA Evaluation Dataset. https://huggingface.co/collections/Qiu-Xinchuan/400-so-101- vla-evaluate-dataset

2026
[9]

Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, et al. 2025. SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics.arXiv preprint arXiv:2506.01844 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Octo Model Team, D Ghosh, H Walke, et al. 2024. Octo: An Open-Source Generalist Robot Policy.arXiv preprint arXiv:2405.12213(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Trossen Robotics. 2020. WidowX Robot Arm. https://www.trossenrobotics.com/widowxrobotarm. Accessed: 2026-05-28

2020
[12]

X-Square Robot Team. 2025. Building General-Purpose Robots Based on Embodied Foundation Models. https://github.com/X-Square-Robot/wall-x

2025
[13]

Tony Z Zhao, Vikash Kumar, Sergey Levine, et al. 2023. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware.arXiv preprint arXiv:2304.13705(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Yuke Zhu, Josiah Wong, Ajay Mandlekar, et al. 2020. robosuite: A Modular Simulation Framework and Benchmark for Robot Learning. https: //arxiv.org/abs/2009.12293. InConference on Robot Learning (CoRL)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[15]

Blake Zitkovich, Tianli Yu, Sherry Xu, et al. 2023. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. InConference on Robot Learning (CoRL). 2165–2183. Manuscript submitted to ACM

2023

[1] [1]

Danny Driess, Fei Xia, Mehdi S M Sajjadi, et al. 2023. PaLM-E: An Embodied Multimodal Language Model.arXiv preprint arXiv:2303.03378(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Franka Emika GmbH. 2020. Franka Emika Panda Robot. https://www.franka.de/technology. Accessed: 2026-05-28

2020

[3] [3]

Physical Intelligence, Kevin Black, Noah Brown, et al. 2025. 𝑝𝑖 0.5: A Vision-Language-Action Model with Open-World Generalization.arXiv preprint arXiv:2504.16054(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J Davison. 2020. RLBench: The Robot Learning Benchmark & Learning Environment. IEEE Robotics and Automation Letters5, 2 (2020), 3019–3026. doi:10.1109/LRA.2020.2974707

work page doi:10.1109/lra.2020.2974707 2020

[5] [5]

M. J. Kim, K. Pertsch, S. Karamcheti, et al. 2024. OpenVLA: An Open-Source Vision-Language-Action Model.arXiv preprint arXiv:2406.09246(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

LeRobot Community. 2024. SO-101 Low-Cost Robotic Manipulation Platform. https://github.com/huggingface/lerobot. Open-source low-cost robot platform used for embodied AI research

2024

[7] [7]

Open X-Embodiment Collaboration et al. 2023. Open X-Embodiment: Robotic Learning Datasets and RT-X Models.arXiv preprint arXiv:2310.08864 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Xinchuan Qiu and Yi Yu. 2026. SO-101 400-Demonstrations VLA Evaluation Dataset. https://huggingface.co/collections/Qiu-Xinchuan/400-so-101- vla-evaluate-dataset

2026

[9] [9]

Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, et al. 2025. SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics.arXiv preprint arXiv:2506.01844 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Octo Model Team, D Ghosh, H Walke, et al. 2024. Octo: An Open-Source Generalist Robot Policy.arXiv preprint arXiv:2405.12213(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Trossen Robotics. 2020. WidowX Robot Arm. https://www.trossenrobotics.com/widowxrobotarm. Accessed: 2026-05-28

2020

[12] [12]

X-Square Robot Team. 2025. Building General-Purpose Robots Based on Embodied Foundation Models. https://github.com/X-Square-Robot/wall-x

2025

[13] [13]

Tony Z Zhao, Vikash Kumar, Sergey Levine, et al. 2023. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware.arXiv preprint arXiv:2304.13705(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [14]

Yuke Zhu, Josiah Wong, Ajay Mandlekar, et al. 2020. robosuite: A Modular Simulation Framework and Benchmark for Robot Learning. https: //arxiv.org/abs/2009.12293. InConference on Robot Learning (CoRL)

work page internal anchor Pith review Pith/arXiv arXiv 2020

[15] [15]

Blake Zitkovich, Tianli Yu, Sherry Xu, et al. 2023. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. InConference on Robot Learning (CoRL). 2165–2183. Manuscript submitted to ACM

2023