pith. sign in

arxiv: 2605.29710 · v1 · pith:43ERLFA5new · submitted 2026-05-28 · 💻 cs.RO

PhAIL: A Real-Robot VLA Benchmark and Distributional Methodology

Pith reviewed 2026-06-29 06:43 UTC · model grok-4.3

classification 💻 cs.RO
keywords vision-language-action policiesreal-robot benchmarkdistributional evaluationtime-to-success CDFKolmogorov-Smirnov testHuman-Relative ThroughputFranka FR3VLA evaluation methodology
0
0 comments X

The pith

Time-to-success CDFs with KS testing resolve close VLA model differences at N=30 where binary success rates cannot.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current real-robot VLA evaluations rely on binary success at a fixed timeout with small cohorts of 25 or fewer rollouts, which often fail to distinguish similar policies. The paper introduces a benchmark called PhAIL that instead treats the full time-to-success cumulative distribution function as the primitive. It defines Human-Relative Throughput as a dimensionless score anchored to human teleoperation and applies a macro-averaged Kolmogorov-Smirnov test across objects. On four public VLAs the KS test separates GR00T from ACT and OpenPI from ACT at N≤30 per cell, while binary metrics do not; the closest pair remains unresolved. The best VLA is reported as roughly seven times slower than the human reference under this metric.

Core claim

The macro-averaged Kolmogorov-Smirnov test applied to per-object time-to-success CDFs distinguishes two of the three closest VLA pairs (GR00T vs. ACT, OpenPI vs. ACT) at N≤30 rollouts per (model, object) cell, whereas binary success-rate thresholds at fixed timeout do not; the same data yield an RMST ratio showing the best evaluated VLA is approximately 7 times slower than human teleoperation on the same fixtures.

What carries the argument

The time-to-success cumulative distribution function evaluated first by Human-Relative Throughput (a dimensionless scalar with bootstrap intervals) and second by per-object then macro-averaged Kolmogorov-Smirnov significance testing.

If this is right

  • VLA comparisons become statistically resolvable with cohorts no larger than 30 rollouts per condition.
  • Model rankings can be reported with explicit confidence intervals rather than point estimates of success rate.
  • Human teleoperation on identical fixtures supplies a stable, dimensionless reference scale for robot throughput.
  • Per-object testing followed by macro-averaging prevents any single object from dominating the overall significance verdict.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Adoption would encourage reporting full rollout traces rather than thresholded aggregates, enabling later re-analysis with different cutoffs.
  • The same CDF-plus-KS pipeline could be applied to other sequential robotics tasks where latency distributions matter more than binary completion.
  • The unresolved OpenPI-GR00T pair indicates that some model distinctions may require either larger N or a different test statistic even under the distributional approach.

Load-bearing premise

The cumulative distribution of time-to-success supplies a more discriminating and comparable evaluation primitive than binary success at a fixed timeout.

What would settle it

Re-evaluating the same four VLAs at N=30 with the identical fixtures and finding that the macro-averaged KS test no longer yields p-values below 0.05 for the pairs it currently separates would falsify the resolution advantage.

Figures

Figures reproduced from arXiv: 2605.29710 by Sergey Arkhangelskiy.

Figure 1
Figure 1. Figure 1: (a) Time-to-success CDFs are richer than any scalar: reliability and throughput on a single axis. The four VLAs all sit far below the human reference – the best is ∼7× slower. Hard failures become the asymptote below F = 1. (b) Choosing the right test, not just running more trials, is what resolves close comparisons: macro-averaged Kolmogorov–Smirnov across per-object CDFs reaches 80% detection on GR00T vs… view at source ↗
Figure 2
Figure 2. Figure 2: One (model, object) cell (ACT on Batteries) illustrating the scoring layer: time-to-success [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Detection rate vs. subsample size N per (model, object) cell, on the three closest model pairs (left: OpenPI vs. GR00T, the closest; centre, right: the next two). Dashed line: the 0.8 power target. KS (red) climbs steeply on every pair while F(30 s), F(60 s), and RMST stay near the floor. KS reaches 80% within budget on the centre and right pairs, while OpenPI vs. GR00T remains unresolved at N=30 (KS at 0.… view at source ↗
Figure 4
Figure 4. Figure 4: PhAIL platform during a rollout, visualized in the run-explorer interface (built on Rerun [ [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Three step-function CDFs FA, FB, FC (each with a T = ∞ asymptote representing a hard-failure rate). Arrows mark each pairwise supremum location with the magnitude and the winner at the sup. The pairwise sup-signs cycle A ≻ B ≻ C ≻ A: KS-sign is not a valid ranker. The KS statistic returns a magnitude and a sign at the supremum point. One might try to use that sign as a pairwise ranker: A ≻ B iff FA(t ∗ ) >… view at source ↗
Figure 6
Figure 6. Figure 6: Per-(model, object) RMST with 95% episode-clustered bootstrap CIs. Crossings exist (most [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Pairwise model-vs-model P-P plots, top-3 only (SmolVLA excluded as too easily separable [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Per-object Q-Q plots, THuman(q) vs Tmodel(q), all four VLAs on the four bin-to-bin objects. Dashed diagonal = as fast as the human reference; below-diagonal = model is slower. Open circle on each model curve marks q = 1 − pfail, the highest quantile the model reaches before its hard-failure asymptote. Curves bending below the diagonal as q → 1 indicate tail-heavy slowdown rather than a uniform multiplicati… view at source ↗
Figure 9
Figure 9. Figure 9: UPH versus MTBF/A trajectories per object, parametric in [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Macro-averaged UPH versus MTBF/A trajectories: for each model and each [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
read the original abstract

Real-world evaluation of vision-language-action (VLA) policies still rests on binary success rate at a fixed timeout with $N \le 25$ rollouts per condition, almost always without confidence intervals or paired statistical comparison; these cohort sizes struggle to resolve close comparisons reliably. We introduce PhAIL (Physical AI Leaderboard, https://phail.ai), an open real-robot benchmark on a Franka FR3 (dataset, per-rollout artifacts, and end-to-end reference implementation) of a distributional evaluation methodology: the time-to-success cumulative distribution function (CDF) as the evaluation primitive, with two separated jobs. The first is scoring via Human-Relative Throughput (HRT), a dimensionless scalar with bootstrap confidence intervals, anchored to same-fixture human teleoperation. The second is a significance test (Kolmogorov-Smirnov, computed per-object and macro-averaged across objects). On four publicly-available VLAs, the macro-averaged KS test resolves two close comparisons (GR00T vs. ACT, OpenPI vs. ACT) at $N \le 30$ rollouts per (model, object) cell where binary-threshold metrics do not; the closest pair (OpenPI vs. GR00T) remains unresolved within our budget. The best evaluated VLA is $\sim 7\times$ slower per operation (RMST ratio) than the human reference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces PhAIL, an open real-robot benchmark for vision-language-action (VLA) policies on a Franka FR3 arm. It critiques the standard practice of using binary success rates at fixed timeouts with small sample sizes (N ≤ 25) without statistical comparisons. The proposed methodology uses the time-to-success cumulative distribution function (CDF) as the evaluation primitive, introducing Human-Relative Throughput (HRT) as a dimensionless scalar metric with bootstrap confidence intervals anchored to human teleoperation performance, and employs per-object Kolmogorov-Smirnov (KS) tests that are macro-averaged across objects for significance testing. On four public VLAs, the macro-averaged KS test is shown to resolve two close model comparisons (GR00T vs. ACT and OpenPI vs. ACT) at N ≤ 30 rollouts per (model, object) cell, where binary-threshold metrics fail to do so, while the closest pair (OpenPI vs. GR00T) remains unresolved; additionally, the best VLA is reported to be approximately 7 times slower than the human reference in terms of RMST ratio.

Significance. If the results hold, this work provides a valuable contribution to real-robot VLA evaluation by offering a distributional approach that can distinguish model performances with modest rollout numbers where traditional metrics cannot. The open release of the benchmark, dataset, per-rollout artifacts, and end-to-end reference implementation is a notable strength, promoting reproducibility and allowing the community to build on the distributional methodology. The empirical demonstration that macro-averaged KS tests on time-to-success CDFs can resolve two pairwise comparisons at N ≤ 30 where binary success rates cannot is a concrete, falsifiable finding that advances benchmarking practices in robotics.

minor comments (3)
  1. The abstract references 'per (model, object) cell' and macro-averaging across objects but does not specify the number of objects or the exact task definitions used for the KS tests; adding this detail in the methods would strengthen verifiability of the reported resolutions.
  2. The RMST ratio used to quantify the ~7× slowdown relative to human teleoperation is mentioned without an explicit definition or reference to its computation; a brief equation or section reference would clarify this metric.
  3. Details on data collection protocols, success criteria, and any exclusion rules for rollouts are referenced as necessary for the KS and HRT computations but are not elaborated in the provided abstract; expanding these in the full methods section would address reproducibility concerns.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our work, the accurate summary of the PhAIL benchmark and distributional methodology, and the recommendation for minor revision. We appreciate the recognition of the open release of the benchmark, dataset, and reference implementation, as well as the concrete empirical demonstration regarding macro-averaged KS tests.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper introduces an empirical benchmark (PhAIL) and distributional methodology using the time-to-success CDF as primitive, with HRT anchored to external human teleoperation data and standard Kolmogorov-Smirnov tests applied per-object then macro-averaged. No load-bearing steps reduce by the paper's equations or self-citation to the paper's own inputs; the central claims consist of direct empirical comparisons on four public VLAs at N≤30, with no self-definitional metrics, fitted parameters renamed as predictions, or uniqueness theorems imported from prior author work. The methodology treats CDF/HRT/KS as alternative evaluation primitives rather than deriving them from the target results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The evaluation framework rests on the assumption that time-to-success distributions are the right primitive and that human teleoperation provides a stable anchor; no free parameters or invented physical entities are described.

axioms (1)
  • standard math Kolmogorov-Smirnov test appropriately compares empirical CDFs of time-to-success across models and objects
    Invoked for per-object and macro-averaged significance testing.
invented entities (1)
  • Human-Relative Throughput (HRT) no independent evidence
    purpose: Dimensionless scalar scoring VLA performance relative to human teleoperation with bootstrap CIs
    New metric introduced as the scoring primitive.

pith-pipeline@v0.9.1-grok · 5773 in / 1130 out tokens · 28380 ms · 2026-06-29T06:43:06.356286+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 18 canonical work pages · 13 internal anchors

  1. [1]

    Nonparametric Estimation from Incomplete Observations

    Kaplan, E.L., Meier, P. Nonparametric Estimation from Incomplete Observations. Journal of the American Statistical Association, 53(282):457–481, 1958

  2. [2]

    Evaluation of Survival Data and Two New Rank Order Statistics Arising in its Consideration

    Mantel, N. Evaluation of Survival Data and Two New Rank Order Statistics Arising in its Consideration. Cancer Chemotherapy Reports, 50(3):163–170, 1966. 10

  3. [3]

    Table for Estimating the Goodness of Fit of Empirical Distributions

    Smirnov, N.V . Table for Estimating the Goodness of Fit of Empirical Distributions. Annals of Mathematical Statistics, 19(2):279–281, 1948

  4. [4]

    An Introduction to the Bootstrap

    Efron, B., Tibshirani, R.J. An Introduction to the Bootstrap. Chapman & Hall, 1993

  5. [5]

    Testing Statistical Hypotheses, 3rd ed

    Lehmann, E.L., Romano, J.P. Testing Statistical Hypotheses, 3rd ed. Springer, 2005

  6. [6]

    Probable Inference, the Law of Succession, and Statistical Inference

    Wilson, E.B. Probable Inference, the Law of Succession, and Statistical Inference. Journal of the American Statistical Association, 22(158):209–212, 1927

  7. [7]

    Note on the Sampling Error of the Difference Between Correlated Proportions or Percentages

    McNemar, Q. Note on the Sampling Error of the Difference Between Correlated Proportions or Percentages. Psychometrika, 12(2):153–157, 1947

  8. [8]

    Sample Size for Testing Differences in Proportions for the Paired-Sample Design

    Connor, R.J. Sample Size for Testing Differences in Proportions for the Paired-Sample Design. Biometrics, 43(1):207–211, 1987

  9. [9]

    Visualization SDK for Multimodal Data.https://rerun.io, 2024

    Rerun.io. Visualization SDK for Multimodal Data.https://rerun.io, 2024

  10. [10]

    https://github.com/Positronic-Robotics/positronic, 2025

    Positronic Robotics.positronic: Open-source framework for real-robot evaluation and operation. https://github.com/Positronic-Robotics/positronic, 2025

  11. [11]

    DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    Khazatsky et al. DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset. RSS, 2024

  12. [12]

    RoboChallenge: Large-scale Real-robot Evaluation of Embodied Policies

    Tang et al. RoboChallenge: Large-scale Real-robot Evaluation of Embodied Policies. arXiv:2510.17950, 2025

  13. [13]

    RoboArena: Distributed Real-World Evaluation of Generalist Robot Policies

    Atreya et al. RoboArena: Distributed Real-World Evaluation of Generalist Robot Policies. arXiv:2506.18123, 2025

  14. [14]

    RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

    Nasiriany et al. RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots. arXiv, 2024

  15. [15]

    CALVIN: A Benchmark for Language-Conditioned Policy Learning for Long- Horizon Robot Manipulation

    Mees et al. CALVIN: A Benchmark for Language-Conditioned Policy Learning for Long- Horizon Robot Manipulation. IEEE RA-L, 2022

  16. [16]

    ManiSkill3: GPU Parallelized Robotics Simulation and Benchmarking at Scale

    Gu et al. ManiSkill3: GPU Parallelized Robotics Simulation and Benchmarking at Scale. arXiv:2410.00425, 2024

  17. [17]

    Evaluating Real-World Robot Manipulation Policies in Simulation

    Li et al. Evaluating Real-World Robot Manipulation Policies in Simulation. arXiv, 2024

  18. [18]

    Research Challenges and Progress in Robotic Grasping and Manipulation Competitions

    Sun, Falco, Roa, Calli. Research Challenges and Progress in Robotic Grasping and Manipulation Competitions. IEEE Robotics and Automation Letters, 2022

  19. [19]

    OCRTOC: A Cloud-Based Competition and Benchmark for Robotic Grasping and Manipulation

    Liu et al. OCRTOC: A Cloud-Based Competition and Benchmark for Robotic Grasping and Manipulation. IEEE Robotics and Automation Letters, 2021

  20. [20]

    NIST Assembly Task Boards: Performance Metrics and Test Methods for Robotic Assembly

    Falco et al. NIST Assembly Task Boards: Performance Metrics and Test Methods for Robotic Assembly. NIST IR / IEEE, ongoing

  21. [21]

    A Robust Real Robot Baseline for the Real Robot Challenge

    Bauer et al. A Robust Real Robot Baseline for the Real Robot Challenge. NeurIPS Datasets and Benchmarks, 2022

  22. [22]

    Digital Robot Judge: Building a Task- centric Performance Database of Real-World Manipulation With Electronic Task Boards

    So, Sarabakha, Wu, Culha, Abu-Dakka, Haddadin. Digital Robot Judge: Building a Task- centric Performance Database of Real-World Manipulation With Electronic Task Boards. IEEE Robotics & Automation Magazine, 2023

  23. [23]

    Robot Learning as an Empirical Science: Best Practices for Policy Evaluation

    Kress-Gazit, Hashimoto, Kuppuswamy, Shah, Horgan, Richardson, Feng, Burchfiel. Robot Learning as an Empirical Science: Best Practices for Policy Evaluation. arXiv:2409.09491, 2024

  24. [24]

    A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation

    TRI LBM Team, Barreiros et al. A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation. arXiv:2507.05331, 2025

  25. [25]

    Deep Reinforcement Learning at the Edge of the Statistical Precipice

    Agarwal, Schwarzer, Castro, Courville, Bellemare. Deep Reinforcement Learning at the Edge of the Statistical Precipice. NeurIPS, 2021. 11

  26. [26]

    Deep Reinforcement Learning That Matters

    Henderson, Islam, Bachman, Pineau, Precup, Meger. Deep Reinforcement Learning That Matters. AAAI, 2018

  27. [27]

    Is Your Imitation Learning Policy Better than Mine? Policy Comparison with Near-Optimal Stopping

    Snyder et al. Is Your Imitation Learning Policy Better than Mine? Policy Comparison with Near-Optimal Stopping. arXiv:2503.10966, 2025

  28. [28]

    Human-level Control through Deep Reinforcement Learning

    Mnih et al. Human-level Control through Deep Reinforcement Learning. Nature, 2015

  29. [29]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    Black et al. ( π0.5): a Vision-Language-Action Model with Open-World Generalization. arXiv:2504.16054, 2025

  30. [30]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    NVIDIA et al. GR00T N1: An Open Foundation Model for Generalist Humanoid Robots. arXiv:2503.14734, 2025

  31. [31]

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

    Zhao, Kumar, Levine, Finn. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware. RSS, 2023 (arXiv:2304.13705)

  32. [32]

    SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

    Shukor et al. SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics. arXiv:2506.01844, 2025

  33. [33]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Brohan et al. RT-1: Robotics Transformer for Real-World Control at Scale. RSS, 2023 (arXiv:2212.06817)

  34. [34]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Brohan et al. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. CoRL, 2023 (arXiv:2307.15818)

  35. [35]

    Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Open X-Embodiment Collaboration. Open X-Embodiment: Robotic Learning Datasets and RT-X Models. ICRA, 2024 (arXiv:2310.08864)

  36. [36]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Kim et al. OpenVLA: An Open-Source Vision-Language-Action Model. CoRL, 2024 (arXiv:2406.09246)

  37. [37]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Black et al. π0: A Vision-Language-Action Flow Model for General Robot Control. arXiv:2410.24164, 2024

  38. [38]

    $\pi^{*}_{0.6}$: a VLA That Learns From Experience

    Black et al.π 0.6: a VLA That Learns from Experience. arXiv:2511.14759, 2025

  39. [39]

    Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

    Fu et al. Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation. CoRL, 2024 (arXiv:2401.02117)

  40. [40]

    Robotic Control via Embodied Chain-of-Thought Reasoning

    Zawalski et al. Robotic Control via Embodied Chain-of-Thought Reasoning. CoRL, 2024 (arXiv:2407.08693). A Survey of Recent VLA Evaluation Practice Table 5 surveys 13 recent real-robot VLA papers from 2023–2025; the LBM examination [ 24] is included below the rule as the single recent counter-example. Modal per-condition N is 10–20; none of the 13 standard...