VLA-FAIL: Efficient Task Failure Detection for Finetuned Vision-Language-Action Models

Emiliyan Gospodinov; Enes Ulas Dincer; Florian Seligmann; Gerhard Neumann

arxiv: 2606.21386 · v1 · pith:3GHBGWNWnew · submitted 2026-06-19 · 💻 cs.LG · cs.CV

VLA-FAIL: Efficient Task Failure Detection for Finetuned Vision-Language-Action Models

Florian Seligmann , Emiliyan Gospodinov , Enes Ulas Dincer , Gerhard Neumann This is my paper

Pith reviewed 2026-06-26 14:55 UTC · model grok-4.3

classification 💻 cs.LG cs.CV

keywords failure detectionvision-language-action modelsout-of-distribution detectionaction consistencyMahalanobis distanceroboticsreceding horizon control

0 comments

The pith

VLA-FAIL combines last-layer feature deviations with action chunk consistency to detect failures in finetuned vision-language-action models without any failure examples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VLA-FAIL as a lightweight framework for spotting when vision-language-action models start to fail at robotic tasks during deployment. It pairs one detector that flags out-of-distribution states by measuring how far last-layer token features stray from training patterns with a second detector that flags failures when consecutive action plans become inconsistent under receding-horizon control. Both checks run with low overhead and require no access to failure demonstrations. A new threshold-free metric tracks the balance between detection accuracy and how quickly failures are caught. Experiments across real robots and simulations show the two signals catch complementary failure types and often outperform more costly baselines.

Core claim

VLA-FAIL detects task failures by running last-layer Mahalanobis distance on token features to identify distribution shifts and action chunk consistency on temporally overlapping plans to identify planning breakdowns. Their combination yields reliable early detection across tasks while adding minimal compute and avoiding any need for failure rollouts or repeated action sampling.

What carries the argument

The joint use of last-layer Mahalanobis distance on token features and consistency checks between consecutive action chunks under receding-horizon control.

If this is right

Failure detection becomes feasible at runtime for any finetuned VLA without collecting negative examples.
Detection latency can be traded against precision using the AUCPDT metric without needing to tune thresholds in advance.
The method applies across manipulation tasks that use receding-horizon action chunking.
Compute cost stays low enough for on-robot deployment compared with sampling-based alternatives.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same two signals could be tested on other sequence predictors that generate overlapping output chunks.
Calibration of the Mahalanobis covariance on a small held-out set might further reduce false positives without retraining the VLA.
Integrating the consistency check directly into the policy's planning loop could allow the model to replan automatically on detection.

Load-bearing premise

Token-wise deviations from training features and inconsistencies between action chunks are sufficient indicators of actual task failure in out-of-distribution states.

What would settle it

A recorded rollout in which the VLA visibly fails at its assigned task yet both last-layer Mahalanobis distance and action chunk consistency remain below their detection thresholds throughout.

Figures

Figures reproduced from arXiv: 2606.21386 by Emiliyan Gospodinov, Enes Ulas Dincer, Florian Seligmann, Gerhard Neumann.

**Figure 1.** Figure 1: (a) Our task failure detection pipeline: We combine LLMD, which detects unlikely lastlayer features under the feature distribution of the training data, with ACC, which measures the consistency between overlapping parts of successive action chunks. (b) Action chunk sample time for various failure detectors (X-VLA [11], RTX 5090, constant 17 ms of VLM excluded). Baselines are not real-time capable due to t… view at source ↗

**Figure 2.** Figure 2: Predicted end-effector (EEF) x position over episode time for a successful and a failed episode of π0.5 on Blocks. In successful episodes, the predicted but not executed actions of an action chunk typically overlap with the next action chunk. In contrast, the failed episode shows substantial misalignment between consecutive action chunks. ACC measures this inconsistency. Fixed Prior Noise. To avoid mode-av… view at source ↗

**Figure 3.** Figure 3: X-VLA on Blocks. Here, STAC demonstrates no useful failure signal, leading to a recall that stays constant at the fail rate, and a PDT that is constant at 1. 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.2 0.4 0.6 0.8 1.0 Precision ACE Diff STAC ACC LLMD FAIL (a) Precision-Recall 0.5 0.6 0.7 0.8 0.9 1.0 Precision 0.0 0.2 0.4 0.6 0.8 1.0 Penalized Detection Time ACE Diff STAC ACC LLMD FAIL (b) Penalized Detection Ti… view at source ↗

**Figure 4.** Figure 4: π0.5 on Blocks. ACC performs substantially worse than other methods. B Ablation Studies Token-Wise LLMD As [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Dependence of ACC on the number of overlapping actions on the [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Dependence of ACC on the number of overlapping actions on the [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative failure detection over time. Each task is shown with two detector rows: ACC [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative failure detection over time. Each task is shown with two detector rows: ACC [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: Left: Real-world setup. Right: RGB views from left, right, and gripper cameras. (a) Blocks (b) Drawer (c) Cups (d) Kitchen (e) Stack T (f) Mixer [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗

**Figure 10.** Figure 10: Initial and final states for the real-world tasks. For each task, the left image shows the [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗

**Figure 11.** Figure 11: LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models. Re [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗

read the original abstract

Vision-language-action models (VLAs) achieve state-of-the-art performance on many robotic manipulation tasks, yet they can still behave unpredictably in out-of-distribution scenarios. Runtime failure detection is therefore essential for the safe real-world deployment of VLAs. However, existing task failure detectors require computationally expensive action sampling, are based on architectural assumptions that limit their applicability to VLAs, or need access to failure rollouts. We propose VLA-FAIL, a lightweight and broadly applicable failure detection framework for VLAs that combines two novel failure detectors with minimal overhead, without requiring failure data. The first, last-layer Mahalanobis distance (LLMD), detects out-of-distribution states by measuring token-wise deviations in last-layer features relative to the training data. The second, action chunk consistency (ACC), exploits the temporal overlap induced by receding-horizon control and detects failures when consecutive action chunks become inconsistent. To capture the trade-off between detection accuracy and detection latency, we introduce AUCPDT, a threshold-independent metric that jointly evaluates precision, recall, and detection time. Through extensive real-world and simulation experiments, we demonstrate that LLMD and ACC capture complementary failure modes whose combination enables reliable and early failure detection across diverse tasks, frequently outperforming significantly more expensive baseline methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VLA-FAIL gives two lightweight detectors for VLA failures without failure data, but their ability to catch every relevant OOD case still needs checking.

read the letter

The key takeaway is that this paper gives a practical, low-cost way to detect task failures in finetuned vision-language-action models using two detectors that do not require any failure data.

It introduces last-layer Mahalanobis distance to flag out-of-distribution states from feature deviations and action chunk consistency to spot when consecutive action predictions disagree due to the receding horizon setup. The AUCPDT metric is a nice addition for evaluating detection without fixing a threshold. The work shows these two capture different failure types and often do better than heavier baselines in their real-world and simulation tests.

One soft spot is the risk that some failures slip through if they do not produce shifts in the last-layer features or inconsistencies in the action chunks. The Mahalanobis detector relies on the training features being well modeled by a Gaussian, which may not hold in high dimensions or with certain finetuning methods. The paper would benefit from more discussion or tests on edge cases like external disturbances that happen after correct actions.

This is for robotics researchers focused on making VLAs reliable in the real world. Readers interested in safety for learned controllers will find the concrete detectors and the evaluation approach helpful.

I would send it for peer review. The contribution is targeted and builds directly on VLA properties, so it merits referee attention even if some assumptions need tightening.

Referee Report

3 major / 3 minor

Summary. The manuscript proposes VLA-FAIL, a lightweight runtime failure detection framework for finetuned vision-language-action (VLA) models. It introduces two detectors that require no failure rollouts: last-layer Mahalanobis distance (LLMD), which measures token-wise deviations of last-layer features from a Gaussian fitted on training data, and action chunk consistency (ACC), which flags failures via inconsistencies between overlapping action chunks produced by receding-horizon control. The combination is claimed to capture complementary failure modes. The paper also defines AUCPDT, a threshold-independent metric that integrates precision, recall, and detection latency, and reports that the method outperforms more expensive baselines across diverse simulation and real-world tasks.

Significance. If the central experimental claims hold, the work supplies a practical, low-overhead safety layer for VLAs that avoids the need for failure data collection, a notable practical advantage. The AUCPDT metric is a useful contribution for comparing detectors on the accuracy-latency trade-off. The explicit complementarity argument between feature-based and temporal-consistency detectors, if substantiated, would be a clear advance over single-heuristic approaches.

major comments (3)

[§3.1] §3.1 (LLMD definition): The detector assumes last-layer activations of successful trajectories are adequately modeled by a single multivariate Gaussian whose covariance can be reliably estimated from training tokens. No validation of this modeling choice (e.g., QQ-plots, covariance conditioning, or comparison to kernel density or mixture models) is provided, yet the central claim that LLMD reliably detects OOD states rests on it. High feature dimensionality or LoRA-style finetuning that leaves the final projection largely unchanged could invalidate the assumption.
[§3.2] §3.2 (ACC definition and §4 experiments): ACC relies on measurable inconsistency between consecutive overlapping action chunks indicating failure. The manuscript supplies no analysis or ablation on policy stochasticity, chunk length, or the choice of inconsistency metric (L2 vs. cosine). These parameters directly affect whether ACC fires on the claimed failure modes; without such analysis the claim that ACC and LLMD are jointly sufficient for reliable detection across arbitrary OOD states remains unsupported.
[§4] §4 (experimental results and Table 2/3): The abstract and results assert that the LLMD+ACC combination “frequently outperforming significantly more expensive baseline methods” across diverse tasks. However, the paper does not demonstrate that the chosen baselines also operate without failure rollouts or that the reported AUCPDT gains are robust to post-hoc threshold selection. This comparison is load-bearing for the superiority claim.

minor comments (3)

[§3.1] Notation for the Mahalanobis distance in §3.1 is introduced without an explicit equation number; adding Eq. (X) would improve traceability.
[Figures] Figure captions for the real-world experiment plots should explicitly state the number of trials per task and whether error bars reflect standard deviation or standard error.
[Related Work] The related-work section omits recent action-chunking papers that also exploit receding-horizon overlap; adding 2–3 citations would strengthen context.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [§3.1] §3.1 (LLMD definition): The detector assumes last-layer activations of successful trajectories are adequately modeled by a single multivariate Gaussian whose covariance can be reliably estimated from training tokens. No validation of this modeling choice (e.g., QQ-plots, covariance conditioning, or comparison to kernel density or mixture models) is provided, yet the central claim that LLMD reliably detects OOD states rests on it. High feature dimensionality or LoRA-style finetuning that leaves the final projection largely unchanged could invalidate the assumption.

Authors: We agree that additional validation of the Gaussian assumption would improve the manuscript. The single-Gaussian Mahalanobis distance is a standard choice in the OOD detection literature, but we will add covariance conditioning diagnostics and representative QQ-plots for the last-layer token features in the revised version. We will also include a short discussion noting that LLMD is applied token-wise (mitigating some dimensionality concerns) and that empirical results hold across both full fine-tuning and LoRA-based VLAs; however, we acknowledge that cases where the final projection remains unchanged could reduce sensitivity and will flag this as a limitation. revision: yes
Referee: [§3.2] §3.2 (ACC definition and §4 experiments): ACC relies on measurable inconsistency between consecutive overlapping action chunks indicating failure. The manuscript supplies no analysis or ablation on policy stochasticity, chunk length, or the choice of inconsistency metric (L2 vs. cosine). These parameters directly affect whether ACC fires on the claimed failure modes; without such analysis the claim that ACC and LLMD are jointly sufficient for reliable detection across arbitrary OOD states remains unsupported.

Authors: We accept that the current manuscript lacks ablations on these design choices. In the revision we will add experiments that vary chunk length, compare L2 versus cosine inconsistency, and report how detection performance changes. All evaluated policies use deterministic decoding at inference time; we will state this explicitly and briefly discuss expected behavior under temperature sampling. These additions will better support the complementarity argument between LLMD and ACC. revision: yes
Referee: [§4] §4 (experimental results and Table 2/3): The abstract and results assert that the LLMD+ACC combination “frequently outperforming significantly more expensive baseline methods” across diverse tasks. However, the paper does not demonstrate that the chosen baselines also operate without failure rollouts or that the reported AUCPDT gains are robust to post-hoc threshold selection. This comparison is load-bearing for the superiority claim.

Authors: We will revise the text and tables to explicitly document the training-data requirements of each baseline, confirming that the main comparators (ensemble and sampling-based detectors) can be run without failure rollouts. Because AUCPDT integrates over all thresholds and incorporates latency, it is designed to be insensitive to any single threshold choice; we will highlight this property more clearly and add a short sensitivity check in the supplement. With these clarifications the reported gains remain valid under the stated experimental conditions. revision: partial

Circularity Check

0 steps flagged

No circularity; detectors defined from standard statistical ideas and evaluated empirically

full rationale

The paper defines LLMD as token-wise Mahalanobis distance on last-layer features fitted to training data (standard OOD technique) and ACC as inconsistency check on overlapping action chunks from receding-horizon control (standard temporal consistency idea). Neither reduces by construction to target data quantities or self-citations. No equations, uniqueness theorems, or ansatzes are smuggled in; performance claims rest on real-world and simulation experiments rather than derivations equivalent to inputs. The central claim of complementarity is an empirical observation, not a forced result.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on the assumption that last-layer features are informative for OOD detection and that action-chunk overlap provides a reliable consistency signal; no free parameters or invented entities are described in the abstract.

axioms (2)

domain assumption Last-layer token features of a finetuned VLA follow a distribution that can be summarized by mean and covariance for Mahalanobis distance computation
Standard OOD assumption applied here to VLAs; invoked when defining LLMD.
domain assumption Consecutive action chunks produced by receding-horizon control should be temporally consistent when the model is succeeding
Core premise of the ACC detector; stated in the abstract description of the method.

pith-pipeline@v0.9.1-grok · 5770 in / 1322 out tokens · 25332 ms · 2026-06-26T14:55:39.357625+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 2 canonical work pages

[1]

Black, N

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Haus- man, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky.π 0: A vision-language-action flow model for general robot control, 2026. URLhttps://arxiv. o...

Pith/arXiv arXiv 2026
[2]

Black, N

Physical Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner...

2025
[3]

Bjorck, N

NVIDIA, J. Bjorck, N. C. Fernando Castañeda, X. Da, R. Ding, L. J. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. Llontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y . L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y . Xie, Y . Xu, Z. Xu, S. Ye, Z. Yu, A....

2025
[4]

J. Lee, J. Duan, H. Fang, Y . Deng, S. Liu, B. Li, B. Fang, J. Zhang, Y . R. Wang, S. Lee, W. Han, W. Pumacay, A. Wu, R. Hendrix, K. Farley, E. VanderBilt, A. Farhadi, D. Fox, and R. Krishna. Molmoact: Action reasoning models that can reason in space, 2025. URLhttps: //arxiv.org/abs/2508.07917

Pith/arXiv arXiv 2025
[5]

H. Fang, J. Duan, D. Clay, S. Wang, S. Liu, W. Huang, X. Fan, W.-C. Tsai, S. Chen, Y . R. Wang, S. Xing, J. Cho, J. S. Park, A. Eftekhar, P. Sushko, K. Farley, A. Wadhwa, C. Harrison, W. Han, Y .-C. Lee, E. VanderBilt, R. Hendrix, S. Ellawela, L. Ngoo, J. Chai, Z. Ren, A. Farhadi, D. Fox, and R. Krishna. Molmoact2: Action reasoning models for real-world d...

Pith/arXiv arXiv 2026
[6]

M. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Pith/arXiv arXiv 2024
[7]

Q. Gu, Y . Ju, S. Sun, I. Gilitschenski, H. Nishimura, M. Itkina, and F. Shkurti. SAFE: Multitask failure detection for vision-language-action models. InThe Thirty-ninth Annual Conference 9 on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum? id=XPyAukgsFf

2025
[8]

Menda, K

K. Menda, K. Driggs-Campbell, and M. J. Kochenderfer. Ensembledagger: A bayesian ap- proach to safe imitation learning. In2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), page 5041–5048. IEEE Press, 2019. doi:10.1109/IROS40897. 2019.8968287. URLhttps://doi.org/10.1109/IROS40897.2019.8968287

work page doi:10.1109/iros40897 2019
[9]

S.-W. Lee, X. Kang, and Y .-L. Kuo. Diff-dagger: Uncertainty estimation with diffusion policy for robotic manipulation. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 4845–4852. IEEE, 2025

2025
[10]

Brunke, M

L. Brunke, M. Greeff, A. W. Hall, Z. Yuan, S. Zhou, J. Panerati, and A. P. Schoellig. Safe learn- ing in robotics: From learning-based control to safe reinforcement learning.Annual Review of Control, Robotics, and Autonomous Systems, 5(1):411–444, 2022

2022
[11]

Zheng, J

J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y . Feng, Y . Zheng, J. Zou, Y . Chen, J. Zeng, T. Wang, Y .-Q. Zhang, J. Liu, and X. Zhan. X-VLA: Soft-prompted transformer as scal- able cross-embodiment vision-language-action model. InThe Fourteenth International Con- ference on Learning Representations, 2026. URLhttps://openreview.net/forum?id= kt51kZH4aG

2026
[12]

S. Fei, S. Wang, J. Shi, Z. Dai, J. Cai, P. Qian, L. Ji, X. He, S. Zhang, Z. Fei, J. Fu, J. Gong, and X. Qiu. Libero-plus: In-depth robustness analysis of vision-language-action models.arXiv preprint arXiv:2510.13626, 2025

Pith/arXiv arXiv 2025
[13]

Römer, A

R. Römer, A. Kobras, L. Worbis, and A. P. Schoellig. Failure prediction at runtime for genera- tive robot policies. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[14]

S. Zhou, B. Zhu, J. Yang, X. Zhao, J. Chen, and Y .-G. Jiang. Rc-nf: Robot-conditioned normalizing flow for real-time anomaly detection in robotic manipulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 43050–43060, June 2026

2026
[15]

C. Ma, G. Yang, K. Lu, S. Xu, B. Byrne, N. Trigoni, and A. Markham. Cyclevla: Proactive self-correcting vision-language-action models via subtask backtracking and minimum bayes risk decoding.arXiv preprint arXiv:2601.02295, 2026

arXiv 2026
[16]

Z. Lin, J. Duan, H. Fang, D. Fox, R. Krishna, C. Tan, and B. Wen. Failsafe: Reasoning and recovery from failures in vision-language-action models, 2025. URLhttps://arxiv.org/ abs/2510.01642

arXiv 2025
[17]

C. Agia, R. Sinha, J. Yang, Z. Cao, R. Antonova, M. Pavone, and J. Bohg. Unpacking failure modes of generative policies: Runtime monitoring of consistency and progress. InProceedings of The 8th Conference on Robot Learning, volume 270 ofProceedings of Machine Learning Research, pages 689–723. PMLR, 2025

2025
[18]

C. Xu, T. K. Nguyen, E. Dixon, C. Rodriguez, P. Miller, R. Lee, P. Shah, R. A. Ambrus, H. Nishimura, and M. Itkina. Can we detect failures without failure data? uncertainty-aware runtime failure detection for imitation learning policies. InRobot Evaluation for the Real World, 2025. URLhttps://openreview.net/forum?id=A2iUXYdWZD

2025
[19]

Rolland, F

Q. Rolland, F. Mayran de Chamisso, and J.-B. Mouret. Failure identification in imitation learning via statistical and semantic filtering. InIEEE International Conference on Robotics and Automation (ICRA), 2026. 10

2026
[20]

Zheng, S

G. Zheng, S. Seenivasan, M. Johnson-Roberson, and W. Zhi. Rewind-il: Online failure de- tection and state respawning for imitation learning, 2026. URLhttps://arxiv.org/abs/ 2604.16683

Pith/arXiv arXiv 2026
[21]

H. Tan, S. Chen, Y . Xu, Z. Wang, Y . Ji, C. Chi, Y . Lyu, Z. Zhao, X. Chen, P. Co, et al. Robo- dopamine: General process reward modeling for high-precision robotic manipulation.arXiv preprint arXiv:2512.23703, 2025

arXiv 2025
[22]

Liang, Y

A. Liang, Y . Korkmaz, J. Zhang, M. Hwang, A. Anwar, S. Kaushik, A. Shah, A. S. Huang, L. Zettlemoyer, D. Fox, Y . Xiang, A. Li, A. Bobu, A. Gupta, S. Tu, E. Biyik, and J. Zhang. Robometer: Scaling general-purpose robotic reward models via trajectory comparisons. In Robotics: Science and Systems 2026, 2026

2026
[23]

T. Lee, A. Wagenmaker, K. Pertsch, P. Liang, S. Levine, and C. Finn. Roboreward: A dataset and benchmark for vision-language reward models in robotics, 2025. URLhttps: //openreview.net/forum?id=iDmt7Tzmke

2025
[24]

S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured predic- tion to no-regret online learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Pro- ceedings, 2011

2011
[25]

K. Lee, K. Lee, H. Lee, and J. Shin. A simple unified framework for de- tecting out-of-distribution samples and adversarial attacks. In S. Bengio, H. Wal- lach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,Ad- vances in Neural Information Processing Systems, volume 31. Curran Associates, Inc.,
[26]

URLhttps://proceedings.neurips.cc/paper_files/paper/2018/file/ abdeb6f575ac5c6676b747bca8d09cc2-Paper.pdf

2018
[27]

J. Ren, S. Fort, J. Liu, A. G. Roy, S. Padhy, and B. Lakshminarayanan. A simple fix to mahalanobis distance for improving near-ood detection, 2021. URLhttps://arxiv.org/ abs/2106.09022

arXiv 2021
[28]

Müller and M

M. Müller and M. Hein. Mahalanobis++: Improving OOD detection via feature normaliza- tion. InForty-second International Conference on Machine Learning, 2025. URLhttps: //openreview.net/forum?id=vutMcZl50l

2025
[29]

Daxberger, A

E. Daxberger, A. Kristiadi, A. Immer, R. Eschenhagen, M. Bauer, and P. Hennig. Laplace redux - effortless bayesian deep learning. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P. Liang, and J. W. Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pages 20089–20103. Curran Associates, Inc., 2021. URLhttps://proceedings.neurips.cc/...

2021
[30]

Reuss, H

M. Reuss, H. Zhou, M. Rühle, Ö. E. Ya ˘gmurlu, F. Otto, and R. Lioutikov. FLOWER: De- mocratizing generalist robot policies with efficient vision-language-flow models. In9th An- nual Conference on Robot Learning, 2025. URLhttps://openreview.net/forum?id= JeppaebLRD

2025
[31]

Q. Li, Y . Deng, Y . Liang, L. Luo, L. Zhou, C. Yao, L. Zeng, Z. Feng, H. Liang, S. Xu, Y . Zhang, X. Chen, H. Chen, L. Sun, D. Chen, J. Yang, and B. Guo. Scalable vision-language-action model pretraining for robotic manipulation with real-life human activity videos.arXiv preprint arXiv:2510.21571, 2025

arXiv 2025
[32]

C. Xu, T. K. Nguyen, E. Dixon, C. Rodriguez, P. Miller, R. Lee, P. Shah, R. Ambrus, H. Nishimura, and M. Itkina. Can we detect failures without failure data? uncertainty-aware runtime failure detection for imitation learning policies. InRobotics: Science and Systems, 2025. 11

2025
[33]

Beyer, A

L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Al- abdulmohsin, M. Tschannen, E. Bugliarello, T. Unterthiner, D. Keysers, S. Koppula, F. Liu, A. Grycner, A. Gritsenko, N. Houlsby, M. Kumar, K. Rong, J. Eisenschlos, R. Kabra, M. Bauer, M. Bošnjak, X. Chen, M. Minderer, P. V oigtlaender, I. Bica, I. Balazevic, J. Puigcerv...

Pith/arXiv arXiv 2024
[34]

B. Xiao, H. Wu, W. Xu, X. Dai, H. Hu, Y . Lu, M. Zeng, C. Liu, and L. Yuan. Florence- 2: Advancing a unified representation for a variety of vision tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4818–4829, 2024

2024
[35]

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. Libero: Benchmarking knowl- edge transfer for lifelong robot learning. volume 36, pages 44776–44791, 2023

2023
[36]

Jülich Supercomputing Centre. JUWELS Cluster and Booster: Exascale Pathfinder with Mod- ular Supercomputing Architecture at Juelich Supercomputing Centre.Journal of large-scale research facilities, 7(A183), 2021. doi:10.17815/jlsrf-7-183. URLhttp://dx.doi.org/ 10.17815/jlsrf-7-183. 12 A Penalized Detection Time Here, we provide further details on the AUCP...

work page doi:10.17815/jlsrf-7-183 2021

[1] [1]

Black, N

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Haus- man, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky.π 0: A vision-language-action flow model for general robot control, 2026. URLhttps://arxiv. o...

Pith/arXiv arXiv 2026

[2] [2]

Black, N

Physical Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner...

2025

[3] [3]

Bjorck, N

NVIDIA, J. Bjorck, N. C. Fernando Castañeda, X. Da, R. Ding, L. J. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. Llontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y . L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y . Xie, Y . Xu, Z. Xu, S. Ye, Z. Yu, A....

2025

[4] [4]

J. Lee, J. Duan, H. Fang, Y . Deng, S. Liu, B. Li, B. Fang, J. Zhang, Y . R. Wang, S. Lee, W. Han, W. Pumacay, A. Wu, R. Hendrix, K. Farley, E. VanderBilt, A. Farhadi, D. Fox, and R. Krishna. Molmoact: Action reasoning models that can reason in space, 2025. URLhttps: //arxiv.org/abs/2508.07917

Pith/arXiv arXiv 2025

[5] [5]

H. Fang, J. Duan, D. Clay, S. Wang, S. Liu, W. Huang, X. Fan, W.-C. Tsai, S. Chen, Y . R. Wang, S. Xing, J. Cho, J. S. Park, A. Eftekhar, P. Sushko, K. Farley, A. Wadhwa, C. Harrison, W. Han, Y .-C. Lee, E. VanderBilt, R. Hendrix, S. Ellawela, L. Ngoo, J. Chai, Z. Ren, A. Farhadi, D. Fox, and R. Krishna. Molmoact2: Action reasoning models for real-world d...

Pith/arXiv arXiv 2026

[6] [6]

M. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Pith/arXiv arXiv 2024

[7] [7]

Q. Gu, Y . Ju, S. Sun, I. Gilitschenski, H. Nishimura, M. Itkina, and F. Shkurti. SAFE: Multitask failure detection for vision-language-action models. InThe Thirty-ninth Annual Conference 9 on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum? id=XPyAukgsFf

2025

[8] [8]

Menda, K

K. Menda, K. Driggs-Campbell, and M. J. Kochenderfer. Ensembledagger: A bayesian ap- proach to safe imitation learning. In2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), page 5041–5048. IEEE Press, 2019. doi:10.1109/IROS40897. 2019.8968287. URLhttps://doi.org/10.1109/IROS40897.2019.8968287

work page doi:10.1109/iros40897 2019

[9] [9]

S.-W. Lee, X. Kang, and Y .-L. Kuo. Diff-dagger: Uncertainty estimation with diffusion policy for robotic manipulation. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 4845–4852. IEEE, 2025

2025

[10] [10]

Brunke, M

L. Brunke, M. Greeff, A. W. Hall, Z. Yuan, S. Zhou, J. Panerati, and A. P. Schoellig. Safe learn- ing in robotics: From learning-based control to safe reinforcement learning.Annual Review of Control, Robotics, and Autonomous Systems, 5(1):411–444, 2022

2022

[11] [11]

Zheng, J

J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y . Feng, Y . Zheng, J. Zou, Y . Chen, J. Zeng, T. Wang, Y .-Q. Zhang, J. Liu, and X. Zhan. X-VLA: Soft-prompted transformer as scal- able cross-embodiment vision-language-action model. InThe Fourteenth International Con- ference on Learning Representations, 2026. URLhttps://openreview.net/forum?id= kt51kZH4aG

2026

[12] [12]

S. Fei, S. Wang, J. Shi, Z. Dai, J. Cai, P. Qian, L. Ji, X. He, S. Zhang, Z. Fei, J. Fu, J. Gong, and X. Qiu. Libero-plus: In-depth robustness analysis of vision-language-action models.arXiv preprint arXiv:2510.13626, 2025

Pith/arXiv arXiv 2025

[13] [13]

Römer, A

R. Römer, A. Kobras, L. Worbis, and A. P. Schoellig. Failure prediction at runtime for genera- tive robot policies. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025

[14] [14]

S. Zhou, B. Zhu, J. Yang, X. Zhao, J. Chen, and Y .-G. Jiang. Rc-nf: Robot-conditioned normalizing flow for real-time anomaly detection in robotic manipulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 43050–43060, June 2026

2026

[15] [15]

C. Ma, G. Yang, K. Lu, S. Xu, B. Byrne, N. Trigoni, and A. Markham. Cyclevla: Proactive self-correcting vision-language-action models via subtask backtracking and minimum bayes risk decoding.arXiv preprint arXiv:2601.02295, 2026

arXiv 2026

[16] [16]

Z. Lin, J. Duan, H. Fang, D. Fox, R. Krishna, C. Tan, and B. Wen. Failsafe: Reasoning and recovery from failures in vision-language-action models, 2025. URLhttps://arxiv.org/ abs/2510.01642

arXiv 2025

[17] [17]

C. Agia, R. Sinha, J. Yang, Z. Cao, R. Antonova, M. Pavone, and J. Bohg. Unpacking failure modes of generative policies: Runtime monitoring of consistency and progress. InProceedings of The 8th Conference on Robot Learning, volume 270 ofProceedings of Machine Learning Research, pages 689–723. PMLR, 2025

2025

[18] [18]

C. Xu, T. K. Nguyen, E. Dixon, C. Rodriguez, P. Miller, R. Lee, P. Shah, R. A. Ambrus, H. Nishimura, and M. Itkina. Can we detect failures without failure data? uncertainty-aware runtime failure detection for imitation learning policies. InRobot Evaluation for the Real World, 2025. URLhttps://openreview.net/forum?id=A2iUXYdWZD

2025

[19] [19]

Rolland, F

Q. Rolland, F. Mayran de Chamisso, and J.-B. Mouret. Failure identification in imitation learning via statistical and semantic filtering. InIEEE International Conference on Robotics and Automation (ICRA), 2026. 10

2026

[20] [20]

Zheng, S

G. Zheng, S. Seenivasan, M. Johnson-Roberson, and W. Zhi. Rewind-il: Online failure de- tection and state respawning for imitation learning, 2026. URLhttps://arxiv.org/abs/ 2604.16683

Pith/arXiv arXiv 2026

[21] [21]

H. Tan, S. Chen, Y . Xu, Z. Wang, Y . Ji, C. Chi, Y . Lyu, Z. Zhao, X. Chen, P. Co, et al. Robo- dopamine: General process reward modeling for high-precision robotic manipulation.arXiv preprint arXiv:2512.23703, 2025

arXiv 2025

[22] [22]

Liang, Y

A. Liang, Y . Korkmaz, J. Zhang, M. Hwang, A. Anwar, S. Kaushik, A. Shah, A. S. Huang, L. Zettlemoyer, D. Fox, Y . Xiang, A. Li, A. Bobu, A. Gupta, S. Tu, E. Biyik, and J. Zhang. Robometer: Scaling general-purpose robotic reward models via trajectory comparisons. In Robotics: Science and Systems 2026, 2026

2026

[23] [23]

T. Lee, A. Wagenmaker, K. Pertsch, P. Liang, S. Levine, and C. Finn. Roboreward: A dataset and benchmark for vision-language reward models in robotics, 2025. URLhttps: //openreview.net/forum?id=iDmt7Tzmke

2025

[24] [24]

S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured predic- tion to no-regret online learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Pro- ceedings, 2011

2011

[25] [25]

K. Lee, K. Lee, H. Lee, and J. Shin. A simple unified framework for de- tecting out-of-distribution samples and adversarial attacks. In S. Bengio, H. Wal- lach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,Ad- vances in Neural Information Processing Systems, volume 31. Curran Associates, Inc.,

[26] [26]

URLhttps://proceedings.neurips.cc/paper_files/paper/2018/file/ abdeb6f575ac5c6676b747bca8d09cc2-Paper.pdf

2018

[27] [27]

J. Ren, S. Fort, J. Liu, A. G. Roy, S. Padhy, and B. Lakshminarayanan. A simple fix to mahalanobis distance for improving near-ood detection, 2021. URLhttps://arxiv.org/ abs/2106.09022

arXiv 2021

[28] [28]

Müller and M

M. Müller and M. Hein. Mahalanobis++: Improving OOD detection via feature normaliza- tion. InForty-second International Conference on Machine Learning, 2025. URLhttps: //openreview.net/forum?id=vutMcZl50l

2025

[29] [29]

Daxberger, A

E. Daxberger, A. Kristiadi, A. Immer, R. Eschenhagen, M. Bauer, and P. Hennig. Laplace redux - effortless bayesian deep learning. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P. Liang, and J. W. Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pages 20089–20103. Curran Associates, Inc., 2021. URLhttps://proceedings.neurips.cc/...

2021

[30] [30]

Reuss, H

M. Reuss, H. Zhou, M. Rühle, Ö. E. Ya ˘gmurlu, F. Otto, and R. Lioutikov. FLOWER: De- mocratizing generalist robot policies with efficient vision-language-flow models. In9th An- nual Conference on Robot Learning, 2025. URLhttps://openreview.net/forum?id= JeppaebLRD

2025

[31] [31]

Q. Li, Y . Deng, Y . Liang, L. Luo, L. Zhou, C. Yao, L. Zeng, Z. Feng, H. Liang, S. Xu, Y . Zhang, X. Chen, H. Chen, L. Sun, D. Chen, J. Yang, and B. Guo. Scalable vision-language-action model pretraining for robotic manipulation with real-life human activity videos.arXiv preprint arXiv:2510.21571, 2025

arXiv 2025

[32] [32]

C. Xu, T. K. Nguyen, E. Dixon, C. Rodriguez, P. Miller, R. Lee, P. Shah, R. Ambrus, H. Nishimura, and M. Itkina. Can we detect failures without failure data? uncertainty-aware runtime failure detection for imitation learning policies. InRobotics: Science and Systems, 2025. 11

2025

[33] [33]

Beyer, A

L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Al- abdulmohsin, M. Tschannen, E. Bugliarello, T. Unterthiner, D. Keysers, S. Koppula, F. Liu, A. Grycner, A. Gritsenko, N. Houlsby, M. Kumar, K. Rong, J. Eisenschlos, R. Kabra, M. Bauer, M. Bošnjak, X. Chen, M. Minderer, P. V oigtlaender, I. Bica, I. Balazevic, J. Puigcerv...

Pith/arXiv arXiv 2024

[34] [34]

B. Xiao, H. Wu, W. Xu, X. Dai, H. Hu, Y . Lu, M. Zeng, C. Liu, and L. Yuan. Florence- 2: Advancing a unified representation for a variety of vision tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4818–4829, 2024

2024

[35] [35]

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. Libero: Benchmarking knowl- edge transfer for lifelong robot learning. volume 36, pages 44776–44791, 2023

2023

[36] [36]

Jülich Supercomputing Centre. JUWELS Cluster and Booster: Exascale Pathfinder with Mod- ular Supercomputing Architecture at Juelich Supercomputing Centre.Journal of large-scale research facilities, 7(A183), 2021. doi:10.17815/jlsrf-7-183. URLhttp://dx.doi.org/ 10.17815/jlsrf-7-183. 12 A Penalized Detection Time Here, we provide further details on the AUCP...

work page doi:10.17815/jlsrf-7-183 2021