ActProbe: Action-Space Probe for Early Failure Detection of Generative Robot Policies

Bingjia Huang; Hao Wu; Kun Li; Liang Mi; Ting Cao; Weijun Wang; Xiang Wang; Xiangyu Li; Yunxin Liu; Zixu Hao

arxiv: 2606.08508 · v1 · pith:PT2C45XLnew · submitted 2026-06-07 · 💻 cs.RO · cs.AI

ActProbe: Action-Space Probe for Early Failure Detection of Generative Robot Policies

Bingjia Huang , Xiangyu Li , Xiang Wang , Liang Mi , Zixu Hao , Weijun Wang , Hao Wu , Kun Li

show 2 more authors

Yunxin Liu Ting Cao

This is my paper

Pith reviewed 2026-06-27 18:33 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords failure detectiongenerative robot policiesaction spaceearly warningtemporal consistencyrobot learningonline monitoring

0 comments

The pith

Action chunks from generative robot policies already carry strong predictive signals for impending failures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that generative robot policies produce action sequences whose internal patterns can reveal failures in advance, without any need to inspect the policy's internals or run extra samples. It demonstrates this by defining two compact action-derived signals and feeding them through a lightweight predictor to output failure probabilities step by step. A sympathetic reader would care because current detectors either demand white-box access or add runtime cost, while this method stays black-box and cheap yet moves the accuracy-timeliness frontier forward. If the claim holds, early detection becomes portable across policies and transferable to real robots with measurable gains in warning time and downstream learning efficiency.

Core claim

Emitted action chunks from generative robot policies already encode predictive information about failures; two signals extracted from a single forward pass—Temporal Consistency Error between consecutive chunks and Action Chunk Magnitude of the current chunk—when passed through a task-conditioned LSTM-MLP, yield per-step failure probabilities that improve the F1-timeliness Pareto frontier by +12.7% hypervolume and early-detection ROC-AUC by +9.0% on unseen tasks, while transferring to real-robot pick tasks and accelerating PPO fine-tuning with 2.9x fewer interactions.

What carries the argument

ActProbe, a detector that extracts Temporal Consistency Error (TCE) and Action Chunk Magnitude (ACM) from action chunks and maps them via a task-conditioned LSTM-MLP to failure probabilities.

If this is right

Alerts can be issued before failures become visually recognizable.
The accuracy-timeliness trade-off improves by +12.7% hypervolume on average over internal- and external-feature baselines.
+9.0% lead in early-detection ROC-AUC holds on tasks never seen during training.
The detector transfers directly to real-robot deployment and reduces environment interactions needed for PPO fine-tuning by a factor of 2.9.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Action-only probing could be combined with minimal observation features to handle edge cases where action signals weaken.
The same chunk-consistency idea might extend to other autoregressive sequence generators outside robotics.
Because no resampling is required, the method could be inserted as a lightweight safety layer in existing deployed policies.

Load-bearing premise

The two action-derived signals together with the task-conditioned LSTM-MLP suffice to generalize failure prediction across policies, benchmarks, and unseen real-robot tasks without internal access or resampling.

What would settle it

An experiment on a new generative policy or real-robot task in which ActProbe raises no earlier alerts than visual inspection or than baselines that use policy internals or observation features.

read the original abstract

Generative robot policies fail unpredictably at deployment: they hesitate at critical moments, drift off-task, or commit to unrecoverable actions. Existing online failure detectors either require white-box access to policy internals or add runtime overhead through resampling and observation-side signals. Our empirical analysis shows that emitted action chunks themselves already carry strong predictive signal for impending failures in generative robot policies. Motivated by this observation, we introduce ActProbe, a lightweight, pure action-space detector that uses two compact signals available from a single forward pass: Temporal Consistency Error (TCE) between consecutive action chunks and Action Chunk Magnitude (ACM) of the current chunk. ActProbe maps these signals to per-step failure probabilities with a task-conditioned LSTM-MLP architecture. Across a diverse suite of generative robot policies and benchmarks, ActProbe raises alerts before failures become visually recognizable, improving the accuracy (F1)-timeliness Pareto frontier of failure detection by an average hypervolume gain of +12.7% over both internal- and external-feature baselines, with a +9.0% early-detection ROC-AUC lead on unseen tasks. ActProbe further transfers to deployment, predicting failures on unseen real-robot pick tasks and accelerating RL fine-tuning (PPO) with 2.9x fewer environment interactions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Action chunks carry usable failure signals and ActProbe gets measurable gains, but task conditioning for unseen tasks is under-specified enough to weaken the generalization claim.

read the letter

The core observation holds up: the actions a generative policy emits already contain predictive information about upcoming failures, and you can extract it with two lightweight signals—TCE across consecutive chunks and ACM of the current chunk—without touching internals or resampling. ActProbe then runs those through a task-conditioned LSTM-MLP to output per-step failure probabilities. That setup is new relative to the internal-feature and observation-side detectors the abstract contrasts against.

The paper does the empirical work: it reports average +12.7% hypervolume improvement on the F1-timeliness frontier across a suite of policies and benchmarks, plus a +9% ROC-AUC edge on unseen tasks, and shows transfer to real-robot pick tasks plus 2.9x faster PPO fine-tuning. Those numbers address a practical deployment issue.

The soft spot is the task-conditioning piece. The architecture is described as task-conditioned, yet the abstract leaves open how the conditioning (embeddings, labels, or otherwise) is produced or adapted for tasks absent from the probe training data. If that step requires per-task information unavailable at deployment, the claimed transfer to unseen real-robot tasks and the “without internal access” result rest on an assumption that is not yet shown to hold. The stress-test concern lands.

The rest of the argument is straightforward—no circular derivations, just empirical signals and a simple model. Experiments appear to support the main claims, though full details on variance, dataset splits, and conditioning implementation would be needed to judge robustness.

This is for robotics researchers working on safe deployment of generative policies. It is worth a serious referee because the problem is real, the method is lightweight, and the reported gains are concrete even if the generalization story needs tightening.

Referee Report

2 major / 2 minor

Summary. The paper claims that action chunks emitted by generative robot policies contain strong predictive signals for impending failures. It introduces ActProbe, a lightweight detector using two action-derived signals—Temporal Consistency Error (TCE) and Action Chunk Magnitude (ACM)—fed into a task-conditioned LSTM-MLP to output per-step failure probabilities. The method is evaluated across generative policies and benchmarks, reporting +12.7% average hypervolume gain on the accuracy-timeliness Pareto frontier and +9.0% ROC-AUC improvement on unseen tasks, plus successful transfer to real-robot pick tasks and 2.9x faster RL fine-tuning, all without policy internals or resampling.

Significance. If the empirical results hold under rigorous validation, ActProbe would provide a practical, low-overhead failure detector for generative policies that operates purely in action space. This could improve safety monitoring and accelerate training loops in robotics without requiring white-box access, addressing a key deployment challenge. The core observation that action chunks carry failure signal is a useful empirical contribution if the generalization claims are substantiated.

major comments (2)

[Architecture description] Architecture description (likely §3): the task-conditioning mechanism for the LSTM-MLP is under-specified for unseen tasks. The generalization claim of +9.0% ROC-AUC lead and successful transfer to unseen real-robot pick tasks depends on how task embeddings or conditioning vectors are obtained or adapted at deployment; if this requires per-task labels or parameters unavailable without internal access, the zero-shot transfer result rests on an untested assumption.
[Experimental results] Experimental results section (likely §4 or §5): the reported hypervolume gains and ROC-AUC improvements lack details on error bars, dataset splits, number of runs, or statistical significance testing. Without these, it is difficult to assess whether the +12.7% and +9.0% margins are robust or sensitive to benchmark selection.

minor comments (2)

[Abstract] The abstract and introduction would benefit from explicit forward references to the specific tables or figures that support the hypervolume and ROC-AUC claims.
[Method] Notation for TCE and ACM should be defined with equations in the main text rather than relying solely on prose descriptions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point-by-point below, indicating where revisions will strengthen the paper.

read point-by-point responses

Referee: [Architecture description] Architecture description (likely §3): the task-conditioning mechanism for the LSTM-MLP is under-specified for unseen tasks. The generalization claim of +9.0% ROC-AUC lead and successful transfer to unseen real-robot pick tasks depends on how task embeddings or conditioning vectors are obtained or adapted at deployment; if this requires per-task labels or parameters unavailable without internal access, the zero-shot transfer result rests on an untested assumption.

Authors: We agree that the task-conditioning mechanism in Section 3 requires additional specification to fully support the generalization claims. In the revised manuscript we will expand the architecture description to detail exactly how task embeddings are computed from task descriptions (via a fixed encoder) and injected into the LSTM-MLP, and we will explicitly state the procedure used for unseen tasks. This clarification will confirm that no per-task labels or policy-internal parameters are required at deployment. revision: yes
Referee: [Experimental results] Experimental results section (likely §4 or §5): the reported hypervolume gains and ROC-AUC improvements lack details on error bars, dataset splits, number of runs, or statistical significance testing. Without these, it is difficult to assess whether the +12.7% and +9.0% margins are robust or sensitive to benchmark selection.

Authors: We acknowledge that the experimental results section would benefit from more rigorous statistical reporting. In the revised manuscript we will add error bars (standard deviation across seeds), explicitly describe the train/validation/test splits, report the number of independent runs, and include statistical significance tests (e.g., paired t-tests) for the reported hypervolume and ROC-AUC improvements. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical detector with no derivations or self-referential fits

full rationale

The paper defines two action-derived signals (TCE, ACM) from emitted chunks, then trains a standard task-conditioned LSTM-MLP to map them to failure probabilities. This is conventional supervised learning on observable inputs; the reported gains are measured against external baselines on held-out tasks and real-robot data. No equations, uniqueness theorems, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claim reduces to empirical validation rather than any input-equivalent construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities identifiable. The LSTM-MLP architecture and task-conditioning are treated as standard components.

pith-pipeline@v0.9.1-grok · 5783 in / 910 out tokens · 18368 ms · 2026-06-27T18:33:52.060913+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 1 canonical work pages

[1]

RT-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

Anthony Brohan et al. RT-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

Pith/arXiv arXiv 2022
[2]

RT-2: Vision-language-action models transfer web knowledge to robotic control

Anthony Brohan et al. RT-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning (CoRL), 2023. arXiv:2307.15818

Pith/arXiv arXiv 2023
[3]

OpenVLA: An open-source vision-language- action model.arXiv preprint arXiv:2406.09246, 2024

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. OpenVLA: An open-source vision-language- action model.arXiv preprint arXiv:2406.09246, 2024

Pith/arXiv arXiv 2024
[4]

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.𝜋0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024
[5]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.𝜋0.5: A vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

Pith/arXiv arXiv 2025
[6]

Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

Pith/arXiv arXiv 2025
[7]

CogACT: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024

Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. CogACT: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024

Pith/arXiv arXiv 2024
[8]

World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

Pith/arXiv arXiv 2026
[9]

Cosmos world foundation model platform for physical AI.arXiv preprint arXiv:2501.03575, 2025

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopad- hyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical AI.arXiv preprint arXiv:2501.03575, 2025

Pith/arXiv arXiv 2025
[10]

Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025

Jun Cen, Siteng Huang, Yuqian Yuan, Kehan Li, Hangjie Yuan, Chaohui Yu, Yuming Jiang, Jiayan Guo, Xin Li, Hao Luo, et al. Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025. 13

Pith/arXiv arXiv 2025
[11]

Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

Pith/arXiv arXiv 2025
[12]

LIBERO: Benchmarking knowledge transfer for lifelong robot learning

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. LIBERO: Benchmarking knowledge transfer for lifelong robot learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023
[13]

RoboCasa: Large-scale simulation of everyday tasks for generalist robots.arXiv preprint arXiv:2406.02523, 2024

Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. RoboCasa: Large-scale simulation of everyday tasks for generalist robots.arXiv preprint arXiv:2406.02523, 2024

Pith/arXiv arXiv 2024
[14]

DROID: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

Alexander Khazatsky, Karl Pertsch, et al. DROID: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

Pith/arXiv arXiv 2024
[15]

SAFE: Multitask failure detection for vision-language-action models

Qiao Gu, Yuanliang Ju, Shengxiang Sun, Igor Gilitschenski, Haruki Nishimura, Masha Itkina, and Florian Shkurti. SAFE: Multitask failure detection for vision-language-action models. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

2025
[16]

Unpacking failure modes of generative policies: Runtime monitoring of consistency and progress

Christopher Agia, Rohan Sinha, Jingyun Yang, Zi-ang Cao, Rika Antonova, Marco Pavone, and Jeannette Bohg. Unpacking failure modes of generative policies: Runtime monitoring of consistency and progress. InConference on Robot Learning (CoRL), 2024

2024
[17]

Task-driven out-of-distribution detection with statistical guarantees for robot learning

Alec Farid, Sushant Veer, and Anirudha Majumdar. Task-driven out-of-distribution detection with statistical guarantees for robot learning. InProceedings of the 5th Conference on Robot Learning (CoRL), pages 970–980, 2021. arXiv:2106.13703

arXiv 2021
[18]

Multi-task interactive robot fleet learning with visual world models

Huihan Liu, Yu Zhang, Vaarij Betala, Evan Zhang, James Liu, Crystal Ding, and Yuke Zhu. Multi-task interactive robot fleet learning with visual world models. InProceedings of the 8th Conference on Robot Learning (CoRL), 2024. arXiv:2410.22689

arXiv 2024
[19]

Real-time anomaly detection and reactive planning with large language models

Rohan Sinha, Amine Elhafsi, Christopher Agia, Matthew Foutter, Edward Schmerling, and Marco Pavone. Real-time anomaly detection and reactive planning with large language models. In Proceedings of Robotics: Science and Systems (RSS), 2024. arXiv:2407.08735

arXiv 2024
[20]

Asking for help: Failure prediction in behavioral cloning through value approximation

Cem Gokmen, Daniel Ho, and Mohi Khansari. Asking for help: Failure prediction in behavioral cloning through value approximation. InIEEE International Conference on Robotics and Automation (ICRA), pages 5821–5828, 2023

2023
[21]

Vision-language models as success detectors

Yuqing Du, Ksenia Konyushkova, Misha Denil, Akhil Raju, Jessica Landon, Felix Hill, Nando de Freitas, and Serkan Cabi. Vision-language models as success detectors. InConference on Lifelong Learning Agents (CoLLAs), pages 120–136. PMLR, 2023

2023
[22]

AHA: A vision-language-model for detecting and reasoning over failures in robotic manipulation

Jiafei Duan et al. AHA: A vision-language-model for detecting and reasoning over failures in robotic manipulation. InInternational Conference on Learning Representations (ICLR), 2025. arXiv:2410.00371

arXiv 2025
[23]

Can we detect failures without failure data? Uncertainty-aware runtime failure detection for imitation learning policies.arXiv preprint arXiv:2503.08558, 2025

Chen Xu, Tony Khuong Nguyen, Emma Dixon, Christopher Rodriguez, Patrick Miller, Robert Lee, Paarth Shah, Rares Ambrus, Haruki Nishimura, and Masha Itkina. Can we detect failures without failure data? Uncertainty-aware runtime failure detection for imitation learning policies.arXiv preprint arXiv:2503.08558, 2025. 14

arXiv 2025
[24]

Schoellig

Ralf Römer, Adrian Kobras, Luca Worbis, and Angela P. Schoellig. FIPER: Failure prediction at runtimeforgenerativerobotpolicies. InAdvancesinNeuralInformationProcessingSystems(NeurIPS), 2025

2025
[25]

Temporal difference calibration in sequential tasks: Application to vision-language-action models.arXiv preprint arXiv:2604.20472, 2026

Shelly Francis-Meretzki, Mirco Mutti, Yaniv Romano, and Aviv Tamar. Temporal difference calibration in sequential tasks: Application to vision-language-action models.arXiv preprint arXiv:2604.20472, 2026

Pith/arXiv arXiv 2026
[26]

Diff-DAgger: Uncertainty estimation with diffusion policy for robotic manipulation

Sung-Wook Lee, Xuhui Kang, and Yen-Ling Kuo. Diff-DAgger: Uncertainty estimation with diffusion policy for robotic manipulation. InIEEE International Conference on Robotics and Automation (ICRA), 2025

2025
[27]

The coordination of arm movements: An experimentally confirmed mathematical model.Journal of Neuroscience, 5(7):1688–1703, 1985

Tamar Flash and Neville Hogan. The coordination of arm movements: An experimentally confirmed mathematical model.Journal of Neuroscience, 5(7):1688–1703, 1985

1985
[28]

On the analysis of movement smoothness.Journal of NeuroEngineering and Rehabilitation, 12:112, 2015

SivakumarBalasubramanian, AlejandroMelendez-Calderon, AgnesRoby-Brami, andEtienneBurdet. On the analysis of movement smoothness.Journal of NeuroEngineering and Rehabilitation, 12:112, 2015

2015
[29]

Fonseca, and Viviane Grunert da Fonseca

Eckart Zitzler, Lothar Thiele, Marco Laumanns, Carlos M. Fonseca, and Viviane Grunert da Fonseca. Performance assessment of multiobjective optimizers: An analysis and review.IEEE Transactions on Evolutionary Computation, 7(2):117–132, 2003

2003
[30]

RLinf: Flexible and efficient large-scale reinforcement learning via macro-to-micro flow transformation.arXiv preprint arXiv:2509.15965, 2025

Chao Yu, Yuanqing Wang, Zhen Guo, Hao Lin, Si Xu, Hongzhi Zang, Quanlu Zhang, Yongji Wu, Chunyang Zhu, Junhao Hu, et al. RLinf: Flexible and efficient large-scale reinforcement learning via macro-to-micro flow transformation.arXiv preprint arXiv:2509.15965, 2025

arXiv 2025
[31]

UniVLA: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025

Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. UniVLA: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025

Pith/arXiv arXiv 2025
[32]

FAST: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. FAST: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

Pith/arXiv arXiv 2025
[33]

Tenenbaum, Dale Schuur- mans, and Pieter Abbeel

Yilun Du, Mengjiao Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Joshua B. Tenenbaum, Dale Schuur- mans, and Pieter Abbeel. Learning universal policies via text-guided video generation. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023
[34]

Learninginteractivereal-worldsimulators

SherryYang,YilunDu,KamyarGhasemipour,JonathanTompson,LeslieKaelbling,DaleSchuurmans, andPieterAbbeel. Learninginteractivereal-worldsimulators. InInternationalConferenceonLearning Representations (ICLR), 2024. arXiv:2310.06114

Pith/arXiv arXiv 2024
[35]

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. InRobotics: Science and Systems (RSS), 2023. arXiv:2303.04137

Pith/arXiv arXiv 2023
[36]

Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023. 15

Pith/arXiv arXiv 2023
[37]

The internal state of an LLM knows when it’s lying

Amos Azaria and Tom Mitchell. The internal state of an LLM knows when it’s lying. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 967–976, 2023. arXiv:2304.13734

Pith/arXiv arXiv 2023
[38]

Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221, 2022

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221, 2022

Pith/arXiv arXiv 2022
[39]

Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. InInternational Conference on Learning Representations (ICLR), 2023. arXiv:2302.09664

Pith/arXiv arXiv 2023
[40]

A baseline for detecting misclassified and out-of-distribution examples in neural networks

Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. InInternational Conference on Learning Representations (ICLR), 2017. arXiv:1610.02136

Pith/arXiv arXiv 2017
[41]

Simple and scalable predictive uncertainty estimation using deep ensembles

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. InAdvances in Neural Information Processing Systems (NeurIPS), 2017. arXiv:1612.01474

Pith/arXiv arXiv 2017
[42]

Dropout as a Bayesian approximation: Representing model uncertainty in deep learning

Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. InInternational Conference on Machine Learning (ICML), 2016. arXiv:1506.02142

Pith/arXiv arXiv 2016
[43]

Uncertainty comes for free: Human-in-the-loop policies with diffusion models.arXiv preprint arXiv:2503.01876, 2025

Zhanpeng He, Yifeng Cao, and Matei Ciocarlie. Uncertainty comes for free: Human-in-the-loop policies with diffusion models.arXiv preprint arXiv:2503.01876, 2025

arXiv 2025
[44]

Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025

Pith/arXiv arXiv 2025
[45]

2005.Algorithmic Learning in a Random World

Vladimir Vovk, Alexander Gammerman, and Glenn Shafer.Algorithmic Learning in a Random World. Springer, 2005. doi: 10.1007/b106715

work page doi:10.1007/b106715 2005
[46]

Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017. A. Architecture and training hyperparameters Training loss.The probe is trained with the per-step binary cross-entropy, averaged over valid (non- padded) timesteps: L=− 1Í 𝑖 𝑇𝑖 ∑︁ 𝑖 𝑇𝑖∑︁ 𝑡=1 𝑦𝑖 log𝑠 ...

Pith/arXiv arXiv 2017

[1] [1]

RT-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

Anthony Brohan et al. RT-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

Pith/arXiv arXiv 2022

[2] [2]

RT-2: Vision-language-action models transfer web knowledge to robotic control

Anthony Brohan et al. RT-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning (CoRL), 2023. arXiv:2307.15818

Pith/arXiv arXiv 2023

[3] [3]

OpenVLA: An open-source vision-language- action model.arXiv preprint arXiv:2406.09246, 2024

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. OpenVLA: An open-source vision-language- action model.arXiv preprint arXiv:2406.09246, 2024

Pith/arXiv arXiv 2024

[4] [4]

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.𝜋0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024

[5] [5]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.𝜋0.5: A vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

Pith/arXiv arXiv 2025

[6] [6]

Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

Pith/arXiv arXiv 2025

[7] [7]

CogACT: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024

Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. CogACT: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024

Pith/arXiv arXiv 2024

[8] [8]

World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

Pith/arXiv arXiv 2026

[9] [9]

Cosmos world foundation model platform for physical AI.arXiv preprint arXiv:2501.03575, 2025

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopad- hyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical AI.arXiv preprint arXiv:2501.03575, 2025

Pith/arXiv arXiv 2025

[10] [10]

Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025

Jun Cen, Siteng Huang, Yuqian Yuan, Kehan Li, Hangjie Yuan, Chaohui Yu, Yuming Jiang, Jiayan Guo, Xin Li, Hao Luo, et al. Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025. 13

Pith/arXiv arXiv 2025

[11] [11]

Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

Pith/arXiv arXiv 2025

[12] [12]

LIBERO: Benchmarking knowledge transfer for lifelong robot learning

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. LIBERO: Benchmarking knowledge transfer for lifelong robot learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023

[13] [13]

RoboCasa: Large-scale simulation of everyday tasks for generalist robots.arXiv preprint arXiv:2406.02523, 2024

Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. RoboCasa: Large-scale simulation of everyday tasks for generalist robots.arXiv preprint arXiv:2406.02523, 2024

Pith/arXiv arXiv 2024

[14] [14]

DROID: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

Alexander Khazatsky, Karl Pertsch, et al. DROID: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

Pith/arXiv arXiv 2024

[15] [15]

SAFE: Multitask failure detection for vision-language-action models

Qiao Gu, Yuanliang Ju, Shengxiang Sun, Igor Gilitschenski, Haruki Nishimura, Masha Itkina, and Florian Shkurti. SAFE: Multitask failure detection for vision-language-action models. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

2025

[16] [16]

Unpacking failure modes of generative policies: Runtime monitoring of consistency and progress

Christopher Agia, Rohan Sinha, Jingyun Yang, Zi-ang Cao, Rika Antonova, Marco Pavone, and Jeannette Bohg. Unpacking failure modes of generative policies: Runtime monitoring of consistency and progress. InConference on Robot Learning (CoRL), 2024

2024

[17] [17]

Task-driven out-of-distribution detection with statistical guarantees for robot learning

Alec Farid, Sushant Veer, and Anirudha Majumdar. Task-driven out-of-distribution detection with statistical guarantees for robot learning. InProceedings of the 5th Conference on Robot Learning (CoRL), pages 970–980, 2021. arXiv:2106.13703

arXiv 2021

[18] [18]

Multi-task interactive robot fleet learning with visual world models

Huihan Liu, Yu Zhang, Vaarij Betala, Evan Zhang, James Liu, Crystal Ding, and Yuke Zhu. Multi-task interactive robot fleet learning with visual world models. InProceedings of the 8th Conference on Robot Learning (CoRL), 2024. arXiv:2410.22689

arXiv 2024

[19] [19]

Real-time anomaly detection and reactive planning with large language models

Rohan Sinha, Amine Elhafsi, Christopher Agia, Matthew Foutter, Edward Schmerling, and Marco Pavone. Real-time anomaly detection and reactive planning with large language models. In Proceedings of Robotics: Science and Systems (RSS), 2024. arXiv:2407.08735

arXiv 2024

[20] [20]

Asking for help: Failure prediction in behavioral cloning through value approximation

Cem Gokmen, Daniel Ho, and Mohi Khansari. Asking for help: Failure prediction in behavioral cloning through value approximation. InIEEE International Conference on Robotics and Automation (ICRA), pages 5821–5828, 2023

2023

[21] [21]

Vision-language models as success detectors

Yuqing Du, Ksenia Konyushkova, Misha Denil, Akhil Raju, Jessica Landon, Felix Hill, Nando de Freitas, and Serkan Cabi. Vision-language models as success detectors. InConference on Lifelong Learning Agents (CoLLAs), pages 120–136. PMLR, 2023

2023

[22] [22]

AHA: A vision-language-model for detecting and reasoning over failures in robotic manipulation

Jiafei Duan et al. AHA: A vision-language-model for detecting and reasoning over failures in robotic manipulation. InInternational Conference on Learning Representations (ICLR), 2025. arXiv:2410.00371

arXiv 2025

[23] [23]

Can we detect failures without failure data? Uncertainty-aware runtime failure detection for imitation learning policies.arXiv preprint arXiv:2503.08558, 2025

Chen Xu, Tony Khuong Nguyen, Emma Dixon, Christopher Rodriguez, Patrick Miller, Robert Lee, Paarth Shah, Rares Ambrus, Haruki Nishimura, and Masha Itkina. Can we detect failures without failure data? Uncertainty-aware runtime failure detection for imitation learning policies.arXiv preprint arXiv:2503.08558, 2025. 14

arXiv 2025

[24] [24]

Schoellig

Ralf Römer, Adrian Kobras, Luca Worbis, and Angela P. Schoellig. FIPER: Failure prediction at runtimeforgenerativerobotpolicies. InAdvancesinNeuralInformationProcessingSystems(NeurIPS), 2025

2025

[25] [25]

Temporal difference calibration in sequential tasks: Application to vision-language-action models.arXiv preprint arXiv:2604.20472, 2026

Shelly Francis-Meretzki, Mirco Mutti, Yaniv Romano, and Aviv Tamar. Temporal difference calibration in sequential tasks: Application to vision-language-action models.arXiv preprint arXiv:2604.20472, 2026

Pith/arXiv arXiv 2026

[26] [26]

Diff-DAgger: Uncertainty estimation with diffusion policy for robotic manipulation

Sung-Wook Lee, Xuhui Kang, and Yen-Ling Kuo. Diff-DAgger: Uncertainty estimation with diffusion policy for robotic manipulation. InIEEE International Conference on Robotics and Automation (ICRA), 2025

2025

[27] [27]

The coordination of arm movements: An experimentally confirmed mathematical model.Journal of Neuroscience, 5(7):1688–1703, 1985

Tamar Flash and Neville Hogan. The coordination of arm movements: An experimentally confirmed mathematical model.Journal of Neuroscience, 5(7):1688–1703, 1985

1985

[28] [28]

On the analysis of movement smoothness.Journal of NeuroEngineering and Rehabilitation, 12:112, 2015

SivakumarBalasubramanian, AlejandroMelendez-Calderon, AgnesRoby-Brami, andEtienneBurdet. On the analysis of movement smoothness.Journal of NeuroEngineering and Rehabilitation, 12:112, 2015

2015

[29] [29]

Fonseca, and Viviane Grunert da Fonseca

Eckart Zitzler, Lothar Thiele, Marco Laumanns, Carlos M. Fonseca, and Viviane Grunert da Fonseca. Performance assessment of multiobjective optimizers: An analysis and review.IEEE Transactions on Evolutionary Computation, 7(2):117–132, 2003

2003

[30] [30]

RLinf: Flexible and efficient large-scale reinforcement learning via macro-to-micro flow transformation.arXiv preprint arXiv:2509.15965, 2025

Chao Yu, Yuanqing Wang, Zhen Guo, Hao Lin, Si Xu, Hongzhi Zang, Quanlu Zhang, Yongji Wu, Chunyang Zhu, Junhao Hu, et al. RLinf: Flexible and efficient large-scale reinforcement learning via macro-to-micro flow transformation.arXiv preprint arXiv:2509.15965, 2025

arXiv 2025

[31] [31]

UniVLA: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025

Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. UniVLA: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025

Pith/arXiv arXiv 2025

[32] [32]

FAST: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. FAST: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

Pith/arXiv arXiv 2025

[33] [33]

Tenenbaum, Dale Schuur- mans, and Pieter Abbeel

Yilun Du, Mengjiao Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Joshua B. Tenenbaum, Dale Schuur- mans, and Pieter Abbeel. Learning universal policies via text-guided video generation. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023

[34] [34]

Learninginteractivereal-worldsimulators

SherryYang,YilunDu,KamyarGhasemipour,JonathanTompson,LeslieKaelbling,DaleSchuurmans, andPieterAbbeel. Learninginteractivereal-worldsimulators. InInternationalConferenceonLearning Representations (ICLR), 2024. arXiv:2310.06114

Pith/arXiv arXiv 2024

[35] [35]

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. InRobotics: Science and Systems (RSS), 2023. arXiv:2303.04137

Pith/arXiv arXiv 2023

[36] [36]

Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023. 15

Pith/arXiv arXiv 2023

[37] [37]

The internal state of an LLM knows when it’s lying

Amos Azaria and Tom Mitchell. The internal state of an LLM knows when it’s lying. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 967–976, 2023. arXiv:2304.13734

Pith/arXiv arXiv 2023

[38] [38]

Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221, 2022

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221, 2022

Pith/arXiv arXiv 2022

[39] [39]

Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. InInternational Conference on Learning Representations (ICLR), 2023. arXiv:2302.09664

Pith/arXiv arXiv 2023

[40] [40]

A baseline for detecting misclassified and out-of-distribution examples in neural networks

Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. InInternational Conference on Learning Representations (ICLR), 2017. arXiv:1610.02136

Pith/arXiv arXiv 2017

[41] [41]

Simple and scalable predictive uncertainty estimation using deep ensembles

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. InAdvances in Neural Information Processing Systems (NeurIPS), 2017. arXiv:1612.01474

Pith/arXiv arXiv 2017

[42] [42]

Dropout as a Bayesian approximation: Representing model uncertainty in deep learning

Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. InInternational Conference on Machine Learning (ICML), 2016. arXiv:1506.02142

Pith/arXiv arXiv 2016

[43] [43]

Uncertainty comes for free: Human-in-the-loop policies with diffusion models.arXiv preprint arXiv:2503.01876, 2025

Zhanpeng He, Yifeng Cao, and Matei Ciocarlie. Uncertainty comes for free: Human-in-the-loop policies with diffusion models.arXiv preprint arXiv:2503.01876, 2025

arXiv 2025

[44] [44]

Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025

Pith/arXiv arXiv 2025

[45] [45]

2005.Algorithmic Learning in a Random World

Vladimir Vovk, Alexander Gammerman, and Glenn Shafer.Algorithmic Learning in a Random World. Springer, 2005. doi: 10.1007/b106715

work page doi:10.1007/b106715 2005

[46] [46]

Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017. A. Architecture and training hyperparameters Training loss.The probe is trained with the per-step binary cross-entropy, averaged over valid (non- padded) timesteps: L=− 1Í 𝑖 𝑇𝑖 ∑︁ 𝑖 𝑇𝑖∑︁ 𝑡=1 𝑦𝑖 log𝑠 ...

Pith/arXiv arXiv 2017