BOKBO (Best of K Bad Options): Calibrated Abstention for VLA Policies

Anya Singh; Cabrel Happi; Jai Relan; Varun Nair; Vidyut Baradwaj

arxiv: 2605.30660 · v1 · pith:HVR2LWARnew · submitted 2026-05-28 · 💻 cs.LG · cs.RO

BOKBO (Best of K Bad Options): Calibrated Abstention for VLA Policies

Anya Singh , Cabrel Happi , Jai Relan , Varun Nair , Vidyut Baradwaj This is my paper

Pith reviewed 2026-06-29 08:05 UTC · model grok-4.3

classification 💻 cs.LG cs.RO

keywords conformal abstentionVLA policiestest-time scalingviolation rate guaranteesMondrian conformal predictionvision language actioncalibrated abstentiondistribution free guarantees

0 comments

The pith

BOKBO adds a conformal abstention layer to K-sample VLA inference that gives finite-sample distribution-free guarantees on executed violation rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BOKBO to address the problem that current test-time scaling for vision-language-action policies samples K candidate actions and executes the best one, yet still runs a violation if every sample is unsafe. BOKBO inserts a calibrated abstention step that decides whether to skip execution, backed by statistical bounds that limit the long-run rate of executed violations. Both a single global threshold and a per-task Mondrian version are provided, the latter improving reliability on the hardest tasks. The work also shows that common internal policy scores fail to track actual safety violations under certain sampling schemes, while a learned predictor on visual features and task identity succeeds in calibration.

Core claim

BOKBO is the first conformal abstention layer for K-sample VLA inference, providing finite-sample distribution-free guarantees on executed-violation rate. It supplies both global and per-task (Mondrian) variants, with the per-task variant closing the conditional gap on the hardest tasks. A learned violation predictor conditioned on semantic visual features and task identity supports tight calibration: at epsilon = 0.05 on libero_object_temp_x0.1 with OpenVLA-OFT, the conditional CRC bound holds on 86 percent of bootstrap splits with 78 percent coverage and 70 percent net task success. Mondrian-BOKBO raises the minimum per-task conditional hold fraction from 0.71 to 0.93.

What carries the argument

Conditional conformal risk control (CRC) applied to a learned violation predictor that takes semantic visual features and task identity as inputs, producing an abstention decision for each K-sample inference round.

If this is right

Mondrian-BOKBO raises the minimum per-task conditional hold fraction from 0.71 to 0.93.
Results remain stable across five training seeds, replicate within bootstrap noise on pi0-FAST, hold on libero_spatial_temp_x0.1, and survive four within-suite distribution shifts.
Policy-internal nonconformity scores correlate at 0.98 with the action-noise hyperparameter rather than with actual safety violations under perturbation-based K-sampling.
Globally set force thresholds below expert-typical manipulation forces inflate reported violation rates by a factor of five.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same abstention construction could be tested on other multi-sample inference pipelines in robotics to add explicit safety bounds without retraining the base policy.
The documented failure of internal scores under perturbation sampling points to a general need for external predictors whenever K-sampling is driven by added noise rather than model stochasticity.
Re-evaluation of earlier VLA safety numbers that used low force thresholds may be required before comparing violation rates across methods.

Load-bearing premise

A learned violation predictor conditioned on semantic visual features and task identity can be calibrated tightly enough for the conditional CRC bound to hold on the majority of bootstrap splits.

What would settle it

On a new held-out distribution shift or task suite, the fraction of bootstrap splits where the conditional CRC bound holds at epsilon = 0.05 falls well below 86 percent while coverage remains near 78 percent.

read the original abstract

Test-time scaling for vision-language-action (VLA) policies, methods such as RoboMonkey, SEAL, MG-Select, and V-GPS, samples K candidate action chunks at inference and executes the verifier-best. When all K candidates are unsafe, the system executes a violating action with no warning. We propose BOKBO, the first conformal abstention layer for K-sample VLA inference, providing finite-sample distribution-free guarantees on executed-violation rate. We provide both global and per-task (Mondrian) variants, with the per-task variant closing the conditional gap on the hardest tasks. Our analysis exposes a structural failure of policy-internal nonconformity scores under perturbation-based K-sampling: the base-policy confidence proxy and K-sample disagreement correlate at 0.98 with the action-noise hyperparameter $\sigma$, while correlating at the noise floor with actual safety violations. We test the failure's scope by replicating the analysis under token-level temperature sampling and find the failure is mechanism-specific and partially mitigated under policy-stochasticity-based sampling. A learned violation predictor conditioned on semantic visual features and task identity supports tight calibration: at $\epsilon$ = 0.05 on libero_object_temp_x0.1 with OpenVLA-OFT, the conditional CRC bound holds on 86% of bootstrap splits with 78% coverage and 70% net task success. Mondrian-BOKBO raises the minimum per-task conditional hold fraction from 0.71 to 0.93. Results are stable across 5 training seeds, replicate within bootstrap noise on $\pi_0$-FAST, hold on libero_spatial_temp_x0.1 as a co-equal benchmark, and survive four within-suite distribution shifts. We additionally identify and correct a methodological pitfall: globally-set force thresholds well below expert-typical manipulation forces conflate unsafe behavior with normal manipulation, inflating violation rates by $5\times$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BOKBO adds conformal abstention to K-sample VLA policies and shows internal scores track noise not safety, but the guarantees rest on empirical calibration of a learned predictor.

read the letter

The main point is that this paper supplies a conformal layer so VLA policies can abstain when every one of the K sampled actions is unsafe, plus an analysis that policy confidence and disagreement mostly follow the sampling noise level rather than actual violations.

What is new is the conformal risk control wrapper for the K-sample inference setting, including a Mondrian per-task variant that narrows the gap on hard tasks. The structural failure result is concrete: the base scores correlate 0.98 with the noise hyperparameter sigma but sit at the noise floor with safety violations. They also flag and correct the force-threshold pitfall that was inflating reported violation rates by 5x.

The experiments report 78% coverage and 70% success at epsilon 0.05 on libero_object, stable across five seeds, with the Mondrian version raising the minimum per-task hold fraction from 0.71 to 0.93. Results hold on a second benchmark and under some distribution shifts.

The soft spot is that the finite-sample distribution-free claim depends on the learned violation predictor (conditioned on visual features and task identity) delivering tight enough calibration. The conditional bound holds on only 86% of bootstrap splits, so the guarantee is partial in practice. If exchangeability between calibration and test points breaks because of the predictor, the safety transfer weakens. The paper does not appear to supply a parameter-free derivation that avoids this dependence.

This is for people building deployable VLA systems who need a practical abstention mechanism. It has enough new analysis and a working method to deserve a serious referee.

Referee Report

2 major / 2 minor

Summary. The paper proposes BOKBO, the first conformal abstention layer for K-sample VLA inference. It provides finite-sample distribution-free guarantees (global and Mondrian/per-task variants) on executed-violation rate by using a learned violation predictor conditioned on semantic visual features and task identity. The work identifies a structural failure mode of policy-internal nonconformity scores under perturbation-based K-sampling (0.98 correlation with noise hyperparameter σ but near-zero with actual violations), replicates the analysis under token-level temperature sampling, corrects a methodological pitfall with globally-set force thresholds that inflate violation rates by 5×, and reports empirical results on libero_object_temp_x0.1 and libero_spatial_temp_x0.1 with OpenVLA-OFT and π0-FAST (conditional CRC bound holds on 86% of bootstrap splits at ε=0.05 with 78% coverage and 70% success; Mondrian raises min per-task hold fraction to 0.93).

Significance. If the central claims hold, BOKBO would be a meaningful addition to test-time scaling methods for VLAs by adding calibrated abstention with explicit finite-sample guarantees on safety violations. The identification and correction of the force-threshold pitfall and the exposure of the nonconformity-score failure mode are concrete contributions. The Mondrian variant addressing per-task gaps is a useful extension of standard conformal methods.

major comments (2)

[Abstract] Abstract: the central claim of 'finite-sample distribution-free guarantees on executed-violation rate' is load-bearing, yet the reported conditional CRC bound holds on only 86% of bootstrap splits (rising to 0.93 min per-task with Mondrian). This empirical fraction indicates that the guarantee does not hold unconditionally across splits; the manuscript must either provide a theoretical argument showing why the 14% failure rate does not invalidate the distribution-free property or revise the claim to reflect that the guarantee is conditional on the learned predictor's calibration performance.
[Abstract] Abstract and methods description: the finite-sample guarantee relies on the learned violation predictor (conditioned on semantic visual features and task identity) satisfying the exchangeability and calibration conditions required for both global and Mondrian CRC. The manuscript reports that this predictor 'supports tight calibration' via the 86%/78%/70% empirical figures, but does not demonstrate that the predictor's training introduces no feature-task correlations that would violate the implicit i.i.d. assumption between calibration and test points; a concrete counter-example or sensitivity analysis on this point is needed to support transfer of the distribution-free property to executed actions.

minor comments (2)

The stability across 5 training seeds and replication on π0-FAST within bootstrap noise is stated but would benefit from explicit quantification of variance in the hold-fraction metric.
The four within-suite distribution shifts are mentioned as survived but lack a table or figure reference showing per-shift coverage and success numbers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, providing theoretical clarification on the marginal nature of the guarantees and agreeing to targeted revisions for improved clarity on assumptions.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of 'finite-sample distribution-free guarantees on executed-violation rate' is load-bearing, yet the reported conditional CRC bound holds on only 86% of bootstrap splits (rising to 0.93 min per-task with Mondrian). This empirical fraction indicates that the guarantee does not hold unconditionally across splits; the manuscript must either provide a theoretical argument showing why the 14% failure rate does not invalidate the distribution-free property or revise the claim to reflect that the guarantee is conditional on the learned predictor's calibration performance.

Authors: The finite-sample distribution-free guarantees of CRC are marginal: they hold with probability at least 1-α over the random draw of the calibration set (under exchangeability), not conditionally for every fixed calibration set. The 86% bootstrap hold rate is an empirical diagnostic of practical performance rather than a violation of the marginal guarantee. We will revise the abstract to explicitly describe the guarantees as marginal finite-sample distribution-free guarantees, thereby addressing the load-bearing claim while preserving its accuracy. revision: yes
Referee: [Abstract] Abstract and methods description: the finite-sample guarantee relies on the learned violation predictor (conditioned on semantic visual features and task identity) satisfying the exchangeability and calibration conditions required for both global and Mondrian CRC. The manuscript reports that this predictor 'supports tight calibration' via the 86%/78%/70% empirical figures, but does not demonstrate that the predictor's training introduces no feature-task correlations that would violate the implicit i.i.d. assumption between calibration and test points; a concrete counter-example or sensitivity analysis on this point is needed to support transfer of the distribution-free property to executed actions.

Authors: The violation predictor is trained on a dataset disjoint from the calibration and test sets used for CRC, ensuring that the resulting nonconformity scores satisfy exchangeability between calibration and test points. The feature and task conditioning defines the score but does not break exchangeability provided the data-generating process remains consistent. We will expand the methods section to detail this disjoint training protocol. While a dedicated sensitivity analysis on induced correlations is not included, the standard conformal assumptions suffice for the transfer of the guarantee; we view the requested counter-example as unnecessary under the maintained i.i.d. conditions. revision: partial

Circularity Check

0 steps flagged

No significant circularity; guarantees derive from external conformal theory

full rationale

The paper's central claim applies standard finite-sample conformal risk control (CRC) to nonconformity scores produced by a learned violation predictor, yielding distribution-free guarantees on executed-violation rate under the usual exchangeability assumption. This theoretical guarantee is independent of the specific predictor training procedure and is not reduced to any fitted parameter or self-referential definition within the paper. The reported bootstrap hold rates (86% global, 0.93 Mondrian), coverage (78%), and success (70%) are presented as empirical validation that the predictor achieves sufficient calibration on the tested splits, not as the source of the guarantee itself. No self-citations, uniqueness theorems, or ansatzes from prior author work appear as load-bearing steps, and the derivation chain remains self-contained against external conformal prediction results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract alone supplies insufficient detail to enumerate free parameters, axioms, or invented entities; the approach builds on standard conformal prediction but the learned predictor and Mondrian variant may introduce unstated modeling choices.

pith-pipeline@v0.9.1-grok · 5899 in / 1156 out tokens · 34572 ms · 2026-06-29T08:05:29.145528+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 15 canonical work pages · 9 internal anchors

[1]

Angelopoulos and Stephen Bates

Anastasios N. Angelopoulos and Stephen Bates. Conformal prediction: A gentle introduction. Foundations and Trends in Machine Learning, 16(4):494–591, 2023

2023
[2]

Angelopoulos, Stephen Bates, Emmanuel J

Anastasios N. Angelopoulos, Stephen Bates, Emmanuel J. Candès, Michael I. Jordan, and Lihua Lei. Learn then test: Calibrating predictive algorithms to achieve risk control.Annals of Applied Statistics, 19(2):1641–1662, 2025. arXiv:2110.01052

work page arXiv 2025
[3]

Angelopoulos, Stephen Bates, Adam Fisch, Lihua Lei, and Tal Schuster

Anastasios N. Angelopoulos, Stephen Bates, Adam Fisch, Lihua Lei, and Tal Schuster. Con- formal risk control. InInternational Conference on Learning Representations (ICLR), 2024. arXiv:2208.02814

work page arXiv 2024
[4]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choro- manski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. RT-2: Vision-language- action models transfer web knowledge to robotic control. InConference on Robot Learning (CoRL), 2023. arXiv:2307.15818

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. InRobotics: Science and Systems, 2023. arXiv:2303.04137

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Selective Classification for Deep Neural Networks

Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2017. arXiv:1705.08500. 9

work page internal anchor Pith review Pith/arXiv arXiv 2017
[7]

Verifier-free test-time sampling for vision language action models.arXiv preprint arXiv:2510.05681, 2025

Kim et al. Verifier-free test-time sampling for vision language action models.arXiv preprint arXiv:2510.05681, 2025. MG-Select

work page arXiv 2025
[8]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success. InRobotics: Science and Systems, 2025. arXiv:2502.19645

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. OpenVLA: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

RoboMonkey: Scaling test-time sampling and verification for vision- language-action models

Jacky Kwok, Christopher Agia, Rohan Sinha, Matt Foutter, Shulu Li, Ion Stoica, Azalia Mirho- seini, and Marco Pavone. RoboMonkey: Scaling test-time sampling and verification for vision- language-action models. InConference on Robot Learning (CoRL), 2025. arXiv:2506.17811

work page arXiv 2025
[11]

Lars Lindemann, Matthew Cleaveland, Gihyun Shim, and George J. Pappas. Safe planning in dynamic environments using conformal prediction.IEEE Robotics and Automation Letters, 2023

2023
[12]

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. LIBERO: Benchmarking knowledge transfer for lifelong robot learning. InAdvances in Neural Information Processing Systems (Datasets and Benchmarks Track), 2023. arXiv:2306.03310

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Steering your generalists: Improving robotic foundation models via value guidance

Mitsuhiko Nakamoto, Oier Mees, Aviral Kumar, and Sergey Levine. Steering your generalists: Improving robotic foundation models via value guidance. InConference on Robot Learning (CoRL), 2024. arXiv:2410.13816

work page arXiv 2024
[14]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. InRobotics: Science and Systems, 2024. arXiv:2405.12213

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Open X-Embodiment Collaboration, Abby O’Neill, Abdul Rehman, et al. Open X-embodiment: Robotic learning datasets and RT-X models. InIEEE International Conference on Robotics and Automation (ICRA), 2024. arXiv:2310.08864

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

DINOv2: Learn- ing robust visual features without supervision.Transactions on Machine Learning Research,

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learn- ing robust visual features without supervision.Transactions on Machine Learning Research,
[17]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. FAST: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Springer, 2005

Vladimir V ovk, Alexander Gammerman, and Glenn Shafer.Algorithmic Learning in a Random World. Springer, 2005

2005
[19]

Mondrian confidence machine

Vladimir V ovk, David Lindsay, Ilia Nouretdinov, and Alexander Gammerman. Mondrian confidence machine. Technical report, Royal Holloway, University of London, 2003

2003
[20]

Do what you say: Steering vision-language-action models via runtime reasoning- action alignment verification.arXiv preprint arXiv:2510.16281, 2025

Yilin Wu et al. Do what you say: Steering vision-language-action models via runtime reasoning- action alignment verification.arXiv preprint arXiv:2510.16281, 2025. SEAL. A Proof of Lemma 1 and Theorem 1 Setup.Let D be the deployed-decision distribution over tuples D= (o, ℓ,{a k}, σ, v, s). Let {D1, . . . , Dn} be drawn i.i.d. from D. A bootstrap partition...

work page arXiv 2025

[1] [1]

Angelopoulos and Stephen Bates

Anastasios N. Angelopoulos and Stephen Bates. Conformal prediction: A gentle introduction. Foundations and Trends in Machine Learning, 16(4):494–591, 2023

2023

[2] [2]

Angelopoulos, Stephen Bates, Emmanuel J

Anastasios N. Angelopoulos, Stephen Bates, Emmanuel J. Candès, Michael I. Jordan, and Lihua Lei. Learn then test: Calibrating predictive algorithms to achieve risk control.Annals of Applied Statistics, 19(2):1641–1662, 2025. arXiv:2110.01052

work page arXiv 2025

[3] [3]

Angelopoulos, Stephen Bates, Adam Fisch, Lihua Lei, and Tal Schuster

Anastasios N. Angelopoulos, Stephen Bates, Adam Fisch, Lihua Lei, and Tal Schuster. Con- formal risk control. InInternational Conference on Learning Representations (ICLR), 2024. arXiv:2208.02814

work page arXiv 2024

[4] [4]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choro- manski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. RT-2: Vision-language- action models transfer web knowledge to robotic control. InConference on Robot Learning (CoRL), 2023. arXiv:2307.15818

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. InRobotics: Science and Systems, 2023. arXiv:2303.04137

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Selective Classification for Deep Neural Networks

Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2017. arXiv:1705.08500. 9

work page internal anchor Pith review Pith/arXiv arXiv 2017

[7] [7]

Verifier-free test-time sampling for vision language action models.arXiv preprint arXiv:2510.05681, 2025

Kim et al. Verifier-free test-time sampling for vision language action models.arXiv preprint arXiv:2510.05681, 2025. MG-Select

work page arXiv 2025

[8] [8]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success. InRobotics: Science and Systems, 2025. arXiv:2502.19645

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. OpenVLA: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

RoboMonkey: Scaling test-time sampling and verification for vision- language-action models

Jacky Kwok, Christopher Agia, Rohan Sinha, Matt Foutter, Shulu Li, Ion Stoica, Azalia Mirho- seini, and Marco Pavone. RoboMonkey: Scaling test-time sampling and verification for vision- language-action models. InConference on Robot Learning (CoRL), 2025. arXiv:2506.17811

work page arXiv 2025

[11] [11]

Lars Lindemann, Matthew Cleaveland, Gihyun Shim, and George J. Pappas. Safe planning in dynamic environments using conformal prediction.IEEE Robotics and Automation Letters, 2023

2023

[12] [12]

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. LIBERO: Benchmarking knowledge transfer for lifelong robot learning. InAdvances in Neural Information Processing Systems (Datasets and Benchmarks Track), 2023. arXiv:2306.03310

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

Steering your generalists: Improving robotic foundation models via value guidance

Mitsuhiko Nakamoto, Oier Mees, Aviral Kumar, and Sergey Levine. Steering your generalists: Improving robotic foundation models via value guidance. InConference on Robot Learning (CoRL), 2024. arXiv:2410.13816

work page arXiv 2024

[14] [14]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. InRobotics: Science and Systems, 2024. arXiv:2405.12213

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Open X-Embodiment Collaboration, Abby O’Neill, Abdul Rehman, et al. Open X-embodiment: Robotic learning datasets and RT-X models. InIEEE International Conference on Robotics and Automation (ICRA), 2024. arXiv:2310.08864

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

DINOv2: Learn- ing robust visual features without supervision.Transactions on Machine Learning Research,

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learn- ing robust visual features without supervision.Transactions on Machine Learning Research,

[17] [17]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. FAST: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

Springer, 2005

Vladimir V ovk, Alexander Gammerman, and Glenn Shafer.Algorithmic Learning in a Random World. Springer, 2005

2005

[19] [19]

Mondrian confidence machine

Vladimir V ovk, David Lindsay, Ilia Nouretdinov, and Alexander Gammerman. Mondrian confidence machine. Technical report, Royal Holloway, University of London, 2003

2003

[20] [20]

Do what you say: Steering vision-language-action models via runtime reasoning- action alignment verification.arXiv preprint arXiv:2510.16281, 2025

Yilin Wu et al. Do what you say: Steering vision-language-action models via runtime reasoning- action alignment verification.arXiv preprint arXiv:2510.16281, 2025. SEAL. A Proof of Lemma 1 and Theorem 1 Setup.Let D be the deployed-decision distribution over tuples D= (o, ℓ,{a k}, σ, v, s). Let {D1, . . . , Dn} be drawn i.i.d. from D. A bootstrap partition...

work page arXiv 2025