pith. sign in

arxiv: 2605.30660 · v1 · pith:HVR2LWARnew · submitted 2026-05-28 · 💻 cs.LG · cs.RO

BOKBO (Best of K Bad Options): Calibrated Abstention for VLA Policies

Pith reviewed 2026-06-29 08:05 UTC · model grok-4.3

classification 💻 cs.LG cs.RO
keywords conformal abstentionVLA policiestest-time scalingviolation rate guaranteesMondrian conformal predictionvision language actioncalibrated abstentiondistribution free guarantees
0
0 comments X

The pith

BOKBO adds a conformal abstention layer to K-sample VLA inference that gives finite-sample distribution-free guarantees on executed violation rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BOKBO to address the problem that current test-time scaling for vision-language-action policies samples K candidate actions and executes the best one, yet still runs a violation if every sample is unsafe. BOKBO inserts a calibrated abstention step that decides whether to skip execution, backed by statistical bounds that limit the long-run rate of executed violations. Both a single global threshold and a per-task Mondrian version are provided, the latter improving reliability on the hardest tasks. The work also shows that common internal policy scores fail to track actual safety violations under certain sampling schemes, while a learned predictor on visual features and task identity succeeds in calibration.

Core claim

BOKBO is the first conformal abstention layer for K-sample VLA inference, providing finite-sample distribution-free guarantees on executed-violation rate. It supplies both global and per-task (Mondrian) variants, with the per-task variant closing the conditional gap on the hardest tasks. A learned violation predictor conditioned on semantic visual features and task identity supports tight calibration: at epsilon = 0.05 on libero_object_temp_x0.1 with OpenVLA-OFT, the conditional CRC bound holds on 86 percent of bootstrap splits with 78 percent coverage and 70 percent net task success. Mondrian-BOKBO raises the minimum per-task conditional hold fraction from 0.71 to 0.93.

What carries the argument

Conditional conformal risk control (CRC) applied to a learned violation predictor that takes semantic visual features and task identity as inputs, producing an abstention decision for each K-sample inference round.

If this is right

  • Mondrian-BOKBO raises the minimum per-task conditional hold fraction from 0.71 to 0.93.
  • Results remain stable across five training seeds, replicate within bootstrap noise on pi0-FAST, hold on libero_spatial_temp_x0.1, and survive four within-suite distribution shifts.
  • Policy-internal nonconformity scores correlate at 0.98 with the action-noise hyperparameter rather than with actual safety violations under perturbation-based K-sampling.
  • Globally set force thresholds below expert-typical manipulation forces inflate reported violation rates by a factor of five.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same abstention construction could be tested on other multi-sample inference pipelines in robotics to add explicit safety bounds without retraining the base policy.
  • The documented failure of internal scores under perturbation sampling points to a general need for external predictors whenever K-sampling is driven by added noise rather than model stochasticity.
  • Re-evaluation of earlier VLA safety numbers that used low force thresholds may be required before comparing violation rates across methods.

Load-bearing premise

A learned violation predictor conditioned on semantic visual features and task identity can be calibrated tightly enough for the conditional CRC bound to hold on the majority of bootstrap splits.

What would settle it

On a new held-out distribution shift or task suite, the fraction of bootstrap splits where the conditional CRC bound holds at epsilon = 0.05 falls well below 86 percent while coverage remains near 78 percent.

read the original abstract

Test-time scaling for vision-language-action (VLA) policies, methods such as RoboMonkey, SEAL, MG-Select, and V-GPS, samples K candidate action chunks at inference and executes the verifier-best. When all K candidates are unsafe, the system executes a violating action with no warning. We propose BOKBO, the first conformal abstention layer for K-sample VLA inference, providing finite-sample distribution-free guarantees on executed-violation rate. We provide both global and per-task (Mondrian) variants, with the per-task variant closing the conditional gap on the hardest tasks. Our analysis exposes a structural failure of policy-internal nonconformity scores under perturbation-based K-sampling: the base-policy confidence proxy and K-sample disagreement correlate at 0.98 with the action-noise hyperparameter $\sigma$, while correlating at the noise floor with actual safety violations. We test the failure's scope by replicating the analysis under token-level temperature sampling and find the failure is mechanism-specific and partially mitigated under policy-stochasticity-based sampling. A learned violation predictor conditioned on semantic visual features and task identity supports tight calibration: at $\epsilon$ = 0.05 on libero_object_temp_x0.1 with OpenVLA-OFT, the conditional CRC bound holds on 86% of bootstrap splits with 78% coverage and 70% net task success. Mondrian-BOKBO raises the minimum per-task conditional hold fraction from 0.71 to 0.93. Results are stable across 5 training seeds, replicate within bootstrap noise on $\pi_0$-FAST, hold on libero_spatial_temp_x0.1 as a co-equal benchmark, and survive four within-suite distribution shifts. We additionally identify and correct a methodological pitfall: globally-set force thresholds well below expert-typical manipulation forces conflate unsafe behavior with normal manipulation, inflating violation rates by $5\times$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes BOKBO, the first conformal abstention layer for K-sample VLA inference. It provides finite-sample distribution-free guarantees (global and Mondrian/per-task variants) on executed-violation rate by using a learned violation predictor conditioned on semantic visual features and task identity. The work identifies a structural failure mode of policy-internal nonconformity scores under perturbation-based K-sampling (0.98 correlation with noise hyperparameter σ but near-zero with actual violations), replicates the analysis under token-level temperature sampling, corrects a methodological pitfall with globally-set force thresholds that inflate violation rates by 5×, and reports empirical results on libero_object_temp_x0.1 and libero_spatial_temp_x0.1 with OpenVLA-OFT and π0-FAST (conditional CRC bound holds on 86% of bootstrap splits at ε=0.05 with 78% coverage and 70% success; Mondrian raises min per-task hold fraction to 0.93).

Significance. If the central claims hold, BOKBO would be a meaningful addition to test-time scaling methods for VLAs by adding calibrated abstention with explicit finite-sample guarantees on safety violations. The identification and correction of the force-threshold pitfall and the exposure of the nonconformity-score failure mode are concrete contributions. The Mondrian variant addressing per-task gaps is a useful extension of standard conformal methods.

major comments (2)
  1. [Abstract] Abstract: the central claim of 'finite-sample distribution-free guarantees on executed-violation rate' is load-bearing, yet the reported conditional CRC bound holds on only 86% of bootstrap splits (rising to 0.93 min per-task with Mondrian). This empirical fraction indicates that the guarantee does not hold unconditionally across splits; the manuscript must either provide a theoretical argument showing why the 14% failure rate does not invalidate the distribution-free property or revise the claim to reflect that the guarantee is conditional on the learned predictor's calibration performance.
  2. [Abstract] Abstract and methods description: the finite-sample guarantee relies on the learned violation predictor (conditioned on semantic visual features and task identity) satisfying the exchangeability and calibration conditions required for both global and Mondrian CRC. The manuscript reports that this predictor 'supports tight calibration' via the 86%/78%/70% empirical figures, but does not demonstrate that the predictor's training introduces no feature-task correlations that would violate the implicit i.i.d. assumption between calibration and test points; a concrete counter-example or sensitivity analysis on this point is needed to support transfer of the distribution-free property to executed actions.
minor comments (2)
  1. The stability across 5 training seeds and replication on π0-FAST within bootstrap noise is stated but would benefit from explicit quantification of variance in the hold-fraction metric.
  2. The four within-suite distribution shifts are mentioned as survived but lack a table or figure reference showing per-shift coverage and success numbers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, providing theoretical clarification on the marginal nature of the guarantees and agreeing to targeted revisions for improved clarity on assumptions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of 'finite-sample distribution-free guarantees on executed-violation rate' is load-bearing, yet the reported conditional CRC bound holds on only 86% of bootstrap splits (rising to 0.93 min per-task with Mondrian). This empirical fraction indicates that the guarantee does not hold unconditionally across splits; the manuscript must either provide a theoretical argument showing why the 14% failure rate does not invalidate the distribution-free property or revise the claim to reflect that the guarantee is conditional on the learned predictor's calibration performance.

    Authors: The finite-sample distribution-free guarantees of CRC are marginal: they hold with probability at least 1-α over the random draw of the calibration set (under exchangeability), not conditionally for every fixed calibration set. The 86% bootstrap hold rate is an empirical diagnostic of practical performance rather than a violation of the marginal guarantee. We will revise the abstract to explicitly describe the guarantees as marginal finite-sample distribution-free guarantees, thereby addressing the load-bearing claim while preserving its accuracy. revision: yes

  2. Referee: [Abstract] Abstract and methods description: the finite-sample guarantee relies on the learned violation predictor (conditioned on semantic visual features and task identity) satisfying the exchangeability and calibration conditions required for both global and Mondrian CRC. The manuscript reports that this predictor 'supports tight calibration' via the 86%/78%/70% empirical figures, but does not demonstrate that the predictor's training introduces no feature-task correlations that would violate the implicit i.i.d. assumption between calibration and test points; a concrete counter-example or sensitivity analysis on this point is needed to support transfer of the distribution-free property to executed actions.

    Authors: The violation predictor is trained on a dataset disjoint from the calibration and test sets used for CRC, ensuring that the resulting nonconformity scores satisfy exchangeability between calibration and test points. The feature and task conditioning defines the score but does not break exchangeability provided the data-generating process remains consistent. We will expand the methods section to detail this disjoint training protocol. While a dedicated sensitivity analysis on induced correlations is not included, the standard conformal assumptions suffice for the transfer of the guarantee; we view the requested counter-example as unnecessary under the maintained i.i.d. conditions. revision: partial

Circularity Check

0 steps flagged

No significant circularity; guarantees derive from external conformal theory

full rationale

The paper's central claim applies standard finite-sample conformal risk control (CRC) to nonconformity scores produced by a learned violation predictor, yielding distribution-free guarantees on executed-violation rate under the usual exchangeability assumption. This theoretical guarantee is independent of the specific predictor training procedure and is not reduced to any fitted parameter or self-referential definition within the paper. The reported bootstrap hold rates (86% global, 0.93 Mondrian), coverage (78%), and success (70%) are presented as empirical validation that the predictor achieves sufficient calibration on the tested splits, not as the source of the guarantee itself. No self-citations, uniqueness theorems, or ansatzes from prior author work appear as load-bearing steps, and the derivation chain remains self-contained against external conformal prediction results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract alone supplies insufficient detail to enumerate free parameters, axioms, or invented entities; the approach builds on standard conformal prediction but the learned predictor and Mondrian variant may introduce unstated modeling choices.

pith-pipeline@v0.9.1-grok · 5899 in / 1156 out tokens · 34572 ms · 2026-06-29T08:05:29.145528+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 15 canonical work pages · 9 internal anchors

  1. [1]

    Angelopoulos and Stephen Bates

    Anastasios N. Angelopoulos and Stephen Bates. Conformal prediction: A gentle introduction. Foundations and Trends in Machine Learning, 16(4):494–591, 2023

  2. [2]

    Angelopoulos, Stephen Bates, Emmanuel J

    Anastasios N. Angelopoulos, Stephen Bates, Emmanuel J. Candès, Michael I. Jordan, and Lihua Lei. Learn then test: Calibrating predictive algorithms to achieve risk control.Annals of Applied Statistics, 19(2):1641–1662, 2025. arXiv:2110.01052

  3. [3]

    Angelopoulos, Stephen Bates, Adam Fisch, Lihua Lei, and Tal Schuster

    Anastasios N. Angelopoulos, Stephen Bates, Adam Fisch, Lihua Lei, and Tal Schuster. Con- formal risk control. InInternational Conference on Learning Representations (ICLR), 2024. arXiv:2208.02814

  4. [4]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choro- manski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. RT-2: Vision-language- action models transfer web knowledge to robotic control. InConference on Robot Learning (CoRL), 2023. arXiv:2307.15818

  5. [5]

    Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

    Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. InRobotics: Science and Systems, 2023. arXiv:2303.04137

  6. [6]

    Selective Classification for Deep Neural Networks

    Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2017. arXiv:1705.08500. 9

  7. [7]

    Verifier-free test-time sampling for vision language action models.arXiv preprint arXiv:2510.05681, 2025

    Kim et al. Verifier-free test-time sampling for vision language action models.arXiv preprint arXiv:2510.05681, 2025. MG-Select

  8. [8]

    Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success. InRobotics: Science and Systems, 2025. arXiv:2502.19645

  9. [9]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. OpenVLA: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  10. [10]

    RoboMonkey: Scaling test-time sampling and verification for vision- language-action models

    Jacky Kwok, Christopher Agia, Rohan Sinha, Matt Foutter, Shulu Li, Ion Stoica, Azalia Mirho- seini, and Marco Pavone. RoboMonkey: Scaling test-time sampling and verification for vision- language-action models. InConference on Robot Learning (CoRL), 2025. arXiv:2506.17811

  11. [11]

    Lars Lindemann, Matthew Cleaveland, Gihyun Shim, and George J. Pappas. Safe planning in dynamic environments using conformal prediction.IEEE Robotics and Automation Letters, 2023

  12. [12]

    LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. LIBERO: Benchmarking knowledge transfer for lifelong robot learning. InAdvances in Neural Information Processing Systems (Datasets and Benchmarks Track), 2023. arXiv:2306.03310

  13. [13]

    Steering your generalists: Improving robotic foundation models via value guidance

    Mitsuhiko Nakamoto, Oier Mees, Aviral Kumar, and Sergey Levine. Steering your generalists: Improving robotic foundation models via value guidance. InConference on Robot Learning (CoRL), 2024. arXiv:2410.13816

  14. [14]

    Octo: An Open-Source Generalist Robot Policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. InRobotics: Science and Systems, 2024. arXiv:2405.12213

  15. [15]

    Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Open X-Embodiment Collaboration, Abby O’Neill, Abdul Rehman, et al. Open X-embodiment: Robotic learning datasets and RT-X models. InIEEE International Conference on Robotics and Automation (ICRA), 2024. arXiv:2310.08864

  16. [16]

    DINOv2: Learn- ing robust visual features without supervision.Transactions on Machine Learning Research,

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learn- ing robust visual features without supervision.Transactions on Machine Learning Research,

  17. [17]

    FAST: Efficient Action Tokenization for Vision-Language-Action Models

    Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. FAST: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

  18. [18]

    Springer, 2005

    Vladimir V ovk, Alexander Gammerman, and Glenn Shafer.Algorithmic Learning in a Random World. Springer, 2005

  19. [19]

    Mondrian confidence machine

    Vladimir V ovk, David Lindsay, Ilia Nouretdinov, and Alexander Gammerman. Mondrian confidence machine. Technical report, Royal Holloway, University of London, 2003

  20. [20]

    Do what you say: Steering vision-language-action models via runtime reasoning- action alignment verification.arXiv preprint arXiv:2510.16281, 2025

    Yilin Wu et al. Do what you say: Steering vision-language-action models via runtime reasoning- action alignment verification.arXiv preprint arXiv:2510.16281, 2025. SEAL. A Proof of Lemma 1 and Theorem 1 Setup.Let D be the deployed-decision distribution over tuples D= (o, ℓ,{a k}, σ, v, s). Let {D1, . . . , Dn} be drawn i.i.d. from D. A bootstrap partition...