BOKBO (Best of K Bad Options): Calibrated Abstention for VLA Policies
Pith reviewed 2026-06-29 08:05 UTC · model grok-4.3
The pith
BOKBO adds a conformal abstention layer to K-sample VLA inference that gives finite-sample distribution-free guarantees on executed violation rates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BOKBO is the first conformal abstention layer for K-sample VLA inference, providing finite-sample distribution-free guarantees on executed-violation rate. It supplies both global and per-task (Mondrian) variants, with the per-task variant closing the conditional gap on the hardest tasks. A learned violation predictor conditioned on semantic visual features and task identity supports tight calibration: at epsilon = 0.05 on libero_object_temp_x0.1 with OpenVLA-OFT, the conditional CRC bound holds on 86 percent of bootstrap splits with 78 percent coverage and 70 percent net task success. Mondrian-BOKBO raises the minimum per-task conditional hold fraction from 0.71 to 0.93.
What carries the argument
Conditional conformal risk control (CRC) applied to a learned violation predictor that takes semantic visual features and task identity as inputs, producing an abstention decision for each K-sample inference round.
If this is right
- Mondrian-BOKBO raises the minimum per-task conditional hold fraction from 0.71 to 0.93.
- Results remain stable across five training seeds, replicate within bootstrap noise on pi0-FAST, hold on libero_spatial_temp_x0.1, and survive four within-suite distribution shifts.
- Policy-internal nonconformity scores correlate at 0.98 with the action-noise hyperparameter rather than with actual safety violations under perturbation-based K-sampling.
- Globally set force thresholds below expert-typical manipulation forces inflate reported violation rates by a factor of five.
Where Pith is reading between the lines
- The same abstention construction could be tested on other multi-sample inference pipelines in robotics to add explicit safety bounds without retraining the base policy.
- The documented failure of internal scores under perturbation sampling points to a general need for external predictors whenever K-sampling is driven by added noise rather than model stochasticity.
- Re-evaluation of earlier VLA safety numbers that used low force thresholds may be required before comparing violation rates across methods.
Load-bearing premise
A learned violation predictor conditioned on semantic visual features and task identity can be calibrated tightly enough for the conditional CRC bound to hold on the majority of bootstrap splits.
What would settle it
On a new held-out distribution shift or task suite, the fraction of bootstrap splits where the conditional CRC bound holds at epsilon = 0.05 falls well below 86 percent while coverage remains near 78 percent.
read the original abstract
Test-time scaling for vision-language-action (VLA) policies, methods such as RoboMonkey, SEAL, MG-Select, and V-GPS, samples K candidate action chunks at inference and executes the verifier-best. When all K candidates are unsafe, the system executes a violating action with no warning. We propose BOKBO, the first conformal abstention layer for K-sample VLA inference, providing finite-sample distribution-free guarantees on executed-violation rate. We provide both global and per-task (Mondrian) variants, with the per-task variant closing the conditional gap on the hardest tasks. Our analysis exposes a structural failure of policy-internal nonconformity scores under perturbation-based K-sampling: the base-policy confidence proxy and K-sample disagreement correlate at 0.98 with the action-noise hyperparameter $\sigma$, while correlating at the noise floor with actual safety violations. We test the failure's scope by replicating the analysis under token-level temperature sampling and find the failure is mechanism-specific and partially mitigated under policy-stochasticity-based sampling. A learned violation predictor conditioned on semantic visual features and task identity supports tight calibration: at $\epsilon$ = 0.05 on libero_object_temp_x0.1 with OpenVLA-OFT, the conditional CRC bound holds on 86% of bootstrap splits with 78% coverage and 70% net task success. Mondrian-BOKBO raises the minimum per-task conditional hold fraction from 0.71 to 0.93. Results are stable across 5 training seeds, replicate within bootstrap noise on $\pi_0$-FAST, hold on libero_spatial_temp_x0.1 as a co-equal benchmark, and survive four within-suite distribution shifts. We additionally identify and correct a methodological pitfall: globally-set force thresholds well below expert-typical manipulation forces conflate unsafe behavior with normal manipulation, inflating violation rates by $5\times$.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes BOKBO, the first conformal abstention layer for K-sample VLA inference. It provides finite-sample distribution-free guarantees (global and Mondrian/per-task variants) on executed-violation rate by using a learned violation predictor conditioned on semantic visual features and task identity. The work identifies a structural failure mode of policy-internal nonconformity scores under perturbation-based K-sampling (0.98 correlation with noise hyperparameter σ but near-zero with actual violations), replicates the analysis under token-level temperature sampling, corrects a methodological pitfall with globally-set force thresholds that inflate violation rates by 5×, and reports empirical results on libero_object_temp_x0.1 and libero_spatial_temp_x0.1 with OpenVLA-OFT and π0-FAST (conditional CRC bound holds on 86% of bootstrap splits at ε=0.05 with 78% coverage and 70% success; Mondrian raises min per-task hold fraction to 0.93).
Significance. If the central claims hold, BOKBO would be a meaningful addition to test-time scaling methods for VLAs by adding calibrated abstention with explicit finite-sample guarantees on safety violations. The identification and correction of the force-threshold pitfall and the exposure of the nonconformity-score failure mode are concrete contributions. The Mondrian variant addressing per-task gaps is a useful extension of standard conformal methods.
major comments (2)
- [Abstract] Abstract: the central claim of 'finite-sample distribution-free guarantees on executed-violation rate' is load-bearing, yet the reported conditional CRC bound holds on only 86% of bootstrap splits (rising to 0.93 min per-task with Mondrian). This empirical fraction indicates that the guarantee does not hold unconditionally across splits; the manuscript must either provide a theoretical argument showing why the 14% failure rate does not invalidate the distribution-free property or revise the claim to reflect that the guarantee is conditional on the learned predictor's calibration performance.
- [Abstract] Abstract and methods description: the finite-sample guarantee relies on the learned violation predictor (conditioned on semantic visual features and task identity) satisfying the exchangeability and calibration conditions required for both global and Mondrian CRC. The manuscript reports that this predictor 'supports tight calibration' via the 86%/78%/70% empirical figures, but does not demonstrate that the predictor's training introduces no feature-task correlations that would violate the implicit i.i.d. assumption between calibration and test points; a concrete counter-example or sensitivity analysis on this point is needed to support transfer of the distribution-free property to executed actions.
minor comments (2)
- The stability across 5 training seeds and replication on π0-FAST within bootstrap noise is stated but would benefit from explicit quantification of variance in the hold-fraction metric.
- The four within-suite distribution shifts are mentioned as survived but lack a table or figure reference showing per-shift coverage and success numbers.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below, providing theoretical clarification on the marginal nature of the guarantees and agreeing to targeted revisions for improved clarity on assumptions.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim of 'finite-sample distribution-free guarantees on executed-violation rate' is load-bearing, yet the reported conditional CRC bound holds on only 86% of bootstrap splits (rising to 0.93 min per-task with Mondrian). This empirical fraction indicates that the guarantee does not hold unconditionally across splits; the manuscript must either provide a theoretical argument showing why the 14% failure rate does not invalidate the distribution-free property or revise the claim to reflect that the guarantee is conditional on the learned predictor's calibration performance.
Authors: The finite-sample distribution-free guarantees of CRC are marginal: they hold with probability at least 1-α over the random draw of the calibration set (under exchangeability), not conditionally for every fixed calibration set. The 86% bootstrap hold rate is an empirical diagnostic of practical performance rather than a violation of the marginal guarantee. We will revise the abstract to explicitly describe the guarantees as marginal finite-sample distribution-free guarantees, thereby addressing the load-bearing claim while preserving its accuracy. revision: yes
-
Referee: [Abstract] Abstract and methods description: the finite-sample guarantee relies on the learned violation predictor (conditioned on semantic visual features and task identity) satisfying the exchangeability and calibration conditions required for both global and Mondrian CRC. The manuscript reports that this predictor 'supports tight calibration' via the 86%/78%/70% empirical figures, but does not demonstrate that the predictor's training introduces no feature-task correlations that would violate the implicit i.i.d. assumption between calibration and test points; a concrete counter-example or sensitivity analysis on this point is needed to support transfer of the distribution-free property to executed actions.
Authors: The violation predictor is trained on a dataset disjoint from the calibration and test sets used for CRC, ensuring that the resulting nonconformity scores satisfy exchangeability between calibration and test points. The feature and task conditioning defines the score but does not break exchangeability provided the data-generating process remains consistent. We will expand the methods section to detail this disjoint training protocol. While a dedicated sensitivity analysis on induced correlations is not included, the standard conformal assumptions suffice for the transfer of the guarantee; we view the requested counter-example as unnecessary under the maintained i.i.d. conditions. revision: partial
Circularity Check
No significant circularity; guarantees derive from external conformal theory
full rationale
The paper's central claim applies standard finite-sample conformal risk control (CRC) to nonconformity scores produced by a learned violation predictor, yielding distribution-free guarantees on executed-violation rate under the usual exchangeability assumption. This theoretical guarantee is independent of the specific predictor training procedure and is not reduced to any fitted parameter or self-referential definition within the paper. The reported bootstrap hold rates (86% global, 0.93 Mondrian), coverage (78%), and success (70%) are presented as empirical validation that the predictor achieves sufficient calibration on the tested splits, not as the source of the guarantee itself. No self-citations, uniqueness theorems, or ansatzes from prior author work appear as load-bearing steps, and the derivation chain remains self-contained against external conformal prediction results.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Angelopoulos and Stephen Bates
Anastasios N. Angelopoulos and Stephen Bates. Conformal prediction: A gentle introduction. Foundations and Trends in Machine Learning, 16(4):494–591, 2023
2023
-
[2]
Angelopoulos, Stephen Bates, Emmanuel J
Anastasios N. Angelopoulos, Stephen Bates, Emmanuel J. Candès, Michael I. Jordan, and Lihua Lei. Learn then test: Calibrating predictive algorithms to achieve risk control.Annals of Applied Statistics, 19(2):1641–1662, 2025. arXiv:2110.01052
-
[3]
Angelopoulos, Stephen Bates, Adam Fisch, Lihua Lei, and Tal Schuster
Anastasios N. Angelopoulos, Stephen Bates, Adam Fisch, Lihua Lei, and Tal Schuster. Con- formal risk control. InInternational Conference on Learning Representations (ICLR), 2024. arXiv:2208.02814
-
[4]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choro- manski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. RT-2: Vision-language- action models transfer web knowledge to robotic control. InConference on Robot Learning (CoRL), 2023. arXiv:2307.15818
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. InRobotics: Science and Systems, 2023. arXiv:2303.04137
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Selective Classification for Deep Neural Networks
Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2017. arXiv:1705.08500. 9
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[7]
Kim et al. Verifier-free test-time sampling for vision language action models.arXiv preprint arXiv:2510.05681, 2025. MG-Select
-
[8]
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success. InRobotics: Science and Systems, 2025. arXiv:2502.19645
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. OpenVLA: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
RoboMonkey: Scaling test-time sampling and verification for vision- language-action models
Jacky Kwok, Christopher Agia, Rohan Sinha, Matt Foutter, Shulu Li, Ion Stoica, Azalia Mirho- seini, and Marco Pavone. RoboMonkey: Scaling test-time sampling and verification for vision- language-action models. InConference on Robot Learning (CoRL), 2025. arXiv:2506.17811
-
[11]
Lars Lindemann, Matthew Cleaveland, Gihyun Shim, and George J. Pappas. Safe planning in dynamic environments using conformal prediction.IEEE Robotics and Automation Letters, 2023
2023
-
[12]
LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning
Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. LIBERO: Benchmarking knowledge transfer for lifelong robot learning. InAdvances in Neural Information Processing Systems (Datasets and Benchmarks Track), 2023. arXiv:2306.03310
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
Steering your generalists: Improving robotic foundation models via value guidance
Mitsuhiko Nakamoto, Oier Mees, Aviral Kumar, and Sergey Levine. Steering your generalists: Improving robotic foundation models via value guidance. InConference on Robot Learning (CoRL), 2024. arXiv:2410.13816
-
[14]
Octo: An Open-Source Generalist Robot Policy
Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. InRobotics: Science and Systems, 2024. arXiv:2405.12213
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment Collaboration, Abby O’Neill, Abdul Rehman, et al. Open X-embodiment: Robotic learning datasets and RT-X models. InIEEE International Conference on Robotics and Automation (ICRA), 2024. arXiv:2310.08864
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
DINOv2: Learn- ing robust visual features without supervision.Transactions on Machine Learning Research,
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learn- ing robust visual features without supervision.Transactions on Machine Learning Research,
-
[17]
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. FAST: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Springer, 2005
Vladimir V ovk, Alexander Gammerman, and Glenn Shafer.Algorithmic Learning in a Random World. Springer, 2005
2005
-
[19]
Mondrian confidence machine
Vladimir V ovk, David Lindsay, Ilia Nouretdinov, and Alexander Gammerman. Mondrian confidence machine. Technical report, Royal Holloway, University of London, 2003
2003
-
[20]
Yilin Wu et al. Do what you say: Steering vision-language-action models via runtime reasoning- action alignment verification.arXiv preprint arXiv:2510.16281, 2025. SEAL. A Proof of Lemma 1 and Theorem 1 Setup.Let D be the deployed-decision distribution over tuples D= (o, ℓ,{a k}, σ, v, s). Let {D1, . . . , Dn} be drawn i.i.d. from D. A bootstrap partition...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.