pith. machine review for the scientific record. sign in

arxiv: 2602.22474 · v2 · submitted 2026-02-25 · 💻 cs.RO · cs.LG

Recognition: no theorem link

When to Act, Ask, or Learn: Uncertainty-Aware Policy Steering

Authors on Pith no claims yet

Pith reviewed 2026-05-15 18:57 UTC · model grok-4.3

classification 💻 cs.RO cs.LG
keywords policy steeringconformal predictionvision-language modelsrobot learninguncertainty quantificationhuman-robot interactioncontinual learningdeployment adaptation
0
0 comments X

The pith

A robot policy can decide to act, query for clarification, or request human intervention by calibrating its uncertainty estimates with conformal prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes uncertainty-aware policy steering (UPS) to handle both high-level semantic uncertainty in task specifications and low-level action uncertainty or incapability in a pre-trained policy. A VLM-based verifier is calibrated via conformal prediction on the joint outputs to select among three strategies: execute a high-confidence action, clarify task ambiguity through natural language queries, or ask for action interventions. This selection comes with statistical assurances, after which residual learning on collected interventions improves the base policy over time. Experiments demonstrate that the approach correctly disentangles confident, ambiguous, and incapable scenarios while reducing the number of expensive user interventions relative to uncalibrated baselines.

Core claim

Uncertainty-aware policy steering (UPS) jointly reasons about semantic task uncertainty and low-level action feasibility, selecting an uncertainty resolution strategy of execute, query via natural language, or intervene, with conformal prediction calibrating the VLM verifier and pre-trained policy composition to provide statistical assurances on correct selection, followed by residual learning from interventions for continual improvement with minimal human feedback.

What carries the argument

Uncertainty-aware policy steering (UPS), which composes a VLM verifier with a pre-trained base policy and applies conformal prediction to their joint outputs to select among execute, query, or intervene strategies.

Load-bearing premise

Conformal prediction applied to the composition of the VLM verifier and pre-trained policy produces valid statistical guarantees for strategy selection in real robotic deployment.

What would settle it

In hardware or simulation trials, the observed rate of incorrect strategy selections (execute when ambiguous, query when capable, or intervene when capable) exceeds the error bound guaranteed by the conformal prediction calibration.

Figures

Figures reproduced from arXiv: 2602.22474 by Andrea Bajcsy, Jessie Yuan, Yilin Wu.

Figure 1
Figure 1. Figure 1: Uncertainty-Aware Policy Steering. Our framework calibrates the VLM verifier used for policy steering via conformal prediction. This enables the VLM to select an appropriate way to resolve uncertainty, from querying the end-user in natural language to asking to re-train the low-level control policy. outcome narration best aligns with the task. Mathematically, our problem becomes: y ⋆ = arg max y∈Y E {ℓ k} … view at source ↗
Figure 2
Figure 2. Figure 2: Outcome Prediction & Narration. The policy and the world model are interleaved to predict long-horizon outcomes induced by the low-level policy. Decoded observations are fed into a VLM which narrates the outcomes in text. problem. This involves three stages: predicting the outcomes o of an action sample a with a world model, generating behavior narrations ℓ for the predicted outcomes, and selecting from th… view at source ↗
Figure 4
Figure 4. Figure 4: Uncertainty Quantification Results: Simulation. We compare the combination of Vanilla, CoT and Bayesian Intent (Ours) models for UQ. Dashed lines are either the target coverage (1 − ϵ = 0.85), clarification rate, or set size. scores and adds the options until the sum exceeds 1 − qˆ. Vanilla uses the distribution p VLM by asking the model to self-generate the probability scores between 0 and 1. CoT shapes p… view at source ↗
Figure 3
Figure 3. Figure 3: Uncertainty Quantification Results: Hardware. Combination of Vanilla, CoT and Bayesian Intent (Ours) models for UQ. Dashed lines are either the target coverage rate (1 − ϵ = 0.85), clarification rate, or set size. 1) Our Score Function Balances Coverage & Clarifica￾tion: Methods. We evaluate three different UQ methods: Simple Set, APS [22], CP [21], and three different score functions including Vanilla, Co… view at source ↗
Figure 6
Figure 6. Figure 6: Hardware: Robot Asking for Interventions. In EnsembleDAgger (top), the human is asked for demonstra￾tions whenever the disagreement across an ensemble of base policies exceeds a threshold. In Human-Gated (HG) DAgger (middle), a human monitors the entire trajectory and intervenes whenever the robot’s behavior deviates from their intention. Our approach (bottom two rows), enables the robot to ask for “cheap”… view at source ↗
Figure 5
Figure 5. Figure 5: Success Rates Pre- and Post-Continual Learning: Hardware and Simulation. We deploy the robot with 20 straightforward (left) and 20 ambiguous (right) task instruc￾tions. We average the success rate over 20 trials for each scenario. Our approach solicits data in a way which maximizes the final success rate after residual policy training, compared to human- and robot-gated baselines. B. Continual Learning Fin… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative Results of Intervention & Re-deployment. On the left, we demonstrate three different strategies to elicit human interventions: HG-DAgger (top), EnsembleDAgger (middle) and UPS (Ours) (bottom). On the right, we demonstrate that our method maintains the multi-modality to be able to place the cup in the left or right bin while other methods suffer from failures and lack of multi-modality. cannot g… view at source ↗
Figure 8
Figure 8. Figure 8: Hardware Setup. We demonstrate our hardware environment setup with a Franka Emika Panda arm and two Zed cameras (left image). In the middle, we show the left-view image captured by a Zed 2i camera and on the right, we show the wrist-view image captured by a Zed M camera [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Hardware Examples of the Task. We show two different ways of achieving the task of placing the cup in the bin. The top row shows the cup placed in the left bin while the bottom row shows the cup placed in the right bin. C. Real Robot (Hardware) Hardware setup [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Simulation Setup. We demonstrate our simulated task setup using a Franka Emika Panda arm. The leftmost two images show the initial environment setup, while the right four images show the target configurations for the two behavior modes, as labeled. In each image pair, the left image provides the third-person camera view and the right image provides the wrist view [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Hardware Histogram for Empirical Quantile. We list the overall histogram with all three categories combined, as well as per category histograms. The red vertical line indicates the qˆ value based on user-desired coverage rate 1 − ε. Non-Conformity Score Histogram Per Category Straightforward Overall Non-Conformity Score Histogram Across Three Categories Non-Conformity Score Histogram Per Category Incapabl… view at source ↗
Figure 12
Figure 12. Figure 12: Simulation Histogram for Empirical Quantile. We list the overall histogram with all three categories combined, as well as per category histograms. The red vertical line indicates the qˆ value based on user-desired coverage rate 1 − ε [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Hardware Uncertainty Quantification with Logits. We conduct an ablation study where we directly use the softmax over the logits of the first generated token as the score rather than the self-generated score. Empirically, we find that, compared to [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗
Figure 15
Figure 15. Figure 15: Hardware Prompt For Behavior Translation (Grasp) Role: You are an expert Robotic Systems Monitor specializing in spatial reasoning, and failure detection. Task: Analyze the final outcome of a robot performing the task of picking up the cup and placing it in the bin. Visual Inputs: Each example contains a single image with two synchronized views : - Front View (Left Pane): A global view of the workspace. U… view at source ↗
Figure 16
Figure 16. Figure 16: Hardware Prompt For Behavior Translation (Place) [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗
Figure 20
Figure 20. Figure 20: Hardware Prompt for Updating Instructions You are a helpful robot assistant that needs to clarify ambiguous instructions before executing tasks. The user has given you the following instruction: "{user_instruction}" However, this instruction is ambiguous because there are multiple possible options: {options} Your task is to generate a single, clear, and concise clarification question to ask the human user… view at source ↗
Figure 21
Figure 21. Figure 21: Hardware Prompt for Clarification Questions Role: You are an expert Robotic Systems Monitor specializing in spatial reasoning, and failure detection. Task: Based on the user's instruction and the possible outcomes of different behavior modes provided, your job is to select the behavior mode whose outcome that best fulfills the user's instruction. Rules: 1.Assign each possible behavior mode a likelihood sc… view at source ↗
Figure 18
Figure 18. Figure 18: Hardware Prompt for Generating Possible Intents Role: You are an expert Robotic Systems Monitor specializing in spatial reasoning, and failure detection to guide a robot to place the cup in the bin. Task: Based on the instruction that the user provides to the robot, your job is to select the intent that best matches the user's instruction. Assign each possible intent a likelihood score based on how probab… view at source ↗
Figure 22
Figure 22. Figure 22: Hardware Prompt for Generating Behavior Prob￾ability Conditioned On the Intent [PITH_FULL_IMAGE:figures/full_fig_p022_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Hardware Prompt For Asking VLM to Analyze Ambiguity First in CoT Reasoning Role: You are a Robotic Systems Monitor to monitor the process of placing a cup in the bin. Your task is to determine if a user instruction contains enough discriminative information to select a single option from the provided options. Task: Evaluate the user instruction against the options. User Instruction: [USER_INSTRUCTION] Mul… view at source ↗
Figure 26
Figure 26. Figure 26: Simulation Prompt For Behavior Translation (Grasp) [PITH_FULL_IMAGE:figures/full_fig_p023_26.png] view at source ↗
Figure 29
Figure 29. Figure 29: Simulation Prompt for Generating Possible Intents Task setup: You are an expert Robotic Systems Monitor specializing in spatial reasoning and failure detection. You are guiding a robot to complete a nut assembly task. The robot is situated on one side of a table and the user stands on the other side. All directions referenced are from the user's perspective. The nut begins with its handle facing backwards… view at source ↗
Figure 28
Figure 28. Figure 28: Simulation Prompt for VLM Verification in Fore￾warn Environment setup: You are a robot arm situated on one side of a tabletop and a user stands on the other side. All directions referenced (left, right, etc.) must be from the user’s perspective. Goal: Given a set of objects on the table and a general task goal, generate a list of distinct, valid user intents that resolve ambiguity in how the task is compl… view at source ↗
Figure 32
Figure 32. Figure 32: Simulation Prompt For Asking VLM to Score Options Based On Previous Reasoning in CoT Reasoning Task setup: You are an expert Robotic Systems Monitor specializing in spatial reasoning and failure detection. You are guiding a robot to complete a nut assembly task. The robot is situated on one side of a table and the user stands on the other side. All directions referenced are from the user's perspective. Th… view at source ↗
Figure 31
Figure 31. Figure 31: Simulation Prompt For Asking VLM to Analyze Ambiguity First in CoT Reasoning Task setup: You are an expert Robotic Systems Monitor specializing in spatial reasoning and failure detection. You are guiding a robot to complete a nut assembly task. The robot is situated on one side of a table and the user stands on the other side. All directions referenced are from the user's perspective. The nut begins with … view at source ↗
Figure 33
Figure 33. Figure 33: Simulation Prompt for Generating Behavior Prob￾ability Conditioned On the Intent [PITH_FULL_IMAGE:figures/full_fig_p025_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: Simulation Prompt For Directly Generating Scores for Available Options [PITH_FULL_IMAGE:figures/full_fig_p026_34.png] view at source ↗
read the original abstract

Policy steering is an emerging way to adapt robot behaviors at deployment-time: a learned verifier analyzes low-level action samples proposed by a pre-trained policy (e.g., diffusion policy) and selects only those aligned with the task. While Vision-Language Models (VLMs) are promising general-purpose verifiers due to their reasoning capabilities, existing frameworks often assume these models are well-calibrated. In practice, the overconfident judgment from VLM can degrade the steering performance under both high-level semantic uncertainty in task specifications and low-level action uncertainty or incapability of the pre-trained policy. We propose uncertainty-aware policy steering (UPS), a framework that jointly reasons about semantic task uncertainty and low-level action feasibility, and selects an uncertainty resolution strategy: execute a high-confidence action, clarify task ambiguity via natural language queries, or ask for action interventions to correct the low-level policy when it is deemed incapable at the task. We leverage conformal prediction to calibrate the composition of the VLM and the pre-trained base policy, providing statistical assurances that the verifier selects the correct strategy. After collecting interventions during deployment, we employ residual learning to improve the capability of the pre-trained policy, enabling the system to learn continually but with minimal expensive human feedback. We demonstrate our framework through experiments in simulation and on hardware, showing that UPS can disentangle confident, ambiguous, and incapable scenarios and minimizes expensive user interventions compared to uncalibrated baselines and prior human- or robot-gated continual learning approaches. Videos can be found at https://jessie-yuan.github.io/ups/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Uncertainty-Aware Policy Steering (UPS), a framework that combines a VLM-based verifier with conformal prediction to jointly handle semantic task uncertainty and low-level action feasibility in pre-trained robot policies. The verifier selects among three strategies—execute a high-confidence action, issue a natural-language clarification query, or request a human intervention for residual learning—while claiming statistical coverage guarantees on correct strategy selection. After interventions, residual learning updates the base policy for continual improvement with limited human feedback. Validation is provided via simulation and hardware experiments showing reduced interventions relative to uncalibrated and prior gated baselines.

Significance. If the coverage guarantees remain valid, the work offers a principled mechanism for safe deployment-time adaptation of diffusion-style policies under both high-level ambiguity and low-level incapability, reducing expensive human oversight. The explicit use of conformal prediction on the VLM-policy composition and the integration of residual learning constitute concrete technical contributions that could support more reliable continual-learning robot systems.

major comments (2)
  1. [§4.2] §4.2 (Conformal Calibration): The calibration procedure is performed once on the initial pre-trained policy distribution. Residual learning then updates the policy from collected interventions, altering the joint (VLM, policy) output distribution and violating the exchangeability assumption required for marginal coverage guarantees on strategy selection. No re-calibration step or non-stationary conformal variant is described.
  2. [§5.1] §5.1 (Experimental Results): The reported success rates, intervention counts, and comparison tables lack per-condition standard deviations, trial counts, or statistical significance tests. Without these, it is impossible to determine whether the claimed reductions in user interventions are reliable or merely consistent with noise.
minor comments (2)
  1. [Abstract] Abstract: The claim of 'statistical assurances' should explicitly state the target coverage level (e.g., 1−α = 0.9) and the precise definition of the nonconformity score used for the three-way decision.
  2. [Figure 2] Figure 2: The pipeline diagram would be clearer if the conformal threshold computation and the residual-learning update loop were shown as separate sub-panels with explicit arrows indicating when re-calibration would (or would not) occur.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the conformal calibration procedure and the experimental reporting. We address each major comment point by point below and outline the revisions planned for the next manuscript version.

read point-by-point responses
  1. Referee: [§4.2] §4.2 (Conformal Calibration): The calibration procedure is performed once on the initial pre-trained policy distribution. Residual learning then updates the policy from collected interventions, altering the joint (VLM, policy) output distribution and violating the exchangeability assumption required for marginal coverage guarantees on strategy selection. No re-calibration step or non-stationary conformal variant is described.

    Authors: We agree that residual learning from interventions induces a distribution shift that can violate the exchangeability assumption underlying the initial conformal calibration. The coverage guarantees therefore apply strictly to the pre-trained policy at calibration time. To strengthen the framework for continual deployment, we will revise §4.2 to explicitly discuss this limitation and introduce a periodic re-calibration step that re-computes conformal thresholds on batches of collected intervention data. This addition will be described in an updated §4.2 and §4.3, preserving the minimal-feedback goal while restoring marginal coverage guarantees after policy updates. revision: yes

  2. Referee: [§5.1] §5.1 (Experimental Results): The reported success rates, intervention counts, and comparison tables lack per-condition standard deviations, trial counts, or statistical significance tests. Without these, it is impossible to determine whether the claimed reductions in user interventions are reliable or merely consistent with noise.

    Authors: We concur that the current experimental presentation would be strengthened by rigorous statistical reporting. In the revised manuscript we will add the number of trials per condition, per-condition standard deviations for all metrics (success rate, intervention count, query count), and results of appropriate statistical tests (e.g., paired t-tests with p-values) comparing UPS against the baselines. These details will be incorporated into §5.1, the tables, and the figure captions. revision: yes

Circularity Check

0 steps flagged

No circularity; framework applies established conformal prediction and residual learning without self-referential reduction

full rationale

The derivation chain in the paper applies conformal prediction to calibrate the joint VLM and pre-trained policy outputs for strategy selection (act/ask/learn), then uses residual learning on collected interventions to update the base policy. These steps rely on standard properties of conformal prediction (marginal coverage under exchangeability) and residual learning techniques from the literature, without defining the statistical assurances or strategy selection as equivalent to the fitted inputs by construction. No equations reduce the central claim to a self-definition, no predictions are statistically forced from a subset fit, and no load-bearing uniqueness theorems or ansatzes are imported via self-citation. The framework remains self-contained against external benchmarks for calibration and continual improvement.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Framework rests on standard assumptions from conformal prediction for calibration guarantees and residual learning for policy updates; no new free parameters or invented entities are introduced in the abstract description.

axioms (1)
  • domain assumption Conformal prediction provides valid statistical guarantees when applied to the composition of VLM verifier outputs and pre-trained policy feasibility.
    Invoked to ensure the verifier selects the correct uncertainty resolution strategy with statistical assurances.

pith-pipeline@v0.9.0 · 5580 in / 1370 out tokens · 30424 ms · 2026-05-15T18:57:37.017059+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

100 extracted references · 100 canonical work pages · 5 internal anchors

  1. [1]

    Let’s think in two steps: Mitigating agreement bias in mllms with self- grounded verification

    Moises Andrade, Joonhyuk Cha, Brandon Ho, Vriksha Srihari, Karmesh Yadav, and Zsolt Kira. Let’s think in two steps: Mitigating agreement bias in mllms with self- grounded verification. InInternational Conference on Learning Representations (ICLR), 2026

  2. [2]

    Con- formal prediction: A gentle introduction.Foundations and trends® in machine learning, 16(4):494–591, 2023

    Anastasios N Angelopoulos, Stephen Bates, et al. Con- formal prediction: A gentle introduction.Foundations and trends® in machine learning, 16(4):494–591, 2023

  3. [3]

    Hallucination of Multimodal Large Language Models: A Survey

    Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. Hallucination of multimodal large language models: A survey.arXiv preprint arXiv:2404.18930, 2024

  4. [4]

    Goal inference as inverse planning

    Chris L Baker, Joshua B Tenenbaum, and Rebecca R Saxe. Goal inference as inverse planning. InProceedings of the annual meeting of the cognitive science society, volume 29, 2007

  5. [5]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π 0 : A vision- language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

  6. [6]

    Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 2024

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 2024

  7. [7]

    Dropout as a bayesian approximation: Representing model uncertainty in deep learning

    Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Maria Florina Balcan and Kilian Q. Weinberger, editors,Proceedings of The 33rd Interna- tional Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1050– 1059, New York, New York, USA, 20–22 Jun ...

  8. [8]

    Mastering diverse domains through world models.Nature, 2023

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timo- thy Lillicrap. Mastering diverse domains through world models.Nature, 2023

  9. [9]

    Thriftydagger: Budget-aware novelty and risk gating for interactive imitation learning

    Ryan Hoque, Ashwin Balakrishna, Ellen Novoseller, Albert Wilcox, Daniel S Brown, and Ken Goldberg. Thriftydagger: Budget-aware novelty and risk gating for interactive imitation learning. In5th Annual Conference on Robot Learning, 2021

  10. [10]

    Transic: Sim-to-real policy transfer by learning from online correction

    Yunfan Jiang, Chen Wang, Ruohan Zhang, Jiajun Wu, and Li Fei-Fei. Transic: Sim-to-real policy transfer by learning from online correction. InConference on Robot Learning, pages 1691–1729. PMLR, 2025

  11. [11]

    Hg-dagger: Inter- active imitation learning with human experts

    Michael Kelly, Chelsea Sidrane, Katherine Driggs- Campbell, and Mykel J Kochenderfer. Hg-dagger: Inter- active imitation learning with human experts. In2019 International Conference on Robotics and Automation (ICRA), pages 8077–8083. IEEE, 2019

  12. [12]

    Robomonkey: Scaling test-time sampling and verification for vision-language-action models, 2025

    Jacky Kwok, Christopher Agia, Rohan Sinha, Matt Fout- ter, Shulu Li, Ion Stoica, Azalia Mirhoseini, and Marco Pavone. Robomonkey: Scaling test-time sampling and verification for vision-language-action models.arXiv preprint arXiv:2506.17811, 2025

  13. [13]

    Risk-calibrated human-robot interaction via set-valued intent prediction

    Justin Lidard, Hang Pham, Ariel Bachman, Bryan Boateng, and Anirudha Majumdar. Risk-calibrated human-robot interaction via set-valued intent prediction. Robotics: Science and Systems, 2024

  14. [14]

    Multi-task interactive robot fleet learning with visual world models

    Huihan Liu, Yu Zhang, Vaarij Betala, Evan Zhang, James Liu, Crystal Ding, and Yuke Zhu. Multi-task interactive robot fleet learning with visual world models. In8th Annual Conference on Robot Learning, 2024

  15. [15]

    Calibrating llm-based evaluator

    Yuxuan Liu, Tianchi Yang, Shaohan Huang, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, and Qi Zhang. Calibrating llm-based evaluator. In Proceedings of the 2024 joint international conference on computational linguistics, language resources and evaluation (lrec-coling 2024), pages 2638–2656, 2024

  16. [16]

    What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

    Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Mart ´ın-Mart´ın. What matters in learning from offline human demon- strations for robot manipulation. InarXiv preprint arXiv:2108.03298, 2021

  17. [17]

    World models that know when they don’t know: Controllable video generation with calibrated uncertainty.arXiv preprint arXiv:2512.05927, 2025

    Zhiting Mei, Tenny Yin, Micah Baker, Ola Shorinwa, and Anirudha Majumdar. World models that know when they don’t know: Controllable video generation with calibrated uncertainty.arXiv preprint arXiv:2512.05927, 2025

  18. [18]

    Reasoning about uncertainty: Do reasoning models know when they don’t know?arXiv preprint arXiv:2506.18183, 2025

    Zhiting Mei, Christina Zhang, Tenny Yin, Justin Lidard, Ola Shorinwa, and Anirudha Majumdar. Reasoning about uncertainty: Do reasoning models know when they don’t know?arXiv preprint arXiv:2506.18183, 2025

  19. [19]

    Ensembledagger: A bayesian approach to safe imitation learning

    Kunal Menda, Katherine Driggs-Campbell, and Mykel J Kochenderfer. Ensembledagger: A bayesian approach to safe imitation learning. In2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5041–5048. IEEE, 2019

  20. [20]

    Lbap: Improved uncertainty alignment of llm planners using bayesian inference

    James F Mullen and Dinesh Manocha. Lbap: Improved uncertainty alignment of llm planners using bayesian inference. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 18716– 18723. IEEE, 2025

  21. [21]

    Robots that ask for help: Uncertainty alignment for large language model planners.Conference on Robot Learning, 2023

    Allen Z Ren, Anushri Dixit, Alexandra Bodrova, Sumeet Singh, Stephen Tu, Noah Brown, Peng Xu, Leila Takayama, Fei Xia, Jake Varley, et al. Robots that ask for help: Uncertainty alignment for large language model planners.Conference on Robot Learning, 2023

  22. [22]

    Classification with valid and adaptive coverage.Ad- vances in neural information processing systems, 33: 3581–3591, 2020

    Yaniv Romano, Matteo Sesia, and Emmanuel Candes. Classification with valid and adaptive coverage.Ad- vances in neural information processing systems, 33: 3581–3591, 2020

  23. [23]

    A reduction of imitation learning and structured prediction to no-regret online learning

    St ´ephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth international conference on artificial intelli- gence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011

  24. [24]

    Residual Policy Learning

    Tom Silver, Kelsey Allen, Josh Tenenbaum, and Leslie Kaelbling. Residual policy learning.arXiv preprint arXiv:1812.06298, 2018

  25. [25]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalk- wyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

  26. [26]

    Evaluating uncertainty and quality of visual language action-enabled robots.arXiv preprint arXiv:2507.17049, 2025

    Pablo Valle, Chengjie Lu, Shaukat Ali, and Aitor Arrieta. Evaluating uncertainty and quality of visual language action-enabled robots.arXiv preprint arXiv:2507.17049, 2025

  27. [27]

    Springer, 2005

    Vladimir V ovk, Alexander Gammerman, and Glenn Shafer.Algorithmic learning in a random world. Springer, 2005

  28. [28]

    Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

  29. [29]

    Do what you say: Steering vision-language-action models via runtime reasoning-action alignment verification.arXiv preprint arXiv:2510.16281, 2025

    Yilin Wu, Anqi Li, Tucker Hermans, Fabio Ramos, An- drea Bajcsy, and Claudia P’erez-D’Arpino. Do what you say: Steering vision-language-action models via runtime reasoning-action alignment verification.arXiv preprint arXiv:2510.16281, 2025

  30. [30]

    From foresight to forethought: Vlm-in-the-loop policy steering via latent alignment.Robotics: Science and Systems, 2025

    Yilin Wu, Ran Tian, Gokul Swamy, and Andrea Bajcsy. From foresight to forethought: Vlm-in-the-loop policy steering via latent alignment.Robotics: Science and Systems, 2025

  31. [31]

    Self-improving vision-language-action models with data generation via residual rl.arXiv preprint arXiv:2511.00091, 2025

    Wenli Xiao, Haotian Lin, Andy Peng, Haoru Xue, Tairan He, Yuqi Xie, Fengyuan Hu, Jimmy Wu, Zhengyi Luo, Linxi Fan, et al. Self-improving vision-language-action models with data generation via residual rl.arXiv preprint arXiv:2511.00091, 2025

  32. [32]

    Zhao, Henny Admoni, Reid Simmons, Aa- ditya Ramdas, and Andrea Bajcsy

    Michelle D. Zhao, Henny Admoni, Reid Simmons, Aa- ditya Ramdas, and Andrea Bajcsy. Conformalized in- teractive imitation learning: Handling expert shift and intermittent feedback. InInternational Conference on Learning Representations (ICLR), 2025. URL https: //openreview.net/forum?id=Ym2RNPX6la

  33. [33]

    Dino-wm: World models on pre-trained visual features enable zero-shot planning

    Gaoyue Zhou, Hengkai Pan, Yann LeCun, and Lerrel Pinto. Dino-wm: World models on pre-trained visual features enable zero-shot planning. InForty-second International Conference on Machine Learning, 2025

  34. [34]

    multi-option

    Thomas P Zollo and Richard Zemel. Confidence cali- bration in vision-language-action models.arXiv preprint arXiv:2507.17383, 2025. APPENDIX A. Proof of Coverage Guarantees To maintain formal coverage guarantees despite temporal dependencies, we follow the same proof style as in [21] and perform sequence-level calibration using the minimum score over phase...

  35. [35]

    The image 4 is the final state of the task execution

  36. [36]

    - GRASPING THE CUP IS A FAILURE if you can clearly see a complete round shape of the cup edge in the wrist view in the final frame

    Determine task outcome from the Wrist View and Front View: - GRASPING THE CUP IS A SUCCESS if you only see part of the cup round edge in the wrist view in the final frame. - GRASPING THE CUP IS A FAILURE if you can clearly see a complete round shape of the cup edge in the wrist view in the final frame. Reference Examples (Ground Truth): - Image 1: GRASPIN...

  37. [37]

    tell the task outcome according to the rules

  38. [38]

    15:Hardware Prompt For Behavior Translation (Grasp) Role: You are an expert Robotic Systems Monitor specializing in spatial reasoning, and failure detection

    don’t need to say the comparison between images Fig. 15:Hardware Prompt For Behavior Translation (Grasp) Role: You are an expert Robotic Systems Monitor specializing in spatial reasoning, and failure detection. Task: Analyze the final outcome of a robot performing the task of picking up the cup and placing it in the bin. Visual Inputs: Each example contai...

  39. [39]

    The image 3 is the final state of the task execution

  40. [40]

    - FAILURE if the cup is on the border or it lies down in the bin

    Determine task outcome from the Wrist View: - SUCCESS if the cup is in either LEFT/RIGHT bin and standing upright. - FAILURE if the cup is on the border or it lies down in the bin. Reference Examples (Ground Truth): - Image 1: Successful because it places the cup in the LEFT bin and you can see the green cup is in the right part of the orange bin in the w...

  41. [41]

    tell the placement of the cup [LEFT/RIGHT]

  42. [42]

    tell the task outcome

  43. [43]

    Please place the nut on the peg with the handle facing left

    don't need to say the comparison between images Fig. 16:Hardware Prompt For Behavior Translation (Place) User Intent Task Instruction left “Please place the nut on the peg with the handle facing left.” left “I want the nut positioned on the peg with the handle directed left.” left “Put the nut on the peg so that its handle is oriented leftward.” left “Set...

  44. [45]

    an option not listed here

    If no mode fully meets the user's intent, you must choose "an option not listed here"

  45. [46]

    Response format:

    Assume that each mode's outcome is exactly as described in the options, with no adjustments. Response format:

  46. [47]

    Begin your response with the single letter corresponding to your choice

  47. [48]

    Repeat the description of that mode

  48. [49]

    User instruction: {USER_INSTRUCTION} Fig

    Provide a one-sentence reason explaining why it is the most appropriate choice. User instruction: {USER_INSTRUCTION} Fig. 17:Hardware Prompt for VLM Verification in Fore- warn Environment setup: You are a robot arm situated on one side of a tabletop and a user stands on the other side. All directions referenced should be from the user's perspective. Goal:...

  49. [50]

    If the Intent Options has only A, output is: {"A": <score for option A>}

  50. [51]

    A": <score for option A>,

    If the Intent Options has A and B, output is: {"A": <score for option A>, "B": <score for option B>}

  51. [52]

    A": <score for option A>,

    If the Intent Options has A , B, C, output is: {"A": <score for option A>, "B": <score for option B>,"C": <score for option C>}

  52. [53]

    A": <score for option A>,

    If the Intent Options has A, B, C, D, output is: {"A": <score for option A>, "B": <score for option B>, "C": <score for option C>, "D": <score for option D>} Do not include any other text in your response, just the JSON object with the scores. <logit version> Format your response as follows : Provide only the single letter that corresponds to the selected...

  53. [54]

    Preserve the intent of the original instruction

  54. [55]

    Clearly specify which option was chosen based on the human's answer

  55. [56]

    Be concise and actionable

  56. [57]

    {user_instruction}

    Sound natural (as if the user had given this instruction from the start) Generate ONLY the updated instruction, nothing else. Do not include any preamble or explanation. Fig. 20:Hardware Prompt for Updating Instructions You are a helpful robot assistant that needs to clarify ambiguous instructions before executing tasks. The user has given you the followi...

  57. [58]

    Be polite and natural-sounding

  58. [59]

    Clearly present the available options

  59. [60]

    Do not include any preamble or explanation

    Help resolve the ambiguity so you can complete the task correctly Generate ONLY the clarification question, nothing else. Do not include any preamble or explanation. Fig. 21:Hardware Prompt for Clarification Questions Role: You are an expert Robotic Systems Monitor specializing in spatial reasoning, and failure detection. Task: Based on the user's instruc...

  60. [65]

    [NOTE] Evaluation Rules (STRICT):

    You can give any value between 0 and 1 and must give a probability value representing your own uncertainty rather than exactly 0 or 1!! Don't use 0 or 1! Evaluation Phase: [QUERY] Phase. [NOTE] Evaluation Rules (STRICT):

  61. [66]

    If it is grasping phase, just selects for appropriate grasping

    Evaluate the options based on the evaluation phase and the user's intent provided to the robot. If it is grasping phase, just selects for appropriate grasping

  62. [67]

    an option not listed here

    If no mode fully meets the user's intent, you must choose "an option not listed here" over the failures

  63. [68]

    A": <score for option A>,

    Assume that each mode's outcome is exactly as described in the options, with no adjustments. <self-score version> Format your response as follows : {"A": <score for option A>, "B": <score for option B>, "C": <score for option C>, "D": <score for option D>} Do not include any other text in your response, just the JSON object with the scores. Only include a...

  64. [69]

    If the evaluation phase is grasping, just selects the behavior mode for good grasping that can move to the next phase

    Evaluate the options based on the evaluation phase, the reasoning and user instruction provided to the robot. If the evaluation phase is grasping, just selects the behavior mode for good grasping that can move to the next phase

  65. [70]

    Assign each possible behavior mode a likelihood score based on how probable it is given the instruction that the user provides to the robot

  66. [71]

    1.0 means the given behavior mode is the only feasible option given the user's instruction, and 0.0 means the given behavior mode is completely at odds with the user's instruction

  67. [72]

    Ensure that the scores sum to 1.0 across all modes and this is very important!!!

  68. [73]

    If multiple options seem equally likely, please distribute the likelihood scores evenly among those options

  69. [75]

    A": <score for option A>,

    You can give any value between 0 and 1 and must give a probability value representing your own uncertainty rather than exactly 0 or 1!! Don't use 0 or 1! Clarifications: If the user's instruction is underspecified, please do not make any biased assumptions about the user's intent or preferences. <self-score version> Format your response for step 1 as foll...

  70. [76]

    If it is grasping phase, just selects for appropriate grasping

    Evaluate the options based on the evaluation phase and the user's instruction. If it is grasping phase, just selects for appropriate grasping

  71. [77]

    an option not listed here

    If no mode fully satisfies the user's instruction, you must choose "an option not listed here" rather than other options

  72. [78]

    A": <score for option A>,

    Assume that each mode's outcome is exactly as described in the options, with no adjustments. <self-score version> Response format: {"A": <score for option A>, "B": <score for option B>, "C": <score for option C>, "D": <score for option D>} Do not include any other text in your response, just the JSON object with the scores. <logit version> Format your res...

  73. [79]

    The Image 4 provided is the final state of the task execution

  74. [80]

    - GRASPING THE NUT IS A FAILURE if any of the following is true: 1

    Determine task outcome from the Wrist View and Front View: - GRASPING THE NUT IS A SUCCESS if the gripper grasps the nut and the handle of the nut is between the grippers in the wrist view and the hole of the nut is centered in the wrist view. - GRASPING THE NUT IS A FAILURE if any of the following is true: 1. the gripper doesn't grasp the nut, 2. the nut...

  75. [81]

    Describe the task outcome according to the rules

  76. [82]

    Don’t need to explicitly say the comparison between images Fig. 26:Simulation Prompt For Behavior Translation (Grasp) Role: You are an expert Robotic Systems Monitor specializing in spatial reasoning, assembly verification, and failure detection. Task: Analyze the final outcome of a robot performing a square -nut peg-in-hole assembly from the last frame o...

  77. [83]

    Image 4 provided is the final state of the task execution

  78. [84]

    the square hole of the nut is aligned with the square peg (i.e., the peg lies roughly within the inner borders of the nut's hole) in the wrist view and 2

    Determine task outcome from the Wrist View and Front View: - ALIGNING THE NUT IS A SUCCESS HANDLE -FACING-LEFT if both of the following are true: 1. the square hole of the nut is aligned with the square peg (i.e., the peg lies roughly within the inner borders of the nut's hole) in the wrist view and 2. the end effector of the robot is to the left of the p...

  79. [85]

    Describe the task outcome and direction of the nut (SUCCESS or FAILURE and, if SUCCESS, HANDLE-FACING-RIGHT or HANDLE-FACING-LEFT) according to the rules

  80. [86]

    27:Simulation Prompt For Behavior Translation (Place) Task setup: You are an expert Robotic Systems Monitor specializing in spatial reasoning and failure detection

    Don't need to explicitly say the comparison between images Fig. 27:Simulation Prompt For Behavior Translation (Place) Task setup: You are an expert Robotic Systems Monitor specializing in spatial reasoning and failure detection. You are guiding a robot to complete a nut assembly task. The robot is situated on one side of a table and the user stands on the...

Showing first 80 references.