pith. sign in

arxiv: 2606.18043 · v1 · pith:HTPCOTZCnew · submitted 2026-06-16 · 💻 cs.RO · cs.LG

Uncertainty Quantification for Flow-Based Vision-Language-Action Models

Pith reviewed 2026-06-27 00:41 UTC · model grok-4.3

classification 💻 cs.RO cs.LG
keywords vision-language-action modelsflow matchingepistemic uncertaintyfailure detectionactive fine-tuningrobotic manipulationLIBERO benchmark
0
0 comments X

The pith

A small ensemble of flow models quantifies epistemic uncertainty in vision-language-action systems by measuring velocity-field disagreement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language-action models trained with flow matching lack built-in ways to report when their generated actions are likely to fail. The paper derives velocity-field disagreement measured across a small ensemble as an efficient estimator of epistemic uncertainty for these models. The estimate supports failure detection at deployment time and drives an active fine-tuning procedure that selects which new demonstrations to collect. A reader should care because robots in changing environments need to recognize their own uncertainty and adapt with limited additional expert data.

Core claim

The authors derive velocity-field disagreement across a small ensemble of flow-matching models as an efficient estimator of epistemic uncertainty in the action head. On the LIBERO benchmark this estimator produces better-calibrated scores that predict downstream performance, detects failures effectively, and enables the SAVE active fine-tuning framework that requires at least 22 percent fewer expert demonstrations than baseline acquisition strategies.

What carries the argument

Velocity-field disagreement (VFD) across a small ensemble of flow models, which isolates epistemic uncertainty by measuring differences in predicted action velocities.

If this is right

  • Models gain the ability to flag unreliable actions during deployment in non-stationary environments.
  • Uncertainty-guided acquisition reduces the number of expert demonstrations needed to adapt to new tasks.
  • VFD scores are predictive of downstream task performance.
  • The method applies to existing flow-based VLA architectures without full retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Disagreement measures of this type could extend to diffusion-based or other generative action heads.
  • In multi-agent robotic teams the same signal might coordinate when one agent should defer to another.
  • Further reduction of ensemble size could enable onboard uncertainty estimation on resource-limited platforms.

Load-bearing premise

Disagreement among velocity fields from a small ensemble reliably isolates epistemic uncertainty rather than aleatoric noise or model-specific artifacts.

What would settle it

An experiment in which VFD scores show no correlation with actual failure rates on held-out tasks, or in which uncertainty-guided sample selection requires as many or more demonstrations as random selection.

Figures

Figures reproduced from arXiv: 2606.18043 by Andreas Krause, Angela P. Schoellig, Ben Sturgis, Daniel Marta, Marco Bagatella, Maximilian Seeliger, Ralf R\"omer, Saida Liu.

Figure 1
Figure 1. Figure 1: Top: VFD quantifies epistemic uncertainty by measuring scaled differences between ensembled velocity fields. Bottom: SAVE prioritizes tasks by their mean VFD uncertainty and, for the most uncertain initial observations within each sampled task, requests an expert demonstration. The models are then fine-tuned using new and replay data, yielding data-efficient multitask adaptation. VLAs are pre-trained on la… view at source ↗
Figure 2
Figure 2. Figure 2: The VLA ensemble size has little im￾pact on calibration, allowing for a lightweight two-member ensemble. P1 P2 P3 P4 P5 Prompt 0.0 0.2 0.4 0.6 0.8 1.0 Success Rate / Uncertainty Action-L2 ACE DECU GU Entropy Perplexity VFD (ours) Success Rate [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Effect of the task-sampling temperature τ on SAVE. Larger τ biases expert queries toward higher-uncertainty tasks, with uniform sampling for τ = 0. Left: Uncertainty-guidance is beneficial both for sampling tasks and initial observations. Legend entries containing uniform corre￾spond to sampling initial observations within a sampled task uniformly instead of uncertainty-guided. Middle: Uncertainty-based sa… view at source ↗
Figure 5
Figure 5. Figure 5: VFD epistemic uncertainty can de￾tect failures during deployment. We consider three recent baselines for detecting fail￾ures of generative policies: ACE [58] computes the conditional entropy of the action distribution, STAC [2] compares action distributions at consecu￾tive timesteps, and RND-OE [58] detects OOD obser￾vations. As shown in [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of the velocity-field disagreement (VFD) computation for two conditioning [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Epistemic uncertainty estimation for a 2D generative modeling problem. Our velocity field [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Learning curves corresponding to the active fine-tuning results reported in Table [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Ensembling vs. Laplace Approximation. Calibration is measured by the negative Spearman [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Pareto front of the exploration-exploitation trade-off. [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Behavior of our VFD-acquisition rule for different temperature values [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Full results of our failure detection experiments. VFD achieves the highest timestep-wise [PITH_FULL_IMAGE:figures/full_fig_p028_12.png] view at source ↗
read the original abstract

Vision-language-action models (VLAs) combine vision-language backbones with expressive generative action heads trained via flow matching on large-scale robotic datasets. Despite their strong empirical performance in robotic manipulation, VLAs lack mechanisms to quantify confidence in their predictions and to detect when their actions may be unreliable. This presents a critical limitation for real-world deployment in non-stationary environments, where models inevitably encounter scenarios outside their pretraining distribution and may fail without warning. To address this, we derive an efficient method for quantifying epistemic uncertainty in flow-matching models by leveraging velocity-field disagreement (VFD) across a small ensemble. We successfully use this uncertainty estimate for failure detection during deployment and active fine-tuning of flow-based VLAs. To this end, we propose SAVE, a framework for uncertainty-guided active multitask fine-tuning that reduces the number of costly expert demonstrations required to adapt VLAs to new tasks. Through extensive experiments on the LIBERO benchmark, we demonstrate that VFD yields better-calibrated uncertainty estimates predictive of downstream performance, that VFD achieves strong performance in detecting failures, and that uncertainty-guided data acquisition with SAVE requires at least 22% fewer samples than baselines. In summary, our work shows that quantifying epistemic uncertainty in flow-based VLAs improves both failure awareness and adaptation. Project website: tum-lsy.github.io/uq_vla/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims to derive an efficient epistemic uncertainty quantification method for flow-matching VLAs via velocity-field disagreement (VFD) across a small ensemble of velocity fields. It introduces the SAVE framework for uncertainty-guided active multitask fine-tuning and reports on LIBERO that VFD yields better-calibrated estimates predictive of performance, achieves strong failure detection, and enables SAVE to require at least 22% fewer expert demonstrations than baselines.

Significance. If VFD reliably isolates epistemic uncertainty from aleatoric or optimization artifacts in flow-matching models, the approach would meaningfully advance safe deployment and efficient adaptation of VLAs. The reported sample-efficiency gain and failure-detection results on LIBERO would be practically relevant for robotics, provided the core separation holds under controlled tests.

major comments (3)
  1. [Abstract, §3] Abstract and §3: The central claim that VFD isolates epistemic uncertainty (rather than optimization stochasticity or aleatoric action noise) is load-bearing for both the failure-detection AUC and the 22% SAVE gain, yet no experiment is described that holds aleatoric variance fixed while injecting controlled OOD support gaps (e.g., via synthetic action noise or distribution shift with matched variance).
  2. [§4.2, Table 2] §4.2 and Table 2: The reported 22% reduction in required samples for SAVE is presented without statistical significance tests, variance across random seeds, or explicit confirmation that baseline implementations match the original papers; this directly affects whether the cross-method comparison supports the efficiency claim.
  3. [§3.1, Eq. (3)–(5)] §3.1, Eq. (3)–(5): The VFD quantity is defined as disagreement among velocity fields v_θ_i; because each v_θ is a deterministic function of its parameters, the derivation does not include a term or proof separating parameter uncertainty from sensitivity to the stochasticity of the flow-matching training objective or to inherent action multimodality.
minor comments (2)
  1. [Abstract] The abstract states 'at least 22% fewer samples' without specifying the exact baseline methods or the precise metric (e.g., demonstrations until a performance threshold); this should be clarified for reproducibility.
  2. [§4] Ensemble size is listed as a free parameter in the axiom ledger but no sensitivity analysis or recommended default is provided in the experimental section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important aspects of validating the epistemic nature of VFD and strengthening the empirical claims. We respond to each major comment below and indicate planned revisions.

read point-by-point responses
  1. Referee: [Abstract, §3] Abstract and §3: The central claim that VFD isolates epistemic uncertainty (rather than optimization stochasticity or aleatoric action noise) is load-bearing for both the failure-detection AUC and the 22% SAVE gain, yet no experiment is described that holds aleatoric variance fixed while injecting controlled OOD support gaps (e.g., via synthetic action noise or distribution shift with matched variance).

    Authors: We agree that the current LIBERO task-shift experiments, while showing VFD's predictive power for performance drops, do not include the precise controlled isolation suggested. In the revision we will add a new experiment subsection that fixes aleatoric variance (via matched synthetic action noise) while systematically varying support gaps, to more directly test whether VFD isolates epistemic uncertainty. revision: yes

  2. Referee: [§4.2, Table 2] §4.2 and Table 2: The reported 22% reduction in required samples for SAVE is presented without statistical significance tests, variance across random seeds, or explicit confirmation that baseline implementations match the original papers; this directly affects whether the cross-method comparison supports the efficiency claim.

    Authors: We acknowledge that the efficiency claim would be more robust with these details. We will revise §4.2 and Table 2 to report means and standard deviations over at least five random seeds, include statistical significance tests (e.g., paired t-tests), and add an appendix section confirming that all baselines were reimplemented following the original papers' protocols and hyperparameters. revision: yes

  3. Referee: [§3.1, Eq. (3)–(5)] §3.1, Eq. (3)–(5): The VFD quantity is defined as disagreement among velocity fields v_θ_i; because each v_θ is a deterministic function of its parameters, the derivation does not include a term or proof separating parameter uncertainty from sensitivity to the stochasticity of the flow-matching training objective or to inherent action multimodality.

    Authors: VFD follows the standard ensemble approach where disagreement across differently initialized models approximates epistemic uncertainty arising from parameter variability. The shared training procedure across ensemble members is intended to keep optimization stochasticity comparable. We will expand the discussion around Eq. (3)–(5) to explicitly state these assumptions and limitations regarding training stochasticity and action multimodality, without claiming a formal separation proof. revision: partial

Circularity Check

0 steps flagged

No circularity: VFD is direct ensemble disagreement, not a fitted or self-defined quantity

full rationale

The paper derives epistemic uncertainty via velocity-field disagreement (VFD) computed directly from an ensemble of flow-matching velocity fields v_θ. This is a standard, non-circular construction: disagreement is measured on the outputs of independently trained models without any parameter being fitted to the target uncertainty or downstream metric. SAVE then uses the resulting scalar as a selection criterion for active fine-tuning; the selection rule does not reduce to the input demonstrations by definition. No self-citation load-bearing step, uniqueness theorem, or ansatz smuggling is present in the abstract or described derivation chain. The method remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 2 invented entities

The central claim rests on the assumption that small-ensemble velocity disagreement approximates epistemic uncertainty in flow models, plus standard supervised-learning assumptions about the LIBERO tasks representing deployment distribution shifts. No new physical entities are introduced.

free parameters (2)
  • ensemble size
    Number of flow models whose velocity fields are compared; chosen to balance compute and uncertainty signal quality.
  • uncertainty threshold for failure detection / data selection
    Decision threshold on VFD score; fitted or tuned on validation data.
axioms (2)
  • domain assumption Ensemble disagreement in velocity fields captures epistemic rather than aleatoric uncertainty for flow-matching policies.
    Invoked when claiming VFD is a valid epistemic uncertainty estimate.
  • domain assumption LIBERO task splits simulate realistic non-stationary deployment shifts.
    Required for the failure-detection and active-fine-tuning claims to transfer beyond the benchmark.
invented entities (2)
  • VFD (velocity-field disagreement) no independent evidence
    purpose: Epistemic uncertainty proxy for flow-based VLAs
    Newly defined quantity; no independent evidence outside the paper's experiments.
  • SAVE framework no independent evidence
    purpose: Uncertainty-guided active multitask fine-tuning loop
    Newly proposed procedure; no independent evidence outside the paper.

pith-pipeline@v0.9.1-grok · 5791 in / 1592 out tokens · 31496 ms · 2026-06-27T00:41:15.708193+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

67 extracted references · 6 linked inside Pith

  1. [1]

    Abdar et al

    M. Abdar et al. A review of uncertainty quantification in deep learning: Techniques, applications and challenges.Information Fusion, 76:243–297, 2021

  2. [2]

    C. Agia, R. Sinha, J. Yang, Z. Cao, R. Antonova, M. Pavone, and J. Bohg. Unpacking failure modes of generative policies: Runtime monitoring of consistency and progress. InConference on Robot Learning (CoRL), 2025

  3. [3]

    AgiBot World Colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems

    AgiBot-World et al. AgiBot World Colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. InInternational Conference on Intelligent Robots and Systems (IROS), 2025

  4. [4]

    A. N. Angelopoulos and S. Bates. Conformal prediction: A gentle introduction.Foundations and Trends in Machine Learning, 16(4):494–591, 2023

  5. [5]

    J. T. Ash, C. Zhang, A. Krishnamurthy, J. Langford, and A. Agarwal. Deep batch active learning by diverse, uncertain gradient lower bounds. InInternational Conference on Learning Representations (ICLR), 2020

  6. [6]

    Awais, M

    M. Awais, M. Naseer, S. Khan, R. M. Anwer, H. Cholakkal, M. Shah, M.-H. Yang, and F. S. Khan. Foundation models defining a new era in vision: a survey and outlook.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(4):2245–2264, 2025

  7. [7]

    M. S. Ayhan and P. Berens. Test-time data augmentation for estimation of heteroscedastic aleatoric uncertainty in deep neural networks. InMedical Imaging with Deep Learning, 2018

  8. [8]

    Bagatella, J

    M. Bagatella, J. Hübotter, G. Martius, and A. Krause. Active fine-tuning of multi-task policies. InInternational Conference on Machine Learning (ICML), 2025

  9. [9]

    Berry, A

    L. Berry, A. Brando, and D. Meger. Shedding light on large generative networks: Estimating epistemic uncertainty in diffusion models. InConference on Uncertainty in Artificial Intelligence (UAI), 2024

  10. [10]

    Black et al

    K. Black et al. π0.5: a vision-language-action model with open-world generalization. In Conference on Robot Learning (CoRL), 2025

  11. [11]

    Cadene, S

    R. Cadene, S. Alibert, F. Capuano, M. Aractingi, A. Zouitine, P. Kooijmans, J. Choghari, M. Russi, C. Pascal, S. Palma, et al. LeRobot: An open-source library for end-to-end robot learning. InInternational Conference on Learning Representations (ICLR), 2026

  12. [12]

    Chaloner and I

    K. Chaloner and I. Verdinelli. Bayesian experimental design: A review.Statistical science, 10: 273–304, 1995

  13. [13]

    M. Chan, M. Molina, and C. Metzler. Estimating epistemic and aleatoric uncertainty with a single model.Advances in Neural Information Processing Systems (NeurIPS), 2024

  14. [14]

    C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.Robotics: Science and Systems (RSS), 2023

  15. [15]

    Y . Cui, D. Isele, S. Niekum, and K. Fujimura. Uncertainty-aware data aggregation for deep imitation learning. InIEEE International Conference on Robotics and Automation (ICRA), 2019. 11

  16. [16]

    Daxberger, A

    E. Daxberger, A. Kristiadi, A. Immer, R. Eschenhagen, M. Bauer, and P. Hennig. Laplace redux - effortless Bayesian deep learning.Advances in Neural Information Processing Systems (NeurIPS), 2021

  17. [17]

    Devlin, M.-W

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InNorth American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2019

  18. [18]

    Diquigiovanni, M

    J. Diquigiovanni, M. Fontana, S. Vantini, et al. The importance of being a band: Finite-sample exact distribution-free prediction sets for functional data.STATISTICA SINICA, 1:1–41, 2024

  19. [19]

    Dohare, J

    S. Dohare, J. F. Hernandez-Garcia, Q. Lan, P. Rahman, A. R. Mahmood, and R. S. Sutton. Loss of plasticity in deep continual learning.Nature, 632:768–774, 2024

  20. [20]

    Dosovitskiy et al

    A. Dosovitskiy et al. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations (ICLR), 2021

  21. [21]

    B. Efron. Tweedie’s formula and selection bias.Journal of the American Statistical Association, 106:1602–1614, 2011

  22. [22]

    Esser, S

    P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In International Conference on Machine Learning (ICML), 2024

  23. [23]

    Fadeeva et al

    E. Fadeeva et al. Fact-checking the output of large language models via token-level uncertainty quantification. InFindings of the Association for Computational Linguistics: ACL 2024, pages 9367–9385, 2024

  24. [24]

    Franchi, N

    G. Franchi, N. Belkhir, D. N. Trong, G. Xia, and A. Pilzer. Towards understanding and quantifying uncertainty for text-to-image generation. InConference on Computer Vision and Pattern Recognition (CVPR), 2025

  25. [25]

    Gal and Z

    Y . Gal and Z. Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. InInternational Conference on Machine Learning (ICML), 2016

  26. [26]

    Y . Gal, R. Islam, and Z. Ghahramani. Deep Bayesian active learning with image data. In International Conference on Machine Learning (ICML), 2017

  27. [27]

    Q. Gu, Y . Ju, S. Sun, I. Gilitschenski, H. Nishimura, M. Itkina, and F. Shkurti. SAFE: Multi- task failure detection for vision-language-action models. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  28. [28]

    Z. He, Y . Cao, and M. Ciocarlie. Uncertainty comes for free: Human-in-the-loop policies with diffusion models.arXiv preprint arXiv:2503.01876, 2025

  29. [29]

    Hejna, S

    J. Hejna, S. Mirchandani, A. Balakrishna, A. Xie, A. Wahid, J. Tompson, P. Sanketi, D. Shah, C. Devin, and D. Sadigh. Robot data curation with mutual information estimators.Robotics: Science and Systems (RSS), 2025

  30. [30]

    J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems (NeurIPS), 33:6840–6851, 2020

  31. [31]

    Holzmüller, V

    D. Holzmüller, V . Zaverkin, J. Kästner, and I. Steinwart. A framework and benchmark for deep batch active learning for regression.Journal of Machine Learning Research (JMLR), 24(164): 1–81, 2023

  32. [32]

    Hüllermeier and W

    E. Hüllermeier and W. Waegeman. Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods.Machine Learning, 110:457–506, 2021

  33. [33]

    Jazbec, E

    M. Jazbec, E. Wong-Toi, G. Xia, D. Zhang, E. Nalisnick, and S. Mandt. Generative uncertainty in diffusion models. InConference on Uncertainty in Artificial Intelligence (UAI), 2025

  34. [34]

    L. Ju, M. Nautiyal, A. Hellander, E. Vats, and P. Singh. Epistemic uncertainty quantification for pre-trained VLMs via Riemannian flow matching.arXiv preprint arXiv:2601.21662, 2026. 12

  35. [35]

    Judah, A

    K. Judah, A. Fern, and T. G. Dietterich. Active imitation learning via reduction to IID active learning. InConference on Uncertainty in Artificial Intelligence (UAI), 2012

  36. [36]

    Karczewski, M

    R. Karczewski, M. Heinonen, and V . Garg. Diffusion models as cartoonists: The curious case of high density regions. InInternational Conference on Learning Representations (ICLR), 2025

  37. [37]

    U. B. Karli, T. Kurumisawa, and T. Fitzgerald. Ask before you act: Token-level uncertainty for intervention in vision-language-action models. InSecond Workshop on Out-of-Distribution Generalization in Robotics at RSS, 2025

  38. [38]

    Khazatsky et al

    A. Khazatsky et al. DROID: A large-scale in-the-wild robot manipulation dataset. InRobotics: Science and Systems, 2024

  39. [39]

    M. J. Kim et al. OpenVLA: An open-source vision-language-action model. InConference on Robot Learning (CoRL), 2024

  40. [40]

    Kirsch, J

    A. Kirsch, J. van Amersfoort, and Y . Gal. BatchBALD: Efficient and diverse batch acquisition for deep Bayesian active learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2019

  41. [41]

    Lakshminarayanan, A

    B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and scalable predictive uncertainty es- timation using deep ensembles.Advances in Neural Information Processing Systems (NeurIPS), 2017

  42. [42]

    S.-W. Lee, X. Kang, and Y .-L. Kuo. Diff-DAgger: Uncertainty estimation with diffusion policy for robotic manipulation. InIEEE International Conference on Robotics and Automation (ICRA), 2025

  43. [43]

    Q. Li, B. Yin, W. Huang, R. Liu, B. Zou, R. Yu, J. Ye, W. Yu, and X. Wang. Vision- language-action safety: Threats, challenges, evaluations, and mechanisms.arXiv preprint arXiv:2604.23775, 2026

  44. [44]

    Ling et al

    C. Ling et al. Uncertainty quantification for in-context learning of large language models. In North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2024

  45. [45]

    Lipman, R

    Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023

  46. [46]

    B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. LIBERO: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems (NeurIPS), 2023

  47. [47]

    Loquercio, M

    A. Loquercio, M. Segu, and D. Scaramuzza. A general framework for uncertainty estimation in deep learning.IEEE Robotics and Automation Letters, 2020

  48. [48]

    H. Ma, J. Chen, J. T. Zhou, G. Wang, and C. Zhang. Estimating LLM uncertainty with evidence. arXiv preprint arXiv:2502.00290, 2025

  49. [49]

    Malinin and M

    A. Malinin and M. Gales. Uncertainty estimation in autoregressive structured prediction. In International Conference on Learning Representations (ICLR), 2021

  50. [50]

    Z. Mei, T. Yin, M. Baker, O. Shorinwa, and A. Majumdar. World models that know when they don’t know: Controllable video generation with calibrated uncertainty.arXiv preprint arXiv:2512.05927, 2025

  51. [51]

    Nalisnick, A

    E. Nalisnick, A. Matsukawa, Y . W. Teh, D. Gorur, and B. Lakshminarayanan. Do deep generative models know what they don’t know? InInternational Conference on Learning Representations (ICLR), 2019

  52. [52]

    GR00T N1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

    NVIDIA et al. GR00T N1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025. 13

  53. [53]

    O’Neill et al

    A. O’Neill et al. Open X-embodiment: Robotic learning datasets and RT-X models: Open X-embodiment collaboration. InIEEE International Conference on Robotics and Automation (ICRA), 2024

  54. [54]

    π0: A vision-language-action flow model for general robot control

    Physical Intelligence et al. π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

  55. [55]

    A. Z. Ren et al. Robots that ask for help: Uncertainty alignment for large language model planners. InConference on Robot Learning (CoRL), 2023

  56. [56]

    J. Ren, J. Luo, Y . Zhao, K. Krishna, M. Saleh, B. Lakshminarayanan, and P. J. Liu. Out-of- distribution detection and selective generation for conditional language models. InInternational Conference on Learning Representations (ICLR), 2023

  57. [57]

    Reuss, H

    M. Reuss, H. Zhou, M. Rühle, Ö. E. Ya˘gmurlu, F. Otto, and R. Lioutikov. FLOWER: Democra- tizing generalist robot policies with efficient vision-language-flow policies. InConference on Robot Learning (CoRL), 2025

  58. [58]

    Römer, A

    R. Römer, A. Kobras, L. Worbis, and A. P. Schoellig. Failure prediction at runtime for generative robot policies.Advances in Neural Information Processing Systems (NeurIPS), 2025

  59. [59]

    Römer, J

    R. Römer, J. Balletshofer, J. Thumm, M. Pavone, A. P. Schoellig, and M. Althoff. From demonstrations to safe deployment: Path-consistent safety filtering for diffusion policies. In IEEE International Conference on Robotics and Automation (ICRA), 2026

  60. [60]

    Römer, Y

    R. Römer, Y . Zhang, and A. P. Schoellig. CLARE: Continual learning for vision-language-action models via autonomous adapter routing and expansion.arXiv preprint arXiv:2601.09512, 2026

  61. [61]

    Sener and S

    O. Sener and S. Savarese. Active learning for convolutional neural networks: A core-set approach. InInternational Conference on Learning Representations (ICLR), 2018

  62. [62]

    B. Settles. Active learning literature survey. Technical report, University of Wisconsin–Madison, 2009

  63. [63]

    Shorinwa, Z

    O. Shorinwa, Z. Mei, J. Lidard, A. Z. Ren, and A. Majumdar. A survey on uncertainty quantification of large language models: Taxonomy, open research challenges, and future directions.ACM Computing Surveys, 58:1–38, 2025

  64. [64]

    Shukor et al

    M. Shukor et al. SmolVLA: A vision-language-action model for affordable and efficient robotics. arXiv preprint arXiv:2506.01844, 2025

  65. [65]

    C. Xu, T. Khuong Nguyen, E. Dixon, C. Rodriguez, P. Miller, R. Lee, P. Shah, R. Ambrus, H. Nishimura, and M. Itkina. Can we detect failures without failure data? Uncertainty-aware runtime failure detection for imitation learning policies. InRobotics: Science and Systems (RSS), 2025

  66. [66]

    X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre-training. InIEEE/CVF International Conference on Computer Vision (ICCV), pages 11975–11986, 2023

  67. [67]

    Zitkovich et al

    B. Zitkovich et al. RT-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning (CoRL), 2023. 14 Appendix Table of Contents A Theoretical Results 16 A.1 Flow Matching Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 A.2 Proof of Theorem 4.1 . . . . . . . . . . . . . . . . . . . . . . ....