pith. sign in

arxiv: 2505.09591 · v3 · submitted 2025-05-14 · 💻 cs.CV · cs.AI

Variational Visual Question Answering for Uncertainty-Aware Selective Prediction

Pith reviewed 2026-05-22 15:24 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords variational inferencevisual question answeringselective predictionuncertainty estimationvision-language modelsmodel calibrationBayesian methods
0
0 comments X

The pith

Variational Bayes enables effective selective prediction in Visual Question Answering by using posterior samples and variance-aware selection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that variational Bayes can be applied successfully to make large vision-language models answer visual questions only when sufficiently confident. This matters because it reduces overconfidence and hallucinations on VQA and visual reasoning tasks, especially when the allowed error rate is very low. The proposed Variational VQA approach improves calibration and delivers gains over standard training, with even a single posterior sample often proving more reliable than models optimized with AdamW. A new risk-averse selector that uses prediction variance further outperforms simple averaging of samples.

Core claim

We show for the first time the effectiveness and competitive edge of variational Bayes for selective prediction in VQA. We build on recent advances in variational methods for deep learning and propose an extension called Variational VQA. This method improves calibration and yields significant gains for selective prediction on VQA and Visual Reasoning, particularly when the error tolerance is low (≤1%). Often, just one posterior sample yields more reliable answers than those given by models trained with AdamW. In addition, we propose a new risk-averse selector that outperforms standard sample averaging by considering the variance of predictions.

What carries the argument

Variational VQA, an extension of variational inference to vision-language models that supports uncertainty-aware selective prediction via posterior sampling combined with a variance-based risk-averse selector.

If this is right

  • Selective prediction performance improves markedly at error tolerances of 1% or below.
  • A single draw from the variational posterior often suffices for more reliable answers than deterministic training.
  • Accounting for prediction variance in the selector yields better abstention decisions than averaging.
  • Variational learning offers a practical route to safer, more trustworthy large VLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same variational treatment could be tested on related multimodal tasks such as visual reasoning or caption generation.
  • If the approach scales, it may become a standard way to add built-in uncertainty awareness to deployed vision-language systems.
  • Combining the variance selector with other calibration techniques could produce further reliability gains in safety-critical settings.

Load-bearing premise

Variational inference stays computationally tractable and effective on large vision-language models, and the variance-based selector adds gains beyond what posterior averaging already provides.

What would settle it

If experiments on standard VQA benchmarks show no meaningful improvement in selective prediction accuracy at 1% error tolerance compared to non-variational baselines, the central claim would not hold.

Figures

Figures reproduced from arXiv: 2505.09591 by Marcus Rohrbach, Mohammad Emtiyaz Khan, Nathalie Daun, Tobias Jan Wieczorek.

Figure 1
Figure 1. Figure 1: Despite recent performance gains, VLMs trained with popular optimizers like AdamW do not [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the methods we experiment with and their selectors. Variational VQA employs [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Accuracy, calibration and selective prediction results for different models after fine-tuning on [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of Variational VQA to MC Dropout, which uses the same inference compute, on [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Accuracy, calibration and selective prediction results for different VQAv2/AdVQA mixtures for [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative examples on VQAv2 with BEiT-3 large where AdamW is wrong while VarVQA abstains. [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative examples on NLVR2 with BEiT-3 large where AdamW is wrong while VarVQA abstains. [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Computational overhead comparison (IVON vs AdamW). All runs were executed on NVIDIA A100 [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Sample ablation and comparison to MC dropout for BEiT-3 large on VQAv2. Lower is better for [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Sample ablation and comparison to MC dropout for BEiT-3 base on VQAv2. Lower is better for [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Sample ablation and comparison to MC dropout for ViLT on VQAv2. Lower is better for ECE [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Sample ablation and comparison to MC Dropout for BEiT-3 large on NLVR2. Lower is better [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Sample ablation and comparison to MC dropout for BEiT-3 base on NLVR2. Lower is better for [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Performance on different ID/OOD (VQAv2/AdVQA) fractions for BEiT-3 large. In [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Performance on different ID/OOD (VQAv2/AdVQA) fractions for BEiT-3 base. In [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Performance on different ID/OOD (VQAv2/AdVQA) fractions for ViLT. In [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Qualitative examples on VQAv2 with BEiT-3 large where AdamW abstains while VarVQA is [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Qualitative examples on NLVR2 with BEiT-3 large where AdamW abstains while VarVQA is [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Qualitative examples on AdVQA with BEiT-3 large where AdamW is wrong while VarVQA [PITH_FULL_IMAGE:figures/full_fig_p028_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Failure cases on VQAv2 with BEiT-3 large where AdamW abstains while VarVQA is wrong. The [PITH_FULL_IMAGE:figures/full_fig_p028_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Failure cases on NLVR2 with BEiT-3 large where AdamW abstains while VarVQA is wrong. The [PITH_FULL_IMAGE:figures/full_fig_p029_21.png] view at source ↗
read the original abstract

Despite remarkable progress in recent years, Vision Language Models (VLMs) remain prone to overconfidence and hallucinations on tasks such as Visual Question Answering (VQA) and Visual Reasoning. Bayesian methods can potentially improve reliability by helping models predict selectively, that is, models respond only when they are sufficiently confident. Unfortunately, such approaches can be costly and ineffective for large models, and there exists little evidence to show otherwise for multimodal applications. Here, we show for the first time the effectiveness and competitive edge of variational Bayes for selective prediction in VQA. We build on recent advances in variational methods for deep learning and propose an extension called "Variational VQA". This method improves calibration and yields significant gains for selective prediction on VQA and Visual Reasoning, particularly when the error tolerance is low ($\leq 1\%$). Often, just one posterior sample yields more reliable answers than those given by models trained with AdamW. In addition, we propose a new risk-averse selector that outperforms standard sample averaging by considering the variance of predictions. Overall, we present compelling evidence that variational learning is a viable option to make large VLMs safer and more trustworthy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Variational VQA, an extension of variational Bayesian methods to vision-language models for uncertainty-aware selective prediction on VQA and visual reasoning tasks. It claims that variational inference improves calibration, enables reliable selective prediction (especially at error tolerances ≤1%), and that a single posterior sample often outperforms AdamW-trained models. A new risk-averse selector based on prediction variance is proposed and shown to outperform standard posterior averaging.

Significance. If the central experimental claims hold, the work would be significant for demonstrating that variational Bayes can be made tractable and effective for selective prediction in large multimodal models, offering a practical route to reduce overconfidence and hallucinations in VLMs. This could influence reliability-focused research in vision-language systems by providing evidence that uncertainty-aware methods yield gains beyond standard training and averaging.

major comments (3)
  1. [§5.2, Table 2] §5.2, Table 2: the reported improvement of the risk-averse selector over posterior averaging (approximately 1–2% absolute on VQA accuracy at 1% error tolerance) is not compared against a non-variational baseline trained to the same total compute budget or against simple averaging with an equal number of forward passes; without this, it is unclear whether the variance term provides gains beyond what posterior averaging already achieves.
  2. [§4.1, Eq. (8)–(10)] §4.1, Eq. (8)–(10): the variational posterior parameterization for the full-scale VLM is described at a high level but lacks explicit details on the number of additional parameters introduced by the variational distribution or the memory/compute overhead relative to standard fine-tuning; this information is load-bearing for the tractability claim when scaling to large multimodal transformers.
  3. [§5.1, Figure 3] §5.1, Figure 3: the claim that one posterior sample yields more reliable answers than AdamW models is supported by point estimates but lacks error bars across random seeds or statistical tests; given the known sensitivity of VQA metrics to initialization, this weakens the assertion that variational sampling is reliably superior at low sample counts.
minor comments (2)
  1. [Abstract] The abstract states performance claims without referencing the specific datasets or metrics used; the introduction or experimental section should explicitly list the VQA and visual reasoning benchmarks together with their statistics.
  2. [§3.3] Notation for the risk-averse selector (variance of predictions) is introduced in §3.3 but not contrasted with existing uncertainty-aware abstention methods in the related-work section; adding 1–2 sentences would improve context.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§5.2, Table 2] the reported improvement of the risk-averse selector over posterior averaging (approximately 1–2% absolute on VQA accuracy at 1% error tolerance) is not compared against a non-variational baseline trained to the same total compute budget or against simple averaging with an equal number of forward passes

    Authors: We agree that direct comparisons to non-variational baselines under matched compute and to simple averaging with an equal number of forward passes would strengthen the evidence that the variance term adds value beyond posterior averaging. In the revised manuscript we will add these baselines, trained to the same total compute budget, and report the corresponding selective prediction results at low error tolerances. revision: yes

  2. Referee: [§4.1, Eq. (8)–(10)] the variational posterior parameterization for the full-scale VLM is described at a high level but lacks explicit details on the number of additional parameters introduced by the variational distribution or the memory/compute overhead relative to standard fine-tuning

    Authors: We acknowledge that explicit quantification of the additional parameters and overhead is necessary to substantiate the tractability claim. In the revised version we will provide the exact number of extra parameters introduced by the variational distribution, together with a breakdown of memory and compute overhead relative to standard AdamW fine-tuning of the same VLM backbone. revision: yes

  3. Referee: [§5.1, Figure 3] the claim that one posterior sample yields more reliable answers than AdamW models is supported by point estimates but lacks error bars across random seeds or statistical tests

    Authors: We agree that error bars and statistical tests are important given the known sensitivity of VQA metrics to initialization. We will recompute the results in Figure 3 across multiple random seeds, add error bars, and include statistical significance tests in the revised manuscript to support the claim that a single posterior sample is reliably superior at low sample counts. revision: yes

Circularity Check

0 steps flagged

No circularity: method extends standard variational inference to VQA with independent empirical claims

full rationale

The paper introduces Variational VQA as an application of existing variational deep learning techniques to selective prediction in VQA and visual reasoning. Claims of improved calibration, gains from a variance-based selector over averaging, and superiority of single posterior samples over AdamW are presented as empirical findings rather than derivations that reduce to inputs by construction. No self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations that collapse the central result are evident from the provided text. The work is self-contained against external benchmarks via proposed extensions and reported experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient technical detail to enumerate free parameters, axioms, or invented entities; the approach extends prior variational methods whose specific assumptions are not stated here.

pith-pipeline@v0.9.0 · 5741 in / 1039 out tokens · 41500 ms · 2026-05-22T15:24:50.923445+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 3 internal anchors

  1. [1]

    Lawrence Zitnick, and Devi Parikh

    13 Published in Transactions on Machine Learning Research (04/2026) Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. VQA: Visual Question Answering. InInternational Conference on Computer Vision (ICCV),

  2. [2]

    Improving LoRA with Variational Learning.arXiv preprint arXiv:2506.14280,

    Bai Cong, Nico Daheim, Yuesong Shen, Rio Yokota, Mohammad Emtiyaz Khan, and Thomas Möllenhoff. Improving LoRA with Variational Learning.arXiv preprint arXiv:2506.14280,

  3. [3]

    Why Language Models Hallucinate

    14 Published in Transactions on Machine Learning Research (04/2026) Adam Tauman Kalai, Ofir Nachum, Santosh S. Vempala, and Edwin Zhang. Why Language Models Hallucinate. arXiv preprint arXiv:2509.04664,

  4. [4]

    Variational Learning is Effective for Large Deep Networks

    15 Published in Transactions on Machine Learning Research (04/2026) Yuesong Shen, Nico Daheim, Bai Cong, Peter Nickl, Gian Maria Marconi, Clement Bazan, Rio Yokota, Iryna Gurevych, Daniel Cremers, Mohammad Emtiyaz Khan, and Thomas Möllenhoff. Variational Learning is Effective for Large Deep Networks. InInternational Conference on Machine Learning (ICML),

  5. [5]

    Overpruning in Variational Bayesian Neural Networks

    Brian Trippe and Richard Turner. Overpruning in Variational Bayesian Neural Networks.arXiv preprint arXiv:1801.06230,

  6. [6]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution. arXiv preprint arXiv:2409.12191,

  7. [7]

    •Section B: Training and inference time differences between AdamW and VarVQA

    16 Published in Transactions on Machine Learning Research (04/2026) Supplement •Section A: Hyperparameters for training and inference. •Section B: Training and inference time differences between AdamW and VarVQA. •Section C: The impact of common calibration methods on the baseline and on VarVQA. •Section D: Extended results from the main paper. •Section E...

  8. [8]

    •Section G: Measuringthreshold generalization(Section 4.2, seeCoverage at Risk),i.e.how close the test risk is to the target, given a validation-selected abstention threshold

    to our VarVQA method and the baselines. •Section G: Measuringthreshold generalization(Section 4.2, seeCoverage at Risk),i.e.how close the test risk is to the target, given a validation-selected abstention threshold. •Section H: Comparing VarVQA to using a task-specific selector head (Whitehead et al., 2022), which requires an additional training phase. A ...

  9. [9]

    •BEiT-3 large is trained in mixed precision (bf16)

    •As the default ViLT implementation has dropout = 0, we performed a hyperparameter search to find the optimal lr-dropout combination, which resulted in a slightly lower learning rate than the default (3·10−5vs.10 −4). •BEiT-3 large is trained in mixed precision (bf16). •Modest gradient clipping is added for all models. IVON Training.We generally follow Sh...

  10. [10]

    •IVON needs gradient clapping for stability, as with no clipping, the Hessian estimate will frequently diverge

    Our high-level findings and guidelines are as follows. •IVON needs gradient clapping for stability, as with no clipping, the Hessian estimate will frequently diverge. •The gradient clipping for IVON needs to be slightly higher than that of AdamW, as AdamW typically produces smaller gradients. 17 Published in Transactions on Machine Learning Research (04/2...

  11. [11]

    8a) and moderately higher for BEiT-3 large (Fig

    Peak GPU memory when training on VQAv2 is slightly higher for BEiT-3 base (Fig. 8a) and moderately higher for BEiT-3 large (Fig. 8b), indicating that more work might be needed to improve the efficiency of IVON on large models. The training overhead also increases for larger models, from just 0.8%longer training with ViLT to4.2%with BEiT-3 large, partially...

  12. [12]

    on top of our trained models. For VQA, the models we investigate use sigmoids in the output layer5, therefore, temperature scaling cannot change relative confidence rankings (due to the strict monotonicity of the sigmoid). We thus use vector scaling and train a linear layer to learn the parameters, following Whitehead et al. (2022). For NLVR2, the binary ...

  13. [13]

    That, in turn, is due to the way that labels are inferred from the answers of 10 annotators,cf.Section 4.2. 19 Published in Transactions on Machine Learning Research (04/2026) NLVR2(cf.Tab.2), theadditionalstepoftemperaturescalingreversestheorder,i.e.VarVQA+temperature scaling achieves a lower ECE than AdamW Dropout + temperature scaling. As temperature s...

  14. [14]

    •Variational VQA reduces miscalibration in terms of the Expected Calibration Error (ECE)

    The findings of the main paper for BEiT-3 large hold true across BEiT-3 base and ViLT, namely: 20 Published in Transactions on Machine Learning Research (04/2026) •Variational VQA is as effective as AdamW for training large multimodal models - it matches or sometimes even surpasses the accuracy obtained with AdamW. •Variational VQA reduces miscalibration ...

  15. [15]

    and NLVR2 (Fig. 21). For all examples, we set the abstention thresholdγby optimizingΦ100 on ID validation data6. We always pick examples where the answers of AdamW and VarVQA are identical and where the gap in their confidence is largest. Interestingly, a large number of examples where VarVQA performs better on NLVR2 seem to be related to counting, we lea...

  16. [16]

    The results for VQAv2 and NLVR2 are shown in Table 11 and Table 12, respectively

    on top of all other previously explored confidence estimation methods, as training multiple models constitutes an orthogonal direction of investigation. The results for VQAv2 and NLVR2 are shown in Table 11 and Table 12, respectively. OnVQAv2, whichisthemuchlargerandthusmuchmorerobustdatasetintermsofthesensitive selective prediction metrics, VarVQA combin...