arxiv: 2604.09529 · v1 · submitted 2026-04-10 · 💻 cs.CV · cs.AI· cs.CL

Recognition: unknown

VL-Calibration: Decoupled Confidence Calibration for Large Vision-Language Models Reasoning

Wenyi Xiao , Xinchi Xu , Leilei Gan

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:21 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords large vision-language modelsconfidence calibrationreinforcement learningvisual reasoninghallucination reductiondecoupled confidencemultimodal modelsimage perturbation

0 comments

The pith

A reinforcement learning method decouples visual and reasoning confidence to calibrate large vision-language models without perception labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large vision-language models often produce confident but incorrect answers due to either faulty visual perception or flawed reasoning on correct perceptions. Existing calibration approaches treat confidence as a single score, which mixes these error sources and fails to address visual uncertainty properly. VL-Calibration introduces a framework that separates these two types of confidence and supervises the visual part using measures derived from perturbing the image and checking token predictability. If successful, this allows the model to reduce hallucinations stemming from poor visual grounding while maintaining good reasoning where perception is solid. Readers should care because it targets a key barrier to deploying these models in reliable, high-stakes settings.

Core claim

The paper establishes that a reinforcement learning framework called VL-Calibration can explicitly decouple confidence into visual and reasoning components. Visual confidence is supervised without ground-truth perception labels by an intrinsic estimation that combines KL-divergence under image perturbations for visual grounding and token entropy for internal certainty. Token-level advantage reweighting then focuses the optimization on tokens according to this visual certainty, which suppresses ungrounded hallucinations while preserving valid perception.

What carries the argument

The reinforcement learning framework that decouples visual and reasoning confidence, supervised by an intrinsic visual certainty measure from KL-divergence on perturbed images combined with token entropy.

If this is right

Improves calibration on thirteen benchmarks for visual reasoning tasks.
Boosts visual reasoning accuracy in addition to better calibration.
Generalizes effectively to out-of-distribution benchmarks.
Applies across various model scales and architectures.
Suppresses hallucinations arising from ungrounded visual inputs via token reweighting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach of using perturbation-based proxies for certainty could extend to calibrating other types of multimodal models, like those processing video or audio inputs.
Separating error sources in this way might enable more targeted fine-tuning strategies for perception and reasoning modules independently.
If the visual certainty proxy holds up, it could reduce the need for expensive labeled data in multimodal calibration tasks.
The method highlights that visual uncertainty often dominates in these models, suggesting focus on perception improvements could yield broader gains in reliability.

Load-bearing premise

The combination of KL-divergence under image perturbations and token entropy provides a reliable proxy for visual certainty without ground-truth perception labels.

What would settle it

Observing whether the estimated visual certainty correlates with actual perception errors on a held-out dataset with known visual failures; poor correlation or no reduction in perception-related errors after applying the method would falsify the approach.

Figures

Figures reproduced from arXiv: 2604.09529 by Leilei Gan, Wenyi Xiao, Xinchi Xu.

**Figure 2.** Figure 2: Overview of our framework. A) Decoupled Confidence Inference. The LVLM explicitly outputs separate visual and reasoning confidence to derive a holistic confidence. B) Intrinsic Visual Certainty Estimation. We quantify visual certainty by measuring visual grounding and internal certainty. C) GRPO Training. We align visual confidence with visual certainty score , and the holistic confidence with the answer a… view at source ↗

**Figure 3.** Figure 3: Effectiveness of Visual Certainty Estimation. Our estimation outperforms the strongest baseline, Self-Certainty, at mask ratios > 0.65. The vertical dashed line marks the mask ratio (0.8) adopted in the following experiments. 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy ECE: 0.421 Acc: 51.6% Base Model 0.0 0.2 0.4 0.6 0.8 1.0 Confidence 10 2 10 4 Count 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy ECE: 0.167 Acc: 70.4% RLCR 0.0 0… view at source ↗

**Figure 4.** Figure 4: Comparison with Holistic Confidence Calibration. Reliability diagrams comparison: Base Model (Qwen3-VL-4B, Left), RLCR (Middle), and Ours (Right) across all evaluation datasets. of image captions. Specifically, we use the base model (Qwen3-VL-4B) to generate 1,500 dense captions and employ Gemini-3-pro-preview as a judge to assess the image captions: (i) whether the caption is correct, and (ii) a quality s… view at source ↗

**Figure 6.** Figure 6: Heatmap of visual vs. reasoning confidence. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Visualization of most visual-uncertain tokens. Darker red represents higher uncertainty. tage reweighting, we visualize the most visually uncertain tokens in [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Entropy curves of different estimation: KL [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Confidence of overconfident wrong answers. Visual Ambiguity Reasoning Failure Visual Overload 0.0 0.2 0.4 0.6 0.8 1.0 Confidence 0.34 0.39 0.40 0.74 0.21 0.35 Vision Confidence Reasoning Confidence [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10 [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Detailed reliability diagrams of VL-Calibration-4B across the evaluation datasets. While Table [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: Detailed reliability diagrams of VL-Calibration-8B across the evaluation datasets. [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: Confidence Distribution across Visually Answerable and Unanswerable Problems. We visualize the distribution of confidence scores for the Base Model, RLCR, and Our Method (columns) on both Answerable and Unanswerable datasets (rows). While baselines tend to remain overconfident even on unanswerable queries (bottom row), Our Method exhibits a significant distributional shift towards lower confidence, demons… view at source ↗

**Figure 14.** Figure 14: Training dynamics of Qwen3-VL-4B (upper) and Qwen3-VL-8B (bottom). We visualize the ACC, [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗

**Figure 15.** Figure 15: Baselines Performance comparison with Qwen3-VL-4B in terms of Accuracy, ECE, and AUROC. [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗

**Figure 16.** Figure 16: Baselines Performance comparison with Qwen3-VL-8B in terms of Accuracy, ECE, and AUROC. [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗

**Figure 17.** Figure 17: Case Study of Qwen3-VL-4B model trained with our method. Figure (a) (upper) showcases a correct [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗

read the original abstract

Large Vision Language Models (LVLMs) achieve strong multimodal reasoning but frequently exhibit hallucinations and incorrect responses with high certainty, which hinders their usage in high-stakes domains. Existing verbalized confidence calibration methods, largely developed for text-only LLMs, typically optimize a single holistic confidence score using binary answer-level correctness. This design is mismatched to LVLMs: an incorrect prediction may arise from perceptual failures or from reasoning errors given correct perception, and a single confidence conflates these sources while visual uncertainty is often dominated by language priors. To address these issues, we propose VL-Calibration, a reinforcement learning framework that explicitly decouples confidence into visual and reasoning confidence. To supervise visual confidence without ground-truth perception labels, we introduce an intrinsic visual certainty estimation that combines (i) visual grounding measured by KL-divergence under image perturbations and (ii) internal certainty measured by token entropy. We further propose token-level advantage reweighting to focus optimization on tokens based on visual certainty, suppressing ungrounded hallucinations while preserving valid perception. Experiments on thirteen benchmarks show that VL-Calibration effectively improves calibration while boosting visual reasoning accuracy, and it generalizes to out-of-distribution benchmarks across model scales and architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VL-Calibration splits visual and reasoning confidence via RL and a perturbation-plus-entropy proxy, but the abstract gives no numbers or ablations to check if the split actually works.

read the letter

The paper's main move is to stop treating confidence as one score in LVLMs and instead run separate visual and reasoning terms inside a reinforcement learning loop. They supervise the visual term with an intrinsic estimator that combines KL divergence on perturbed images and token entropy, then reweight tokens in the advantage calculation to downplay ungrounded parts. This directly targets the mismatch the abstract describes: perceptual errors versus reasoning errors on top of correct perception, plus the fact that language priors often swamp visual signals in standard calibration.

Referee Report

1 major / 3 minor

Summary. The paper proposes VL-Calibration, a reinforcement learning framework for LVLMs that decouples confidence into separate visual and reasoning components. Visual confidence is supervised via an intrinsic estimator combining KL-divergence under image perturbations with token entropy (without ground-truth perception labels), paired with token-level advantage reweighting to suppress ungrounded hallucinations. Experiments on thirteen benchmarks report improved calibration and visual reasoning accuracy, plus OOD generalization across model scales and architectures.

Significance. If the results hold, the work addresses a key limitation of existing calibration methods for multimodal models by distinguishing perceptual from reasoning errors, which could improve reliability in high-stakes applications. The label-free proxy and RL-based decoupling represent a practical advance over holistic confidence scoring, with the reported cross-scale and OOD generalization adding to potential impact.

major comments (1)

[Abstract (intrinsic visual certainty estimation)] The central claim that VL-Calibration improves calibration and accuracy via decoupling rests on the intrinsic visual certainty estimation (KL-divergence under image perturbations plus token entropy) cleanly isolating visual uncertainty from language priors or reasoning effects. The manuscript must provide targeted validation—such as correlation with human-annotated perception accuracy on controlled datasets, ablations isolating visual vs. linguistic perturbations, or comparison against ground-truth perception labels—to confirm the proxy does not conflate non-visual factors. Absent this, observed gains cannot be attributed to the proposed separation.

minor comments (3)

[Abstract] The abstract states positive results on thirteen benchmarks and OOD generalization but omits quantitative metrics, effect sizes, error bars, or baseline comparisons; the full experimental section should include these details for reproducibility and assessment of practical significance.
[Methods] Clarify the precise RL objective, advantage computation, and token reweighting formula (including any hyperparameters) with equations or pseudocode to allow independent implementation.
[Experiments] List all thirteen benchmarks explicitly with citations and note any preprocessing or evaluation protocols used for calibration metrics (e.g., ECE).

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our work. We address the major comment point by point below, providing our honest assessment and plans for revision.

read point-by-point responses

Referee: The central claim that VL-Calibration improves calibration and accuracy via decoupling rests on the intrinsic visual certainty estimation (KL-divergence under image perturbations plus token entropy) cleanly isolating visual uncertainty from language priors or reasoning effects. The manuscript must provide targeted validation—such as correlation with human-annotated perception accuracy on controlled datasets, ablations isolating visual vs. linguistic perturbations, or comparison against ground-truth perception labels—to confirm the proxy does not conflate non-visual factors. Absent this, observed gains cannot be attributed to the proposed separation.

Authors: We agree that stronger direct evidence is needed to attribute performance gains specifically to the visual-reasoning decoupling rather than other factors. The manuscript motivates the intrinsic estimator as a label-free proxy: KL-divergence under image perturbations quantifies sensitivity to visual changes (visual grounding), while token entropy captures internal model certainty, both independent of downstream reasoning chains or language priors. The RL objective with token-level advantage reweighting then uses this to suppress ungrounded tokens. Experiments across thirteen benchmarks, including OOD generalization and cross-scale/architecture results, show consistent gains in calibration and visual reasoning accuracy. To address the concern, we will revise the manuscript to include (i) new ablations that isolate visual perturbations from linguistic or reasoning-related factors and (ii) expanded discussion with qualitative examples linking the proxy to perceptual failures. A direct correlation analysis against human-annotated perception labels or ground-truth perception comparisons is not present in the current work, as the method is intentionally designed to avoid requiring such labels; we will note this limitation explicitly and discuss why the downstream hallucination suppression and generalization results provide supporting (if indirect) evidence for the separation. revision: partial

Circularity Check

0 steps flagged

No circularity: novel RL decoupling and intrinsic estimator are introduced as new components, validated empirically rather than derived by construction

full rationale

The paper's core contribution is a new reinforcement learning framework (VL-Calibration) that explicitly decouples visual and reasoning confidence, supervised by a label-free intrinsic visual certainty estimator (KL-divergence under perturbations plus token entropy) and token-level advantage reweighting. These are presented as methodological innovations, not as mathematical derivations or predictions that reduce to fitted inputs or prior self-citations by construction. Claims of improved calibration and accuracy rest on experiments across thirteen benchmarks and generalization tests, which are external to the method definition itself. No equations or steps in the abstract or described chain equate outputs to inputs tautologically, and the reader's assessment of independent content is consistent with the absence of self-definitional, fitted-prediction, or load-bearing self-citation patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based solely on the abstract, the central claim rests on the assumption that intrinsic visual certainty can be estimated without ground-truth labels and that token-level reweighting in RL will suppress hallucinations while preserving valid perception. No explicit free parameters, axioms, or invented entities are detailed beyond the new method components.

axioms (1)

domain assumption Reinforcement learning can optimize decoupled confidence scores when supervised by intrinsic visual certainty signals.
Invoked to justify the overall training framework.

invented entities (1)

intrinsic visual certainty estimation no independent evidence
purpose: Supervise visual confidence without ground-truth perception labels by combining KL-divergence under perturbations and token entropy.
New construct introduced to enable the decoupling; no independent evidence provided in abstract.

pith-pipeline@v0.9.0 · 5515 in / 1318 out tokens · 29302 ms · 2026-05-10T17:21:39.055054+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Grounded or Guessing? LVLM Confidence Estimation via Blind-Image Contrastive Ranking
cs.CL 2026-05 unverdicted novelty 7.0

BICR uses blind-image contrastive ranking on frozen LVLM hidden states to train a lightweight probe that penalizes confidence on blacked-out inputs, yielding top calibration and discrimination across five models and m...

Reference graph

Works this paper leans on

11 extracted references · 6 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

InThe Thirty-ninth Annual Conference on Neural Information Process- ing Systems

Conftuner: Training large language models to express their confidence verbally. InThe Thirty-ninth Annual Conference on Neural Information Process- ing Systems. Zhuowan Li, Xingrui Wang, Elias Stengel-Eskin, Adam Kortylewski, Wufei Ma, Benjamin Van Durme, and Alan L Yuille. 2023. Super-clevr: A virtual bench- mark to diagnose domain robustness in visual r...

2023
[2]

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

Mm-eureka: Exploring visual aha moment with rule-based large-scale reinforcement learning. arXiv preprint arXiv:2503.07365. Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma Gongque, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, Runfeng Qiao, Yifan Zhang, Xiao Zong, Yida Xu, Muxi Diao, Zhimin Bao, Chen Li, and Honggang Zhang

work page Pith review arXiv
[3]

We-math: Does your large multimodal model achieve human-like mathematical reasoning?arXiv preprint arXiv:2407.01284, 2024

We-math: Does your large multimodal model achieve human-like mathematical reasoning?CoRR, abs/2407.01284. Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn

work page arXiv
[4]

Proximal Policy Optimization Algorithms

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741. Sainandan Ramakrishnan, Aishwarya Agrawal, and Ste- fan Lee. 2018. Overcoming language priors in visual question answering with adversarial regularization. Advances in neural information processing systems, 31. ...

work page internal anchor Pith review Pith/arXiv arXiv 2018
[5]

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

Lacie: Listener-aware finetuning for calibra- tion in large language models.Advances in Neural Information Processing Systems, 37:43080–43106. Roman Vashurin, Maiya Goloburda, Albina Ilina, Alek- sandr Rubashevskii, Preslav Nakov, Artem Shel- manov, and Maxim Panov. 2025. Cocoa: A mini- mum bayes risk framework bridging confidence and consistency for unce...

work page Pith review arXiv 2025
[6]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Measuring multimodal mathematical reason- ing with math-vision dataset. InThe Thirty-eight Conference on Neural Information Processing Sys- tems Datasets and Benchmarks Track. Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, and 1 others. 2025b. In- ternvl3.5: Advancing open-sour...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

UFO-RL: Uncertainty-focused optimization for efficient reinforcement learning data selection. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. Ziang Zhou, Tianyuan Jin, Jieming Shi, and Li Qing
[8]

VL-Calibration: Decoupled Verbalized Confidence for Large Vision-Language Models Reasoning

Steerconf: Steering LLMs for confidence elic- itation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, and Huan Zhang. 2024. Dynamath: A dy- namic visual benchmark for evaluating mathematical reasoning robustness of vision language models. This Appendix for"VL-Calibrat...

2024
[9]

Accuracy:A measure of reasoning perfor- mance
[10]

AUROC= Z 1 0 TPR(FPR−1(t))dt(13) where TPR is the True Positive Rate and FPR is the False Positive Rate

Area Under the Receiver Operating Char- acteristic Curve (AUROC):Measures cali- bration ability of classifier to distinguish be- tween positive/negative classes across thresh- olds. AUROC= Z 1 0 TPR(FPR−1(t))dt(13) where TPR is the True Positive Rate and FPR is the False Positive Rate
[11]

Do MLLMs truly see the diagrams?

Expected Calibration Error (ECE):Calibra- tion metric that groups confidences into bins and computes difference between the average correctness and confidence. ECE= MX m=1 |Bm| N |acc(Bm)−conf(B m)| (14) where M is the number of bins , Bm is the set of samples in bin m, and N is the number of samples. We use M=10. A.2.2 Evaluation Datasets This section pr...

work page arXiv 2024