Recognition: unknown
VL-Calibration: Decoupled Confidence Calibration for Large Vision-Language Models Reasoning
Pith reviewed 2026-05-10 17:21 UTC · model grok-4.3
The pith
A reinforcement learning method decouples visual and reasoning confidence to calibrate large vision-language models without perception labels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that a reinforcement learning framework called VL-Calibration can explicitly decouple confidence into visual and reasoning components. Visual confidence is supervised without ground-truth perception labels by an intrinsic estimation that combines KL-divergence under image perturbations for visual grounding and token entropy for internal certainty. Token-level advantage reweighting then focuses the optimization on tokens according to this visual certainty, which suppresses ungrounded hallucinations while preserving valid perception.
What carries the argument
The reinforcement learning framework that decouples visual and reasoning confidence, supervised by an intrinsic visual certainty measure from KL-divergence on perturbed images combined with token entropy.
If this is right
- Improves calibration on thirteen benchmarks for visual reasoning tasks.
- Boosts visual reasoning accuracy in addition to better calibration.
- Generalizes effectively to out-of-distribution benchmarks.
- Applies across various model scales and architectures.
- Suppresses hallucinations arising from ungrounded visual inputs via token reweighting.
Where Pith is reading between the lines
- This approach of using perturbation-based proxies for certainty could extend to calibrating other types of multimodal models, like those processing video or audio inputs.
- Separating error sources in this way might enable more targeted fine-tuning strategies for perception and reasoning modules independently.
- If the visual certainty proxy holds up, it could reduce the need for expensive labeled data in multimodal calibration tasks.
- The method highlights that visual uncertainty often dominates in these models, suggesting focus on perception improvements could yield broader gains in reliability.
Load-bearing premise
The combination of KL-divergence under image perturbations and token entropy provides a reliable proxy for visual certainty without ground-truth perception labels.
What would settle it
Observing whether the estimated visual certainty correlates with actual perception errors on a held-out dataset with known visual failures; poor correlation or no reduction in perception-related errors after applying the method would falsify the approach.
Figures
read the original abstract
Large Vision Language Models (LVLMs) achieve strong multimodal reasoning but frequently exhibit hallucinations and incorrect responses with high certainty, which hinders their usage in high-stakes domains. Existing verbalized confidence calibration methods, largely developed for text-only LLMs, typically optimize a single holistic confidence score using binary answer-level correctness. This design is mismatched to LVLMs: an incorrect prediction may arise from perceptual failures or from reasoning errors given correct perception, and a single confidence conflates these sources while visual uncertainty is often dominated by language priors. To address these issues, we propose VL-Calibration, a reinforcement learning framework that explicitly decouples confidence into visual and reasoning confidence. To supervise visual confidence without ground-truth perception labels, we introduce an intrinsic visual certainty estimation that combines (i) visual grounding measured by KL-divergence under image perturbations and (ii) internal certainty measured by token entropy. We further propose token-level advantage reweighting to focus optimization on tokens based on visual certainty, suppressing ungrounded hallucinations while preserving valid perception. Experiments on thirteen benchmarks show that VL-Calibration effectively improves calibration while boosting visual reasoning accuracy, and it generalizes to out-of-distribution benchmarks across model scales and architectures.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes VL-Calibration, a reinforcement learning framework for LVLMs that decouples confidence into separate visual and reasoning components. Visual confidence is supervised via an intrinsic estimator combining KL-divergence under image perturbations with token entropy (without ground-truth perception labels), paired with token-level advantage reweighting to suppress ungrounded hallucinations. Experiments on thirteen benchmarks report improved calibration and visual reasoning accuracy, plus OOD generalization across model scales and architectures.
Significance. If the results hold, the work addresses a key limitation of existing calibration methods for multimodal models by distinguishing perceptual from reasoning errors, which could improve reliability in high-stakes applications. The label-free proxy and RL-based decoupling represent a practical advance over holistic confidence scoring, with the reported cross-scale and OOD generalization adding to potential impact.
major comments (1)
- [Abstract (intrinsic visual certainty estimation)] The central claim that VL-Calibration improves calibration and accuracy via decoupling rests on the intrinsic visual certainty estimation (KL-divergence under image perturbations plus token entropy) cleanly isolating visual uncertainty from language priors or reasoning effects. The manuscript must provide targeted validation—such as correlation with human-annotated perception accuracy on controlled datasets, ablations isolating visual vs. linguistic perturbations, or comparison against ground-truth perception labels—to confirm the proxy does not conflate non-visual factors. Absent this, observed gains cannot be attributed to the proposed separation.
minor comments (3)
- [Abstract] The abstract states positive results on thirteen benchmarks and OOD generalization but omits quantitative metrics, effect sizes, error bars, or baseline comparisons; the full experimental section should include these details for reproducibility and assessment of practical significance.
- [Methods] Clarify the precise RL objective, advantage computation, and token reweighting formula (including any hyperparameters) with equations or pseudocode to allow independent implementation.
- [Experiments] List all thirteen benchmarks explicitly with citations and note any preprocessing or evaluation protocols used for calibration metrics (e.g., ECE).
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our work. We address the major comment point by point below, providing our honest assessment and plans for revision.
read point-by-point responses
-
Referee: The central claim that VL-Calibration improves calibration and accuracy via decoupling rests on the intrinsic visual certainty estimation (KL-divergence under image perturbations plus token entropy) cleanly isolating visual uncertainty from language priors or reasoning effects. The manuscript must provide targeted validation—such as correlation with human-annotated perception accuracy on controlled datasets, ablations isolating visual vs. linguistic perturbations, or comparison against ground-truth perception labels—to confirm the proxy does not conflate non-visual factors. Absent this, observed gains cannot be attributed to the proposed separation.
Authors: We agree that stronger direct evidence is needed to attribute performance gains specifically to the visual-reasoning decoupling rather than other factors. The manuscript motivates the intrinsic estimator as a label-free proxy: KL-divergence under image perturbations quantifies sensitivity to visual changes (visual grounding), while token entropy captures internal model certainty, both independent of downstream reasoning chains or language priors. The RL objective with token-level advantage reweighting then uses this to suppress ungrounded tokens. Experiments across thirteen benchmarks, including OOD generalization and cross-scale/architecture results, show consistent gains in calibration and visual reasoning accuracy. To address the concern, we will revise the manuscript to include (i) new ablations that isolate visual perturbations from linguistic or reasoning-related factors and (ii) expanded discussion with qualitative examples linking the proxy to perceptual failures. A direct correlation analysis against human-annotated perception labels or ground-truth perception comparisons is not present in the current work, as the method is intentionally designed to avoid requiring such labels; we will note this limitation explicitly and discuss why the downstream hallucination suppression and generalization results provide supporting (if indirect) evidence for the separation. revision: partial
Circularity Check
No circularity: novel RL decoupling and intrinsic estimator are introduced as new components, validated empirically rather than derived by construction
full rationale
The paper's core contribution is a new reinforcement learning framework (VL-Calibration) that explicitly decouples visual and reasoning confidence, supervised by a label-free intrinsic visual certainty estimator (KL-divergence under perturbations plus token entropy) and token-level advantage reweighting. These are presented as methodological innovations, not as mathematical derivations or predictions that reduce to fitted inputs or prior self-citations by construction. Claims of improved calibration and accuracy rest on experiments across thirteen benchmarks and generalization tests, which are external to the method definition itself. No equations or steps in the abstract or described chain equate outputs to inputs tautologically, and the reader's assessment of independent content is consistent with the absence of self-definitional, fitted-prediction, or load-bearing self-citation patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Reinforcement learning can optimize decoupled confidence scores when supervised by intrinsic visual certainty signals.
invented entities (1)
-
intrinsic visual certainty estimation
no independent evidence
Forward citations
Cited by 1 Pith paper
-
Grounded or Guessing? LVLM Confidence Estimation via Blind-Image Contrastive Ranking
BICR uses blind-image contrastive ranking on frozen LVLM hidden states to train a lightweight probe that penalizes confidence on blacked-out inputs, yielding top calibration and discrimination across five models and m...
Reference graph
Works this paper leans on
-
[1]
InThe Thirty-ninth Annual Conference on Neural Information Process- ing Systems
Conftuner: Training large language models to express their confidence verbally. InThe Thirty-ninth Annual Conference on Neural Information Process- ing Systems. Zhuowan Li, Xingrui Wang, Elias Stengel-Eskin, Adam Kortylewski, Wufei Ma, Benjamin Van Durme, and Alan L Yuille. 2023. Super-clevr: A virtual bench- mark to diagnose domain robustness in visual r...
2023
-
[2]
MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning
Mm-eureka: Exploring visual aha moment with rule-based large-scale reinforcement learning. arXiv preprint arXiv:2503.07365. Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma Gongque, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, Runfeng Qiao, Yifan Zhang, Xiao Zong, Yida Xu, Muxi Diao, Zhimin Bao, Chen Li, and Honggang Zhang
-
[3]
We-math: Does your large multimodal model achieve human-like mathematical reasoning?CoRR, abs/2407.01284. Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn
-
[4]
Proximal Policy Optimization Algorithms
Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741. Sainandan Ramakrishnan, Aishwarya Agrawal, and Ste- fan Lee. 2018. Overcoming language priors in visual question answering with adversarial regularization. Advances in neural information processing systems, 31. ...
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[5]
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning
Lacie: Listener-aware finetuning for calibra- tion in large language models.Advances in Neural Information Processing Systems, 37:43080–43106. Roman Vashurin, Maiya Goloburda, Albina Ilina, Alek- sandr Rubashevskii, Preslav Nakov, Artem Shel- manov, and Maxim Panov. 2025. Cocoa: A mini- mum bayes risk framework bridging confidence and consistency for unce...
work page Pith review arXiv 2025
-
[6]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Measuring multimodal mathematical reason- ing with math-vision dataset. InThe Thirty-eight Conference on Neural Information Processing Sys- tems Datasets and Benchmarks Track. Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, and 1 others. 2025b. In- ternvl3.5: Advancing open-sour...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
InThe Thirty-ninth Annual Conference on Neural Information Processing Systems
UFO-RL: Uncertainty-focused optimization for efficient reinforcement learning data selection. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. Ziang Zhou, Tianyuan Jin, Jieming Shi, and Li Qing
-
[8]
VL-Calibration: Decoupled Verbalized Confidence for Large Vision-Language Models Reasoning
Steerconf: Steering LLMs for confidence elic- itation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, and Huan Zhang. 2024. Dynamath: A dy- namic visual benchmark for evaluating mathematical reasoning robustness of vision language models. This Appendix for"VL-Calibrat...
2024
-
[9]
Accuracy:A measure of reasoning perfor- mance
-
[10]
AUROC= Z 1 0 TPR(FPR−1(t))dt(13) where TPR is the True Positive Rate and FPR is the False Positive Rate
Area Under the Receiver Operating Char- acteristic Curve (AUROC):Measures cali- bration ability of classifier to distinguish be- tween positive/negative classes across thresh- olds. AUROC= Z 1 0 TPR(FPR−1(t))dt(13) where TPR is the True Positive Rate and FPR is the False Positive Rate
-
[11]
Do MLLMs truly see the diagrams?
Expected Calibration Error (ECE):Calibra- tion metric that groups confidences into bins and computes difference between the average correctness and confidence. ECE= MX m=1 |Bm| N |acc(Bm)−conf(B m)| (14) where M is the number of bins , Bm is the set of samples in bin m, and N is the number of samples. We use M=10. A.2.2 Evaluation Datasets This section pr...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.