pith. sign in

arxiv: 2605.16999 · v1 · pith:FPJIHIJKnew · submitted 2026-05-16 · 💻 cs.LG

Ranking-Aware Calibration for Reliable Multimodal Reinforcement Learning

Pith reviewed 2026-05-19 20:49 UTC · model grok-4.3

classification 💻 cs.LG
keywords calibrationreinforcement learningmultimodal reasoningvision-language modelsranking lossconfidence estimationpost-traininggroup-based RL
0
0 comments X

The pith

Ranking signals from group-based RL can supervise confidence to improve calibration in vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reinforcement learning post-training improves reasoning accuracy in vision-language models, but the resulting policies remain poorly calibrated, especially when inputs are corrupted or ambiguous. The paper introduces Ranking-Aware Calibration, a framework that uses two comparison signals already produced during group-based RL training to supervise model confidence without extra labels. One signal enforces higher confidence for better rollouts than worse ones within the same prompt, while the other requires confidence to drop as visual evidence degrades. These losses integrate directly into existing RL post-training and also reinforce task accuracy by teaching the policy to discriminate between reasoning paths. A sympathetic reader would care because reliable confidence estimates are necessary for safe deployment where models must know when they are likely wrong.

Core claim

Ranking-Aware Calibration supervises confidence using ranking signals that group-based RL already produces at no additional cost. The ranking-aware group loss enforces that a better rollout receives higher confidence than a worse one within the same prompt. The clean-corrupted pairwise loss enforces that confidence attenuates as visual evidence degrades. Because the ranking signal forces the policy to distinguish between correct and incorrect reasoning paths, it also reinforces task accuracy beyond what correctness rewards alone produce. Both losses require no external confidence annotations and integrate naturally with group-based RL post-training.

What carries the argument

Ranking-Aware Calibration (RAC), which applies a ranking-aware group loss and a clean-corrupted pairwise loss to supervise confidence using comparison signals already generated during group-based reinforcement learning.

If this is right

  • The policy learns to discriminate between correct and incorrect reasoning paths, improving task accuracy beyond standard correctness rewards.
  • Calibration error decreases under degraded or corrupted visual inputs.
  • The combination of both losses achieves the best calibration across tested backbones while improving accuracy in most settings.
  • No external confidence annotations are needed because the signals come directly from the existing RL process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same ranking-based supervision approach could be tested on language-only models to see if it improves calibration without visual components.
  • This method might reduce reliance on separate post-hoc calibration techniques in production RL pipelines.
  • Extending the pairwise loss to other forms of input degradation beyond visual corruption could reveal broader robustness benefits.

Load-bearing premise

The ranking signals already produced by group-based RL directly reflect reasoning quality and can be used to supervise confidence without introducing new biases or requiring external validation.

What would settle it

A controlled experiment on the same benchmarks showing that adding the ranking-aware and pairwise losses produces no reduction in calibration error or no accuracy gain compared to standard group-based RL training would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.16999 by Boyao Yang, Jun Zhu, Peng Cui.

Figure 1
Figure 1. Figure 1: Overview of the RAC Calibration Architecture. The framework is structured around three reinforcement signals: (1) the standard outcome verification task reward, (2) the Ranking-Aware Group Loss, which asserts that higher confidence should coordinate with greater answer correctness within the same rollout group, and (3) the Clean–Corrupted Pairwise Loss, which enforces directional confidence attenuation whe… view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative case study on Qwen2.5-VL-7B-Instruct. Before RAC, the model produces an [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Effect of training corruption severity on calibration transfer for Qwen2.5-VL-7B￾Instruct. Models trained at mild corruption levels (T0.2, T0.4) maintain consistently lower ECE and Brier scores across all test severities, while those trained at T0.6 and above exhibit uniformly degraded calibration. The severity-transfer curves in (a) separate into two distinct clusters at T0.6, a pattern confirmed by the h… view at source ↗
Figure 4
Figure 4. Figure 4: Hyperparameter sweeps on Qwen2.5-VL-7B-Instruct. Each point shows macro-averaged [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Generative Training Pair Construction. The matched multimodal branches underpinning our method: Branch A (Corrupted) applies randomly sampled visual perturbation (following the ImageNet-C protocol), juxtaposed against the untouched source in Branch B (Clean). This paired structure provides the clean–corrupted comparisons used in the RAC pairwise loss ( [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visual degradation severity ladder from Clean to [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visual degradation severity ladder from Clean to [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visual degradation severity ladder from Clean to [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
read the original abstract

Reinforcement learning post-training has substantially improved the reasoning accuracy of vision-language models, yet the resulting policies remain poorly calibrated. Terminal correctness rewards provide no gradient that penalizes confident errors more than uncertain ones and no signal that ties confidence to the quality of visual evidence, a gap that becomes especially severe under corrupted or ambiguous inputs where models continue to report high confidence on incorrect answers. We introduce Ranking-Aware Calibration (RAC), a training-time framework that supervises confidence using two comparison signals that group-based RL already produces at no additional labeling cost. The ranking-aware group loss enforces that a better rollout receives higher confidence than a worse one within the same prompt. The clean--corrupted pairwise loss enforces that confidence attenuates as visual evidence degrades. Because the ranking signal forces the policy to distinguish between correct and incorrect reasoning paths, it also reinforces task accuracy beyond what correctness rewards alone produce. Both losses require no external confidence annotations and integrate naturally with group-based RL post-training. We instantiate RAC on Qwen2.5-VL and InternVL-3.5 backbones and evaluate on six multimodal reasoning benchmarks under clean and corrupted inputs. Empirical results show that the ranking-aware loss substantially improves task accuracy by teaching the policy to discriminate between better and worse reasoning, while the pairwise corruption loss reduces calibration error under degraded inputs. Their combination achieves the best calibration across all tested backbones while improving accuracy in the majority of settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Ranking-Aware Calibration (RAC) as a training-time framework for post-training vision-language models with group-based RL. It introduces a ranking-aware group loss that requires higher confidence for better-ranked rollouts within the same prompt and a clean-corrupted pairwise loss that attenuates confidence under degraded visual inputs. Both losses leverage signals already produced by group-based RL without external confidence annotations. The method is instantiated on Qwen2.5-VL and InternVL-3.5 and evaluated on six multimodal reasoning benchmarks under clean and corrupted conditions, with the claim that the combination yields the best calibration while improving accuracy in most settings.

Significance. If the empirical gains hold under rigorous controls, the work could meaningfully advance calibration in multimodal RL systems by repurposing existing group signals for confidence supervision, particularly under input corruption. The absence of new labeling costs and the dual benefit to accuracy and calibration are potential strengths for practical deployment of reliable vision-language agents.

major comments (3)
  1. [§3] §3 (Method), ranking-aware group loss definition: the central claim that this loss 'forces the policy to distinguish between correct and incorrect reasoning paths' and thereby improves both accuracy and calibration rests on the unverified assumption that terminal-correctness-derived rankings within a prompt encode graded reasoning quality beyond binary outcome. No correlation analysis with external visual-reasoning metrics or human judgments is described; if rankings largely reflect the binary reward, the loss reduces to a reweighting of the original correctness signal and cannot be guaranteed to correct miscalibration.
  2. [§4] Experimental evaluation (likely §4 and associated tables): the abstract asserts that the combination 'achieves the best calibration across all tested backbones while improving accuracy in the majority of settings,' yet the provided summary supplies no quantitative values, baseline comparisons, ECE or Brier scores, statistical significance tests, or details on corruption application (e.g., severity levels, number of corruptions per image). Without these, the magnitude and robustness of the claimed gains cannot be assessed and the load-bearing empirical support remains opaque.
  3. [§3.3] Clean-corrupted pairwise loss (likely Eq. in §3.3): the loss is described as enforcing confidence attenuation as visual evidence degrades, but the manuscript does not specify how the corrupted inputs are generated or whether the corruption distribution matches the test-time distribution. If the corruption model introduces spurious correlations rather than controlled degradation, the pairwise supervision could itself induce new miscalibration rather than mitigate it.
minor comments (2)
  1. [§3] Notation for the two losses should be introduced with explicit mathematical definitions early in §3 rather than described only in prose, to allow readers to verify the claimed parameter-free integration with standard group-based RL objectives.
  2. [Discussion] The manuscript should include a dedicated limitations paragraph discussing potential failure modes when ranking signals are noisy or when corruption types differ from those used in training.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for the constructive feedback on our manuscript. We address each of the major comments below, providing clarifications and indicating where revisions will be made to strengthen the paper.

read point-by-point responses
  1. Referee: [§3] §3 (Method), ranking-aware group loss definition: the central claim that this loss 'forces the policy to distinguish between correct and incorrect reasoning paths' and thereby improves both accuracy and calibration rests on the unverified assumption that terminal-correctness-derived rankings within a prompt encode graded reasoning quality beyond binary outcome. No correlation analysis with external visual-reasoning metrics or human judgments is described; if rankings largely reflect the binary reward, the loss reduces to a reweighting of the original correctness signal and cannot be guaranteed to correct miscalibration.

    Authors: We appreciate the referee's careful reading of the method section. The ranking-aware group loss is indeed based on rankings derived from terminal correctness within groups of rollouts for the same prompt. However, because the policy generates multiple diverse reasoning paths, the relative ranking captures distinctions in reasoning quality that lead to success or failure. Our empirical results demonstrate improvements in both accuracy and calibration, which would not occur if the loss merely reweighted the binary signal. That said, we acknowledge the value of additional validation and will include a correlation analysis between the derived rankings and human judgments of reasoning quality in the revised manuscript. revision: yes

  2. Referee: [§4] Experimental evaluation (likely §4 and associated tables): the abstract asserts that the combination 'achieves the best calibration across all tested backbones while improving accuracy in the majority of settings,' yet the provided summary supplies no quantitative values, baseline comparisons, ECE or Brier scores, statistical significance tests, or details on corruption application (e.g., severity levels, number of corruptions per image). Without these, the magnitude and robustness of the claimed gains cannot be assessed and the load-bearing empirical support remains opaque.

    Authors: The full manuscript contains detailed experimental results in Section 4, including tables reporting accuracy, ECE, and Brier scores for all baselines and our method across the six benchmarks under both clean and corrupted conditions. We compare against standard RL post-training and calibration methods. Corruption details, including the types (e.g., noise, blur, weather) and severity levels (1-5), along with the number of corrupted versions per image, are specified in Section 4.1 and the appendix. We report results averaged over multiple runs but will add explicit statistical significance tests (e.g., paired t-tests) in the revision to further support the claims. revision: partial

  3. Referee: [§3.3] Clean-corrupted pairwise loss (likely Eq. in §3.3): the loss is described as enforcing confidence attenuation as visual evidence degrades, but the manuscript does not specify how the corrupted inputs are generated or whether the corruption distribution matches the test-time distribution. If the corruption model introduces spurious correlations rather than controlled degradation, the pairwise supervision could itself induce new miscalibration rather than mitigate it.

    Authors: We clarify that the corrupted inputs for the pairwise loss are generated using the same corruption functions and severity levels as those used in the test-time evaluation, ensuring the training and test distributions align. The generation process is detailed in the experimental setup section, employing standard image corruptions without introducing task-specific spurious features. To address potential concerns about new miscalibrations, we will add an ablation study examining the effect of different corruption types in the revised version. revision: yes

Circularity Check

0 steps flagged

No circularity: RAC losses directly apply existing group-RL ranking signals without redefinition or fitted-input predictions

full rationale

The paper presents the ranking-aware group loss and clean-corrupted pairwise loss as direct uses of ranking and corruption signals already generated by standard group-based RL post-training, with no equations shown that equate outputs to inputs by construction, no fitted parameters renamed as predictions, and no load-bearing self-citations or uniqueness theorems invoked. The claimed calibration and accuracy gains are framed as empirical results from integrating these losses on Qwen2.5-VL and InternVL-3.5 backbones across six benchmarks, remaining independent of the input signals rather than tautological. This matches the most common honest finding of a self-contained derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on the assumption that group rollouts already encode reliable quality comparisons and that visual corruption reliably degrades evidence quality in a way that should reduce confidence. No free parameters or new entities are introduced in the abstract description.

axioms (2)
  • domain assumption Group-based RL rollouts produce ranking signals that correlate with reasoning quality.
    This premise is invoked to justify the ranking-aware group loss without additional correctness labels.
  • domain assumption Corrupted visual inputs provide measurably weaker evidence than clean inputs for the same prompt.
    Used to define the clean-corrupted pairwise loss.

pith-pipeline@v0.9.0 · 5778 in / 1324 out tokens · 29623 ms · 2026-05-19T20:49:17.718708+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages

  1. [1]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceed- ings of the 38th International Conference on Machine Learning, volume 139, pages 8748–8...

  2. [2]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikoł aj Bi´nk...

  3. [3]

    Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. InProceedings of the 40th International Conference on Machine Learning, 2023

  4. [4]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Advances in Neural Information Processing Systems, volume 36, pages 34892–34916, 2023

  5. [5]

    Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023

  6. [6]

    Qwen2-vl: Enhancing vision- language model’s perception of the world at any resolution, 2024

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision- language model’s perception of the world at any resolution, 2024

  7. [7]

    Qwen2.5-vl technical report, 2025

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2025

  8. [8]

    Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models, 2025

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xuehui Wang, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Nianchen Deng, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zh...

  9. [9]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024

  10. [10]

    Deepseek-r1 incentivizes reasoning in llms through reinforcement learn- ing.Nature, 645(8081):633–638, September 2025

    Deepseek AI. Deepseek-r1 incentivizes reasoning in llms through reinforcement learn- ing.Nature, 645(8081):633–638, September 2025. ISSN 1476-4687. doi: 10.1038/ s41586-025-09422-z

  11. [11]

    Enhancing the reasoning ability of multimodal large language models via mixed preference optimization, 2025

    Weiyun Wang, Zhe Chen, Wenhai Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Jinguo Zhu, Xizhou Zhu, Lewei Lu, Yu Qiao, and Jifeng Dai. Enhancing the reasoning ability of multimodal large language models via mixed preference optimization, 2025

  12. [12]

    Language models (mostly) know what they know, 2022

    Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec,...

  13. [13]

    Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback

    Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 54...

  14. [14]

    Chen, Yoonho Lee, Eric Mitchell, and Chelsea Finn

    Johnathan Xie, Annie S. Chen, Yoonho Lee, Eric Mitchell, and Chelsea Finn. Calibrating language models with adaptive temperature scaling. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18128–18138, 2024

  15. [15]

    Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms

    Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms. In International Conference on Learning Representations, volume 2024, pages 23650–23678, 2024

  16. [16]

    Calibrating the confidence of large language models by eliciting fidelity

    Mozhi Zhang, Mianqiu Huang, Rundong Shi, Linsen Guo, Chong Peng, Peng Yan, Yaqian Zhou, and Xipeng Qiu. Calibrating the confidence of large language models by eliciting fidelity. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 2959–2979, 2024

  17. [17]

    Calibrating verbal uncertainty as a linear feature to reduce hallucinations

    Ziwei Ji, Lei Yu, Yeskendir Koishekenov, Yejin Bang, Anthony Hartshorn, Alan Schelten, Cheng Zhang, Pascale Fung, and Nicola Cancedda. Calibrating verbal uncertainty as a linear feature to reduce hallucinations. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 3769–3793, 2025

  18. [18]

    LACIE: Listener-aware finetuning for calibration in large language models

    Elias Stengel-Eskin, Peter Hase, and Mohit Bansal. LACIE: Listener-aware finetuning for calibration in large language models. InAdvances in Neural Information Processing Systems, volume 37, pages 43080–43106, 2024

  19. [19]

    Beyond binary rewards: Training lms to reason about their uncertainty, 2025

    Mehul Damani, Isha Puri, Stewart Slocum, Idan Shenfeld, Leshem Choshen, Yoon Kim, and Jacob Andreas. Beyond binary rewards: Training lms to reason about their uncertainty, 2025

  20. [20]

    Reasoning models better express their confidence, 2025

    Dongkeun Yoon, Seungone Kim, Sohee Yang, Sunkyoung Kim, Soyeon Kim, Yongil Kim, Eunbi Choi, Yireun Kim, and Minjoon Seo. Reasoning models better express their confidence, 2025

  21. [21]

    Vlm-robustbench: A comprehensive benchmark for robustness of vision-language models, 2026

    Rohit Saxena, Alessandro Suglia, and Pasquale Minervini. Vlm-robustbench: A comprehensive benchmark for robustness of vision-language models, 2026

  22. [22]

    Bench- marking corruption robustness of lvlms: A discriminative benchmark and robustness alignment metric, 2025

    Xiangjie Sui, Songyang Li, Hanwei Zhu, Baoliang Chen, Yuming Fang, and Xin Sun. Bench- marking corruption robustness of lvlms: A discriminative benchmark and robustness alignment metric, 2025. 11

  23. [23]

    Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek

    Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D. Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model's uncertainty? evaluating predictive uncertainty under dataset shift. InAdvances in Neural Information Processing Systems, volume 32, 2019

  24. [24]

    Benchmarking neural network robustness to common corruptions and perturbations

    Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. InInternational Conference on Learning Representations, 2019

  25. [25]

    Revisiting the calibration of modern neural networks

    Matthias Minderer, Josip Djolonga, Rob Romijnders, Frances Hubis, Xiaohua Zhai, Neil Houlsby, Dustin Tran, and Mario Lucic. Revisiting the calibration of modern neural networks. InAdvances in Neural Information Processing Systems, volume 34, pages 15682–15694, 2021

  26. [26]

    In: Bouamor, H., Pino, J., Bali, K

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 292–305, December 2023. doi: 10.18653/v1/2023.emnlp-main.20

  27. [27]

    Multi-object hallucination in vision language models

    Xuweiyi Chen, Ziqiao Ma, Xuejun Zhang, Sihan Xu, Jianing Yang, David Fouhey, Joyce Chai, and Shengyi Qian. Multi-object hallucination in vision language models. InAdvances in Neural Information Processing Systems, volume 37, pages 44393–44418, 2024. doi: 10.52202/ 079017-1409

  28. [28]

    Instinct vs

    Yunkai Dang, Yifan Jiang, Yizhu Jiang, Anqi Chen, Wenbin Li, and Yang Gao. Instinct vs. reflection: Unifying token and verbalized confidence in multimodal large models, 2026

  29. [29]

    Overconfidence and calibration in medical vqa: Empirical findings and hallucination-aware mitigation, 2026

    Ji Young Byun, Young-Jin Park, Jean-Philippe Corbeil, and Asma Ben Abacha. Overconfidence and calibration in medical vqa: Empirical findings and hallucination-aware mitigation, 2026

  30. [30]

    Knowledge-Centric Hallucination Detection

    Ahmadian Arash, Cremer Chris, Gallé Matthias, Fadaee Marzieh, Kreutzer Julia, Pietquin Olivier, Üstün Ahmet, and Hooker Sara. Back to basics: Revisiting REINFORCE-style optimization for learning from human feedback in LLMs. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12248–12267....

  31. [31]

    Enhancing the outcome reward-based rl training of mllms with self-consistency sampling

    Jiahao Wang, Weiye Xu, Aijun Yang, Wengang Zhou, Lewei Lu, Houqiang Li, Xiaohua Wang, and Jinguo Zhu. Enhancing the outcome reward-based rl training of mllms with self-consistency sampling. InAdvances in Neural Information Processing Systems, volume 38, pages 132018– 132052, 2025

  32. [32]

    Vision-r1: Incentivizing reasoning capability in multimodal large language models, 2026

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Xu Tang, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models, 2026

  33. [33]

    Supervised contrastive learning

    Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. InAdvances in Neural Information Processing Systems, volume 33, pages 18661–18673, 2020

  34. [34]

    Hybridflow: A flexible and efficient rlhf framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, March 2025. doi: 10.1145/3689031.3696075

  35. [35]

    M 3CoT: A novel benchmark for multi-domain multi-step multi-modal chain-of-thought

    Qiguang Chen, Libo Qin, Jin Zhang, Zhi Chen, Xiao Xu, and Wanxiang Che. M 3CoT: A novel benchmark for multi-domain multi-step multi-modal chain-of-thought. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8199–8221, 2024

  36. [36]

    Inter-GPS: Interpretable geometry problem solving with formal language and symbolic reason- ing

    Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Songchun Zhu. Inter-GPS: Interpretable geometry problem solving with formal language and symbolic reason- ing. InProceedings of the Annual Meeting of the Association for Computational Linguistics, 2021. 12

  37. [37]

    Learn to explain: Multimodal reasoning via thought chains for science question answering

    Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InAdvances in Neural Information Processing Systems, volume 35, pages 2507–2521, 2022

  38. [38]

    Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InComputer Vision – ECCV 2024, pages 169–186, 2025

    Renrui ZZhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Yu Qiao, Peng Gao, and Hongsheng Li. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InComputer Vision – ECCV 2024, pages 169–186, 2025

  39. [39]

    Emogen: Emotional image content generation with text-to-image diffusion models,

    Xiang Yue, Yuansheng Ni, Tianyu Zheng, Kai Zhang, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...

  40. [40]

    We-math: Does your large multimodal model achieve human-like mathematical reasoning? InProceedings of the Annual Meeting of the Association for Computational Linguistics, 2025

    Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, GongQue, et al. We-math: Does your large multimodal model achieve human-like mathematical reasoning? InProceedings of the Annual Meeting of the Association for Computational Linguistics, 2025

  41. [41]

    Measuring multimodal mathematical reasoning with math-vision dataset

    Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 95095–95169, 2024. doi:...

  42. [42]

    Obtaining well-calibrated probabilities using Bayesian binning

    Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well-calibrated probabilities using Bayesian binning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 29, 2015

  43. [43]

    Glenn W. Brier. Verification of forecasts expressed in terms of probability.Monthly Weather Review, 78(1):1–3, 1950

  44. [44]

    Proximal policy optimization algorithms, 2017

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017

  45. [45]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...

  46. [46]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems, volume 36, pages 53728–53741, 2023

  47. [47]

    Visual-rft: Visual reinforcement fine-tuning, 2025

    Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning, 2025

  48. [48]

    Predicting good probabilities with supervised learning , isbn =

    Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised learning. InProceedings of the 22nd International Conference on Machine Learning, page 625–632, 2005. ISBN 1595931805. doi: 10.1145/1102351.1102430

  49. [49]

    Weinberger

    Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. InProceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 1321–1330, 06–11 Aug 2017

  50. [50]

    Calibration of pre-trained transformers

    Shrey Desai and Greg Durrett. Calibration of pre-trained transformers. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 295–302, November 2020. doi: 10.18653/v1/2020.emnlp-main.21. URL https://aclanthology. org/2020.emnlp-main.21/. 13

  51. [51]

    Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation

    Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. InInternational Conference on Learning Representations (ICLR), 2023

  52. [52]

    \n Please reason step by step and follow this exact response schema:\n

    Wenyi Xiao, Xinchi Xu, and Leilei Gan. Vl-calibration: Decoupled confidence calibration for large vision-language models reasoning, 2026. 14 Appendices A Details of Models and Datasets 16 B Chat Template 16 C Image Corruptions and Examples 16 D Evaluation Metrics 17 E Compute Resources 21 F Use of LLMs 21 G Broader Impacts 21 15 A Details of Models and Da...