Ranking-Aware Calibration for Reliable Multimodal Reinforcement Learning
Pith reviewed 2026-05-19 20:49 UTC · model grok-4.3
The pith
Ranking signals from group-based RL can supervise confidence to improve calibration in vision-language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Ranking-Aware Calibration supervises confidence using ranking signals that group-based RL already produces at no additional cost. The ranking-aware group loss enforces that a better rollout receives higher confidence than a worse one within the same prompt. The clean-corrupted pairwise loss enforces that confidence attenuates as visual evidence degrades. Because the ranking signal forces the policy to distinguish between correct and incorrect reasoning paths, it also reinforces task accuracy beyond what correctness rewards alone produce. Both losses require no external confidence annotations and integrate naturally with group-based RL post-training.
What carries the argument
Ranking-Aware Calibration (RAC), which applies a ranking-aware group loss and a clean-corrupted pairwise loss to supervise confidence using comparison signals already generated during group-based reinforcement learning.
If this is right
- The policy learns to discriminate between correct and incorrect reasoning paths, improving task accuracy beyond standard correctness rewards.
- Calibration error decreases under degraded or corrupted visual inputs.
- The combination of both losses achieves the best calibration across tested backbones while improving accuracy in most settings.
- No external confidence annotations are needed because the signals come directly from the existing RL process.
Where Pith is reading between the lines
- The same ranking-based supervision approach could be tested on language-only models to see if it improves calibration without visual components.
- This method might reduce reliance on separate post-hoc calibration techniques in production RL pipelines.
- Extending the pairwise loss to other forms of input degradation beyond visual corruption could reveal broader robustness benefits.
Load-bearing premise
The ranking signals already produced by group-based RL directly reflect reasoning quality and can be used to supervise confidence without introducing new biases or requiring external validation.
What would settle it
A controlled experiment on the same benchmarks showing that adding the ranking-aware and pairwise losses produces no reduction in calibration error or no accuracy gain compared to standard group-based RL training would falsify the central claim.
Figures
read the original abstract
Reinforcement learning post-training has substantially improved the reasoning accuracy of vision-language models, yet the resulting policies remain poorly calibrated. Terminal correctness rewards provide no gradient that penalizes confident errors more than uncertain ones and no signal that ties confidence to the quality of visual evidence, a gap that becomes especially severe under corrupted or ambiguous inputs where models continue to report high confidence on incorrect answers. We introduce Ranking-Aware Calibration (RAC), a training-time framework that supervises confidence using two comparison signals that group-based RL already produces at no additional labeling cost. The ranking-aware group loss enforces that a better rollout receives higher confidence than a worse one within the same prompt. The clean--corrupted pairwise loss enforces that confidence attenuates as visual evidence degrades. Because the ranking signal forces the policy to distinguish between correct and incorrect reasoning paths, it also reinforces task accuracy beyond what correctness rewards alone produce. Both losses require no external confidence annotations and integrate naturally with group-based RL post-training. We instantiate RAC on Qwen2.5-VL and InternVL-3.5 backbones and evaluate on six multimodal reasoning benchmarks under clean and corrupted inputs. Empirical results show that the ranking-aware loss substantially improves task accuracy by teaching the policy to discriminate between better and worse reasoning, while the pairwise corruption loss reduces calibration error under degraded inputs. Their combination achieves the best calibration across all tested backbones while improving accuracy in the majority of settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Ranking-Aware Calibration (RAC) as a training-time framework for post-training vision-language models with group-based RL. It introduces a ranking-aware group loss that requires higher confidence for better-ranked rollouts within the same prompt and a clean-corrupted pairwise loss that attenuates confidence under degraded visual inputs. Both losses leverage signals already produced by group-based RL without external confidence annotations. The method is instantiated on Qwen2.5-VL and InternVL-3.5 and evaluated on six multimodal reasoning benchmarks under clean and corrupted conditions, with the claim that the combination yields the best calibration while improving accuracy in most settings.
Significance. If the empirical gains hold under rigorous controls, the work could meaningfully advance calibration in multimodal RL systems by repurposing existing group signals for confidence supervision, particularly under input corruption. The absence of new labeling costs and the dual benefit to accuracy and calibration are potential strengths for practical deployment of reliable vision-language agents.
major comments (3)
- [§3] §3 (Method), ranking-aware group loss definition: the central claim that this loss 'forces the policy to distinguish between correct and incorrect reasoning paths' and thereby improves both accuracy and calibration rests on the unverified assumption that terminal-correctness-derived rankings within a prompt encode graded reasoning quality beyond binary outcome. No correlation analysis with external visual-reasoning metrics or human judgments is described; if rankings largely reflect the binary reward, the loss reduces to a reweighting of the original correctness signal and cannot be guaranteed to correct miscalibration.
- [§4] Experimental evaluation (likely §4 and associated tables): the abstract asserts that the combination 'achieves the best calibration across all tested backbones while improving accuracy in the majority of settings,' yet the provided summary supplies no quantitative values, baseline comparisons, ECE or Brier scores, statistical significance tests, or details on corruption application (e.g., severity levels, number of corruptions per image). Without these, the magnitude and robustness of the claimed gains cannot be assessed and the load-bearing empirical support remains opaque.
- [§3.3] Clean-corrupted pairwise loss (likely Eq. in §3.3): the loss is described as enforcing confidence attenuation as visual evidence degrades, but the manuscript does not specify how the corrupted inputs are generated or whether the corruption distribution matches the test-time distribution. If the corruption model introduces spurious correlations rather than controlled degradation, the pairwise supervision could itself induce new miscalibration rather than mitigate it.
minor comments (2)
- [§3] Notation for the two losses should be introduced with explicit mathematical definitions early in §3 rather than described only in prose, to allow readers to verify the claimed parameter-free integration with standard group-based RL objectives.
- [Discussion] The manuscript should include a dedicated limitations paragraph discussing potential failure modes when ranking signals are noisy or when corruption types differ from those used in training.
Simulated Author's Rebuttal
We are grateful to the referee for the constructive feedback on our manuscript. We address each of the major comments below, providing clarifications and indicating where revisions will be made to strengthen the paper.
read point-by-point responses
-
Referee: [§3] §3 (Method), ranking-aware group loss definition: the central claim that this loss 'forces the policy to distinguish between correct and incorrect reasoning paths' and thereby improves both accuracy and calibration rests on the unverified assumption that terminal-correctness-derived rankings within a prompt encode graded reasoning quality beyond binary outcome. No correlation analysis with external visual-reasoning metrics or human judgments is described; if rankings largely reflect the binary reward, the loss reduces to a reweighting of the original correctness signal and cannot be guaranteed to correct miscalibration.
Authors: We appreciate the referee's careful reading of the method section. The ranking-aware group loss is indeed based on rankings derived from terminal correctness within groups of rollouts for the same prompt. However, because the policy generates multiple diverse reasoning paths, the relative ranking captures distinctions in reasoning quality that lead to success or failure. Our empirical results demonstrate improvements in both accuracy and calibration, which would not occur if the loss merely reweighted the binary signal. That said, we acknowledge the value of additional validation and will include a correlation analysis between the derived rankings and human judgments of reasoning quality in the revised manuscript. revision: yes
-
Referee: [§4] Experimental evaluation (likely §4 and associated tables): the abstract asserts that the combination 'achieves the best calibration across all tested backbones while improving accuracy in the majority of settings,' yet the provided summary supplies no quantitative values, baseline comparisons, ECE or Brier scores, statistical significance tests, or details on corruption application (e.g., severity levels, number of corruptions per image). Without these, the magnitude and robustness of the claimed gains cannot be assessed and the load-bearing empirical support remains opaque.
Authors: The full manuscript contains detailed experimental results in Section 4, including tables reporting accuracy, ECE, and Brier scores for all baselines and our method across the six benchmarks under both clean and corrupted conditions. We compare against standard RL post-training and calibration methods. Corruption details, including the types (e.g., noise, blur, weather) and severity levels (1-5), along with the number of corrupted versions per image, are specified in Section 4.1 and the appendix. We report results averaged over multiple runs but will add explicit statistical significance tests (e.g., paired t-tests) in the revision to further support the claims. revision: partial
-
Referee: [§3.3] Clean-corrupted pairwise loss (likely Eq. in §3.3): the loss is described as enforcing confidence attenuation as visual evidence degrades, but the manuscript does not specify how the corrupted inputs are generated or whether the corruption distribution matches the test-time distribution. If the corruption model introduces spurious correlations rather than controlled degradation, the pairwise supervision could itself induce new miscalibration rather than mitigate it.
Authors: We clarify that the corrupted inputs for the pairwise loss are generated using the same corruption functions and severity levels as those used in the test-time evaluation, ensuring the training and test distributions align. The generation process is detailed in the experimental setup section, employing standard image corruptions without introducing task-specific spurious features. To address potential concerns about new miscalibrations, we will add an ablation study examining the effect of different corruption types in the revised version. revision: yes
Circularity Check
No circularity: RAC losses directly apply existing group-RL ranking signals without redefinition or fitted-input predictions
full rationale
The paper presents the ranking-aware group loss and clean-corrupted pairwise loss as direct uses of ranking and corruption signals already generated by standard group-based RL post-training, with no equations shown that equate outputs to inputs by construction, no fitted parameters renamed as predictions, and no load-bearing self-citations or uniqueness theorems invoked. The claimed calibration and accuracy gains are framed as empirical results from integrating these losses on Qwen2.5-VL and InternVL-3.5 backbones across six benchmarks, remaining independent of the input signals rather than tautological. This matches the most common honest finding of a self-contained derivation.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Group-based RL rollouts produce ranking signals that correlate with reasoning quality.
- domain assumption Corrupted visual inputs provide measurably weaker evidence than clean inputs for the same prompt.
Reference graph
Works this paper leans on
-
[1]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceed- ings of the 38th International Conference on Machine Learning, volume 139, pages 8748–8...
work page 2021
-
[2]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikoł aj Bi´nk...
work page 2022
-
[3]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. InProceedings of the 40th International Conference on Machine Learning, 2023
work page 2023
-
[4]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Advances in Neural Information Processing Systems, volume 36, pages 34892–34916, 2023
work page 2023
-
[5]
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023
work page 2023
-
[6]
Qwen2-vl: Enhancing vision- language model’s perception of the world at any resolution, 2024
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision- language model’s perception of the world at any resolution, 2024
work page 2024
-
[7]
Qwen2.5-vl technical report, 2025
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2025
work page 2025
-
[8]
Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models, 2025
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xuehui Wang, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Nianchen Deng, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zh...
work page 2025
-
[9]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024
work page 2024
-
[10]
Deepseek AI. Deepseek-r1 incentivizes reasoning in llms through reinforcement learn- ing.Nature, 645(8081):633–638, September 2025. ISSN 1476-4687. doi: 10.1038/ s41586-025-09422-z
work page 2025
-
[11]
Weiyun Wang, Zhe Chen, Wenhai Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Jinguo Zhu, Xizhou Zhu, Lewei Lu, Yu Qiao, and Jifeng Dai. Enhancing the reasoning ability of multimodal large language models via mixed preference optimization, 2025
work page 2025
-
[12]
Language models (mostly) know what they know, 2022
Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec,...
work page 2022
-
[13]
Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 54...
work page 2023
-
[14]
Chen, Yoonho Lee, Eric Mitchell, and Chelsea Finn
Johnathan Xie, Annie S. Chen, Yoonho Lee, Eric Mitchell, and Chelsea Finn. Calibrating language models with adaptive temperature scaling. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18128–18138, 2024
work page 2024
-
[15]
Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms
Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms. In International Conference on Learning Representations, volume 2024, pages 23650–23678, 2024
work page 2024
-
[16]
Calibrating the confidence of large language models by eliciting fidelity
Mozhi Zhang, Mianqiu Huang, Rundong Shi, Linsen Guo, Chong Peng, Peng Yan, Yaqian Zhou, and Xipeng Qiu. Calibrating the confidence of large language models by eliciting fidelity. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 2959–2979, 2024
work page 2024
-
[17]
Calibrating verbal uncertainty as a linear feature to reduce hallucinations
Ziwei Ji, Lei Yu, Yeskendir Koishekenov, Yejin Bang, Anthony Hartshorn, Alan Schelten, Cheng Zhang, Pascale Fung, and Nicola Cancedda. Calibrating verbal uncertainty as a linear feature to reduce hallucinations. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 3769–3793, 2025
work page 2025
-
[18]
LACIE: Listener-aware finetuning for calibration in large language models
Elias Stengel-Eskin, Peter Hase, and Mohit Bansal. LACIE: Listener-aware finetuning for calibration in large language models. InAdvances in Neural Information Processing Systems, volume 37, pages 43080–43106, 2024
work page 2024
-
[19]
Beyond binary rewards: Training lms to reason about their uncertainty, 2025
Mehul Damani, Isha Puri, Stewart Slocum, Idan Shenfeld, Leshem Choshen, Yoon Kim, and Jacob Andreas. Beyond binary rewards: Training lms to reason about their uncertainty, 2025
work page 2025
-
[20]
Reasoning models better express their confidence, 2025
Dongkeun Yoon, Seungone Kim, Sohee Yang, Sunkyoung Kim, Soyeon Kim, Yongil Kim, Eunbi Choi, Yireun Kim, and Minjoon Seo. Reasoning models better express their confidence, 2025
work page 2025
-
[21]
Vlm-robustbench: A comprehensive benchmark for robustness of vision-language models, 2026
Rohit Saxena, Alessandro Suglia, and Pasquale Minervini. Vlm-robustbench: A comprehensive benchmark for robustness of vision-language models, 2026
work page 2026
-
[22]
Xiangjie Sui, Songyang Li, Hanwei Zhu, Baoliang Chen, Yuming Fang, and Xin Sun. Bench- marking corruption robustness of lvlms: A discriminative benchmark and robustness alignment metric, 2025. 11
work page 2025
-
[23]
Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek
Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D. Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model's uncertainty? evaluating predictive uncertainty under dataset shift. InAdvances in Neural Information Processing Systems, volume 32, 2019
work page 2019
-
[24]
Benchmarking neural network robustness to common corruptions and perturbations
Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. InInternational Conference on Learning Representations, 2019
work page 2019
-
[25]
Revisiting the calibration of modern neural networks
Matthias Minderer, Josip Djolonga, Rob Romijnders, Frances Hubis, Xiaohua Zhai, Neil Houlsby, Dustin Tran, and Mario Lucic. Revisiting the calibration of modern neural networks. InAdvances in Neural Information Processing Systems, volume 34, pages 15682–15694, 2021
work page 2021
-
[26]
In: Bouamor, H., Pino, J., Bali, K
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 292–305, December 2023. doi: 10.18653/v1/2023.emnlp-main.20
-
[27]
Multi-object hallucination in vision language models
Xuweiyi Chen, Ziqiao Ma, Xuejun Zhang, Sihan Xu, Jianing Yang, David Fouhey, Joyce Chai, and Shengyi Qian. Multi-object hallucination in vision language models. InAdvances in Neural Information Processing Systems, volume 37, pages 44393–44418, 2024. doi: 10.52202/ 079017-1409
work page 2024
-
[28]
Yunkai Dang, Yifan Jiang, Yizhu Jiang, Anqi Chen, Wenbin Li, and Yang Gao. Instinct vs. reflection: Unifying token and verbalized confidence in multimodal large models, 2026
work page 2026
-
[29]
Ji Young Byun, Young-Jin Park, Jean-Philippe Corbeil, and Asma Ben Abacha. Overconfidence and calibration in medical vqa: Empirical findings and hallucination-aware mitigation, 2026
work page 2026
-
[30]
Knowledge-Centric Hallucination Detection
Ahmadian Arash, Cremer Chris, Gallé Matthias, Fadaee Marzieh, Kreutzer Julia, Pietquin Olivier, Üstün Ahmet, and Hooker Sara. Back to basics: Revisiting REINFORCE-style optimization for learning from human feedback in LLMs. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12248–12267....
-
[31]
Enhancing the outcome reward-based rl training of mllms with self-consistency sampling
Jiahao Wang, Weiye Xu, Aijun Yang, Wengang Zhou, Lewei Lu, Houqiang Li, Xiaohua Wang, and Jinguo Zhu. Enhancing the outcome reward-based rl training of mllms with self-consistency sampling. InAdvances in Neural Information Processing Systems, volume 38, pages 132018– 132052, 2025
work page 2025
-
[32]
Vision-r1: Incentivizing reasoning capability in multimodal large language models, 2026
Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Xu Tang, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models, 2026
work page 2026
-
[33]
Supervised contrastive learning
Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. InAdvances in Neural Information Processing Systems, volume 33, pages 18661–18673, 2020
work page 2020
-
[34]
Hybridflow: A flexible and efficient rlhf framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, March 2025. doi: 10.1145/3689031.3696075
-
[35]
M 3CoT: A novel benchmark for multi-domain multi-step multi-modal chain-of-thought
Qiguang Chen, Libo Qin, Jin Zhang, Zhi Chen, Xiao Xu, and Wanxiang Che. M 3CoT: A novel benchmark for multi-domain multi-step multi-modal chain-of-thought. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8199–8221, 2024
work page 2024
-
[36]
Inter-GPS: Interpretable geometry problem solving with formal language and symbolic reason- ing
Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Songchun Zhu. Inter-GPS: Interpretable geometry problem solving with formal language and symbolic reason- ing. InProceedings of the Annual Meeting of the Association for Computational Linguistics, 2021. 12
work page 2021
-
[37]
Learn to explain: Multimodal reasoning via thought chains for science question answering
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InAdvances in Neural Information Processing Systems, volume 35, pages 2507–2521, 2022
work page 2022
-
[38]
Renrui ZZhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Yu Qiao, Peng Gao, and Hongsheng Li. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InComputer Vision – ECCV 2024, pages 169–186, 2025
work page 2024
-
[39]
Emogen: Emotional image content generation with text-to-image diffusion models,
Xiang Yue, Yuansheng Ni, Tianyu Zheng, Kai Zhang, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...
-
[40]
Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, GongQue, et al. We-math: Does your large multimodal model achieve human-like mathematical reasoning? InProceedings of the Annual Meeting of the Association for Computational Linguistics, 2025
work page 2025
-
[41]
Measuring multimodal mathematical reasoning with math-vision dataset
Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 95095–95169, 2024. doi:...
-
[42]
Obtaining well-calibrated probabilities using Bayesian binning
Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well-calibrated probabilities using Bayesian binning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 29, 2015
work page 2015
-
[43]
Glenn W. Brier. Verification of forecasts expressed in terms of probability.Monthly Weather Review, 78(1):1–3, 1950
work page 1950
-
[44]
Proximal policy optimization algorithms, 2017
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017
work page 2017
-
[45]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...
work page 2022
-
[46]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems, volume 36, pages 53728–53741, 2023
work page 2023
-
[47]
Visual-rft: Visual reinforcement fine-tuning, 2025
Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning, 2025
work page 2025
-
[48]
Predicting good probabilities with supervised learning , isbn =
Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised learning. InProceedings of the 22nd International Conference on Machine Learning, page 625–632, 2005. ISBN 1595931805. doi: 10.1145/1102351.1102430
-
[49]
Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. InProceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 1321–1330, 06–11 Aug 2017
work page 2017
-
[50]
Calibration of pre-trained transformers
Shrey Desai and Greg Durrett. Calibration of pre-trained transformers. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 295–302, November 2020. doi: 10.18653/v1/2020.emnlp-main.21. URL https://aclanthology. org/2020.emnlp-main.21/. 13
-
[51]
Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. InInternational Conference on Learning Representations (ICLR), 2023
work page 2023
-
[52]
\n Please reason step by step and follow this exact response schema:\n
Wenyi Xiao, Xinchi Xu, and Leilei Gan. Vl-calibration: Decoupled confidence calibration for large vision-language models reasoning, 2026. 14 Appendices A Details of Models and Datasets 16 B Chat Template 16 C Image Corruptions and Examples 16 D Evaluation Metrics 17 E Compute Resources 21 F Use of LLMs 21 G Broader Impacts 21 15 A Details of Models and Da...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.