Recognition: unknown
Visual Preference Optimization with Rubric Rewards
Pith reviewed 2026-05-10 15:42 UTC · model grok-4.3
The pith
Rubric-based scoring provides finer-grained preference data for optimizing visual reasoning models than outcome-based methods.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose rDPO, a preference optimization framework based on instance-specific rubrics. For each image-instruction pair, we create a checklist-style rubric of essential and additional criteria to score responses from any possible policies. The instruction-rubric pool is built offline and reused during the construction of on-policy data. On public reward modeling benchmarks, rubric-based prompting massively improves a 30B-A3B judge and brings it close to GPT-5.4. On public downstream benchmarks, rubric-based filtering raises the macro average to 82.69, whereas outcome-based filtering drops it to 75.82 from 81.14. When evaluating scalability on a comprehensive benchmark, rDPO achieves 61.01,
What carries the argument
Instance-specific rubrics: checklist-style criteria created offline for each image-instruction pair to enable criterion-level scoring of responses.
If this is right
- Rubric-based filtering raises the macro average to 82.69 on public downstream benchmarks, while outcome-based filtering drops it to 75.82 from 81.14.
- rDPO achieves 61.01 on a comprehensive benchmark, outperforming the style-constrained baseline at 52.36 and surpassing the base model at 59.48.
- Rubric-based prompting improves a 30B-A3B judge model to approach GPT-5.4 performance on reward modeling benchmarks.
Where Pith is reading between the lines
- The rubric approach may reduce dependence on proprietary large models for generating preference data in multimodal training.
- Automating rubric generation could scale the method to additional visual and multimodal tasks without increased manual effort.
- Similar criterion checklists might improve preference optimization in text-only domains by providing more structured feedback.
Load-bearing premise
That instance-specific rubrics can be reliably created offline to capture fine-grained quality differences that matter for visual reasoning, and that rubric-based prompting produces a judge model whose scores generalize to on-policy data construction.
What would settle it
If models trained with rDPO show no performance gains over outcome-based DPO when evaluated on held-out visual reasoning benchmarks using human judgments instead of rubric scores.
Figures
read the original abstract
The effectiveness of Direct Preference Optimization (DPO) depends on preference data that reflect the quality differences that matter in multimodal tasks. Existing pipelines often rely on off-policy perturbations or coarse outcome-based signals, which are not well suited to fine-grained visual reasoning. We propose rDPO, a preference optimization framework based on instance-specific rubrics. For each image-instruction pair, we create a checklist-style rubric of essential and additional criteria to score responses from any possible policies. The instruction-rubric pool is built offline and reused during the construction of on-policy data. On public reward modeling benchmarks, rubric-based prompting massively improves a 30B-A3B judge and brings it close to GPT-5.4. On public downstream benchmarks, rubric-based filtering raises the macro average to 82.69, whereas outcome-based filtering drops it to 75.82 from 81.14. When evaluating scalability on a comprehensive benchmark, rDPO achieves 61.01, markedly outperforming the style-constrained baseline (52.36) and surpassing the 59.48 base model. Together, these results show that visual preference optimization benefits from combining on-policy data construction with instance-specific criterion-level feedback.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes rDPO, a Direct Preference Optimization framework for multimodal (visual) tasks that constructs preference data using instance-specific checklist-style rubrics for each image-instruction pair. Rubrics are built offline and reused for on-policy data construction; the method is evaluated on reward-modeling benchmarks (where rubric prompting improves a 30B-A3B judge toward GPT-5.4 performance) and downstream tasks (rubric-based filtering yields macro-average 82.69 vs. 75.82 for outcome-based filtering; rDPO reaches 61.01 on a comprehensive benchmark, beating a style-constrained baseline of 52.36 and the 59.48 base model).
Significance. If the empirical gains are reproducible and attributable to the rubric mechanism rather than unstated supervision or bias, the work would provide a concrete way to inject fine-grained, criterion-level visual-reasoning signal into preference optimization pipelines, addressing a known limitation of coarse outcome or off-policy signals in multimodal DPO.
major comments (3)
- [Abstract / Methods] Abstract and Methods: the central claim that 'rubric-based filtering raises the macro average to 82.69' while 'outcome-based filtering drops it to 75.82' is load-bearing, yet the rubric creation protocol, reuse procedure, and any human-agreement or on-policy validation statistics are not supplied; without these it is impossible to rule out selection bias or implicit stronger-model supervision as the source of the lift.
- [Experiments] Experiments: the reported downstream numbers (82.69 macro average, 61.01 on the comprehensive benchmark) lack error bars, data-split details, and explicit controls for prompt engineering or compute budget; these omissions make it difficult to assess whether the superiority over the 59.48 base model and 52.36 style baseline is robust.
- [Experiments / Judge Evaluation] Judge-model evaluation: the statement that rubric prompting 'massively improves a 30B-A3B judge and brings it close to GPT-5.4' on public reward-modeling benchmarks requires correlation statistics between rubric scores and human preferences specifically on the on-policy responses later used for DPO training; absent this, generalization of the judge signal remains unverified.
minor comments (1)
- [Abstract] Notation for the 30B-A3B judge model and the 'rDPO' acronym should be introduced once with a clear definition before repeated use.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate where revisions have been made to the manuscript to improve clarity and reporting.
read point-by-point responses
-
Referee: [Abstract / Methods] Abstract and Methods: the central claim that 'rubric-based filtering raises the macro average to 82.69' while 'outcome-based filtering drops it to 75.82' is load-bearing, yet the rubric creation protocol, reuse procedure, and any human-agreement or on-policy validation statistics are not supplied; without these it is impossible to rule out selection bias or implicit stronger-model supervision as the source of the lift.
Authors: We agree that the rubric creation and reuse details require further elaboration to substantiate the claims and address potential bias concerns. In the revised manuscript, we expand the Methods section with a dedicated description of the offline rubric creation protocol (a fixed checklist-generation prompt applied to each image-instruction pair), the exact reuse mechanism during on-policy sampling, and internal human agreement statistics on rubric quality. We also explicitly note that rubrics are generated from the input pair alone, without reference to any model outputs, to preclude implicit stronger-model supervision. revision: yes
-
Referee: [Experiments] Experiments: the reported downstream numbers (82.69 macro average, 61.01 on the comprehensive benchmark) lack error bars, data-split details, and explicit controls for prompt engineering or compute budget; these omissions make it difficult to assess whether the superiority over the 59.48 base model and 52.36 style baseline is robust.
Authors: We acknowledge these omissions in experimental reporting. The revised manuscript now includes error bars computed across multiple random seeds for the key metrics, specifies the exact data splits drawn from the public benchmarks, and adds ablation controls confirming that the gains persist under fixed prompt templates and matched compute budgets for data generation. These additions demonstrate the robustness of the reported improvements over the base model and style baseline. revision: yes
-
Referee: [Experiments / Judge Evaluation] Judge-model evaluation: the statement that rubric prompting 'massively improves a 30B-A3B judge and brings it close to GPT-5.4' on public reward-modeling benchmarks requires correlation statistics between rubric scores and human preferences specifically on the on-policy responses later used for DPO training; absent this, generalization of the judge signal remains unverified.
Authors: We agree that direct correlation statistics on the on-policy responses would provide stronger verification of generalization. Our judge evaluation was performed on standard public reward-modeling benchmarks, and the downstream task gains offer indirect support for the rubric signal. In the revision we add a discussion of this point and note the absence of on-policy human correlations as a limitation. revision: partial
- Correlation statistics between rubric scores and human preferences specifically on the on-policy responses used for DPO training were not collected.
Circularity Check
No circularity: empirical benchmark comparisons only
full rationale
The paper describes an empirical framework (rubric creation, judge prompting, on-policy data filtering, DPO training) evaluated via public benchmark scores. No equations, derivations, or fitted parameters are presented that reduce to inputs by construction. Claims rest on reported macro averages (82.69 vs 75.82) and scalability results (61.01), which are external measurements rather than self-referential reductions. No self-citation load-bearing steps or ansatz smuggling appear in the provided text.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Rubrics of essential and additional criteria can be constructed that accurately reflect quality differences in multimodal responses
Reference graph
Works this paper leans on
-
[1]
Manning, Stefano Ermon, and Chelsea Finn
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference o...
2023
-
[2]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, Zhaokai Wang, Zhe Chen, Hongjie Zhang, Ganlin Yang, Haomin Wang, Qi Wei, Jinhui Yin, Wenhao Li, Erfei Cui, Guanzhou Chen, Zichen Ding, Changyao Tian, Zhenyu Wu, JingJing Xie, Zehao Li, Bowen Yang, Yuchen Duan, Xuehui Wang, Zhi Hou,...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Mimo-vl technical report.arXiv preprint arXiv:2506.03569, 2025
Zihao Yue, Zhenru Lin, Yifan Song, Weikun Wang, Shuhuai Ren, Shuhao Gu, Shicheng Li, Peidian Li, Liang Zhao, Lei Li, Kainan Bao, Hao Tian, Hailin Zhang, Xiao-Gang Wang, Dawei Zhu, Cici, Chenhong He, Bowen Ye, Bowen Shen, Zihan Zhang, Zihan Jiang, Zhixian Zheng, Zhichao Song, Zhenbo Luo, Yue Yu, Yudong Wang, Yuanyuan Tian, Yu Tu, Yihan Yan, Yi Huang, Xu Wa...
-
[4]
Kwai keye-vl 1.5 technical report.arXiv preprint arXiv:2509.01563, 2025
Biao Yang, Bin Wen, Boyang Ding, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, Fan Yang, Guorui Zhou, Guowang Zhang, Han Shen, Hao Peng, Haojie Ding, Hao Wang, Haonan Fan, Hengrui Ju, Jiaming Huang, Jiangxia Cao, Jiankang Chen, Jingyun Hua, Kaibing Chen, Kaiyu Jiang, Kaiyu Tang, Kun Gai, Muhao Wei, Qiang Wang, Ruita...
-
[5]
Qwen Team. Qwen3-vl technical report.CoRR, abs/2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
arXiv preprint arXiv:2402.11411 , year=
Yiyang Zhou, Chenhang Cui, Rafael Rafailov, Chelsea Finn, and Huaxiu Yao. Aligning modalities in vision large language models via preference fine-tuning.CoRR, abs/2402.11411, 2024
-
[7]
Beyond multimodal hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization
Zhiyuan Zhao, Bin Wang, Linke Ouyang, Xiaoyi Dong, Jiaqi Wang, and Conghui He. Beyond multimodal hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization. InIEEE International Conference on Multimedia and Expo, ICME 2025, Nantes, France, June 30 - July 4, 2025, pages 1–6. IEEE, 2025
2025
-
[8]
Strengthening multimodal large language model with bootstrapped preference optimization
Renjie Pi, Tianyang Han, Wei Xiong, Jipeng Zhang, Runtao Liu, Rui Pan, and Tong Zhang. Strengthening multimodal large language model with bootstrapped preference optimization. In Ales Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors,Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, Septembe...
2024
-
[9]
Zou, Kai- Wei Chang, and Wei Wang
Yihe Deng, Pan Lu, Fan Yin, Ziniu Hu, Sheng Shen, Quanquan Gu, James Y . Zou, Kai- Wei Chang, and Wei Wang. Enhancing large vision language models with self-training on image comprehension. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors,Advances in Neural Information Processing S...
2024
-
[10]
Self-supervised visual preference alignment
Ke Zhu, Liang Zhao, Zheng Ge, and Xiangyu Zhang. Self-supervised visual preference alignment. In Jianfei Cai, Mohan S. Kankanhalli, Balakrishnan Prabhakaran, Susanne Boll, Ramanathan Subramanian, Liang Zheng, Vivek K. Singh, Pablo César, Lexing Xie, and Dong Xu, editors,Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne...
2024
-
[11]
Shijian Deng, Wentian Zhao, Yu-Jhe Li, Kun Wan, Daniel Miranda, Ajinkya Kale, and Yapeng Tian. Efficient self-improvement in multimodal large language models: A model-level judge- free approach.CoRR, abs/2411.17760, 2024
-
[12]
arXiv preprint arXiv:2411.10442 , year=
Weiyun Wang, Zhe Chen, Wenhai Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Jinguo Zhu, Xizhou Zhu, Lewei Lu, Yu Qiao, and Jifeng Dai. Enhancing the reasoning ability of multimodal large language models via mixed preference optimization.CoRR, abs/2411.10442, 2024
-
[13]
RLAIF-V: open-source AI feedback leads to super GPT-4V trustworthiness
Tianyu Yu, Haoye Zhang, Qiming Li, Qixin Xu, Yuan Yao, Da Chen, Xiaoman Lu, Ganqu Cui, Yunkai Dang, Taiwen He, Xiaocheng Feng, Jun Song, Bo Zheng, Zhiyuan Liu, Tat-Seng Chua, and Maosong Sun. RLAIF-V: open-source AI feedback leads to super GPT-4V trustworthiness. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, U...
2025
-
[14]
Modality-fair preference optimization for trustworthy MLLM alignment
Songtao Jiang, Yan Zhang, Ruizhe Chen, Tianxiang Hu, Yeying Jin, Qinglin He, Yang Feng, Jian Wu, and Zuozhu Liu. Modality-fair preference optimization for trustworthy MLLM alignment. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2025, Montreal, Canada, August 16-22, 2025, pages 403–411. ijcai.org, 2025
2025
-
[15]
Huang, Nan Xu, Sheng Zhang, Hoifung Poon, and Muhao Chen
Fei Wang, Wenxuan Zhou, James Y . Huang, Nan Xu, Sheng Zhang, Hoifung Poon, and Muhao Chen. mdpo: Conditional preference optimization for multimodal large language models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, Novembe...
2024
-
[16]
V-DPO: mitigating hallucination in large vision language models via vision-guided direct preference optimization
Yuxi Xie, Guanzhen Li, Xiao Xu, and Min-Yen Kan. V-DPO: mitigating hallucination in large vision language models via vision-guided direct preference optimization. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, Florida, USA, November 12-16, 2024, volume EMNLP 2024 ...
2024
-
[17]
Probing visual language priors in vlms
Tiange Luo, Ang Cao, Gunhee Lee, Justin Johnson, and Honglak Lee. Probing visual language priors in vlms. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Forty-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13-19, 2025, volum...
2025
-
[18]
Re-align: Aligning vision language models via retrieval-augmented direct preference optimization
Shuo Xing, Peiran Li, Yuping Wang, Ruizheng Bai, Yueqi Wang, Chan-Wei Hu, Chengxuan Qian, Huaxiu Yao, and Zhengzhong Tu. Re-align: Aligning vision language models via retrieval-augmented direct preference optimization. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical...
2025
-
[19]
Mit- igating hallucination through theory-consistent symmetric multimodal preference optimization
Wenqi Liu, Xuemeng Song, Jiaxi Li, Yinwei Wei, Na Zheng, Jianhua Yin, and Liqiang Nie. Mit- igating hallucination through theory-consistent symmetric multimodal preference optimization. CoRR, abs/2506.11712, 2025. 11
-
[20]
Mitigating halluci- nations in large vision-language models via DPO: on-policy data hold the key
Zhihe Yang, Xufang Luo, Dongqi Han, Yunjian Xu, and Dongsheng Li. Mitigating halluci- nations in large vision-language models via DPO: on-policy data hold the key. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 10610–10620. Computer Vision Foundation / IEEE, 2025
2025
-
[21]
Ovip: Online vision-language preference learning.CoRR, abs/2505.15963, 2025
Shujun Liu, Siyuan Wang, Zejun Li, Jianxiang Wang, Cheng Zeng, and Zhongyu Wei. Ovip: Online vision-language preference learning.CoRR, abs/2505.15963, 2025
-
[22]
Chengzhi Yu, Yifan Xu, Yifan Chen, and Wenyi Zhang. Optimizing lvlms with on-policy data for effective hallucination mitigation.CoRR, abs/2512.00706, 2025
-
[23]
Aligning large multimodal models with factually augmented RLHF
Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liangyan Gui, Yu-Xiong Wang, Yiming Yang, Kurt Keutzer, and Trevor Darrell. Aligning large multimodal models with factually augmented RLHF. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics, ACL 2024, ...
2024
-
[24]
RLHF-V: towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback
Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, and Maosong Sun. RLHF-V: towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 13807–13816...
2024
-
[25]
Wildvision: Evaluating vision-language models in the wild with human preferences
Yujie Lu, Dongfu Jiang, Wenhu Chen, William Yang Wang, Yejin Choi, and Bill Yuchen Lin. Wildvision: Evaluating vision-language models in the wild with human preferences. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors,Advances in Neural Information Processing Systems 38: Annual Co...
2024
-
[26]
arXiv preprint arXiv:2312.10665 (2024)
Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen, Yazheng Yang, Benyou Wang, and Lingpeng Kong. Silkie: Preference distillation for large visual language models. CoRR, abs/2312.10665, 2023
-
[27]
MIA-DPO: multi-image augmented direct prefer- ence optimization for large vision-language models
Ziyu Liu, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Haodong Duan, Conghui He, Yuanjun Xiong, Dahua Lin, and Jiaqi Wang. MIA-DPO: multi-image augmented direct prefer- ence optimization for large vision-language models. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025
2025
-
[28]
MM-RLHF: the next step forward in multimodal LLM alignment
Yifan Zhang, Tao Yu, Haochen Tian, Chaoyou Fu, Peiyan Li, Jianshu Zeng, Wulin Xie, Yang Shi, Huanyu Zhang, Junkang Wu, Xue Wang, Yibo Hu, Bin Wen, Tingting Gao, Zhang Zhang, Fan Yang, Di Zhang, Liang Wang, and Rong Jin. MM-RLHF: the next step forward in multimodal LLM alignment. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenk...
2025
-
[29]
Prometheus- vision: Vision-language model as a judge for fine-grained evaluation
Seongyun Lee, Seungone Kim, Sue Hyun Park, Geewook Kim, and Minjoon Seo. Prometheus- vision: Vision-language model as a judge for fine-grained evaluation. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, volume ACL 2024 ...
2024
-
[30]
Kass-Hout, Furong Huang, and Cao Xiao
Xiyao Wang, Jiuhai Chen, Zhaoyang Wang, Yuhang Zhou, Yiyang Zhou, Huaxiu Yao, Tianyi Zhou, Tom Goldstein, Parminder Bhatia, Taha A. Kass-Hout, Furong Huang, and Cao Xiao. Enhancing visual-language modality alignment in large vision language models via self- improvement. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Findings of the Association for Co...
2025
-
[31]
Llava-critic: Learning to evaluate multimodal models
Tianyi Xiong, Xiyao Wang, Dong Guo, Qinghao Ye, Haoqi Fan, Quanquan Gu, Heng Huang, and Chunyuan Li. Llava-critic: Learning to evaluate multimodal models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 13618–13628. Computer Vision Foundation / IEEE, 2025
2025
-
[32]
Critic-v: VLM critics help catch VLM errors in multimodal reasoning
Di Zhang, Jingdi Lei, Junxian Li, Xunzhi Wang, Yujie Liu, Zonglin Yang, Jiatong Li, Weida Wang, Suorong Yang, Jianbo Wu, Peng Ye, Wanli Ouyang, and Dongzhan Zhou. Critic-v: VLM critics help catch VLM errors in multimodal reasoning. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 9050...
2025
-
[33]
Internlm- xcomposer2.5-reward: A simple yet effective multi-modal reward model
Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Ziyu Liu, Shengyuan Ding, Shenxi Wu, Yubo Ma, Haodong Duan, Wenwei Zhang, Kai Chen, Dahua Lin, and Jiaqi Wang. Internlm- xcomposer2.5-reward: A simple yet effective multi-modal reward model. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association f...
2025
-
[34]
Xiaokun Wang, Peiyu Wang, Jiangbo Pei, Wei Shen, Yi Peng, Yunzhuo Hao, Weijie Qiu, Ai Jian, Tianyidan Xie, Xuchen Song, Yang Liu, and Yahui Zhou. Skywork-vl reward: An effective reward model for multimodal understanding and reasoning.CoRR, abs/2505.07263, 2025
-
[35]
Llm- rubric: A multidimensional, calibrated approach to automated evaluation of natural language texts
Helia Hashemi, Jason Eisner, Corby Rosset, Benjamin Van Durme, and Chris Kedzie. Llm- rubric: A multidimensional, calibrated approach to automated evaluation of natural language texts. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), A...
2024
-
[36]
Paperbench: Evaluating ai’s ability to replicate AI research
Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, and Tejal Patwardhan. Paperbench: Evaluating ai’s ability to replicate AI research. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Mahara...
2025
-
[37]
CARMO: dynamic criteria generation for context aware reward modelling
Taneesh Gupta, Shivam Shandilya, Xuchao Zhang, Rahul Madhavan, Supriyo Ghosh, Chetan Bansal, Huaxiu Yao, and Saravan Rajmohan. CARMO: dynamic criteria generation for context aware reward modelling. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Moham- mad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics, ACL 2025, ...
2025
-
[38]
In: NeurIPS (2025),https://arxiv.org/abs/2507.18624
Vijay Viswanathan, Yanchao Sun, Shuang Ma, Xiang Kong, Meng Cao, Graham Neubig, and Tongshuang Wu. Checklists are better than reward models for aligning language models.CoRR, abs/2507.18624, 2025
-
[39]
Each sentence in the generated text uses a second person
Yang Zhou, Sunzhu Li, Shunyu Liu, Wenkai Fang, Jiale Zhao, Jingwen Yang, Jianwei Lv, Kongcheng Zhang, Yihe Zhou, Hengtong Lu, Wei Chen, Yan Xie, and Mingli Song. Breaking the exploration bottleneck: Rubric-scaffolded reinforcement learning for general LLM reasoning. CoRR, abs/2508.16949, 2025
-
[40]
Zicheng Kong, Dehua Ma, Zhenbo Xu, Alven Yang, Yiwei Ru, Haoran Wang, Zixuan Zhou, Fuqing Bie, Liuyu Xiang, Huijia Wu, Jian Zhao, and Zhaofeng He. Omni-rrm: Advancing omni reward modeling via automatic rubric-grounded preference synthesis.CoRR, abs/2602.00846, 2026
-
[41]
Gonzalez, and Wei-Lin Chiang
Christopher Chou, Lisa Dunlap, Koki Mashita, Krishna Mandal, Trevor Darrell, Ion Stoica, Joseph E. Gonzalez, and Wei-Lin Chiang. Visionarena: 230k real world user-vlm conversations with preference labels. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 13 CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 3877–3887. Computer Vision F...
2025
-
[42]
From captions to rewards (carevl): Leveraging large language model experts for enhanced reward modeling in large vision-language models
Muzhi Dai, Jiashuo Sun, Zhiyuan Zhao, Shixuan Liu, Rui Li, Junyu Gao, and Xuelong Li. From captions to rewards (carevl): Leveraging large language model experts for enhanced reward modeling in large vision-language models. In Cathal Gurrin, Klaus Schoeffmann, Min Zhang, Luca Rossetto, Stevan Rudinac, Duc-Tien Dang-Nguyen, Wen-Huang Cheng, Phoebe Chen, and...
2025
-
[43]
Benchmarking multimodal cot reward model stepwise by visual program.CoRR, abs/2504.06606, 2025
Minghe Gao, Xuqi Liu, Zhongqi Yue, Yang Wu, Shuang Chen, Juncheng Li, Siliang Tang, Fei Wu, Tat-Seng Chua, and Yueting Zhuang. Benchmarking multimodal cot reward model stepwise by visual program.CoRR, abs/2504.06606, 2025
-
[44]
Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, and Karan Singhal. Healthbench: Evaluating large language models towards improved human health.CoRR, abs/2505.08775, 2025
work page internal anchor Pith review arXiv 2025
-
[45]
Zhilin Wang, Jaehun Jung, Ximing Lu, Shizhe Diao, Ellie Evans, Jiaqi Zeng, Pavlo Molchanov, Yejin Choi, Jan Kautz, and Yi Dong. Profbench: Multi-domain rubrics requiring professional knowledge to answer and judge.CoRR, abs/2510.18941, 2025
-
[46]
Afra Feyza Akyürek, Advait Gosai, Chen Bo Calvin Zhang, Vipul Gupta, Jaehwan Jeong, Anisha Gunjal, Tahseen Rabbani, Maria Mazzone, David Randolph, Mohammad Mahmoudi Meymand, Gurshaan Chattha, Paula Rodriguez, Diego Mares, Pavit Singh, Michael Liu, Subodh Chawla, Pete Cline, Lucy Ogaz, Ernesto Hernandez, Zihao Wang, Pavi Bhatter, Marcos Ayestaran, Bing Liu...
-
[47]
Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains
Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.CoRR, abs/2507.17746, 2025
work page internal anchor Pith review arXiv 2025
-
[48]
Reinforcement learning with rubric anchors
Zenan Huang, Yihong Zhuang, Guoshan Lu, Zeyu Qin, Haokai Xu, Tianyu Zhao, Ru Peng, Jiaqi Hu, Zhanming Shen, Xiaomeng Hu, Xijun Gu, Peiyi Tu, Jiaxin Liu, Wenyu Chen, Yuzhuo Fu, Zhiting Fan, Yanmei Gu, Yuanyuan Wang, Zhengkai Yang, Jianguo Li, and Junbo Zhao. Reinforcement learning with rubric anchors.CoRR, abs/2508.12790, 2025
-
[49]
Tianci Liu, Ran Xu, Tony Yu, Ilgee Hong, Carl Yang, Tuo Zhao, and Haoyu Wang. Openrubrics: Towards scalable synthetic rubric generation for reward modeling and LLM alignment.CoRR, abs/2510.07743, 2025
-
[50]
Sunzhu Li, Jiale Zhao, Miteto Wei, Huimin Ren, Yang Zhou, Jingwen Yang, Shunyu Liu, Kaike Zhang, and Wei Chen. Rubrichub: A comprehensive and highly discriminative rubric dataset via automated coarse-to-fine generation.CoRR, abs/2601.08430, 2026
-
[51]
MohammadHossein Rezaei, Robert Vacareanu, Zihao Wang, Clinton Wang, Bing Liu, Yunzhong He, and Afra Feyza Akyürek. Online rubrics elicitation from pairwise comparisons.CoRR, abs/2510.07284, 2025
-
[52]
Lipeng Xie, Sen Huang, Zhuo Zhang, Anni Zou, Yunpeng Zhai, Dingchao Ren, Kezun Zhang, Haoyuan Hu, Boyin Liu, Haoran Chen, Zhaoyang Liu, and Bolin Ding. Auto-rubric: Learning to extract generalizable criteria for reward modeling.CoRR, abs/2510.17314, 2025
-
[53]
arXiv preprint arXiv:2602.01511 , year=
Ran Xu, Tianci Liu, Zihan Dong, Tony Yu, Ilgee Hong, Carl Yang, Linjun Zhang, Tao Zhao, and Haoyu Wang. Alternating reinforcement learning for rubric-based reward modeling in non-verifiable llm post-training.CoRR, abs/2602.01511, 2026
-
[54]
Shu Pu, Yaochen Wang, Dongping Chen, Yuhang Chen, Guohao Wang, Qi Qin, Zhongyi Zhang, Zhiyuan Zhang, Zetong Zhou, Shuang Gong, Yi Gui, Yao Wan, and Philip S. Yu. Judge anything: MLLM as a judge across any modality. In Luiza Antonie, Jian Pei, Xiaohui Yu, Flavio Chierichetti, Hady W. Lauw, Yizhou Sun, and Srinivasan Parthasarathy, editors,Proceedings of th...
2025
-
[55]
Tianyi Xiong, Yi Ge, Ming Li, Zuolong Zhang, Pranav Kulkarni, Kaishen Wang, Qi He, Zeying Zhu, Chenxi Liu, Ruibo Chen, Tong Zheng, Yanshuo Chen, Xiyao Wang, Renrui Zhang, Wenhu Chen, and Heng Huang. Multi-crit: Benchmarking multimodal judges on pluralistic criteria-following.CoRR, abs/2511.21662, 2025
-
[56]
Rank analysis of incomplete block designs: I
Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952
1952
-
[57]
Ya-Qi Yu, Minghui Liao, Jiwen Zhang, and Jihao Wu. Texthawk2: A large vision-language model excels in bilingual OCR and grounding with 16x fewer tokens.CoRR, abs/2410.05261, 2024
-
[58]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Ming-Hsuan Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[59]
Michihiro Yasunaga, Luke Zettlemoyer, and Marjan Ghazvininejad. Multimodal rewardbench: Holistic evaluation of reward models for vision language models.CoRR, abs/2502.14191, 2025
-
[60]
Vl-rewardbench: A challenging benchmark for vision-language generative reward models
Lei Li, Yuancheng Wei, Zhihui Xie, Xuqing Yang, Yifan Song, Peiyi Wang, Chenxin An, Tianyu Liu, Sujian Li, Bill Yuchen Lin, Lingpeng Kong, and Qi Liu. Vl-rewardbench: A challenging benchmark for vision-language generative reward models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages...
2025
-
[61]
Mllm-as-a-judge: Assessing multimodal llm-as- a-judge with vision-language benchmark
Dongping Chen, Ruoxi Chen, Shilin Zhang, Yaochen Wang, Yinuo Liu, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, and Lichao Sun. Mllm-as-a-judge: Assessing multimodal llm-as- a-judge with vision-language benchmark. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Forty- ...
2024
-
[62]
Introducing claude opus 4.6
Anthropic. Introducing claude opus 4.6. https://www.anthropic.com/news/ introducing-claude-opus-4-6, 2026. Released February 5, 2026
2026
-
[63]
Gpt-5.4 thinking system card
OpenAI. Gpt-5.4 thinking system card. https://openai.com/index/ gpt-5-4-thinking-system-card/, 2026. Released March 5, 2026
2026
-
[64]
Gemini 3: A family of highly capable multimodal reasoning models
Google Gemini Team. Gemini 3: A family of highly capable multimodal reasoning models. arXiv preprint arXiv:2512.03267, 2025
-
[65]
A diagram is worth a dozen images
Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Min Joon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors,Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part IV, Lecture Notes in Computer Scien...
2016
-
[66]
Joty, and Enamul Hoque
Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq R. Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors,Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022, Findings of ACL, p...
2022
-
[67]
M 3cot: A novel benchmark for multi-domain multi-step multi-modal chain-of-thought
Qiguang Chen, Libo Qin, Jin Zhang, Zhi Chen, Xiao Xu, and Wanxiang Che. M 3cot: A novel benchmark for multi-domain multi-step multi-modal chain-of-thought. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, A...
2024
-
[68]
Arka Pal, Deep Karkhanis, Samuel Dooley, Manley Roberts, Siddartha Naidu, and Colin White. Smaug: Fixing failure modes of preference optimisation with dpo-positive.CoRR, abs/2402.13228, 2024
-
[69]
Disentangling length from quality in direct preference optimization
Ryan Park, Rafael Rafailov, Stefano Ermon, and Chelsea Finn. Disentangling length from quality in direct preference optimization. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, volume ACL 2024 ofFindings of ACL, pages...
2024
-
[70]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.CoRR, abs/2402.03300, 2024. 16 A Prompt Templates for Rubric Construction and Scoring This appendix presents the complete system prompts used across the stages of...
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.