arxiv: 2604.13029 · v1 · submitted 2026-04-14 · 💻 cs.CV · cs.AI

Recognition: unknown

Visual Preference Optimization with Rubric Rewards

Ya-Qi Yu , Fangyu Hong , Xiangyang Qu , Hao Wang , Gaojie Wu , Qiaoyu Luo , Nuo Xu , Huixin Wang

show 10 more authors

Wuheng Xu Yongxin Liao Zihao Chen Haonan Li Ziming Li Dezhi Peng Minghui Liao Jihao Wu Haoyu Ren Dandan Tu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:42 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords rDPOvisual preference optimizationrubric rewardsdirect preference optimizationmultimodal reasoningreward modelingon-policy data

0 comments

The pith

Rubric-based scoring provides finer-grained preference data for optimizing visual reasoning models than outcome-based methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes rDPO to address limitations in direct preference optimization for multimodal tasks. It creates offline checklists of criteria for each image-instruction pair to score model responses at a detailed level. These rubrics enable better on-policy preference data construction. Experiments demonstrate superior judge performance and benchmark results over baselines. The work highlights the value of criterion-level feedback in visual preference optimization.

Core claim

We propose rDPO, a preference optimization framework based on instance-specific rubrics. For each image-instruction pair, we create a checklist-style rubric of essential and additional criteria to score responses from any possible policies. The instruction-rubric pool is built offline and reused during the construction of on-policy data. On public reward modeling benchmarks, rubric-based prompting massively improves a 30B-A3B judge and brings it close to GPT-5.4. On public downstream benchmarks, rubric-based filtering raises the macro average to 82.69, whereas outcome-based filtering drops it to 75.82 from 81.14. When evaluating scalability on a comprehensive benchmark, rDPO achieves 61.01,

What carries the argument

Instance-specific rubrics: checklist-style criteria created offline for each image-instruction pair to enable criterion-level scoring of responses.

If this is right

Rubric-based filtering raises the macro average to 82.69 on public downstream benchmarks, while outcome-based filtering drops it to 75.82 from 81.14.
rDPO achieves 61.01 on a comprehensive benchmark, outperforming the style-constrained baseline at 52.36 and surpassing the base model at 59.48.
Rubric-based prompting improves a 30B-A3B judge model to approach GPT-5.4 performance on reward modeling benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The rubric approach may reduce dependence on proprietary large models for generating preference data in multimodal training.
Automating rubric generation could scale the method to additional visual and multimodal tasks without increased manual effort.
Similar criterion checklists might improve preference optimization in text-only domains by providing more structured feedback.

Load-bearing premise

That instance-specific rubrics can be reliably created offline to capture fine-grained quality differences that matter for visual reasoning, and that rubric-based prompting produces a judge model whose scores generalize to on-policy data construction.

What would settle it

If models trained with rDPO show no performance gains over outcome-based DPO when evaluated on held-out visual reasoning benchmarks using human judgments instead of rubric scores.

Figures

Figures reproduced from arXiv: 2604.13029 by Dandan Tu, Dezhi Peng, Fangyu Hong, Gaojie Wu, Haonan Li, Hao Wang, Haoyu Ren, Huixin Wang, Jihao Wu, Minghui Liao, Nuo Xu, Qiaoyu Luo, Wuheng Xu, Xiangyang Qu, Ya-Qi Yu, Yongxin Liao, Zihao Chen, Ziming Li.

**Figure 2.** Figure 2: Training dynamics and hyperparameter ablations. (a)(b) Convergence curves for three [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

The effectiveness of Direct Preference Optimization (DPO) depends on preference data that reflect the quality differences that matter in multimodal tasks. Existing pipelines often rely on off-policy perturbations or coarse outcome-based signals, which are not well suited to fine-grained visual reasoning. We propose rDPO, a preference optimization framework based on instance-specific rubrics. For each image-instruction pair, we create a checklist-style rubric of essential and additional criteria to score responses from any possible policies. The instruction-rubric pool is built offline and reused during the construction of on-policy data. On public reward modeling benchmarks, rubric-based prompting massively improves a 30B-A3B judge and brings it close to GPT-5.4. On public downstream benchmarks, rubric-based filtering raises the macro average to 82.69, whereas outcome-based filtering drops it to 75.82 from 81.14. When evaluating scalability on a comprehensive benchmark, rDPO achieves 61.01, markedly outperforming the style-constrained baseline (52.36) and surpassing the 59.48 base model. Together, these results show that visual preference optimization benefits from combining on-policy data construction with instance-specific criterion-level feedback.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

rDPO adds instance-specific rubrics to DPO for visual tasks and reports benchmark gains over outcome baselines, but the rubric creation and judge validation steps are too thin to pin down where the lift comes from.

read the letter

The core idea is straightforward: build offline rubrics that list essential and extra criteria for each image-instruction pair, then use those rubrics to score on-policy responses and feed the resulting preferences into DPO. That combination is the main novelty relative to prior multimodal preference work, and the paper shows it can push a 30B-A3B judge close to GPT-5.4 on reward benchmarks while lifting downstream macro averages from 81.14 to 82.69 and reaching 61.01 on the larger scalability set against a 59.48 base and a 52.36 style baseline. Those numbers are concrete and worth checking if the method scales cleanly to other VLMs.

Referee Report

3 major / 1 minor

Summary. The paper proposes rDPO, a Direct Preference Optimization framework for multimodal (visual) tasks that constructs preference data using instance-specific checklist-style rubrics for each image-instruction pair. Rubrics are built offline and reused for on-policy data construction; the method is evaluated on reward-modeling benchmarks (where rubric prompting improves a 30B-A3B judge toward GPT-5.4 performance) and downstream tasks (rubric-based filtering yields macro-average 82.69 vs. 75.82 for outcome-based filtering; rDPO reaches 61.01 on a comprehensive benchmark, beating a style-constrained baseline of 52.36 and the 59.48 base model).

Significance. If the empirical gains are reproducible and attributable to the rubric mechanism rather than unstated supervision or bias, the work would provide a concrete way to inject fine-grained, criterion-level visual-reasoning signal into preference optimization pipelines, addressing a known limitation of coarse outcome or off-policy signals in multimodal DPO.

major comments (3)

[Abstract / Methods] Abstract and Methods: the central claim that 'rubric-based filtering raises the macro average to 82.69' while 'outcome-based filtering drops it to 75.82' is load-bearing, yet the rubric creation protocol, reuse procedure, and any human-agreement or on-policy validation statistics are not supplied; without these it is impossible to rule out selection bias or implicit stronger-model supervision as the source of the lift.
[Experiments] Experiments: the reported downstream numbers (82.69 macro average, 61.01 on the comprehensive benchmark) lack error bars, data-split details, and explicit controls for prompt engineering or compute budget; these omissions make it difficult to assess whether the superiority over the 59.48 base model and 52.36 style baseline is robust.
[Experiments / Judge Evaluation] Judge-model evaluation: the statement that rubric prompting 'massively improves a 30B-A3B judge and brings it close to GPT-5.4' on public reward-modeling benchmarks requires correlation statistics between rubric scores and human preferences specifically on the on-policy responses later used for DPO training; absent this, generalization of the judge signal remains unverified.

minor comments (1)

[Abstract] Notation for the 30B-A3B judge model and the 'rDPO' acronym should be introduced once with a clear definition before repeated use.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate where revisions have been made to the manuscript to improve clarity and reporting.

read point-by-point responses

Referee: [Abstract / Methods] Abstract and Methods: the central claim that 'rubric-based filtering raises the macro average to 82.69' while 'outcome-based filtering drops it to 75.82' is load-bearing, yet the rubric creation protocol, reuse procedure, and any human-agreement or on-policy validation statistics are not supplied; without these it is impossible to rule out selection bias or implicit stronger-model supervision as the source of the lift.

Authors: We agree that the rubric creation and reuse details require further elaboration to substantiate the claims and address potential bias concerns. In the revised manuscript, we expand the Methods section with a dedicated description of the offline rubric creation protocol (a fixed checklist-generation prompt applied to each image-instruction pair), the exact reuse mechanism during on-policy sampling, and internal human agreement statistics on rubric quality. We also explicitly note that rubrics are generated from the input pair alone, without reference to any model outputs, to preclude implicit stronger-model supervision. revision: yes
Referee: [Experiments] Experiments: the reported downstream numbers (82.69 macro average, 61.01 on the comprehensive benchmark) lack error bars, data-split details, and explicit controls for prompt engineering or compute budget; these omissions make it difficult to assess whether the superiority over the 59.48 base model and 52.36 style baseline is robust.

Authors: We acknowledge these omissions in experimental reporting. The revised manuscript now includes error bars computed across multiple random seeds for the key metrics, specifies the exact data splits drawn from the public benchmarks, and adds ablation controls confirming that the gains persist under fixed prompt templates and matched compute budgets for data generation. These additions demonstrate the robustness of the reported improvements over the base model and style baseline. revision: yes
Referee: [Experiments / Judge Evaluation] Judge-model evaluation: the statement that rubric prompting 'massively improves a 30B-A3B judge and brings it close to GPT-5.4' on public reward-modeling benchmarks requires correlation statistics between rubric scores and human preferences specifically on the on-policy responses later used for DPO training; absent this, generalization of the judge signal remains unverified.

Authors: We agree that direct correlation statistics on the on-policy responses would provide stronger verification of generalization. Our judge evaluation was performed on standard public reward-modeling benchmarks, and the downstream task gains offer indirect support for the rubric signal. In the revision we add a discussion of this point and note the absence of on-policy human correlations as a limitation. revision: partial

standing simulated objections not resolved

Correlation statistics between rubric scores and human preferences specifically on the on-policy responses used for DPO training were not collected.

Circularity Check

0 steps flagged

No circularity: empirical benchmark comparisons only

full rationale

The paper describes an empirical framework (rubric creation, judge prompting, on-policy data filtering, DPO training) evaluated via public benchmark scores. No equations, derivations, or fitted parameters are presented that reduce to inputs by construction. Claims rest on reported macro averages (82.69 vs 75.82) and scalability results (61.01), which are external measurements rather than self-referential reductions. No self-citation load-bearing steps or ansatz smuggling appear in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the feasibility of building effective rubrics offline and the assumption that rubric scores provide superior signals to outcome-based ones; no free parameters or invented physical entities are introduced.

axioms (1)

domain assumption Rubrics of essential and additional criteria can be constructed that accurately reflect quality differences in multimodal responses
Invoked when stating that rubric-based prompting improves the judge and enables better data filtering.

pith-pipeline@v0.9.0 · 5568 in / 1163 out tokens · 27162 ms · 2026-05-10T15:42:27.511915+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 33 canonical work pages · 6 internal anchors

[1]

Manning, Stefano Ermon, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference o...

2023
[2]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, Zhaokai Wang, Zhe Chen, Hongjie Zhang, Ganlin Yang, Haomin Wang, Qi Wei, Jinhui Yin, Wenhao Li, Erfei Cui, Guanzhou Chen, Zichen Ding, Changyao Tian, Zhenyu Wu, JingJing Xie, Zehao Li, Bowen Yang, Yuchen Duan, Xuehui Wang, Zhi Hou,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Mimo-vl technical report.arXiv preprint arXiv:2506.03569, 2025

Zihao Yue, Zhenru Lin, Yifan Song, Weikun Wang, Shuhuai Ren, Shuhao Gu, Shicheng Li, Peidian Li, Liang Zhao, Lei Li, Kainan Bao, Hao Tian, Hailin Zhang, Xiao-Gang Wang, Dawei Zhu, Cici, Chenhong He, Bowen Ye, Bowen Shen, Zihan Zhang, Zihan Jiang, Zhixian Zheng, Zhichao Song, Zhenbo Luo, Yue Yu, Yudong Wang, Yuanyuan Tian, Yu Tu, Yihan Yan, Yi Huang, Xu Wa...

work page arXiv 2025
[4]

Kwai keye-vl 1.5 technical report.arXiv preprint arXiv:2509.01563, 2025

Biao Yang, Bin Wen, Boyang Ding, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, Fan Yang, Guorui Zhou, Guowang Zhang, Han Shen, Hao Peng, Haojie Ding, Hao Wang, Haonan Fan, Hengrui Ju, Jiaming Huang, Jiangxia Cao, Jiankang Chen, Jingyun Hua, Kaibing Chen, Kaiyu Jiang, Kaiyu Tang, Kun Gai, Muhao Wei, Qiang Wang, Ruita...

work page arXiv 2025
[5]

Qwen3-VL Technical Report

Qwen Team. Qwen3-vl technical report.CoRR, abs/2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

arXiv preprint arXiv:2402.11411 , year=

Yiyang Zhou, Chenhang Cui, Rafael Rafailov, Chelsea Finn, and Huaxiu Yao. Aligning modalities in vision large language models via preference fine-tuning.CoRR, abs/2402.11411, 2024

work page arXiv 2024
[7]

Beyond multimodal hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization

Zhiyuan Zhao, Bin Wang, Linke Ouyang, Xiaoyi Dong, Jiaqi Wang, and Conghui He. Beyond multimodal hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization. InIEEE International Conference on Multimedia and Expo, ICME 2025, Nantes, France, June 30 - July 4, 2025, pages 1–6. IEEE, 2025

2025
[8]

Strengthening multimodal large language model with bootstrapped preference optimization

Renjie Pi, Tianyang Han, Wei Xiong, Jipeng Zhang, Runtao Liu, Rui Pan, and Tong Zhang. Strengthening multimodal large language model with bootstrapped preference optimization. In Ales Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors,Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, Septembe...

2024
[9]

Zou, Kai- Wei Chang, and Wei Wang

Yihe Deng, Pan Lu, Fan Yin, Ziniu Hu, Sheng Shen, Quanquan Gu, James Y . Zou, Kai- Wei Chang, and Wei Wang. Enhancing large vision language models with self-training on image comprehension. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors,Advances in Neural Information Processing S...

2024
[10]

Self-supervised visual preference alignment

Ke Zhu, Liang Zhao, Zheng Ge, and Xiangyu Zhang. Self-supervised visual preference alignment. In Jianfei Cai, Mohan S. Kankanhalli, Balakrishnan Prabhakaran, Susanne Boll, Ramanathan Subramanian, Liang Zheng, Vivek K. Singh, Pablo César, Lexing Xie, and Dong Xu, editors,Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne...

2024
[11]

Efficient self-improvement in multimodal large language models: A model-level judge- free approach.CoRR, abs/2411.17760, 2024

Shijian Deng, Wentian Zhao, Yu-Jhe Li, Kun Wan, Daniel Miranda, Ajinkya Kale, and Yapeng Tian. Efficient self-improvement in multimodal large language models: A model-level judge- free approach.CoRR, abs/2411.17760, 2024

work page arXiv 2024
[12]

arXiv preprint arXiv:2411.10442 , year=

Weiyun Wang, Zhe Chen, Wenhai Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Jinguo Zhu, Xizhou Zhu, Lewei Lu, Yu Qiao, and Jifeng Dai. Enhancing the reasoning ability of multimodal large language models via mixed preference optimization.CoRR, abs/2411.10442, 2024

work page arXiv 2024
[13]

RLAIF-V: open-source AI feedback leads to super GPT-4V trustworthiness

Tianyu Yu, Haoye Zhang, Qiming Li, Qixin Xu, Yuan Yao, Da Chen, Xiaoman Lu, Ganqu Cui, Yunkai Dang, Taiwen He, Xiaocheng Feng, Jun Song, Bo Zheng, Zhiyuan Liu, Tat-Seng Chua, and Maosong Sun. RLAIF-V: open-source AI feedback leads to super GPT-4V trustworthiness. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, U...

2025
[14]

Modality-fair preference optimization for trustworthy MLLM alignment

Songtao Jiang, Yan Zhang, Ruizhe Chen, Tianxiang Hu, Yeying Jin, Qinglin He, Yang Feng, Jian Wu, and Zuozhu Liu. Modality-fair preference optimization for trustworthy MLLM alignment. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2025, Montreal, Canada, August 16-22, 2025, pages 403–411. ijcai.org, 2025

2025
[15]

Huang, Nan Xu, Sheng Zhang, Hoifung Poon, and Muhao Chen

Fei Wang, Wenxuan Zhou, James Y . Huang, Nan Xu, Sheng Zhang, Hoifung Poon, and Muhao Chen. mdpo: Conditional preference optimization for multimodal large language models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, Novembe...

2024
[16]

V-DPO: mitigating hallucination in large vision language models via vision-guided direct preference optimization

Yuxi Xie, Guanzhen Li, Xiao Xu, and Min-Yen Kan. V-DPO: mitigating hallucination in large vision language models via vision-guided direct preference optimization. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, Florida, USA, November 12-16, 2024, volume EMNLP 2024 ...

2024
[17]

Probing visual language priors in vlms

Tiange Luo, Ang Cao, Gunhee Lee, Justin Johnson, and Honglak Lee. Probing visual language priors in vlms. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Forty-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13-19, 2025, volum...

2025
[18]

Re-align: Aligning vision language models via retrieval-augmented direct preference optimization

Shuo Xing, Peiran Li, Yuping Wang, Ruizheng Bai, Yueqi Wang, Chan-Wei Hu, Chengxuan Qian, Huaxiu Yao, and Zhengzhong Tu. Re-align: Aligning vision language models via retrieval-augmented direct preference optimization. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical...

2025
[19]

Mit- igating hallucination through theory-consistent symmetric multimodal preference optimization

Wenqi Liu, Xuemeng Song, Jiaxi Li, Yinwei Wei, Na Zheng, Jianhua Yin, and Liqiang Nie. Mit- igating hallucination through theory-consistent symmetric multimodal preference optimization. CoRR, abs/2506.11712, 2025. 11

work page arXiv 2025
[20]

Mitigating halluci- nations in large vision-language models via DPO: on-policy data hold the key

Zhihe Yang, Xufang Luo, Dongqi Han, Yunjian Xu, and Dongsheng Li. Mitigating halluci- nations in large vision-language models via DPO: on-policy data hold the key. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 10610–10620. Computer Vision Foundation / IEEE, 2025

2025
[21]

Ovip: Online vision-language preference learning.CoRR, abs/2505.15963, 2025

Shujun Liu, Siyuan Wang, Zejun Li, Jianxiang Wang, Cheng Zeng, and Zhongyu Wei. Ovip: Online vision-language preference learning.CoRR, abs/2505.15963, 2025

work page arXiv 2025
[22]

Optimizing lvlms with on-policy data for effective hallucination mitigation.CoRR, abs/2512.00706, 2025

Chengzhi Yu, Yifan Xu, Yifan Chen, and Wenyi Zhang. Optimizing lvlms with on-policy data for effective hallucination mitigation.CoRR, abs/2512.00706, 2025

work page arXiv 2025
[23]

Aligning large multimodal models with factually augmented RLHF

Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liangyan Gui, Yu-Xiong Wang, Yiming Yang, Kurt Keutzer, and Trevor Darrell. Aligning large multimodal models with factually augmented RLHF. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics, ACL 2024, ...

2024
[24]

RLHF-V: towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback

Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, and Maosong Sun. RLHF-V: towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 13807–13816...

2024
[25]

Wildvision: Evaluating vision-language models in the wild with human preferences

Yujie Lu, Dongfu Jiang, Wenhu Chen, William Yang Wang, Yejin Choi, and Bill Yuchen Lin. Wildvision: Evaluating vision-language models in the wild with human preferences. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors,Advances in Neural Information Processing Systems 38: Annual Co...

2024
[26]

arXiv preprint arXiv:2312.10665 (2024)

Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen, Yazheng Yang, Benyou Wang, and Lingpeng Kong. Silkie: Preference distillation for large visual language models. CoRR, abs/2312.10665, 2023

work page arXiv 2023
[27]

MIA-DPO: multi-image augmented direct prefer- ence optimization for large vision-language models

Ziyu Liu, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Haodong Duan, Conghui He, Yuanjun Xiong, Dahua Lin, and Jiaqi Wang. MIA-DPO: multi-image augmented direct prefer- ence optimization for large vision-language models. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025

2025
[28]

MM-RLHF: the next step forward in multimodal LLM alignment

Yifan Zhang, Tao Yu, Haochen Tian, Chaoyou Fu, Peiyan Li, Jianshu Zeng, Wulin Xie, Yang Shi, Huanyu Zhang, Junkang Wu, Xue Wang, Yibo Hu, Bin Wen, Tingting Gao, Zhang Zhang, Fan Yang, Di Zhang, Liang Wang, and Rong Jin. MM-RLHF: the next step forward in multimodal LLM alignment. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenk...

2025
[29]

Prometheus- vision: Vision-language model as a judge for fine-grained evaluation

Seongyun Lee, Seungone Kim, Sue Hyun Park, Geewook Kim, and Minjoon Seo. Prometheus- vision: Vision-language model as a judge for fine-grained evaluation. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, volume ACL 2024 ...

2024
[30]

Kass-Hout, Furong Huang, and Cao Xiao

Xiyao Wang, Jiuhai Chen, Zhaoyang Wang, Yuhang Zhou, Yiyang Zhou, Huaxiu Yao, Tianyi Zhou, Tom Goldstein, Parminder Bhatia, Taha A. Kass-Hout, Furong Huang, and Cao Xiao. Enhancing visual-language modality alignment in large vision language models via self- improvement. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Findings of the Association for Co...

2025
[31]

Llava-critic: Learning to evaluate multimodal models

Tianyi Xiong, Xiyao Wang, Dong Guo, Qinghao Ye, Haoqi Fan, Quanquan Gu, Heng Huang, and Chunyuan Li. Llava-critic: Learning to evaluate multimodal models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 13618–13628. Computer Vision Foundation / IEEE, 2025

2025
[32]

Critic-v: VLM critics help catch VLM errors in multimodal reasoning

Di Zhang, Jingdi Lei, Junxian Li, Xunzhi Wang, Yujie Liu, Zonglin Yang, Jiatong Li, Weida Wang, Suorong Yang, Jianbo Wu, Peng Ye, Wanli Ouyang, and Dongzhan Zhou. Critic-v: VLM critics help catch VLM errors in multimodal reasoning. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 9050...

2025
[33]

Internlm- xcomposer2.5-reward: A simple yet effective multi-modal reward model

Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Ziyu Liu, Shengyuan Ding, Shenxi Wu, Yubo Ma, Haodong Duan, Wenwei Zhang, Kai Chen, Dahua Lin, and Jiaqi Wang. Internlm- xcomposer2.5-reward: A simple yet effective multi-modal reward model. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association f...

2025
[34]

Skywork-VL reward: An effective reward model for multimodal understanding and reasoning.arXiv preprint arXiv:2505.07263, 2025

Xiaokun Wang, Peiyu Wang, Jiangbo Pei, Wei Shen, Yi Peng, Yunzhuo Hao, Weijie Qiu, Ai Jian, Tianyidan Xie, Xuchen Song, Yang Liu, and Yahui Zhou. Skywork-vl reward: An effective reward model for multimodal understanding and reasoning.CoRR, abs/2505.07263, 2025

work page arXiv 2025
[35]

Llm- rubric: A multidimensional, calibrated approach to automated evaluation of natural language texts

Helia Hashemi, Jason Eisner, Corby Rosset, Benjamin Van Durme, and Chris Kedzie. Llm- rubric: A multidimensional, calibrated approach to automated evaluation of natural language texts. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), A...

2024
[36]

Paperbench: Evaluating ai’s ability to replicate AI research

Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, and Tejal Patwardhan. Paperbench: Evaluating ai’s ability to replicate AI research. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Mahara...

2025
[37]

CARMO: dynamic criteria generation for context aware reward modelling

Taneesh Gupta, Shivam Shandilya, Xuchao Zhang, Rahul Madhavan, Supriyo Ghosh, Chetan Bansal, Huaxiu Yao, and Saravan Rajmohan. CARMO: dynamic criteria generation for context aware reward modelling. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Moham- mad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics, ACL 2025, ...

2025
[38]

In: NeurIPS (2025),https://arxiv.org/abs/2507.18624

Vijay Viswanathan, Yanchao Sun, Shuang Ma, Xiang Kong, Meng Cao, Graham Neubig, and Tongshuang Wu. Checklists are better than reward models for aligning language models.CoRR, abs/2507.18624, 2025

work page arXiv 2025
[39]

Each sentence in the generated text uses a second person

Yang Zhou, Sunzhu Li, Shunyu Liu, Wenkai Fang, Jiale Zhao, Jingwen Yang, Jianwei Lv, Kongcheng Zhang, Yihe Zhou, Hengtong Lu, Wei Chen, Yan Xie, and Mingli Song. Breaking the exploration bottleneck: Rubric-scaffolded reinforcement learning for general LLM reasoning. CoRR, abs/2508.16949, 2025

work page arXiv 2025
[40]

Omni-rrm: Advancing omni reward modeling via automatic rubric-grounded preference synthesis.CoRR, abs/2602.00846, 2026

Zicheng Kong, Dehua Ma, Zhenbo Xu, Alven Yang, Yiwei Ru, Haoran Wang, Zixuan Zhou, Fuqing Bie, Liuyu Xiang, Huijia Wu, Jian Zhao, and Zhaofeng He. Omni-rrm: Advancing omni reward modeling via automatic rubric-grounded preference synthesis.CoRR, abs/2602.00846, 2026

work page arXiv 2026
[41]

Gonzalez, and Wei-Lin Chiang

Christopher Chou, Lisa Dunlap, Koki Mashita, Krishna Mandal, Trevor Darrell, Ion Stoica, Joseph E. Gonzalez, and Wei-Lin Chiang. Visionarena: 230k real world user-vlm conversations with preference labels. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 13 CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 3877–3887. Computer Vision F...

2025
[42]

From captions to rewards (carevl): Leveraging large language model experts for enhanced reward modeling in large vision-language models

Muzhi Dai, Jiashuo Sun, Zhiyuan Zhao, Shixuan Liu, Rui Li, Junyu Gao, and Xuelong Li. From captions to rewards (carevl): Leveraging large language model experts for enhanced reward modeling in large vision-language models. In Cathal Gurrin, Klaus Schoeffmann, Min Zhang, Luca Rossetto, Stevan Rudinac, Duc-Tien Dang-Nguyen, Wen-Huang Cheng, Phoebe Chen, and...

2025
[43]

Benchmarking multimodal cot reward model stepwise by visual program.CoRR, abs/2504.06606, 2025

Minghe Gao, Xuqi Liu, Zhongqi Yue, Yang Wu, Shuang Chen, Juncheng Li, Siliang Tang, Fei Wu, Tat-Seng Chua, and Yueting Zhuang. Benchmarking multimodal cot reward model stepwise by visual program.CoRR, abs/2504.06606, 2025

work page arXiv 2025
[44]

Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, and Karan Singhal. Healthbench: Evaluating large language models towards improved human health.CoRR, abs/2505.08775, 2025

work page internal anchor Pith review arXiv 2025
[45]

Profbench: Multi-domain rubrics requiring professional knowledge to answer and judge.CoRR, abs/2510.18941, 2025

Zhilin Wang, Jaehun Jung, Ximing Lu, Shizhe Diao, Ellie Evans, Jiaqi Zeng, Pavlo Molchanov, Yejin Choi, Jan Kautz, and Yi Dong. Profbench: Multi-domain rubrics requiring professional knowledge to answer and judge.CoRR, abs/2510.18941, 2025

work page arXiv 2025
[46]

Prbench: Large-scale expert rubrics for evaluating high-stakes professional reasoning.CoRR, abs/2511.11562, 2025

Afra Feyza Akyürek, Advait Gosai, Chen Bo Calvin Zhang, Vipul Gupta, Jaehwan Jeong, Anisha Gunjal, Tahseen Rabbani, Maria Mazzone, David Randolph, Mohammad Mahmoudi Meymand, Gurshaan Chattha, Paula Rodriguez, Diego Mares, Pavit Singh, Michael Liu, Subodh Chawla, Pete Cline, Lucy Ogaz, Ernesto Hernandez, Zihao Wang, Pavi Bhatter, Marcos Ayestaran, Bing Liu...

work page arXiv 2025
[47]

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.CoRR, abs/2507.17746, 2025

work page internal anchor Pith review arXiv 2025
[48]

Reinforcement learning with rubric anchors

Zenan Huang, Yihong Zhuang, Guoshan Lu, Zeyu Qin, Haokai Xu, Tianyu Zhao, Ru Peng, Jiaqi Hu, Zhanming Shen, Xiaomeng Hu, Xijun Gu, Peiyi Tu, Jiaxin Liu, Wenyu Chen, Yuzhuo Fu, Zhiting Fan, Yanmei Gu, Yuanyuan Wang, Zhengkai Yang, Jianguo Li, and Junbo Zhao. Reinforcement learning with rubric anchors.CoRR, abs/2508.12790, 2025

work page arXiv 2025
[49]

Openrubrics: Towards scalable synthetic rubric generation for reward modeling and llm alignment.arXiv preprint arXiv:2510.07743, 2025

Tianci Liu, Ran Xu, Tony Yu, Ilgee Hong, Carl Yang, Tuo Zhao, and Haoyu Wang. Openrubrics: Towards scalable synthetic rubric generation for reward modeling and LLM alignment.CoRR, abs/2510.07743, 2025

work page arXiv 2025
[50]

Rubrichub: A comprehensive and highly discriminative rubric dataset via automated coarse-to-fine generation.arXiv preprint arXiv:2601.08430, 2026

Sunzhu Li, Jiale Zhao, Miteto Wei, Huimin Ren, Yang Zhou, Jingwen Yang, Shunyu Liu, Kaike Zhang, and Wei Chen. Rubrichub: A comprehensive and highly discriminative rubric dataset via automated coarse-to-fine generation.CoRR, abs/2601.08430, 2026

work page arXiv 2026
[51]

suggested_labels

MohammadHossein Rezaei, Robert Vacareanu, Zihao Wang, Clinton Wang, Bing Liu, Yunzhong He, and Afra Feyza Akyürek. Online rubrics elicitation from pairwise comparisons.CoRR, abs/2510.07284, 2025

work page arXiv 2025
[52]

Auto-rubric: Learning from implicit weights to explicit rubrics for reward modeling.arXiv preprint arXiv:2510.17314, 2025

Lipeng Xie, Sen Huang, Zhuo Zhang, Anni Zou, Yunpeng Zhai, Dingchao Ren, Kezun Zhang, Haoyuan Hu, Boyin Liu, Haoran Chen, Zhaoyang Liu, and Bolin Ding. Auto-rubric: Learning to extract generalizable criteria for reward modeling.CoRR, abs/2510.17314, 2025

work page arXiv 2025
[53]

arXiv preprint arXiv:2602.01511 , year=

Ran Xu, Tianci Liu, Zihan Dong, Tony Yu, Ilgee Hong, Carl Yang, Linjun Zhang, Tao Zhao, and Haoyu Wang. Alternating reinforcement learning for rubric-based reward modeling in non-verifiable llm post-training.CoRR, abs/2602.01511, 2026

work page arXiv 2026
[54]

Shu Pu, Yaochen Wang, Dongping Chen, Yuhang Chen, Guohao Wang, Qi Qin, Zhongyi Zhang, Zhiyuan Zhang, Zetong Zhou, Shuang Gong, Yi Gui, Yao Wan, and Philip S. Yu. Judge anything: MLLM as a judge across any modality. In Luiza Antonie, Jian Pei, Xiaohui Yu, Flavio Chierichetti, Hady W. Lauw, Yizhou Sun, and Srinivasan Parthasarathy, editors,Proceedings of th...

2025
[55]

Multi-crit: Benchmarking multimodal judges on pluralistic criteria-following.CoRR, abs/2511.21662, 2025

Tianyi Xiong, Yi Ge, Ming Li, Zuolong Zhang, Pranav Kulkarni, Kaishen Wang, Qi He, Zeying Zhu, Chenxi Liu, Ruibo Chen, Tong Zheng, Yanshuo Chen, Xiyao Wang, Renrui Zhang, Wenhu Chen, and Heng Huang. Multi-crit: Benchmarking multimodal judges on pluralistic criteria-following.CoRR, abs/2511.21662, 2025

work page arXiv 2025
[56]

Rank analysis of incomplete block designs: I

Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952

1952
[57]

Texthawk2: A large vision-language model excels in bilingual OCR and grounding with 16x fewer tokens.CoRR, abs/2410.05261, 2024

Ya-Qi Yu, Minghui Liao, Jiwen Zhang, and Jihao Wu. Texthawk2: A large vision-language model excels in bilingual OCR and grounding with 16x fewer tokens.CoRR, abs/2410.05261, 2024

work page arXiv 2024
[58]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Ming-Hsuan Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

Multimodal reward- bench: Holistic evaluation of reward models for vision language models.arXiv preprint arXiv:2502.14191, 2025

Michihiro Yasunaga, Luke Zettlemoyer, and Marjan Ghazvininejad. Multimodal rewardbench: Holistic evaluation of reward models for vision language models.CoRR, abs/2502.14191, 2025

work page arXiv 2025
[60]

Vl-rewardbench: A challenging benchmark for vision-language generative reward models

Lei Li, Yuancheng Wei, Zhihui Xie, Xuqing Yang, Yifan Song, Peiyi Wang, Chenxin An, Tianyu Liu, Sujian Li, Bill Yuchen Lin, Lingpeng Kong, and Qi Liu. Vl-rewardbench: A challenging benchmark for vision-language generative reward models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages...

2025
[61]

Mllm-as-a-judge: Assessing multimodal llm-as- a-judge with vision-language benchmark

Dongping Chen, Ruoxi Chen, Shilin Zhang, Yaochen Wang, Yinuo Liu, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, and Lichao Sun. Mllm-as-a-judge: Assessing multimodal llm-as- a-judge with vision-language benchmark. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Forty- ...

2024
[62]

Introducing claude opus 4.6

Anthropic. Introducing claude opus 4.6. https://www.anthropic.com/news/ introducing-claude-opus-4-6, 2026. Released February 5, 2026

2026
[63]

Gpt-5.4 thinking system card

OpenAI. Gpt-5.4 thinking system card. https://openai.com/index/ gpt-5-4-thinking-system-card/, 2026. Released March 5, 2026

2026
[64]

Gemini 3: A family of highly capable multimodal reasoning models

Google Gemini Team. Gemini 3: A family of highly capable multimodal reasoning models. arXiv preprint arXiv:2512.03267, 2025

work page arXiv 2025
[65]

A diagram is worth a dozen images

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Min Joon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors,Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part IV, Lecture Notes in Computer Scien...

2016
[66]

Joty, and Enamul Hoque

Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq R. Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors,Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022, Findings of ACL, p...

2022
[67]

M 3cot: A novel benchmark for multi-domain multi-step multi-modal chain-of-thought

Qiguang Chen, Libo Qin, Jin Zhang, Zhi Chen, Xiao Xu, and Wanxiang Che. M 3cot: A novel benchmark for multi-domain multi-step multi-modal chain-of-thought. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, A...

2024
[68]

Smaug: Fixing failure modes of preference optimisation with dpo-positive.arXiv preprint arXiv:2402.13228, 2024

Arka Pal, Deep Karkhanis, Samuel Dooley, Manley Roberts, Siddartha Naidu, and Colin White. Smaug: Fixing failure modes of preference optimisation with dpo-positive.CoRR, abs/2402.13228, 2024

work page arXiv 2024
[69]

Disentangling length from quality in direct preference optimization

Ryan Park, Rafael Rafailov, Stefano Ermon, and Chelsea Finn. Disentangling length from quality in direct preference optimization. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, volume ACL 2024 ofFindings of ACL, pages...

2024
[70]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.CoRR, abs/2402.03300, 2024. 16 A Prompt Templates for Rubric Construction and Scoring This appendix presents the complete system prompts used across the stages of...

work page internal anchor Pith review Pith/arXiv arXiv 2024