Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

Anas Mahmoud; Bing Liu; Daniel George; Jackson Lee; MohammadHossein Rezaei; Utkarsh Tyagi; Xingang Guo; Yunzhong He

arxiv: 2605.20164 · v1 · pith:BYDZGUXSnew · submitted 2026-05-19 · 💻 cs.AI

Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

Utkarsh Tyagi , Xingang Guo , MohammadHossein Rezaei , Daniel George , Anas Mahmoud , Jackson Lee , Bing Liu , Yunzhong He This is my paper

Pith reviewed 2026-05-20 04:59 UTC · model grok-4.3

classification 💻 cs.AI

keywords rubric rewardsreinforcement learningpolicy-aware adaptationGRPORLVRverifiable rewardsmultimodal tasks

0 comments

The pith

Policy-aware adaptation of rubric rewards makes RL training both more effective and several times faster by emphasizing currently informative criteria.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard rubric rewards in reinforcement learning fix each criterion's weight according to human judgment from the start, yet this mixes lasting importance with immediate usefulness as a teaching signal. The paper shows that many high-importance criteria are already mastered or still unreachable while lower-weighted ones often separate the model's recent attempts. POW3R keeps the original human weights and category balance but raises the influence of criteria that currently differ across rollouts, turning the aggregated reward into a clearer optimization target for GRPO. This change produces higher average rubric scores, more responses that satisfy every criterion, and the same final performance level in two-and-a-half to four times fewer steps across multimodal and text-only tasks.

Core claim

POW3R preserves the human-assigned weights and category balance in rubric rewards while using rollout-level contrast to increase the weight of criteria that currently distinguish the policy's outputs, thereby making the GRPO reward more informative without changing the underlying evaluation target.

What carries the argument

POW3R's policy-aware adaptation mechanism, which reweights rubric criteria according to rollout-level contrast while preserving the original human objective.

If this is right

Higher mean rubric reward than vanilla GRPO with static aggregation.
Larger fraction of prompts whose responses satisfy every required criterion.
Reaching the same performance plateau in 2.5 to 4 times fewer training steps.
Consistent wins across three base policies and two datasets that include both multimodal and text-only cases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same contrast-based reweighting could be tested in other multi-criteria reward designs where static weights underperform.
Separating the signal used for training from the rubric used for final scoring may apply to preference-based RL as well.
Checking whether the adaptation reduces reward hacking on high-weight but uninformative criteria would be a direct follow-up experiment.

Load-bearing premise

That rollout-level contrast between policy outputs can be used to reweight criteria without distorting the human-specified rubric objective or introducing optimization instability.

What would settle it

An experiment in which POW3R produces lower strict completion rates or requires more steps than fixed-weight rubric rewards on the same base policies and datasets would show the adaptation provides no benefit.

Figures

Figures reproduced from arXiv: 2605.20164 by Anas Mahmoud, Bing Liu, Daniel George, Jackson Lee, MohammadHossein Rezaei, Utkarsh Tyagi, Xingang Guo, Yunzhong He.

**Figure 1.** Figure 1: Rubric-pressure diagnostic. We track each criterion’s training pressure1 across criterion signal states. A criterion is dead when no rollout passes it (pj=0), saturated when every rollout passes it (pj=1), and mixed when verdicts differ across the rollout group; dead and saturated criteria have vj=0 and give no group-relative advantage signal. (a) Criteria grouped by setting and absolute static weight/poin… view at source ↗

**Figure 2.** Figure 2: Mechanism check. Each gray point is one prompt under one diagnostic run: Qwen3-VL-4B-Instruct or Gemma 3 12B-IT on our multimodal dataset (MM) or HealthBench English (HB). Moving weight off criteria that all rollouts pass or fail separates the rollout rewards more; reward spread is the standard deviation of rollouts before GRPO standardization. The trend holds prompt-by-prompt in every run. The diagnostic… view at source ↗

**Figure 3.** Figure 3: Illustrative tasks from each rubric-RL setting. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Main result visual summary. (a) MM test rubric reward and strict completion for each reward construction; lines connect methods trained from the same base policy. (b) Test rubric-reward gain over the corresponding base model on MM and HB. each other because most Writing Style criteria are already passed by the base policy. This is by design: POW3R concentrates pressure where the rollout group exposes learn… view at source ↗

**Figure 5.** Figure 5: Per-category validation reward trajectories (Qwen3-VL-4B, MM). [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Overall validation reward trajectory (Qwen3- [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Where POW3R has signal and where it acts [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

read the original abstract

Reinforcement learning with verifiable rewards has made post-training highly effective when correctness can be checked automatically. However, many important model behaviors require satisfying several qualitative criteria at once. Rubric-based rewards address this setting by grading prompt-specific criteria and aggregating them into a scalar reward. Yet standard static aggregations conflate a criterion's human-assigned importance with its current usefulness as an optimization signal. We show that this assumption breaks down in rubric RL: many important criteria are already saturated or currently unreachable, while criteria that distinguish rollouts are not necessarily those with the largest human weights. We introduce POW3R, a policy-aware rubric reward framework that preserves human weights and category balance as the rubric objective while adapting criterion-level reward weights during training. POW3R uses rollout-level contrast to emphasize criteria that currently separate the policy's outputs, making the GRPO reward more informative without changing the underlying evaluation target. Across three base policies on two datasets spanning multimodal and text-only settings, POW3R wins $24$ of $30$ base-policy/metric comparisons, improving both mean rubric reward and strict completion (the fraction of prompts whose response satisfies every required rubric criterion) over vanilla GRPO with rubric rewards, and reaches the same plateau in $2.5$--$4\times$ fewer training steps. Rubric rewards should therefore distinguish what should matter in the final answer from what can teach the current policy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

POW3R adapts rubric criterion weights via rollout contrast to give GRPO a more informative signal while keeping the human objective fixed, and the experiments show consistent gains plus faster convergence.

read the letter

The main point is that static rubric aggregation often wastes signal because some criteria are already saturated or unreachable for the current policy. POW3R fixes this by reweighting at the criterion level using contrast between the policy's own rollouts, without changing the final human-specified target or category balance. That distinction between what should matter at the end and what can teach the model right now is the actual contribution here.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces POW3R, a policy-aware rubric reward framework for RLVR that preserves human-assigned weights and category balance as the fixed objective while dynamically adapting per-criterion weights via rollout-level contrast to emphasize currently distinguishing criteria. It argues that static aggregations conflate importance with optimization utility and shows that this leads to suboptimal signals when criteria are saturated or unreachable. Across three base policies on two datasets (multimodal and text-only), POW3R outperforms vanilla GRPO with rubric rewards in 24 of 30 comparisons, improving mean rubric reward and strict completion rate while reaching the same performance plateau in 2.5--4× fewer training steps.

Significance. If the reported gains hold under rigorous statistical scrutiny, the work provides a practical, low-overhead enhancement to rubric-based RL by making rewards more informative for the current policy state without altering the human-specified target. The consistent improvements across diverse settings and the faster convergence are potentially valuable for efficient post-training on multi-criteria tasks.

major comments (1)

[Abstract and experimental results section] Abstract and experimental results section: the central claim of winning 24 of 30 base-policy/metric comparisons, plus 2.5--4× faster plateauing, is reported without any details on the number of random seeds, variance or standard errors across runs, or statistical significance tests. This absence directly affects the verifiability of the empirical superiority and is load-bearing for the paper's main contribution.

minor comments (2)

[Method section] The precise formula or algorithm for computing rollout-level contrast and reweighting the criteria should be presented with an equation or pseudocode to allow exact reproduction.
[Introduction or evaluation metrics] Ensure the definition of 'strict completion' (fraction of prompts satisfying every rubric criterion) is stated explicitly in the main text, not only in the abstract.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thorough review and positive assessment of the manuscript. We appreciate the constructive feedback on strengthening the empirical presentation and will address the concern directly in the revision.

read point-by-point responses

Referee: [Abstract and experimental results section] Abstract and experimental results section: the central claim of winning 24 of 30 base-policy/metric comparisons, plus 2.5--4× faster plateauing, is reported without any details on the number of random seeds, variance or standard errors across runs, or statistical significance tests. This absence directly affects the verifiability of the empirical superiority and is load-bearing for the paper's main contribution.

Authors: We agree that the current version of the manuscript does not include details on the number of random seeds, variance or standard errors, or statistical significance tests in the abstract or experimental results section. This information is important for verifying the reported improvements. In the revised manuscript we will expand the experimental results section to report the number of random seeds used for each base-policy and dataset combination, present mean rubric rewards and strict completion rates with standard errors across runs, and include statistical significance tests (e.g., paired t-tests or Wilcoxon tests) to support the 24-of-30 comparison wins and the 2.5--4× faster plateauing claims. These additions will be made without changing the underlying experimental protocol or results. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces POW3R as an empirical policy-aware adaptation layered on top of standard GRPO with rubric rewards. It uses rollout-level contrast to modulate per-criterion emphasis while explicitly preserving the original human-assigned weights and category balance as the evaluation target. No equations or claims reduce a prediction or result to a fitted quantity by construction, and no load-bearing steps rely on self-citations, uniqueness theorems, or ansatzes imported from prior author work. The reported gains in mean rubric reward, strict completion rate, and faster convergence are presented as direct empirical outcomes across three base policies and two datasets, keeping the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard RL assumptions for generating multiple rollouts and the validity of contrast as an optimization signal; no new free parameters, axioms, or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Multiple independent rollouts can be sampled from the current policy for each prompt to compute contrast
Invoked in the description of rollout-level contrast for reweighting criteria

pith-pipeline@v0.9.0 · 5808 in / 1219 out tokens · 40833 ms · 2026-05-20T04:59:03.594473+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

POW3R uses rollout-level contrast to emphasize criteria that currently separate the policy's outputs... g(t)_j = sqrt(v(t)_j + ε), ρ(t)_j = g(t)_j / g-bar(t)_κj, α-hat(t)_j = clip((1-λ)+λ ρ(t)_j, α_min, α_max)
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

preserves human weights and category balance as the rubric objective while adapting criterion-level reward weights

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 23 internal anchors

[1]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models, 2024. URLhttps://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI, Daya Guo, Dejian Yang, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning, 2025. URLhttps://arxiv.org/abs/2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, et al. DAPO: An open-source LLM reinforcement learning system at scale, 2025. URLhttps://arxiv.org/abs/2503.14476

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Spurious Rewards: Rethinking Training Signals in RLVR

Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, et al. Spurious rewards: Rethinking training signals in RLVR, 2025. URLhttps://arxiv.org/abs/2506.10947

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Reinforcement learning with verifiable yet noisy rewards under imperfect verifiers, 2025

Xin-Qiang Cai, Wei Wang, Feng Liu, Tongliang Liu, Gang Niu, and Masashi Sugiyama. Reinforcement learning with verifiable yet noisy rewards under imperfect verifiers, 2025. URLhttps://arxiv.org/abs/ 2510.00915

work page arXiv 2025
[6]

Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero-Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, and Karan Singhal. HealthBench: Evaluating large language models towards improved human health, 2025. URL https://arxiv.org/abs/2505.08775

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Perception-R1: Advancing multimodal reasoning capabilities of MLLMs via visual perception reward, 2025

Tong Xiao, Xin Xu, Zhenya Huang, Hongyu Gao, Quan Liu, Qi Liu, and Enhong Chen. Perception-R1: Advancing multimodal reasoning capabilities of MLLMs via visual perception reward, 2025. URL https: //arxiv.org/abs/2506.07218

work page arXiv 2025
[8]

Seeing with you: Perception-reasoning coevolution for multimodal reasoning, 2026

Ziqi Miao, Haonan Jia, Lijun Li, Chen Qian, Yuan Xiong, Wenting Yan, and Jing Shao. Seeing with you: Perception-reasoning coevolution for multimodal reasoning, 2026. URL https://arxiv.org/abs/2603. 28618

work page 2026
[9]

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains, 2025. URL https://arxiv.org/abs/ 2507.17746

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

OpenRubrics: Towards scalable synthetic rubric generation for reward modeling and LLM alignment, 2025

Tianci Liu, Ran Xu, Tony Yu, Ilgee Hong, Carl Yang, Tuo Zhao, and Haoyu Wang. OpenRubrics: Towards scalable synthetic rubric generation for reward modeling and LLM alignment, 2025. URL https://arxiv. org/abs/2510.07743

work page arXiv 2025
[11]

Rubrichub: A comprehensive and highly discriminative rubric dataset via automated coarse-to-fine generation.arXiv preprint arXiv:2601.08430, 2026

Sunzhu Li, Jiale Zhao, Miteto Wei, Huimin Ren, Yang Zhou, Jingwen Yang, Shunyu Liu, Kaike Zhang, and Wei Chen. RubricHub: A comprehensive and highly discriminative rubric dataset via automated coarse-to-fine generation, 2026. URLhttps://arxiv.org/abs/2601.08430

work page arXiv 2026
[12]

AutoRubric: Rubric- based generative rewards for faithful multimodal reasoning, 2025

Mengzhao Jia, Zhihan Zhang, Ignacio Cases, Zheyuan Liu, Meng Jiang, and Peng Qi. AutoRubric: Rubric- based generative rewards for faithful multimodal reasoning, 2025. URL https://arxiv.org/abs/2510. 14738

work page 2025
[13]

Learning to optimize multi-objective alignment through dynamic reward weighting, 2025

Yining Lu, Zilong Wang, Shiyang Li, Xin Liu, et al. Learning to optimize multi-objective alignment through dynamic reward weighting, 2025. URLhttps://arxiv.org/abs/2509.11452. 12 Scale AI Research

work page arXiv 2025
[14]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, et al. Qwen3-VL technical report, 2025. URL https://arxiv.org/ abs/2511.21631

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Gemma 3 Technical Report

Gemma Team, Aishwarya Kamath, Johan Ferret, et al. Gemma 3 technical report, 2025. URL https: //arxiv.org/abs/2503.19786

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Introducing GPT-5.4 mini and nano

OpenAI. Introducing GPT-5.4 mini and nano. OpenAI product release, 2026. URL https://openai.com/ index/introducing-gpt-5-4-mini-and-nano/

work page 2026
[17]

A Survey of Multi-Objective Sequential Decision-Making

Diederik Marijn Roijers, Peter Vamplew, Shimon Whiteson, and Richard Dazeley. A survey of multi-objective sequential decision-making, 2014. URLhttps://arxiv.org/abs/1402.0590

work page internal anchor Pith review Pith/arXiv arXiv 2014
[18]

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Yejin Choi, Jan Kautz, and Pavlo Molchanov. GDPO: Group reward-decoupled normalization policy optimization for multi-reward RL optimization, 2026. URL https://arxiv.org/abs/2601.05242

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

Audio multichallenge: A multi-turn evaluation of spoken dialogue systems on natural human interaction,

Advait Gosai, Tyler Vuong, Utkarsh Tyagi, Steven Li, Wenjia You, Miheer Bavare, Arda Uçar, Zhongwang Fang, Brian Jang, Bing Liu, and Yunzhong He. Audio MultiChallenge: A multi-turn evaluation of spoken dialogue systems on natural human interaction, 2025. URLhttps://arxiv.org/abs/2512.14865

work page arXiv 2025
[20]

Online rubrics elicitation from pairwise comparisons, 2025

MohammadHossein Rezaei, Robert Vacareanu, Zihao Wang, Clinton Wang, Bing Liu, Yunzhong He, and Afra Feyza Akyürek. Online rubrics elicitation from pairwise comparisons, 2025. URL https://arxiv. org/abs/2510.07284

work page arXiv 2025
[21]

Rulin Shao, Akari Asai, Shannon Zejiang Shen, Hamish Ivison, Varsha Kishore, Jingming Zhuo, Xinran Zhao, Molly Park, Samuel G. Finlayson, David Sontag, Tyler Murray, Sewon Min, Pradeep Dasigi, Luca Soldaini, Faeze Brahman, Wen-tau Yih, Tongshuang Wu, Luke Zettlemoyer, Yoon Kim, Hannaneh Hajishirzi, and Pang Wei Koh. DR Tulu: Reinforcement learning with ev...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

RuCL: Stratified rubric-based curriculum learning for multimodal large language model reasoning, 2026

Yukun Chen, Jiaming Li, Longze Chen, Ze Gong, Jingpeng Li, Zhen Qin, Hengyu Chang, Ancheng Xu, Zhihao Yang, Hamid Alinejad-Rokny, Qiang Qu, Bo Zheng, and Min Yang. RuCL: Stratified rubric-based curriculum learning for multimodal large language model reasoning, 2026. URL https://arxiv.org/abs/2602. 21628

work page 2026
[23]

Multi-objective reinforcement learning from ai feedback.arXiv preprint arXiv:2406.07295, 2024

Marcus Williams. Multi-objective reinforcement learning from AI feedback, 2024. URL https://arxiv. org/abs/2406.07295

work page arXiv 2024
[24]

Fine-Tuning Language Models from Human Preferences

Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences, 2019. URL https://arxiv. org/abs/1909.08593

work page internal anchor Pith review Pith/arXiv arXiv 2019
[25]

Learning to summarize from human feedback

Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback, 2020. URL https://arxiv. org/abs/2009.01325

work page internal anchor Pith review Pith/arXiv arXiv 2020
[26]

Training language models to follow instructions with human feedback

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, et al. Training language models to follow instructions with human feedback, 2022. URL https://arxiv.org/abs/2203.02155

work page internal anchor Pith review Pith/arXiv arXiv 2022
[27]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, et al. Constitutional ai: Harmlessness from ai feedback, 2022. URL https://arxiv.org/ abs/2212.08073. 13 Scale AI Research

work page internal anchor Pith review Pith/arXiv arXiv 2022
[28]

RLAIF vs

Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Lu, Colton Bishop, Ethan Hall, et al. RLAIF vs. RLHF: Scaling reinforcement learning from human feedback with ai feedback,

work page
[29]

URLhttps://arxiv.org/abs/2309.00267

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Rewardbench: Evaluating reward models for language modeling.arXiv preprint arXiv:2403.13787, 2024

Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, et al. RewardBench: Evaluating reward models for language modeling, 2024. URL https://arxiv.org/abs/2403.13787

work page arXiv 2024
[31]

Visual-RFT: Visual Reinforcement Fine-Tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-RFT: Visual reinforcement fine-tuning, 2025. URLhttps://arxiv.org/abs/2503.01785

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Xu Tang, Yao Hu, and Shaohui Lin. Vision-R1: Incentivizing reasoning capability in multimodal large language models, 2025. URL https://arxiv.org/abs/2503.06749

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

R1-VL: Learning to reason with multimodal large language models via step-wise group relative policy optimization,

Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-VL: Learning to reason with multimodal large language models via step-wise group relative policy optimization,

work page
[34]

URLhttps://arxiv.org/abs/2503.12937

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Ground-R1: Incentiviz- ing grounded visual reasoning via reinforcement learning, 2025

Meng Cao, Haoze Zhao, Can Zhang, Xiaojun Chang, Ian Reid, and Xiaodan Liang. Ground-R1: Incentiviz- ing grounded visual reasoning via reinforcement learning, 2025. URL https://arxiv.org/abs/2505. 20272

work page 2025
[36]

Perceptual-evidence anchored reinforced learning for multimodal reasoning, 2025

Chi Zhang, Haibo Qiu, Qiming Zhang, Yufei Xu, Zhixiong Zeng, Siqi Yang, Peng Shi, Lin Ma, and Jing Zhang. Perceptual-evidence anchored reinforced learning for multimodal reasoning, 2025. URL https: //arxiv.org/abs/2511.18437

work page arXiv 2025
[37]

arXiv preprint arXiv:2510.02240 (2025)

Sicheng Feng, Kaiwen Tuo, Song Wang, Lingdong Kong, Jianke Zhu, and Huan Wang. RewardMap: Tackling sparse rewards in fine-grained visual reasoning via multi-stage reinforcement learning, 2025. URL https: //arxiv.org/abs/2510.02240

work page arXiv 2025
[38]

Bridging perception and reasoning: Token reweighting for RLVR in multimodal LLMs, 2026

Jinda Lu, Junkang Wu, Jinghan Li, Kexin Huang, Shuo Yang, Guoyin Wang, Jiancan Wu, Xiang Wang, and Xiangnan He. Bridging perception and reasoning: Token reweighting for RLVR in multimodal LLMs, 2026. URLhttps://arxiv.org/abs/2603.25077

work page arXiv 2026
[39]

Beyond seeing: Evaluating multimodal llms on tool-enabled image perception, transformation, and reasoning,

Xingang Guo, Utkarsh Tyagi, Advait Gosai, Paula Vergara, Jayeon Park, Ernesto Gabriel Hernandez Montoya, Chen Bo Calvin Zhang, Bin Hu, Yunzhong He, Bing Liu, and Rakshith Sharma Srinivasa. Beyond seeing: Evaluating multimodal LLMs on tool-enabled image perception, transformation, and reasoning, 2025. URL https://arxiv.org/abs/2510.12712

work page arXiv 2025
[40]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model?, 2025. URL https://arxiv.org/abs/2504.13837

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, et al. Qwen3 technical report, 2025. URL https://arxiv.org/abs/ 2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. HallusionBench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision an...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

In: Bouamor, H., Pino, J., Bali, K

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating object hallu- cination in large vision-language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 292–305. Association for Computational Linguistics, 2023. doi: 10.18653/v1/2023.emnlp-main.20. URLhttps://aclanthology.org/...

work page doi:10.18653/v1/2023.emnlp-main.20 2023
[44]

arXiv preprint arXiv:2504.07957 , year=

Shengyuan Ding, Shenxi Wu, Xiangyu Zhao, Yuhang Zang, Haodong Duan, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. MM-IFEngine: Towards multimodal instruction following, 2025. URL https://arxiv.org/abs/2504.07957

work page arXiv 2025
[45]

MM-Vet v2: A challenging benchmark to evaluate large multimodal models for integrated capabilities, 2024

Weihao Yu, Zhengyuan Yang, Lingfeng Ren, Linjie Li, Jianfeng Wang, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Lijuan Wang, and Xinchao Wang. MM-Vet v2: A challenging benchmark to evaluate large multimodal models for integrated capabilities, 2024. URLhttps://arxiv.org/abs/2408.00765

work page arXiv 2024
[46]

MathVista: Evaluating mathematical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating mathematical reasoning of foundation models in visual contexts. InInternational Conference on Learning Representations, 2024. URL https:// openreview.net/forum?id=KUNzEQMWU7

work page 2024
[47]

RealWorldQA: A benchmark for real-world spatial understanding

xAI. RealWorldQA: A benchmark for real-world spatial understanding. xAI dataset release, 2024. URL https://huggingface.co/datasets/xai-org/RealworldQA

work page 2024
[48]

Generalized Slow Roll for Tensors

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. ZeRO: Memory optimizations toward training trillion parameter models. InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16, 2020. doi: 10.1109/SC41405.2020.00024. URL https://arxiv.org/abs/ 1910.02054

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41405.2020.00024 2020
[49]

the response avoids the prohibited behavior,

OpenAI. Introducing GPT-5.4. OpenAI product release, 2026. URL https://openai.com/index/ introducing-gpt-5-4/. 15 Scale AI Research Appendix Appendix index. • Section A: Dataset and contributor details– rubric annotations, contributor demographics, and split details for MM. • Section B: Judge selection– reference-judge calibration, cost–quality tradeoffs,...

work page 2026

[1] [1]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models, 2024. URLhttps://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI, Daya Guo, Dejian Yang, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning, 2025. URLhttps://arxiv.org/abs/2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, et al. DAPO: An open-source LLM reinforcement learning system at scale, 2025. URLhttps://arxiv.org/abs/2503.14476

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Spurious Rewards: Rethinking Training Signals in RLVR

Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, et al. Spurious rewards: Rethinking training signals in RLVR, 2025. URLhttps://arxiv.org/abs/2506.10947

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Reinforcement learning with verifiable yet noisy rewards under imperfect verifiers, 2025

Xin-Qiang Cai, Wei Wang, Feng Liu, Tongliang Liu, Gang Niu, and Masashi Sugiyama. Reinforcement learning with verifiable yet noisy rewards under imperfect verifiers, 2025. URLhttps://arxiv.org/abs/ 2510.00915

work page arXiv 2025

[6] [6]

Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero-Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, and Karan Singhal. HealthBench: Evaluating large language models towards improved human health, 2025. URL https://arxiv.org/abs/2505.08775

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Perception-R1: Advancing multimodal reasoning capabilities of MLLMs via visual perception reward, 2025

Tong Xiao, Xin Xu, Zhenya Huang, Hongyu Gao, Quan Liu, Qi Liu, and Enhong Chen. Perception-R1: Advancing multimodal reasoning capabilities of MLLMs via visual perception reward, 2025. URL https: //arxiv.org/abs/2506.07218

work page arXiv 2025

[8] [8]

Seeing with you: Perception-reasoning coevolution for multimodal reasoning, 2026

Ziqi Miao, Haonan Jia, Lijun Li, Chen Qian, Yuan Xiong, Wenting Yan, and Jing Shao. Seeing with you: Perception-reasoning coevolution for multimodal reasoning, 2026. URL https://arxiv.org/abs/2603. 28618

work page 2026

[9] [9]

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains, 2025. URL https://arxiv.org/abs/ 2507.17746

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

OpenRubrics: Towards scalable synthetic rubric generation for reward modeling and LLM alignment, 2025

Tianci Liu, Ran Xu, Tony Yu, Ilgee Hong, Carl Yang, Tuo Zhao, and Haoyu Wang. OpenRubrics: Towards scalable synthetic rubric generation for reward modeling and LLM alignment, 2025. URL https://arxiv. org/abs/2510.07743

work page arXiv 2025

[11] [11]

Rubrichub: A comprehensive and highly discriminative rubric dataset via automated coarse-to-fine generation.arXiv preprint arXiv:2601.08430, 2026

Sunzhu Li, Jiale Zhao, Miteto Wei, Huimin Ren, Yang Zhou, Jingwen Yang, Shunyu Liu, Kaike Zhang, and Wei Chen. RubricHub: A comprehensive and highly discriminative rubric dataset via automated coarse-to-fine generation, 2026. URLhttps://arxiv.org/abs/2601.08430

work page arXiv 2026

[12] [12]

AutoRubric: Rubric- based generative rewards for faithful multimodal reasoning, 2025

Mengzhao Jia, Zhihan Zhang, Ignacio Cases, Zheyuan Liu, Meng Jiang, and Peng Qi. AutoRubric: Rubric- based generative rewards for faithful multimodal reasoning, 2025. URL https://arxiv.org/abs/2510. 14738

work page 2025

[13] [13]

Learning to optimize multi-objective alignment through dynamic reward weighting, 2025

Yining Lu, Zilong Wang, Shiyang Li, Xin Liu, et al. Learning to optimize multi-objective alignment through dynamic reward weighting, 2025. URLhttps://arxiv.org/abs/2509.11452. 12 Scale AI Research

work page arXiv 2025

[14] [14]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, et al. Qwen3-VL technical report, 2025. URL https://arxiv.org/ abs/2511.21631

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Gemma 3 Technical Report

Gemma Team, Aishwarya Kamath, Johan Ferret, et al. Gemma 3 technical report, 2025. URL https: //arxiv.org/abs/2503.19786

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

Introducing GPT-5.4 mini and nano

OpenAI. Introducing GPT-5.4 mini and nano. OpenAI product release, 2026. URL https://openai.com/ index/introducing-gpt-5-4-mini-and-nano/

work page 2026

[17] [17]

A Survey of Multi-Objective Sequential Decision-Making

Diederik Marijn Roijers, Peter Vamplew, Shimon Whiteson, and Richard Dazeley. A survey of multi-objective sequential decision-making, 2014. URLhttps://arxiv.org/abs/1402.0590

work page internal anchor Pith review Pith/arXiv arXiv 2014

[18] [18]

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Yejin Choi, Jan Kautz, and Pavlo Molchanov. GDPO: Group reward-decoupled normalization policy optimization for multi-reward RL optimization, 2026. URL https://arxiv.org/abs/2601.05242

work page internal anchor Pith review Pith/arXiv arXiv 2026

[19] [19]

Audio multichallenge: A multi-turn evaluation of spoken dialogue systems on natural human interaction,

Advait Gosai, Tyler Vuong, Utkarsh Tyagi, Steven Li, Wenjia You, Miheer Bavare, Arda Uçar, Zhongwang Fang, Brian Jang, Bing Liu, and Yunzhong He. Audio MultiChallenge: A multi-turn evaluation of spoken dialogue systems on natural human interaction, 2025. URLhttps://arxiv.org/abs/2512.14865

work page arXiv 2025

[20] [20]

Online rubrics elicitation from pairwise comparisons, 2025

MohammadHossein Rezaei, Robert Vacareanu, Zihao Wang, Clinton Wang, Bing Liu, Yunzhong He, and Afra Feyza Akyürek. Online rubrics elicitation from pairwise comparisons, 2025. URL https://arxiv. org/abs/2510.07284

work page arXiv 2025

[21] [21]

Rulin Shao, Akari Asai, Shannon Zejiang Shen, Hamish Ivison, Varsha Kishore, Jingming Zhuo, Xinran Zhao, Molly Park, Samuel G. Finlayson, David Sontag, Tyler Murray, Sewon Min, Pradeep Dasigi, Luca Soldaini, Faeze Brahman, Wen-tau Yih, Tongshuang Wu, Luke Zettlemoyer, Yoon Kim, Hannaneh Hajishirzi, and Pang Wei Koh. DR Tulu: Reinforcement learning with ev...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

RuCL: Stratified rubric-based curriculum learning for multimodal large language model reasoning, 2026

Yukun Chen, Jiaming Li, Longze Chen, Ze Gong, Jingpeng Li, Zhen Qin, Hengyu Chang, Ancheng Xu, Zhihao Yang, Hamid Alinejad-Rokny, Qiang Qu, Bo Zheng, and Min Yang. RuCL: Stratified rubric-based curriculum learning for multimodal large language model reasoning, 2026. URL https://arxiv.org/abs/2602. 21628

work page 2026

[23] [23]

Multi-objective reinforcement learning from ai feedback.arXiv preprint arXiv:2406.07295, 2024

Marcus Williams. Multi-objective reinforcement learning from AI feedback, 2024. URL https://arxiv. org/abs/2406.07295

work page arXiv 2024

[24] [24]

Fine-Tuning Language Models from Human Preferences

Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences, 2019. URL https://arxiv. org/abs/1909.08593

work page internal anchor Pith review Pith/arXiv arXiv 2019

[25] [25]

Learning to summarize from human feedback

Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback, 2020. URL https://arxiv. org/abs/2009.01325

work page internal anchor Pith review Pith/arXiv arXiv 2020

[26] [26]

Training language models to follow instructions with human feedback

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, et al. Training language models to follow instructions with human feedback, 2022. URL https://arxiv.org/abs/2203.02155

work page internal anchor Pith review Pith/arXiv arXiv 2022

[27] [27]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, et al. Constitutional ai: Harmlessness from ai feedback, 2022. URL https://arxiv.org/ abs/2212.08073. 13 Scale AI Research

work page internal anchor Pith review Pith/arXiv arXiv 2022

[28] [28]

RLAIF vs

Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Lu, Colton Bishop, Ethan Hall, et al. RLAIF vs. RLHF: Scaling reinforcement learning from human feedback with ai feedback,

work page

[29] [29]

URLhttps://arxiv.org/abs/2309.00267

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

Rewardbench: Evaluating reward models for language modeling.arXiv preprint arXiv:2403.13787, 2024

Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, et al. RewardBench: Evaluating reward models for language modeling, 2024. URL https://arxiv.org/abs/2403.13787

work page arXiv 2024

[31] [31]

Visual-RFT: Visual Reinforcement Fine-Tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-RFT: Visual reinforcement fine-tuning, 2025. URLhttps://arxiv.org/abs/2503.01785

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Xu Tang, Yao Hu, and Shaohui Lin. Vision-R1: Incentivizing reasoning capability in multimodal large language models, 2025. URL https://arxiv.org/abs/2503.06749

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

R1-VL: Learning to reason with multimodal large language models via step-wise group relative policy optimization,

Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-VL: Learning to reason with multimodal large language models via step-wise group relative policy optimization,

work page

[34] [34]

URLhttps://arxiv.org/abs/2503.12937

work page internal anchor Pith review Pith/arXiv arXiv

[35] [35]

Ground-R1: Incentiviz- ing grounded visual reasoning via reinforcement learning, 2025

Meng Cao, Haoze Zhao, Can Zhang, Xiaojun Chang, Ian Reid, and Xiaodan Liang. Ground-R1: Incentiviz- ing grounded visual reasoning via reinforcement learning, 2025. URL https://arxiv.org/abs/2505. 20272

work page 2025

[36] [36]

Perceptual-evidence anchored reinforced learning for multimodal reasoning, 2025

Chi Zhang, Haibo Qiu, Qiming Zhang, Yufei Xu, Zhixiong Zeng, Siqi Yang, Peng Shi, Lin Ma, and Jing Zhang. Perceptual-evidence anchored reinforced learning for multimodal reasoning, 2025. URL https: //arxiv.org/abs/2511.18437

work page arXiv 2025

[37] [37]

arXiv preprint arXiv:2510.02240 (2025)

Sicheng Feng, Kaiwen Tuo, Song Wang, Lingdong Kong, Jianke Zhu, and Huan Wang. RewardMap: Tackling sparse rewards in fine-grained visual reasoning via multi-stage reinforcement learning, 2025. URL https: //arxiv.org/abs/2510.02240

work page arXiv 2025

[38] [38]

Bridging perception and reasoning: Token reweighting for RLVR in multimodal LLMs, 2026

Jinda Lu, Junkang Wu, Jinghan Li, Kexin Huang, Shuo Yang, Guoyin Wang, Jiancan Wu, Xiang Wang, and Xiangnan He. Bridging perception and reasoning: Token reweighting for RLVR in multimodal LLMs, 2026. URLhttps://arxiv.org/abs/2603.25077

work page arXiv 2026

[39] [39]

Beyond seeing: Evaluating multimodal llms on tool-enabled image perception, transformation, and reasoning,

Xingang Guo, Utkarsh Tyagi, Advait Gosai, Paula Vergara, Jayeon Park, Ernesto Gabriel Hernandez Montoya, Chen Bo Calvin Zhang, Bin Hu, Yunzhong He, Bing Liu, and Rakshith Sharma Srinivasa. Beyond seeing: Evaluating multimodal LLMs on tool-enabled image perception, transformation, and reasoning, 2025. URL https://arxiv.org/abs/2510.12712

work page arXiv 2025

[40] [40]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model?, 2025. URL https://arxiv.org/abs/2504.13837

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, et al. Qwen3 technical report, 2025. URL https://arxiv.org/abs/ 2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [42]

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. HallusionBench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision an...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[43] [43]

In: Bouamor, H., Pino, J., Bali, K

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating object hallu- cination in large vision-language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 292–305. Association for Computational Linguistics, 2023. doi: 10.18653/v1/2023.emnlp-main.20. URLhttps://aclanthology.org/...

work page doi:10.18653/v1/2023.emnlp-main.20 2023

[44] [44]

arXiv preprint arXiv:2504.07957 , year=

Shengyuan Ding, Shenxi Wu, Xiangyu Zhao, Yuhang Zang, Haodong Duan, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. MM-IFEngine: Towards multimodal instruction following, 2025. URL https://arxiv.org/abs/2504.07957

work page arXiv 2025

[45] [45]

MM-Vet v2: A challenging benchmark to evaluate large multimodal models for integrated capabilities, 2024

Weihao Yu, Zhengyuan Yang, Lingfeng Ren, Linjie Li, Jianfeng Wang, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Lijuan Wang, and Xinchao Wang. MM-Vet v2: A challenging benchmark to evaluate large multimodal models for integrated capabilities, 2024. URLhttps://arxiv.org/abs/2408.00765

work page arXiv 2024

[46] [46]

MathVista: Evaluating mathematical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating mathematical reasoning of foundation models in visual contexts. InInternational Conference on Learning Representations, 2024. URL https:// openreview.net/forum?id=KUNzEQMWU7

work page 2024

[47] [47]

RealWorldQA: A benchmark for real-world spatial understanding

xAI. RealWorldQA: A benchmark for real-world spatial understanding. xAI dataset release, 2024. URL https://huggingface.co/datasets/xai-org/RealworldQA

work page 2024

[48] [48]

Generalized Slow Roll for Tensors

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. ZeRO: Memory optimizations toward training trillion parameter models. InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16, 2020. doi: 10.1109/SC41405.2020.00024. URL https://arxiv.org/abs/ 1910.02054

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41405.2020.00024 2020

[49] [49]

the response avoids the prohibited behavior,

OpenAI. Introducing GPT-5.4. OpenAI product release, 2026. URL https://openai.com/index/ introducing-gpt-5-4/. 15 Scale AI Research Appendix Appendix index. • Section A: Dataset and contributor details– rubric annotations, contributor demographics, and split details for MM. • Section B: Judge selection– reference-judge calibration, cost–quality tradeoffs,...

work page 2026