pith. sign in

arxiv: 2605.20164 · v1 · pith:BYDZGUXSnew · submitted 2026-05-19 · 💻 cs.AI

Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

Pith reviewed 2026-05-20 04:59 UTC · model grok-4.3

classification 💻 cs.AI
keywords rubric rewardsreinforcement learningpolicy-aware adaptationGRPORLVRverifiable rewardsmultimodal tasks
0
0 comments X

The pith

Policy-aware adaptation of rubric rewards makes RL training both more effective and several times faster by emphasizing currently informative criteria.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard rubric rewards in reinforcement learning fix each criterion's weight according to human judgment from the start, yet this mixes lasting importance with immediate usefulness as a teaching signal. The paper shows that many high-importance criteria are already mastered or still unreachable while lower-weighted ones often separate the model's recent attempts. POW3R keeps the original human weights and category balance but raises the influence of criteria that currently differ across rollouts, turning the aggregated reward into a clearer optimization target for GRPO. This change produces higher average rubric scores, more responses that satisfy every criterion, and the same final performance level in two-and-a-half to four times fewer steps across multimodal and text-only tasks.

Core claim

POW3R preserves the human-assigned weights and category balance in rubric rewards while using rollout-level contrast to increase the weight of criteria that currently distinguish the policy's outputs, thereby making the GRPO reward more informative without changing the underlying evaluation target.

What carries the argument

POW3R's policy-aware adaptation mechanism, which reweights rubric criteria according to rollout-level contrast while preserving the original human objective.

If this is right

  • Higher mean rubric reward than vanilla GRPO with static aggregation.
  • Larger fraction of prompts whose responses satisfy every required criterion.
  • Reaching the same performance plateau in 2.5 to 4 times fewer training steps.
  • Consistent wins across three base policies and two datasets that include both multimodal and text-only cases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same contrast-based reweighting could be tested in other multi-criteria reward designs where static weights underperform.
  • Separating the signal used for training from the rubric used for final scoring may apply to preference-based RL as well.
  • Checking whether the adaptation reduces reward hacking on high-weight but uninformative criteria would be a direct follow-up experiment.

Load-bearing premise

That rollout-level contrast between policy outputs can be used to reweight criteria without distorting the human-specified rubric objective or introducing optimization instability.

What would settle it

An experiment in which POW3R produces lower strict completion rates or requires more steps than fixed-weight rubric rewards on the same base policies and datasets would show the adaptation provides no benefit.

Figures

Figures reproduced from arXiv: 2605.20164 by Anas Mahmoud, Bing Liu, Daniel George, Jackson Lee, MohammadHossein Rezaei, Utkarsh Tyagi, Xingang Guo, Yunzhong He.

Figure 1
Figure 1. Figure 1: Rubric-pressure diagnostic. We track each criterion’s training pressure1 across criterion signal states. A criterion is dead when no rollout passes it (pj=0), saturated when every rollout passes it (pj=1), and mixed when verdicts differ across the rollout group; dead and saturated criteria have vj=0 and give no group-relative advantage signal. (a) Criteria grouped by setting and absolute static weight/poin… view at source ↗
Figure 2
Figure 2. Figure 2: Mechanism check. Each gray point is one prompt under one diagnostic run: Qwen3-VL-4B-Instruct or Gemma 3 12B-IT on our multimodal dataset (MM) or HealthBench English (HB). Moving weight off criteria that all rollouts pass or fail separates the rollout rewards more; reward spread is the standard deviation of rollouts before GRPO standard￾ization. The trend holds prompt-by-prompt in every run. The diagnostic… view at source ↗
Figure 3
Figure 3. Figure 3: Illustrative tasks from each rubric-RL setting. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Main result visual summary. (a) MM test rubric reward and strict completion for each reward construction; lines connect methods trained from the same base policy. (b) Test rubric-reward gain over the corresponding base model on MM and HB. each other because most Writing Style criteria are already passed by the base policy. This is by design: POW3R concentrates pressure where the rollout group exposes learn… view at source ↗
Figure 5
Figure 5. Figure 5: Per-category validation reward trajectories (Qwen3-VL-4B, MM). [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Overall validation reward trajectory (Qwen3- [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Where POW3R has signal and where it acts [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
read the original abstract

Reinforcement learning with verifiable rewards has made post-training highly effective when correctness can be checked automatically. However, many important model behaviors require satisfying several qualitative criteria at once. Rubric-based rewards address this setting by grading prompt-specific criteria and aggregating them into a scalar reward. Yet standard static aggregations conflate a criterion's human-assigned importance with its current usefulness as an optimization signal. We show that this assumption breaks down in rubric RL: many important criteria are already saturated or currently unreachable, while criteria that distinguish rollouts are not necessarily those with the largest human weights. We introduce POW3R, a policy-aware rubric reward framework that preserves human weights and category balance as the rubric objective while adapting criterion-level reward weights during training. POW3R uses rollout-level contrast to emphasize criteria that currently separate the policy's outputs, making the GRPO reward more informative without changing the underlying evaluation target. Across three base policies on two datasets spanning multimodal and text-only settings, POW3R wins $24$ of $30$ base-policy/metric comparisons, improving both mean rubric reward and strict completion (the fraction of prompts whose response satisfies every required rubric criterion) over vanilla GRPO with rubric rewards, and reaches the same plateau in $2.5$--$4\times$ fewer training steps. Rubric rewards should therefore distinguish what should matter in the final answer from what can teach the current policy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces POW3R, a policy-aware rubric reward framework for RLVR that preserves human-assigned weights and category balance as the fixed objective while dynamically adapting per-criterion weights via rollout-level contrast to emphasize currently distinguishing criteria. It argues that static aggregations conflate importance with optimization utility and shows that this leads to suboptimal signals when criteria are saturated or unreachable. Across three base policies on two datasets (multimodal and text-only), POW3R outperforms vanilla GRPO with rubric rewards in 24 of 30 comparisons, improving mean rubric reward and strict completion rate while reaching the same performance plateau in 2.5--4× fewer training steps.

Significance. If the reported gains hold under rigorous statistical scrutiny, the work provides a practical, low-overhead enhancement to rubric-based RL by making rewards more informative for the current policy state without altering the human-specified target. The consistent improvements across diverse settings and the faster convergence are potentially valuable for efficient post-training on multi-criteria tasks.

major comments (1)
  1. [Abstract and experimental results section] Abstract and experimental results section: the central claim of winning 24 of 30 base-policy/metric comparisons, plus 2.5--4× faster plateauing, is reported without any details on the number of random seeds, variance or standard errors across runs, or statistical significance tests. This absence directly affects the verifiability of the empirical superiority and is load-bearing for the paper's main contribution.
minor comments (2)
  1. [Method section] The precise formula or algorithm for computing rollout-level contrast and reweighting the criteria should be presented with an equation or pseudocode to allow exact reproduction.
  2. [Introduction or evaluation metrics] Ensure the definition of 'strict completion' (fraction of prompts satisfying every rubric criterion) is stated explicitly in the main text, not only in the abstract.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thorough review and positive assessment of the manuscript. We appreciate the constructive feedback on strengthening the empirical presentation and will address the concern directly in the revision.

read point-by-point responses
  1. Referee: [Abstract and experimental results section] Abstract and experimental results section: the central claim of winning 24 of 30 base-policy/metric comparisons, plus 2.5--4× faster plateauing, is reported without any details on the number of random seeds, variance or standard errors across runs, or statistical significance tests. This absence directly affects the verifiability of the empirical superiority and is load-bearing for the paper's main contribution.

    Authors: We agree that the current version of the manuscript does not include details on the number of random seeds, variance or standard errors, or statistical significance tests in the abstract or experimental results section. This information is important for verifying the reported improvements. In the revised manuscript we will expand the experimental results section to report the number of random seeds used for each base-policy and dataset combination, present mean rubric rewards and strict completion rates with standard errors across runs, and include statistical significance tests (e.g., paired t-tests or Wilcoxon tests) to support the 24-of-30 comparison wins and the 2.5--4× faster plateauing claims. These additions will be made without changing the underlying experimental protocol or results. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces POW3R as an empirical policy-aware adaptation layered on top of standard GRPO with rubric rewards. It uses rollout-level contrast to modulate per-criterion emphasis while explicitly preserving the original human-assigned weights and category balance as the evaluation target. No equations or claims reduce a prediction or result to a fitted quantity by construction, and no load-bearing steps rely on self-citations, uniqueness theorems, or ansatzes imported from prior author work. The reported gains in mean rubric reward, strict completion rate, and faster convergence are presented as direct empirical outcomes across three base policies and two datasets, keeping the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard RL assumptions for generating multiple rollouts and the validity of contrast as an optimization signal; no new free parameters, axioms, or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Multiple independent rollouts can be sampled from the current policy for each prompt to compute contrast
    Invoked in the description of rollout-level contrast for reweighting criteria

pith-pipeline@v0.9.0 · 5808 in / 1219 out tokens · 40833 ms · 2026-05-20T04:59:03.594473+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 23 internal anchors

  1. [1]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models, 2024. URLhttps://arxiv.org/abs/2402.03300

  2. [2]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI, Daya Guo, Dejian Yang, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning, 2025. URLhttps://arxiv.org/abs/2501.12948

  3. [3]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, et al. DAPO: An open-source LLM reinforcement learning system at scale, 2025. URLhttps://arxiv.org/abs/2503.14476

  4. [4]

    Spurious Rewards: Rethinking Training Signals in RLVR

    Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, et al. Spurious rewards: Rethinking training signals in RLVR, 2025. URLhttps://arxiv.org/abs/2506.10947

  5. [5]

    Reinforcement learning with verifiable yet noisy rewards under imperfect verifiers, 2025

    Xin-Qiang Cai, Wei Wang, Feng Liu, Tongliang Liu, Gang Niu, and Masashi Sugiyama. Reinforcement learning with verifiable yet noisy rewards under imperfect verifiers, 2025. URLhttps://arxiv.org/abs/ 2510.00915

  6. [6]

    Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero-Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, and Karan Singhal. HealthBench: Evaluating large language models towards improved human health, 2025. URL https://arxiv.org/abs/2505.08775

  7. [7]

    Perception-R1: Advancing multimodal reasoning capabilities of MLLMs via visual perception reward, 2025

    Tong Xiao, Xin Xu, Zhenya Huang, Hongyu Gao, Quan Liu, Qi Liu, and Enhong Chen. Perception-R1: Advancing multimodal reasoning capabilities of MLLMs via visual perception reward, 2025. URL https: //arxiv.org/abs/2506.07218

  8. [8]

    Seeing with you: Perception-reasoning coevolution for multimodal reasoning, 2026

    Ziqi Miao, Haonan Jia, Lijun Li, Chen Qian, Yuan Xiong, Wenting Yan, and Jing Shao. Seeing with you: Perception-reasoning coevolution for multimodal reasoning, 2026. URL https://arxiv.org/abs/2603. 28618

  9. [9]

    Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

    Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains, 2025. URL https://arxiv.org/abs/ 2507.17746

  10. [10]

    OpenRubrics: Towards scalable synthetic rubric generation for reward modeling and LLM alignment, 2025

    Tianci Liu, Ran Xu, Tony Yu, Ilgee Hong, Carl Yang, Tuo Zhao, and Haoyu Wang. OpenRubrics: Towards scalable synthetic rubric generation for reward modeling and LLM alignment, 2025. URL https://arxiv. org/abs/2510.07743

  11. [11]

    Rubrichub: A comprehensive and highly discriminative rubric dataset via automated coarse-to-fine generation.arXiv preprint arXiv:2601.08430, 2026

    Sunzhu Li, Jiale Zhao, Miteto Wei, Huimin Ren, Yang Zhou, Jingwen Yang, Shunyu Liu, Kaike Zhang, and Wei Chen. RubricHub: A comprehensive and highly discriminative rubric dataset via automated coarse-to-fine generation, 2026. URLhttps://arxiv.org/abs/2601.08430

  12. [12]

    AutoRubric: Rubric- based generative rewards for faithful multimodal reasoning, 2025

    Mengzhao Jia, Zhihan Zhang, Ignacio Cases, Zheyuan Liu, Meng Jiang, and Peng Qi. AutoRubric: Rubric- based generative rewards for faithful multimodal reasoning, 2025. URL https://arxiv.org/abs/2510. 14738

  13. [13]

    Learning to optimize multi-objective alignment through dynamic reward weighting, 2025

    Yining Lu, Zilong Wang, Shiyang Li, Xin Liu, et al. Learning to optimize multi-objective alignment through dynamic reward weighting, 2025. URLhttps://arxiv.org/abs/2509.11452. 12 Scale AI Research

  14. [14]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, et al. Qwen3-VL technical report, 2025. URL https://arxiv.org/ abs/2511.21631

  15. [15]

    Gemma 3 Technical Report

    Gemma Team, Aishwarya Kamath, Johan Ferret, et al. Gemma 3 technical report, 2025. URL https: //arxiv.org/abs/2503.19786

  16. [16]

    Introducing GPT-5.4 mini and nano

    OpenAI. Introducing GPT-5.4 mini and nano. OpenAI product release, 2026. URL https://openai.com/ index/introducing-gpt-5-4-mini-and-nano/

  17. [17]

    A Survey of Multi-Objective Sequential Decision-Making

    Diederik Marijn Roijers, Peter Vamplew, Shimon Whiteson, and Richard Dazeley. A survey of multi-objective sequential decision-making, 2014. URLhttps://arxiv.org/abs/1402.0590

  18. [18]

    GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

    Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Yejin Choi, Jan Kautz, and Pavlo Molchanov. GDPO: Group reward-decoupled normalization policy optimization for multi-reward RL optimization, 2026. URL https://arxiv.org/abs/2601.05242

  19. [19]

    Audio multichallenge: A multi-turn evaluation of spoken dialogue systems on natural human interaction,

    Advait Gosai, Tyler Vuong, Utkarsh Tyagi, Steven Li, Wenjia You, Miheer Bavare, Arda Uçar, Zhongwang Fang, Brian Jang, Bing Liu, and Yunzhong He. Audio MultiChallenge: A multi-turn evaluation of spoken dialogue systems on natural human interaction, 2025. URLhttps://arxiv.org/abs/2512.14865

  20. [20]

    Online rubrics elicitation from pairwise comparisons, 2025

    MohammadHossein Rezaei, Robert Vacareanu, Zihao Wang, Clinton Wang, Bing Liu, Yunzhong He, and Afra Feyza Akyürek. Online rubrics elicitation from pairwise comparisons, 2025. URL https://arxiv. org/abs/2510.07284

  21. [21]

    Rulin Shao, Akari Asai, Shannon Zejiang Shen, Hamish Ivison, Varsha Kishore, Jingming Zhuo, Xinran Zhao, Molly Park, Samuel G. Finlayson, David Sontag, Tyler Murray, Sewon Min, Pradeep Dasigi, Luca Soldaini, Faeze Brahman, Wen-tau Yih, Tongshuang Wu, Luke Zettlemoyer, Yoon Kim, Hannaneh Hajishirzi, and Pang Wei Koh. DR Tulu: Reinforcement learning with ev...

  22. [22]

    RuCL: Stratified rubric-based curriculum learning for multimodal large language model reasoning, 2026

    Yukun Chen, Jiaming Li, Longze Chen, Ze Gong, Jingpeng Li, Zhen Qin, Hengyu Chang, Ancheng Xu, Zhihao Yang, Hamid Alinejad-Rokny, Qiang Qu, Bo Zheng, and Min Yang. RuCL: Stratified rubric-based curriculum learning for multimodal large language model reasoning, 2026. URL https://arxiv.org/abs/2602. 21628

  23. [23]

    Multi-objective reinforcement learning from ai feedback.arXiv preprint arXiv:2406.07295, 2024

    Marcus Williams. Multi-objective reinforcement learning from AI feedback, 2024. URL https://arxiv. org/abs/2406.07295

  24. [24]

    Fine-Tuning Language Models from Human Preferences

    Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences, 2019. URL https://arxiv. org/abs/1909.08593

  25. [25]

    Learning to summarize from human feedback

    Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback, 2020. URL https://arxiv. org/abs/2009.01325

  26. [26]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, et al. Training language models to follow instructions with human feedback, 2022. URL https://arxiv.org/abs/2203.02155

  27. [27]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, et al. Constitutional ai: Harmlessness from ai feedback, 2022. URL https://arxiv.org/ abs/2212.08073. 13 Scale AI Research

  28. [28]

    RLAIF vs

    Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Lu, Colton Bishop, Ethan Hall, et al. RLAIF vs. RLHF: Scaling reinforcement learning from human feedback with ai feedback,

  29. [29]

    URLhttps://arxiv.org/abs/2309.00267

  30. [30]

    Rewardbench: Evaluating reward models for language modeling.arXiv preprint arXiv:2403.13787, 2024

    Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, et al. RewardBench: Evaluating reward models for language modeling, 2024. URL https://arxiv.org/abs/2403.13787

  31. [31]

    Visual-RFT: Visual Reinforcement Fine-Tuning

    Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-RFT: Visual reinforcement fine-tuning, 2025. URLhttps://arxiv.org/abs/2503.01785

  32. [32]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Xu Tang, Yao Hu, and Shaohui Lin. Vision-R1: Incentivizing reasoning capability in multimodal large language models, 2025. URL https://arxiv.org/abs/2503.06749

  33. [33]

    R1-VL: Learning to reason with multimodal large language models via step-wise group relative policy optimization,

    Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-VL: Learning to reason with multimodal large language models via step-wise group relative policy optimization,

  34. [34]

    URLhttps://arxiv.org/abs/2503.12937

  35. [35]

    Ground-R1: Incentiviz- ing grounded visual reasoning via reinforcement learning, 2025

    Meng Cao, Haoze Zhao, Can Zhang, Xiaojun Chang, Ian Reid, and Xiaodan Liang. Ground-R1: Incentiviz- ing grounded visual reasoning via reinforcement learning, 2025. URL https://arxiv.org/abs/2505. 20272

  36. [36]

    Perceptual-evidence anchored reinforced learning for multimodal reasoning, 2025

    Chi Zhang, Haibo Qiu, Qiming Zhang, Yufei Xu, Zhixiong Zeng, Siqi Yang, Peng Shi, Lin Ma, and Jing Zhang. Perceptual-evidence anchored reinforced learning for multimodal reasoning, 2025. URL https: //arxiv.org/abs/2511.18437

  37. [37]

    arXiv preprint arXiv:2510.02240 (2025)

    Sicheng Feng, Kaiwen Tuo, Song Wang, Lingdong Kong, Jianke Zhu, and Huan Wang. RewardMap: Tackling sparse rewards in fine-grained visual reasoning via multi-stage reinforcement learning, 2025. URL https: //arxiv.org/abs/2510.02240

  38. [38]

    Bridging perception and reasoning: Token reweighting for RLVR in multimodal LLMs, 2026

    Jinda Lu, Junkang Wu, Jinghan Li, Kexin Huang, Shuo Yang, Guoyin Wang, Jiancan Wu, Xiang Wang, and Xiangnan He. Bridging perception and reasoning: Token reweighting for RLVR in multimodal LLMs, 2026. URLhttps://arxiv.org/abs/2603.25077

  39. [39]

    Beyond seeing: Evaluating multimodal llms on tool-enabled image perception, transformation, and reasoning,

    Xingang Guo, Utkarsh Tyagi, Advait Gosai, Paula Vergara, Jayeon Park, Ernesto Gabriel Hernandez Montoya, Chen Bo Calvin Zhang, Bin Hu, Yunzhong He, Bing Liu, and Rakshith Sharma Srinivasa. Beyond seeing: Evaluating multimodal LLMs on tool-enabled image perception, transformation, and reasoning, 2025. URL https://arxiv.org/abs/2510.12712

  40. [40]

    Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model?, 2025. URL https://arxiv.org/abs/2504.13837

  41. [41]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, et al. Qwen3 technical report, 2025. URL https://arxiv.org/abs/ 2505.09388

  42. [42]

    HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

    Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. HallusionBench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision an...

  43. [43]

    In: Bouamor, H., Pino, J., Bali, K

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating object hallu- cination in large vision-language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 292–305. Association for Computational Linguistics, 2023. doi: 10.18653/v1/2023.emnlp-main.20. URLhttps://aclanthology.org/...

  44. [44]

    arXiv preprint arXiv:2504.07957 , year=

    Shengyuan Ding, Shenxi Wu, Xiangyu Zhao, Yuhang Zang, Haodong Duan, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. MM-IFEngine: Towards multimodal instruction following, 2025. URL https://arxiv.org/abs/2504.07957

  45. [45]

    MM-Vet v2: A challenging benchmark to evaluate large multimodal models for integrated capabilities, 2024

    Weihao Yu, Zhengyuan Yang, Lingfeng Ren, Linjie Li, Jianfeng Wang, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Lijuan Wang, and Xinchao Wang. MM-Vet v2: A challenging benchmark to evaluate large multimodal models for integrated capabilities, 2024. URLhttps://arxiv.org/abs/2408.00765

  46. [46]

    MathVista: Evaluating mathematical reasoning of foundation models in visual contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating mathematical reasoning of foundation models in visual contexts. InInternational Conference on Learning Representations, 2024. URL https:// openreview.net/forum?id=KUNzEQMWU7

  47. [47]

    RealWorldQA: A benchmark for real-world spatial understanding

    xAI. RealWorldQA: A benchmark for real-world spatial understanding. xAI dataset release, 2024. URL https://huggingface.co/datasets/xai-org/RealworldQA

  48. [48]

    Generalized Slow Roll for Tensors

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. ZeRO: Memory optimizations toward training trillion parameter models. InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16, 2020. doi: 10.1109/SC41405.2020.00024. URL https://arxiv.org/abs/ 1910.02054

  49. [49]

    the response avoids the prohibited behavior,

    OpenAI. Introducing GPT-5.4. OpenAI product release, 2026. URL https://openai.com/index/ introducing-gpt-5-4/. 15 Scale AI Research Appendix Appendix index. • Section A: Dataset and contributor details– rubric annotations, contributor demographics, and split details for MM. • Section B: Judge selection– reference-judge calibration, cost–quality tradeoffs,...