Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR
Pith reviewed 2026-05-20 04:59 UTC · model grok-4.3
The pith
Policy-aware adaptation of rubric rewards makes RL training both more effective and several times faster by emphasizing currently informative criteria.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
POW3R preserves the human-assigned weights and category balance in rubric rewards while using rollout-level contrast to increase the weight of criteria that currently distinguish the policy's outputs, thereby making the GRPO reward more informative without changing the underlying evaluation target.
What carries the argument
POW3R's policy-aware adaptation mechanism, which reweights rubric criteria according to rollout-level contrast while preserving the original human objective.
If this is right
- Higher mean rubric reward than vanilla GRPO with static aggregation.
- Larger fraction of prompts whose responses satisfy every required criterion.
- Reaching the same performance plateau in 2.5 to 4 times fewer training steps.
- Consistent wins across three base policies and two datasets that include both multimodal and text-only cases.
Where Pith is reading between the lines
- The same contrast-based reweighting could be tested in other multi-criteria reward designs where static weights underperform.
- Separating the signal used for training from the rubric used for final scoring may apply to preference-based RL as well.
- Checking whether the adaptation reduces reward hacking on high-weight but uninformative criteria would be a direct follow-up experiment.
Load-bearing premise
That rollout-level contrast between policy outputs can be used to reweight criteria without distorting the human-specified rubric objective or introducing optimization instability.
What would settle it
An experiment in which POW3R produces lower strict completion rates or requires more steps than fixed-weight rubric rewards on the same base policies and datasets would show the adaptation provides no benefit.
Figures
read the original abstract
Reinforcement learning with verifiable rewards has made post-training highly effective when correctness can be checked automatically. However, many important model behaviors require satisfying several qualitative criteria at once. Rubric-based rewards address this setting by grading prompt-specific criteria and aggregating them into a scalar reward. Yet standard static aggregations conflate a criterion's human-assigned importance with its current usefulness as an optimization signal. We show that this assumption breaks down in rubric RL: many important criteria are already saturated or currently unreachable, while criteria that distinguish rollouts are not necessarily those with the largest human weights. We introduce POW3R, a policy-aware rubric reward framework that preserves human weights and category balance as the rubric objective while adapting criterion-level reward weights during training. POW3R uses rollout-level contrast to emphasize criteria that currently separate the policy's outputs, making the GRPO reward more informative without changing the underlying evaluation target. Across three base policies on two datasets spanning multimodal and text-only settings, POW3R wins $24$ of $30$ base-policy/metric comparisons, improving both mean rubric reward and strict completion (the fraction of prompts whose response satisfies every required rubric criterion) over vanilla GRPO with rubric rewards, and reaches the same plateau in $2.5$--$4\times$ fewer training steps. Rubric rewards should therefore distinguish what should matter in the final answer from what can teach the current policy.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces POW3R, a policy-aware rubric reward framework for RLVR that preserves human-assigned weights and category balance as the fixed objective while dynamically adapting per-criterion weights via rollout-level contrast to emphasize currently distinguishing criteria. It argues that static aggregations conflate importance with optimization utility and shows that this leads to suboptimal signals when criteria are saturated or unreachable. Across three base policies on two datasets (multimodal and text-only), POW3R outperforms vanilla GRPO with rubric rewards in 24 of 30 comparisons, improving mean rubric reward and strict completion rate while reaching the same performance plateau in 2.5--4× fewer training steps.
Significance. If the reported gains hold under rigorous statistical scrutiny, the work provides a practical, low-overhead enhancement to rubric-based RL by making rewards more informative for the current policy state without altering the human-specified target. The consistent improvements across diverse settings and the faster convergence are potentially valuable for efficient post-training on multi-criteria tasks.
major comments (1)
- [Abstract and experimental results section] Abstract and experimental results section: the central claim of winning 24 of 30 base-policy/metric comparisons, plus 2.5--4× faster plateauing, is reported without any details on the number of random seeds, variance or standard errors across runs, or statistical significance tests. This absence directly affects the verifiability of the empirical superiority and is load-bearing for the paper's main contribution.
minor comments (2)
- [Method section] The precise formula or algorithm for computing rollout-level contrast and reweighting the criteria should be presented with an equation or pseudocode to allow exact reproduction.
- [Introduction or evaluation metrics] Ensure the definition of 'strict completion' (fraction of prompts satisfying every rubric criterion) is stated explicitly in the main text, not only in the abstract.
Simulated Author's Rebuttal
We thank the referee for their thorough review and positive assessment of the manuscript. We appreciate the constructive feedback on strengthening the empirical presentation and will address the concern directly in the revision.
read point-by-point responses
-
Referee: [Abstract and experimental results section] Abstract and experimental results section: the central claim of winning 24 of 30 base-policy/metric comparisons, plus 2.5--4× faster plateauing, is reported without any details on the number of random seeds, variance or standard errors across runs, or statistical significance tests. This absence directly affects the verifiability of the empirical superiority and is load-bearing for the paper's main contribution.
Authors: We agree that the current version of the manuscript does not include details on the number of random seeds, variance or standard errors, or statistical significance tests in the abstract or experimental results section. This information is important for verifying the reported improvements. In the revised manuscript we will expand the experimental results section to report the number of random seeds used for each base-policy and dataset combination, present mean rubric rewards and strict completion rates with standard errors across runs, and include statistical significance tests (e.g., paired t-tests or Wilcoxon tests) to support the 24-of-30 comparison wins and the 2.5--4× faster plateauing claims. These additions will be made without changing the underlying experimental protocol or results. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper introduces POW3R as an empirical policy-aware adaptation layered on top of standard GRPO with rubric rewards. It uses rollout-level contrast to modulate per-criterion emphasis while explicitly preserving the original human-assigned weights and category balance as the evaluation target. No equations or claims reduce a prediction or result to a fitted quantity by construction, and no load-bearing steps rely on self-citations, uniqueness theorems, or ansatzes imported from prior author work. The reported gains in mean rubric reward, strict completion rate, and faster convergence are presented as direct empirical outcomes across three base policies and two datasets, keeping the derivation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Multiple independent rollouts can be sampled from the current policy for each prompt to compute contrast
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
POW3R uses rollout-level contrast to emphasize criteria that currently separate the policy's outputs... g(t)_j = sqrt(v(t)_j + ε), ρ(t)_j = g(t)_j / g-bar(t)_κj, α-hat(t)_j = clip((1-λ)+λ ρ(t)_j, α_min, α_max)
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
preserves human weights and category balance as the rubric objective while adapting criterion-level reward weights
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models, 2024. URLhttps://arxiv.org/abs/2402.03300
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI, Daya Guo, Dejian Yang, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning, 2025. URLhttps://arxiv.org/abs/2501.12948
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, et al. DAPO: An open-source LLM reinforcement learning system at scale, 2025. URLhttps://arxiv.org/abs/2503.14476
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Spurious Rewards: Rethinking Training Signals in RLVR
Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, et al. Spurious rewards: Rethinking training signals in RLVR, 2025. URLhttps://arxiv.org/abs/2506.10947
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Reinforcement learning with verifiable yet noisy rewards under imperfect verifiers, 2025
Xin-Qiang Cai, Wei Wang, Feng Liu, Tongliang Liu, Gang Niu, and Masashi Sugiyama. Reinforcement learning with verifiable yet noisy rewards under imperfect verifiers, 2025. URLhttps://arxiv.org/abs/ 2510.00915
-
[6]
Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero-Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, and Karan Singhal. HealthBench: Evaluating large language models towards improved human health, 2025. URL https://arxiv.org/abs/2505.08775
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Tong Xiao, Xin Xu, Zhenya Huang, Hongyu Gao, Quan Liu, Qi Liu, and Enhong Chen. Perception-R1: Advancing multimodal reasoning capabilities of MLLMs via visual perception reward, 2025. URL https: //arxiv.org/abs/2506.07218
-
[8]
Seeing with you: Perception-reasoning coevolution for multimodal reasoning, 2026
Ziqi Miao, Haonan Jia, Lijun Li, Chen Qian, Yuan Xiong, Wenting Yan, and Jing Shao. Seeing with you: Perception-reasoning coevolution for multimodal reasoning, 2026. URL https://arxiv.org/abs/2603. 28618
work page 2026
-
[9]
Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains
Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains, 2025. URL https://arxiv.org/abs/ 2507.17746
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Tianci Liu, Ran Xu, Tony Yu, Ilgee Hong, Carl Yang, Tuo Zhao, and Haoyu Wang. OpenRubrics: Towards scalable synthetic rubric generation for reward modeling and LLM alignment, 2025. URL https://arxiv. org/abs/2510.07743
-
[11]
Sunzhu Li, Jiale Zhao, Miteto Wei, Huimin Ren, Yang Zhou, Jingwen Yang, Shunyu Liu, Kaike Zhang, and Wei Chen. RubricHub: A comprehensive and highly discriminative rubric dataset via automated coarse-to-fine generation, 2026. URLhttps://arxiv.org/abs/2601.08430
-
[12]
AutoRubric: Rubric- based generative rewards for faithful multimodal reasoning, 2025
Mengzhao Jia, Zhihan Zhang, Ignacio Cases, Zheyuan Liu, Meng Jiang, and Peng Qi. AutoRubric: Rubric- based generative rewards for faithful multimodal reasoning, 2025. URL https://arxiv.org/abs/2510. 14738
work page 2025
-
[13]
Learning to optimize multi-objective alignment through dynamic reward weighting, 2025
Yining Lu, Zilong Wang, Shiyang Li, Xin Liu, et al. Learning to optimize multi-objective alignment through dynamic reward weighting, 2025. URLhttps://arxiv.org/abs/2509.11452. 12 Scale AI Research
-
[14]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, et al. Qwen3-VL technical report, 2025. URL https://arxiv.org/ abs/2511.21631
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Gemma Team, Aishwarya Kamath, Johan Ferret, et al. Gemma 3 technical report, 2025. URL https: //arxiv.org/abs/2503.19786
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Introducing GPT-5.4 mini and nano
OpenAI. Introducing GPT-5.4 mini and nano. OpenAI product release, 2026. URL https://openai.com/ index/introducing-gpt-5-4-mini-and-nano/
work page 2026
-
[17]
A Survey of Multi-Objective Sequential Decision-Making
Diederik Marijn Roijers, Peter Vamplew, Shimon Whiteson, and Richard Dazeley. A survey of multi-objective sequential decision-making, 2014. URLhttps://arxiv.org/abs/1402.0590
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[18]
GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Yejin Choi, Jan Kautz, and Pavlo Molchanov. GDPO: Group reward-decoupled normalization policy optimization for multi-reward RL optimization, 2026. URL https://arxiv.org/abs/2601.05242
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[19]
Advait Gosai, Tyler Vuong, Utkarsh Tyagi, Steven Li, Wenjia You, Miheer Bavare, Arda Uçar, Zhongwang Fang, Brian Jang, Bing Liu, and Yunzhong He. Audio MultiChallenge: A multi-turn evaluation of spoken dialogue systems on natural human interaction, 2025. URLhttps://arxiv.org/abs/2512.14865
-
[20]
Online rubrics elicitation from pairwise comparisons, 2025
MohammadHossein Rezaei, Robert Vacareanu, Zihao Wang, Clinton Wang, Bing Liu, Yunzhong He, and Afra Feyza Akyürek. Online rubrics elicitation from pairwise comparisons, 2025. URL https://arxiv. org/abs/2510.07284
-
[21]
Rulin Shao, Akari Asai, Shannon Zejiang Shen, Hamish Ivison, Varsha Kishore, Jingming Zhuo, Xinran Zhao, Molly Park, Samuel G. Finlayson, David Sontag, Tyler Murray, Sewon Min, Pradeep Dasigi, Luca Soldaini, Faeze Brahman, Wen-tau Yih, Tongshuang Wu, Luke Zettlemoyer, Yoon Kim, Hannaneh Hajishirzi, and Pang Wei Koh. DR Tulu: Reinforcement learning with ev...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Yukun Chen, Jiaming Li, Longze Chen, Ze Gong, Jingpeng Li, Zhen Qin, Hengyu Chang, Ancheng Xu, Zhihao Yang, Hamid Alinejad-Rokny, Qiang Qu, Bo Zheng, and Min Yang. RuCL: Stratified rubric-based curriculum learning for multimodal large language model reasoning, 2026. URL https://arxiv.org/abs/2602. 21628
work page 2026
-
[23]
Multi-objective reinforcement learning from ai feedback.arXiv preprint arXiv:2406.07295, 2024
Marcus Williams. Multi-objective reinforcement learning from AI feedback, 2024. URL https://arxiv. org/abs/2406.07295
-
[24]
Fine-Tuning Language Models from Human Preferences
Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences, 2019. URL https://arxiv. org/abs/1909.08593
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[25]
Learning to summarize from human feedback
Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback, 2020. URL https://arxiv. org/abs/2009.01325
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[26]
Training language models to follow instructions with human feedback
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, et al. Training language models to follow instructions with human feedback, 2022. URL https://arxiv.org/abs/2203.02155
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[27]
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, et al. Constitutional ai: Harmlessness from ai feedback, 2022. URL https://arxiv.org/ abs/2212.08073. 13 Scale AI Research
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [28]
-
[29]
URLhttps://arxiv.org/abs/2309.00267
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
Rewardbench: Evaluating reward models for language modeling.arXiv preprint arXiv:2403.13787, 2024
Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, et al. RewardBench: Evaluating reward models for language modeling, 2024. URL https://arxiv.org/abs/2403.13787
-
[31]
Visual-RFT: Visual Reinforcement Fine-Tuning
Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-RFT: Visual reinforcement fine-tuning, 2025. URLhttps://arxiv.org/abs/2503.01785
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Xu Tang, Yao Hu, and Shaohui Lin. Vision-R1: Incentivizing reasoning capability in multimodal large language models, 2025. URL https://arxiv.org/abs/2503.06749
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-VL: Learning to reason with multimodal large language models via step-wise group relative policy optimization,
-
[34]
URLhttps://arxiv.org/abs/2503.12937
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
Ground-R1: Incentiviz- ing grounded visual reasoning via reinforcement learning, 2025
Meng Cao, Haoze Zhao, Can Zhang, Xiaojun Chang, Ian Reid, and Xiaodan Liang. Ground-R1: Incentiviz- ing grounded visual reasoning via reinforcement learning, 2025. URL https://arxiv.org/abs/2505. 20272
work page 2025
-
[36]
Perceptual-evidence anchored reinforced learning for multimodal reasoning, 2025
Chi Zhang, Haibo Qiu, Qiming Zhang, Yufei Xu, Zhixiong Zeng, Siqi Yang, Peng Shi, Lin Ma, and Jing Zhang. Perceptual-evidence anchored reinforced learning for multimodal reasoning, 2025. URL https: //arxiv.org/abs/2511.18437
-
[37]
arXiv preprint arXiv:2510.02240 (2025)
Sicheng Feng, Kaiwen Tuo, Song Wang, Lingdong Kong, Jianke Zhu, and Huan Wang. RewardMap: Tackling sparse rewards in fine-grained visual reasoning via multi-stage reinforcement learning, 2025. URL https: //arxiv.org/abs/2510.02240
-
[38]
Bridging perception and reasoning: Token reweighting for RLVR in multimodal LLMs, 2026
Jinda Lu, Junkang Wu, Jinghan Li, Kexin Huang, Shuo Yang, Guoyin Wang, Jiancan Wu, Xiang Wang, and Xiangnan He. Bridging perception and reasoning: Token reweighting for RLVR in multimodal LLMs, 2026. URLhttps://arxiv.org/abs/2603.25077
-
[39]
Xingang Guo, Utkarsh Tyagi, Advait Gosai, Paula Vergara, Jayeon Park, Ernesto Gabriel Hernandez Montoya, Chen Bo Calvin Zhang, Bin Hu, Yunzhong He, Bing Liu, and Rakshith Sharma Srinivasa. Beyond seeing: Evaluating multimodal LLMs on tool-enabled image perception, transformation, and reasoning, 2025. URL https://arxiv.org/abs/2510.12712
-
[40]
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model?, 2025. URL https://arxiv.org/abs/2504.13837
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
An Yang, Anfeng Li, Baosong Yang, et al. Qwen3 technical report, 2025. URL https://arxiv.org/abs/ 2505.09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. HallusionBench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision an...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
In: Bouamor, H., Pino, J., Bali, K
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating object hallu- cination in large vision-language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 292–305. Association for Computational Linguistics, 2023. doi: 10.18653/v1/2023.emnlp-main.20. URLhttps://aclanthology.org/...
-
[44]
arXiv preprint arXiv:2504.07957 , year=
Shengyuan Ding, Shenxi Wu, Xiangyu Zhao, Yuhang Zang, Haodong Duan, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. MM-IFEngine: Towards multimodal instruction following, 2025. URL https://arxiv.org/abs/2504.07957
-
[45]
Weihao Yu, Zhengyuan Yang, Lingfeng Ren, Linjie Li, Jianfeng Wang, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Lijuan Wang, and Xinchao Wang. MM-Vet v2: A challenging benchmark to evaluate large multimodal models for integrated capabilities, 2024. URLhttps://arxiv.org/abs/2408.00765
-
[46]
MathVista: Evaluating mathematical reasoning of foundation models in visual contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating mathematical reasoning of foundation models in visual contexts. InInternational Conference on Learning Representations, 2024. URL https:// openreview.net/forum?id=KUNzEQMWU7
work page 2024
-
[47]
RealWorldQA: A benchmark for real-world spatial understanding
xAI. RealWorldQA: A benchmark for real-world spatial understanding. xAI dataset release, 2024. URL https://huggingface.co/datasets/xai-org/RealworldQA
work page 2024
-
[48]
Generalized Slow Roll for Tensors
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. ZeRO: Memory optimizations toward training trillion parameter models. InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16, 2020. doi: 10.1109/SC41405.2020.00024. URL https://arxiv.org/abs/ 1910.02054
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41405.2020.00024 2020
-
[49]
the response avoids the prohibited behavior,
OpenAI. Introducing GPT-5.4. OpenAI product release, 2026. URL https://openai.com/index/ introducing-gpt-5-4/. 15 Scale AI Research Appendix Appendix index. • Section A: Dataset and contributor details– rubric annotations, contributor demographics, and split details for MM. • Section B: Judge selection– reference-judge calibration, cost–quality tradeoffs,...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.