Step-wise Rubric Rewards for LLM Reasoning
Pith reviewed 2026-05-20 13:26 UTC · model grok-4.3
The pith
Step-wise Rubric Rewards assign supervision to individual reasoning steps to fix aggregation problems in LLM reinforcement learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SRaR uses an LLM judge to attribute each rubric item to a specific reasoning step, normalizes per-step rubric scores across rollouts so only steps whose quality varies produce a learning signal, and combines the per-step reward with the outcome reward through a decoupled advantage estimator that keeps the outcome baseline stable.
What carries the argument
The LLM-judge attribution of rubric items to individual reasoning steps, combined with per-step score normalization across rollouts and a decoupled advantage estimator for merging with outcome rewards.
If this is right
- Improves average accuracy over RaR by 3.57 points on Qwen3-8B and 2.75 points on Qwen3-32B across six mathematical reasoning benchmarks.
- Raises the Faithful Reasoning Rate on AIME 2025 from 34.5% to 46.7%.
- Reduces self-correction looping from 48.1% to 26.5%.
- Avoids the observed pattern where 18.2% of steps in correct-answer responses are wrong yet positively rewarded and 49.9% of steps in incorrect-answer responses are correct yet penalized.
Where Pith is reading between the lines
- The contrastive distillation process used to build the 16K-problem rubric dataset could be reused to create targeted training signals for other multi-step tasks such as code generation or scientific proof construction.
- Decoupling step-wise and outcome advantages may stabilize training in reinforcement learning settings that combine several different reward components beyond reasoning.
- Applying the same attribution technique outside mathematical domains could test whether step-level supervision reduces reward hacking in areas like long-form writing or multi-turn dialogue.
- If future judge models become more accurate at step attribution, the framework would automatically deliver stronger per-step signals without requiring changes to the training pipeline.
Load-bearing premise
The LLM judge can reliably map each rubric item to the correct reasoning step without frequent or systematic attribution errors.
What would settle it
Direct measurement of the LLM judge's attribution accuracy on a sample of rollouts showing high error rates that would make the per-step reward signal too noisy to produce the claimed gains over standard rubric aggregation.
read the original abstract
Reinforcement Learning with Verifiable Rewards (RLVR) is widely used to improve reasoning in large language models, but rewards only final-answer correctness with no supervision over intermediate steps. Rubric-based methods such as Rubrics as Rewards (RaR) introduce finer-grained supervision by scoring rollouts against structured criteria, yet the rubric scores are still aggregated into a single scalar applied to the entire response, causing three weaknesses: loss of multi-criterion structure, uniform supervision of correct and incorrect steps, and reward hacking through unbounded self-correction. On 1,000 problems, we find 18.2% of steps in correct-answer responses are wrong yet positively rewarded, while 49.9% of steps in incorrect-answer responses are correct yet penalized. We introduce Step-wise Rubrics as Rewards (SRaR), an RLVR framework that (i) uses an LLM judge to attribute each rubric item to a specific reasoning step, (ii) normalizes per-step rubric scores across rollouts so only steps whose quality varies produce a learning signal, and (iii) combines the per-step reward with the outcome reward through a decoupled advantage estimator that keeps the outcome baseline stable. We further build a 16K-problem rubric dataset by contrastively distilling rubric items from correct and flawed reasoning paths sampled from a strong model. Across six mathematical reasoning benchmarks, SRaR improves average accuracy over RaR by 3.57 points on Qwen3-8B and 2.75 points on Qwen3-32B, raises the Faithful Reasoning Rate on AIME 2025 from 34.5% to 46.7%, and reduces self-correction looping from 48.1% to 26.5%.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Step-wise Rubrics as Rewards (SRaR), extending Rubrics as Rewards (RaR) within RLVR for LLM reasoning. SRaR uses an LLM judge to attribute each rubric item to a specific reasoning step in a rollout, normalizes per-step scores across rollouts so that only varying-quality steps generate a signal, and combines the resulting per-step reward with the outcome reward through a decoupled advantage estimator. The authors construct a 16K-problem rubric dataset via contrastive distillation from correct and flawed paths and report accuracy gains over RaR of 3.57 points on Qwen3-8B and 2.75 points on Qwen3-32B across six math benchmarks, plus an increase in Faithful Reasoning Rate on AIME 2025 from 34.5% to 46.7% and a drop in self-correction looping from 48.1% to 26.5%.
Significance. If the LLM-judge attribution step is shown to be reliable, SRaR would constitute a clear technical advance by restoring multi-criterion structure and step-level supervision while mitigating uniform reward of incorrect steps and self-correction hacking. The contrastive construction of the 16K rubric dataset and the explicit diagnostic statistics on rewarded wrong steps are concrete strengths that future work can build upon.
major comments (2)
- [Abstract and §3 (Method)] Abstract and method description of the attribution procedure: the central diagnostic (18.2% of steps in correct-answer trajectories are wrong yet positively rewarded) and all per-step learning signals rest on the LLM judge correctly mapping rubric items to individual steps. No human-agreement rates, error analysis, or validation of this attribution step are reported. If attribution errors are systematic, the variance-based normalization produces noisy or biased advantages, so the claimed 3.57-point gain over RaR cannot be confidently attributed to step-wise supervision rather than incidental changes in training dynamics.
- [§5 (Experiments)] Experimental results (benchmark tables and AIME 2025 numbers): concrete accuracy lifts and the reduction in self-correction looping are presented without statistical significance tests, standard deviations across seeds, or ablations that isolate the three new components (attribution, normalization, decoupled estimator). This makes it difficult to assess whether the reported improvements are robust or load-bearing for the central claim.
minor comments (2)
- [§3.2] The normalization formula and the precise form of the decoupled advantage estimator would benefit from an explicit equation or pseudocode block for reproducibility.
- [§3.1] Clarify whether the LLM judge is drawn from the same model family as the policy or trained on overlapping data; this bears on the circularity concern noted in the review.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications and commit to specific revisions that directly strengthen the manuscript's claims.
read point-by-point responses
-
Referee: [Abstract and §3 (Method)] Abstract and method description of the attribution procedure: the central diagnostic (18.2% of steps in correct-answer trajectories are wrong yet positively rewarded) and all per-step learning signals rest on the LLM judge correctly mapping rubric items to individual steps. No human-agreement rates, error analysis, or validation of this attribution step are reported. If attribution errors are systematic, the variance-based normalization produces noisy or biased advantages, so the claimed 3.57-point gain over RaR cannot be confidently attributed to step-wise supervision rather than incidental changes in training dynamics.
Authors: We agree that the absence of explicit validation for the LLM-judge attribution step is a limitation that weakens the ability to fully attribute gains to step-wise supervision. The manuscript does report diagnostic statistics (18.2% wrong steps positively rewarded in correct trajectories and 49.9% correct steps penalized in incorrect ones) that were obtained via the same judge, providing indirect evidence that attribution errors exist and that the method still yields net benefits. However, without human agreement rates or error analysis, systematic biases cannot be ruled out. In the revised version we will add a dedicated validation subsection: we will sample 200 attributions, have two human annotators label them for correctness, report inter-annotator agreement and judge accuracy, and analyze the most frequent error categories. These results will be used to qualify the strength of the central claim. revision: yes
-
Referee: [§5 (Experiments)] Experimental results (benchmark tables and AIME 2025 numbers): concrete accuracy lifts and the reduction in self-correction looping are presented without statistical significance tests, standard deviations across seeds, or ablations that isolate the three new components (attribution, normalization, decoupled estimator). This makes it difficult to assess whether the reported improvements are robust or load-bearing for the central claim.
Authors: We concur that the experimental presentation would be more convincing with statistical rigor and component-wise ablations. The current results are reported from single runs without error bars or significance tests, and the three novel elements (attribution, per-step normalization, and decoupled advantage) are not isolated. In the revision we will (i) rerun the main experiments on Qwen3-8B and Qwen3-32B with at least three random seeds, reporting mean accuracy, standard deviation, and paired t-test p-values against RaR; (ii) add an ablation table that cumulatively enables attribution, normalization, and the decoupled estimator while keeping all other factors fixed; and (iii) include the same statistical treatment for the AIME 2025 Faithful Reasoning Rate and self-correction looping metrics. These additions will allow readers to assess which components drive the observed gains. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper introduces SRaR as an RLVR extension that attributes rubric items via an LLM judge, normalizes per-step scores across rollouts, and combines them with outcome rewards using a decoupled advantage estimator. All reported gains (accuracy lifts, Faithful Reasoning Rate, reduced looping) are framed as empirical outcomes from training and evaluation on external benchmarks. No equations reduce the final performance metric to the judge outputs or normalization by algebraic identity, no fitted parameter is relabeled as a prediction, and no load-bearing premise rests on a self-citation chain. The 18.2 % statistic is an observational motivation rather than a definitional input, and the method remains externally falsifiable on held-out math problems.
Axiom & Free-Parameter Ledger
free parameters (1)
- per-step reward weight
axioms (1)
- domain assumption An LLM judge can map rubric criteria to specific tokens or sentences in a reasoning rollout with high fidelity.
invented entities (1)
-
step-wise rubric attribution via LLM judge
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
uses an LLM judge to attribute each rubric item to a specific reasoning step, (ii) normalizes the per-step rubric scores across rollouts so that only steps whose quality varies produce a learning signal, and (iii) combines the resulting per-step reward with the standard outcome reward through a decoupled advantage estimator
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Baolong Bi, Shenghua Liu, Yiwei Wang, Siqian Tong, Lingrui Mei, Yuyao Ge, Yilong Xu, Jiafeng Guo, and Xueqi Cheng. Reward and guidance through rubrics: Promoting exploration to improve multi-domain reasoning.arXiv preprint arXiv:2511.12344, 2025
-
[2]
Babyvision: Visual reasoning beyond language
Liang Chen, Weichu Xie, Yiyan Liang, Hongfeng He, Hans Zhao, Zhibo Yang, Zhiqi Huang, Haoning Wu, Haoyu Lu, Yiping Bao, et al. Babyvision: Visual reasoning beyond language.arXiv preprint arXiv:2601.06521, 2026
-
[3]
DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.Nature, 645: 633–638, 2025
work page 2025
-
[4]
Shuai Dong, Siyuan Wang, Xingyu Liu, Chenglin Li, Haowen Hou, and Zhongyu Wei. Interleaved latent visual reasoning with selective perceptual modeling.arXiv preprint arXiv:2512.05665, 2025
-
[5]
Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains
Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Helia Hashemi, Jason Eisner, Corby Rosset, Benjamin Van Durme, and Chris Kedzie. LLM-Rubric: A multidi- mensional, calibrated approach to automated evaluation of natural language texts. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers), pages 13806–13834, 2024
work page 2024
-
[7]
Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. OlympiadBench: A challenging benchmark for promoting AGI with olympiad- level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024
work page 2024
-
[8]
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset.Advancesin Neural Information Processing Systems, 34:7294–7305, 2021
work page 2021
-
[9]
Prometheus 2: An open source language model specialized in evaluating other language models
Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, and Minjoon Seo. Prometheus 2: An open source language model specialized in evaluating other language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024
work page 2024
-
[10]
Solving quantitative reasoning problems with language models
Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Am- brose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models. Advancesin Neural Information Processing Systems, 35:3843–3857, 2022
work page 2022
-
[11]
Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InInternational Conference on Learning Representations, 2024
work page 2024
-
[12]
Improve Mathematical Reasoning in Language Models by Automated Process Supervision
Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Meiqi Guo, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, Jiao Sun, and Abhinav Rastogi. Improve mathematical reasoning in language models by automated process supervision.arXiv preprint arXiv:2406.06592, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Learning to reason with LLMs.https://openai.com/index/learning-to-reason-with-llms/, 2024
OpenAI. Learning to reason with LLMs.https://openai.com/index/learning-to-reason-with-llms/, 2024
work page 2024
-
[14]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y.K. Li, Y. Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
From Context to Skills: Can Language Models Learn from Context Skillfully?
Shuzheng Si, Haozhe Zhao, Yu Lei, Qingyi Wang, Dingwei Chen, Zhitong Wang, Zhenhailong Wang, Kangyang Luo, Zheng Wang, Gang Chen, Fanchao Qi, Minjia Zhang, and Maosong Sun. From context to skills: Can language models learn from context skillfully?, 2026. URLhttps://arxiv.org/abs/2604.27660
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[16]
Longcat-next: Lexicalizing modalities as discrete tokens.arXiv preprint arXiv:2603.27538, 2026
Meituan LongCat Team, Bin Xiao, Chao Wang, Chengjiang Li, Chi Zhang, Chong Peng, Hang Yu, Hao Yang, Haonan Yan, Haoze Sun, Haozhe Zhao, Hong Liu, Hui Su, Jiaqi Zhang, Jiawei Wang, Jing Li, Kefeng Zhang, Manyuan Zhang, Minhao Jing, Peng Pei, Quan Chen, Taofeng Xue, Tongxin Pan, Xiaotong Li, Xiaoyang Li, Xiaoyu Zhao, Xing Hu, Xinyang Lin, Xunliang Cai, Yan ...
-
[17]
Jingyi Wang, Lei Zhu, Tengjin Weng, Song-Li Wu, Haochen Tan, Jierun Chen, Chaofan Tao, Haoli Bai, Lu Hou, Lifeng Shang, and Xiao-Ping Zhang. GRPO-VPS: Enhancing group relative policy optimization with verifiable process supervision for effective reasoning.arXiv preprint arXiv:2604.20659, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[18]
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math- shepherd: Verify and reinforce LLMs step-by-step without human annotations.arXiv preprint arXiv:2312.08935, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Self-consistency improves chain of thought reasoning in language models
XuezhiWang, JasonWei, DaleSchuurmans, QuocLe, EdChi, SharanNarang, AakankshaChowdhery, andDenny Zhou. Self-consistency improves chain of thought reasoning in language models. InInternational Conference on Learning Representations, 2023
work page 2023
-
[20]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, 2022
work page 2022
-
[21]
Simple statistical gradient-following algorithms for connectionist reinforcement learning
Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3–4):229–256, 1992
work page 1992
-
[22]
Yuqi Xu, Rizhen Hu, Zihan Liu, Mou Sun, and Kun Yuan. Grouter: Decoupling routing from representation for accelerated moe training.arXiv preprint arXiv:2603.06626, 2026
-
[23]
An Yang, Anfeng Yang, Baosong Yang, Beichen Bi, Binyuan Hui, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, et al. DAPO: An open-source LLM reinforce- ment learning system at scale.arXiv preprint arXiv:2503.14476, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
Mentor: Efficient multimodal-conditioned tuning for autoregressive vision generation models
Haozhe Zhao, Zefan Cai, Shuzheng Si, Liang Chen, Jiuxiang Gu, Wen Xiao, Minjia Zhang, and Junjie Hu. Mentor: Efficient multimodal-conditioned tuning for autoregressive vision generation models. arXiv preprint arXiv:2507.09574, 2025
-
[26]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, et al. Judging LLM-as-a-judge with MT-Bench and chatbot arena.Advances in Neural Information Processing Systems, 36, 2023. 12 Appendix A Limitations We identify the following limitations of our work: 1.Dependence on the LLM ...
work page 2023
-
[27]
Are the mathematical operations correct?
-
[28]
Is the logic/deduction valid?
-
[29]
CORRECT” if the step is logically and mathematically correct •“INCORRECT
Are there any errors in the step? Respond with EXACTLY one of: •“CORRECT” if the step is logically and mathematically correct •“INCORRECT” if the step contains any error C Training Hyperparameters Table 4Training Hyperparameters. Hyperparameter Value Max Prompt Length 2048 Max Response Length 8192 Train Batch Size 128 Learning Rate1×10 −6 Rollout Group Si...
work page 2048
-
[30]
Is the content mathematically correct?
-
[31]
Is it relevant to this specific problem?
-
[32]
Is the type label (SUGGEST/PITFALL/BONUS/ANSWER) appropriate? Respond with ONLY a JSON object (no markdown, no extra text): {"valid": true or false, "reason": "brief explanation"} 0 50 100 150 200 250 Training Step 0.35 0.40 0.45 0.50 0.55Suggest Score (Step) (a) Suggest Score 0 50 100 150 200 250 Training Step 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09Bonus...
-
[33]
The outcome component:∑n i=1 Abase,i=0(Proposition O.1)
-
[34]
Proof.Follows directly from Propositions O.1 and O.2
The rubric component: for each stepk,∑i∈Gk ¯dk,i =0(Proposition O.2). Proof.Follows directly from Propositions O.1 and O.2. The outcome advantage is a scalar broadcast to all tokens, so it sums to zero over rollouts at every token position. The rubric reward at any tokentbelonging to stepkin rolloutiequals ¯dk,i, which sums to zero overG k by Proposition ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.