pith. sign in

arxiv: 2605.17291 · v1 · pith:DQNRXODZnew · submitted 2026-05-17 · 💻 cs.LG

Step-wise Rubric Rewards for LLM Reasoning

Pith reviewed 2026-05-20 13:26 UTC · model grok-4.3

classification 💻 cs.LG
keywords reinforcement learningLLM reasoningrubric rewardsstep-wise supervisionmathematical reasoningfaithful reasoningself-correction
0
0 comments X

The pith

Step-wise Rubric Rewards assign supervision to individual reasoning steps to fix aggregation problems in LLM reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reinforcement learning for language models typically rewards only whether the final answer is correct, leaving intermediate reasoning steps without targeted feedback. Rubric-based methods score entire responses against structured criteria but still collapse everything into one scalar reward, which can reward incorrect steps inside correct answers and penalize correct steps inside wrong answers. This paper introduces Step-wise Rubrics as Rewards that uses an LLM judge to link each rubric item to a particular reasoning step, normalizes the per-step scores across multiple rollouts so only steps whose quality actually varies drive learning, and combines the step rewards with the final outcome reward through a decoupled advantage estimator that keeps the outcome baseline stable. The approach is supported by a 16K-problem rubric dataset created through contrastive distillation from correct and flawed reasoning paths. Experiments across six mathematical reasoning benchmarks show higher accuracy, increased faithful reasoning, and reduced self-correction looping compared with prior rubric aggregation methods.

Core claim

SRaR uses an LLM judge to attribute each rubric item to a specific reasoning step, normalizes per-step rubric scores across rollouts so only steps whose quality varies produce a learning signal, and combines the per-step reward with the outcome reward through a decoupled advantage estimator that keeps the outcome baseline stable.

What carries the argument

The LLM-judge attribution of rubric items to individual reasoning steps, combined with per-step score normalization across rollouts and a decoupled advantage estimator for merging with outcome rewards.

If this is right

  • Improves average accuracy over RaR by 3.57 points on Qwen3-8B and 2.75 points on Qwen3-32B across six mathematical reasoning benchmarks.
  • Raises the Faithful Reasoning Rate on AIME 2025 from 34.5% to 46.7%.
  • Reduces self-correction looping from 48.1% to 26.5%.
  • Avoids the observed pattern where 18.2% of steps in correct-answer responses are wrong yet positively rewarded and 49.9% of steps in incorrect-answer responses are correct yet penalized.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The contrastive distillation process used to build the 16K-problem rubric dataset could be reused to create targeted training signals for other multi-step tasks such as code generation or scientific proof construction.
  • Decoupling step-wise and outcome advantages may stabilize training in reinforcement learning settings that combine several different reward components beyond reasoning.
  • Applying the same attribution technique outside mathematical domains could test whether step-level supervision reduces reward hacking in areas like long-form writing or multi-turn dialogue.
  • If future judge models become more accurate at step attribution, the framework would automatically deliver stronger per-step signals without requiring changes to the training pipeline.

Load-bearing premise

The LLM judge can reliably map each rubric item to the correct reasoning step without frequent or systematic attribution errors.

What would settle it

Direct measurement of the LLM judge's attribution accuracy on a sample of rollouts showing high error rates that would make the per-step reward signal too noisy to produce the claimed gains over standard rubric aggregation.

read the original abstract

Reinforcement Learning with Verifiable Rewards (RLVR) is widely used to improve reasoning in large language models, but rewards only final-answer correctness with no supervision over intermediate steps. Rubric-based methods such as Rubrics as Rewards (RaR) introduce finer-grained supervision by scoring rollouts against structured criteria, yet the rubric scores are still aggregated into a single scalar applied to the entire response, causing three weaknesses: loss of multi-criterion structure, uniform supervision of correct and incorrect steps, and reward hacking through unbounded self-correction. On 1,000 problems, we find 18.2% of steps in correct-answer responses are wrong yet positively rewarded, while 49.9% of steps in incorrect-answer responses are correct yet penalized. We introduce Step-wise Rubrics as Rewards (SRaR), an RLVR framework that (i) uses an LLM judge to attribute each rubric item to a specific reasoning step, (ii) normalizes per-step rubric scores across rollouts so only steps whose quality varies produce a learning signal, and (iii) combines the per-step reward with the outcome reward through a decoupled advantage estimator that keeps the outcome baseline stable. We further build a 16K-problem rubric dataset by contrastively distilling rubric items from correct and flawed reasoning paths sampled from a strong model. Across six mathematical reasoning benchmarks, SRaR improves average accuracy over RaR by 3.57 points on Qwen3-8B and 2.75 points on Qwen3-32B, raises the Faithful Reasoning Rate on AIME 2025 from 34.5% to 46.7%, and reduces self-correction looping from 48.1% to 26.5%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Step-wise Rubrics as Rewards (SRaR), extending Rubrics as Rewards (RaR) within RLVR for LLM reasoning. SRaR uses an LLM judge to attribute each rubric item to a specific reasoning step in a rollout, normalizes per-step scores across rollouts so that only varying-quality steps generate a signal, and combines the resulting per-step reward with the outcome reward through a decoupled advantage estimator. The authors construct a 16K-problem rubric dataset via contrastive distillation from correct and flawed paths and report accuracy gains over RaR of 3.57 points on Qwen3-8B and 2.75 points on Qwen3-32B across six math benchmarks, plus an increase in Faithful Reasoning Rate on AIME 2025 from 34.5% to 46.7% and a drop in self-correction looping from 48.1% to 26.5%.

Significance. If the LLM-judge attribution step is shown to be reliable, SRaR would constitute a clear technical advance by restoring multi-criterion structure and step-level supervision while mitigating uniform reward of incorrect steps and self-correction hacking. The contrastive construction of the 16K rubric dataset and the explicit diagnostic statistics on rewarded wrong steps are concrete strengths that future work can build upon.

major comments (2)
  1. [Abstract and §3 (Method)] Abstract and method description of the attribution procedure: the central diagnostic (18.2% of steps in correct-answer trajectories are wrong yet positively rewarded) and all per-step learning signals rest on the LLM judge correctly mapping rubric items to individual steps. No human-agreement rates, error analysis, or validation of this attribution step are reported. If attribution errors are systematic, the variance-based normalization produces noisy or biased advantages, so the claimed 3.57-point gain over RaR cannot be confidently attributed to step-wise supervision rather than incidental changes in training dynamics.
  2. [§5 (Experiments)] Experimental results (benchmark tables and AIME 2025 numbers): concrete accuracy lifts and the reduction in self-correction looping are presented without statistical significance tests, standard deviations across seeds, or ablations that isolate the three new components (attribution, normalization, decoupled estimator). This makes it difficult to assess whether the reported improvements are robust or load-bearing for the central claim.
minor comments (2)
  1. [§3.2] The normalization formula and the precise form of the decoupled advantage estimator would benefit from an explicit equation or pseudocode block for reproducibility.
  2. [§3.1] Clarify whether the LLM judge is drawn from the same model family as the policy or trained on overlapping data; this bears on the circularity concern noted in the review.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications and commit to specific revisions that directly strengthen the manuscript's claims.

read point-by-point responses
  1. Referee: [Abstract and §3 (Method)] Abstract and method description of the attribution procedure: the central diagnostic (18.2% of steps in correct-answer trajectories are wrong yet positively rewarded) and all per-step learning signals rest on the LLM judge correctly mapping rubric items to individual steps. No human-agreement rates, error analysis, or validation of this attribution step are reported. If attribution errors are systematic, the variance-based normalization produces noisy or biased advantages, so the claimed 3.57-point gain over RaR cannot be confidently attributed to step-wise supervision rather than incidental changes in training dynamics.

    Authors: We agree that the absence of explicit validation for the LLM-judge attribution step is a limitation that weakens the ability to fully attribute gains to step-wise supervision. The manuscript does report diagnostic statistics (18.2% wrong steps positively rewarded in correct trajectories and 49.9% correct steps penalized in incorrect ones) that were obtained via the same judge, providing indirect evidence that attribution errors exist and that the method still yields net benefits. However, without human agreement rates or error analysis, systematic biases cannot be ruled out. In the revised version we will add a dedicated validation subsection: we will sample 200 attributions, have two human annotators label them for correctness, report inter-annotator agreement and judge accuracy, and analyze the most frequent error categories. These results will be used to qualify the strength of the central claim. revision: yes

  2. Referee: [§5 (Experiments)] Experimental results (benchmark tables and AIME 2025 numbers): concrete accuracy lifts and the reduction in self-correction looping are presented without statistical significance tests, standard deviations across seeds, or ablations that isolate the three new components (attribution, normalization, decoupled estimator). This makes it difficult to assess whether the reported improvements are robust or load-bearing for the central claim.

    Authors: We concur that the experimental presentation would be more convincing with statistical rigor and component-wise ablations. The current results are reported from single runs without error bars or significance tests, and the three novel elements (attribution, per-step normalization, and decoupled advantage) are not isolated. In the revision we will (i) rerun the main experiments on Qwen3-8B and Qwen3-32B with at least three random seeds, reporting mean accuracy, standard deviation, and paired t-test p-values against RaR; (ii) add an ablation table that cumulatively enables attribution, normalization, and the decoupled estimator while keeping all other factors fixed; and (iii) include the same statistical treatment for the AIME 2025 Faithful Reasoning Rate and self-correction looping metrics. These additions will allow readers to assess which components drive the observed gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces SRaR as an RLVR extension that attributes rubric items via an LLM judge, normalizes per-step scores across rollouts, and combines them with outcome rewards using a decoupled advantage estimator. All reported gains (accuracy lifts, Faithful Reasoning Rate, reduced looping) are framed as empirical outcomes from training and evaluation on external benchmarks. No equations reduce the final performance metric to the judge outputs or normalization by algebraic identity, no fitted parameter is relabeled as a prediction, and no load-bearing premise rests on a self-citation chain. The 18.2 % statistic is an observational motivation rather than a definitional input, and the method remains externally falsifiable on held-out math problems.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the accuracy of an external LLM judge for step attribution and on the assumption that cross-rollout normalization isolates genuine quality variation; both are introduced without independent verification in the abstract.

free parameters (1)
  • per-step reward weight
    The decoupled advantage estimator must combine the normalized per-step signal with the outcome reward; the mixing coefficient is not stated as fixed by prior literature.
axioms (1)
  • domain assumption An LLM judge can map rubric criteria to specific tokens or sentences in a reasoning rollout with high fidelity.
    This premise is required for the per-step reward to be meaningful and is invoked when the abstract describes the attribution step.
invented entities (1)
  • step-wise rubric attribution via LLM judge no independent evidence
    purpose: To convert a single scalar rubric score into per-step learning signals.
    This is a new mechanism introduced in the paper; no independent evidence (such as human validation of attributions) is mentioned in the abstract.

pith-pipeline@v0.9.0 · 5905 in / 1511 out tokens · 44541 ms · 2026-05-20T13:26:14.451113+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    uses an LLM judge to attribute each rubric item to a specific reasoning step, (ii) normalizes the per-step rubric scores across rollouts so that only steps whose quality varies produce a learning signal, and (iii) combines the resulting per-step reward with the standard outcome reward through a decoupled advantage estimator

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 8 internal anchors

  1. [1]

    Reward and guidance through rubrics: Promoting exploration to improve multi-domain reasoning.arXiv preprint arXiv:2511.12344, 2025

    Baolong Bi, Shenghua Liu, Yiwei Wang, Siqian Tong, Lingrui Mei, Yuyao Ge, Yilong Xu, Jiafeng Guo, and Xueqi Cheng. Reward and guidance through rubrics: Promoting exploration to improve multi-domain reasoning.arXiv preprint arXiv:2511.12344, 2025

  2. [2]

    Babyvision: Visual reasoning beyond language

    Liang Chen, Weichu Xie, Yiyan Liang, Hongfeng He, Hans Zhao, Zhibo Yang, Zhiqi Huang, Haoning Wu, Haoyu Lu, Yiping Bao, et al. Babyvision: Visual reasoning beyond language.arXiv preprint arXiv:2601.06521, 2026

  3. [3]

    DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.Nature, 645: 633–638, 2025

    DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.Nature, 645: 633–638, 2025

  4. [4]

    Interleaved latent visual reasoning with selective perceptual modeling.arXiv preprint arXiv:2512.05665, 2025

    Shuai Dong, Siyuan Wang, Xingyu Liu, Chenglin Li, Haowen Hou, and Zhongyu Wei. Interleaved latent visual reasoning with selective perceptual modeling.arXiv preprint arXiv:2512.05665, 2025

  5. [5]

    Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

    Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746, 2025

  6. [6]

    LLM-Rubric: A multidi- mensional, calibrated approach to automated evaluation of natural language texts

    Helia Hashemi, Jason Eisner, Corby Rosset, Benjamin Van Durme, and Chris Kedzie. LLM-Rubric: A multidi- mensional, calibrated approach to automated evaluation of natural language texts. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers), pages 13806–13834, 2024

  7. [7]

    OlympiadBench: A challenging benchmark for promoting AGI with olympiad- level bilingual multimodal scientific problems

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. OlympiadBench: A challenging benchmark for promoting AGI with olympiad- level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

  8. [8]

    Measuring mathematical problem solving with the MATH dataset.Advancesin Neural Information Processing Systems, 34:7294–7305, 2021

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset.Advancesin Neural Information Processing Systems, 34:7294–7305, 2021

  9. [9]

    Prometheus 2: An open source language model specialized in evaluating other language models

    Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, and Minjoon Seo. Prometheus 2: An open source language model specialized in evaluating other language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

  10. [10]

    Solving quantitative reasoning problems with language models

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Am- brose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models. Advancesin Neural Information Processing Systems, 35:3843–3857, 2022

  11. [11]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InInternational Conference on Learning Representations, 2024

  12. [12]

    Improve Mathematical Reasoning in Language Models by Automated Process Supervision

    Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Meiqi Guo, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, Jiao Sun, and Abhinav Rastogi. Improve mathematical reasoning in language models by automated process supervision.arXiv preprint arXiv:2406.06592, 2024

  13. [13]

    Learning to reason with LLMs.https://openai.com/index/learning-to-reason-with-llms/, 2024

    OpenAI. Learning to reason with LLMs.https://openai.com/index/learning-to-reason-with-llms/, 2024

  14. [14]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y.K. Li, Y. Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  15. [15]

    From Context to Skills: Can Language Models Learn from Context Skillfully?

    Shuzheng Si, Haozhe Zhao, Yu Lei, Qingyi Wang, Dingwei Chen, Zhitong Wang, Zhenhailong Wang, Kangyang Luo, Zheng Wang, Gang Chen, Fanchao Qi, Minjia Zhang, and Maosong Sun. From context to skills: Can language models learn from context skillfully?, 2026. URLhttps://arxiv.org/abs/2604.27660

  16. [16]

    Longcat-next: Lexicalizing modalities as discrete tokens.arXiv preprint arXiv:2603.27538, 2026

    Meituan LongCat Team, Bin Xiao, Chao Wang, Chengjiang Li, Chi Zhang, Chong Peng, Hang Yu, Hao Yang, Haonan Yan, Haoze Sun, Haozhe Zhao, Hong Liu, Hui Su, Jiaqi Zhang, Jiawei Wang, Jing Li, Kefeng Zhang, Manyuan Zhang, Minhao Jing, Peng Pei, Quan Chen, Taofeng Xue, Tongxin Pan, Xiaotong Li, Xiaoyang Li, Xiaoyu Zhao, Xing Hu, Xinyang Lin, Xunliang Cai, Yan ...

  17. [17]

    GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning

    Jingyi Wang, Lei Zhu, Tengjin Weng, Song-Li Wu, Haochen Tan, Jierun Chen, Chaofan Tao, Haoli Bai, Lu Hou, Lifeng Shang, and Xiao-Ping Zhang. GRPO-VPS: Enhancing group relative policy optimization with verifiable process supervision for effective reasoning.arXiv preprint arXiv:2604.20659, 2026

  18. [18]

    Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

    Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math- shepherd: Verify and reinforce LLMs step-by-step without human annotations.arXiv preprint arXiv:2312.08935, 2024

  19. [19]

    Self-consistency improves chain of thought reasoning in language models

    XuezhiWang, JasonWei, DaleSchuurmans, QuocLe, EdChi, SharanNarang, AakankshaChowdhery, andDenny Zhou. Self-consistency improves chain of thought reasoning in language models. InInternational Conference on Learning Representations, 2023

  20. [20]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, 2022

  21. [21]

    Simple statistical gradient-following algorithms for connectionist reinforcement learning

    Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3–4):229–256, 1992

  22. [22]

    Grouter: Decoupling routing from representation for accelerated moe training.arXiv preprint arXiv:2603.06626, 2026

    Yuqi Xu, Rizhen Hu, Zihan Liu, Mou Sun, and Kun Yuan. Grouter: Decoupling routing from representation for accelerated moe training.arXiv preprint arXiv:2603.06626, 2026

  23. [23]

    Qwen3 Technical Report

    An Yang, Anfeng Yang, Baosong Yang, Beichen Bi, Binyuan Hui, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  24. [24]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, et al. DAPO: An open-source LLM reinforce- ment learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  25. [25]

    Mentor: Efficient multimodal-conditioned tuning for autoregressive vision generation models

    Haozhe Zhao, Zefan Cai, Shuzheng Si, Liang Chen, Jiuxiang Gu, Wen Xiao, Minjia Zhang, and Junjie Hu. Mentor: Efficient multimodal-conditioned tuning for autoregressive vision generation models. arXiv preprint arXiv:2507.09574, 2025

  26. [26]

    ### Step N:

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, et al. Judging LLM-as-a-judge with MT-Bench and chatbot arena.Advances in Neural Information Processing Systems, 36, 2023. 12 Appendix A Limitations We identify the following limitations of our work: 1.Dependence on the LLM ...

  27. [27]

    Are the mathematical operations correct?

  28. [28]

    Is the logic/deduction valid?

  29. [29]

    CORRECT” if the step is logically and mathematically correct •“INCORRECT

    Are there any errors in the step? Respond with EXACTLY one of: •“CORRECT” if the step is logically and mathematically correct •“INCORRECT” if the step contains any error C Training Hyperparameters Table 4Training Hyperparameters. Hyperparameter Value Max Prompt Length 2048 Max Response Length 8192 Train Batch Size 128 Learning Rate1×10 −6 Rollout Group Si...

  30. [30]

    Is the content mathematically correct?

  31. [31]

    Is it relevant to this specific problem?

  32. [32]

    valid": true or false,

    Is the type label (SUGGEST/PITFALL/BONUS/ANSWER) appropriate? Respond with ONLY a JSON object (no markdown, no extra text): {"valid": true or false, "reason": "brief explanation"} 0 50 100 150 200 250 Training Step 0.35 0.40 0.45 0.50 0.55Suggest Score (Step) (a) Suggest Score 0 50 100 150 200 250 Training Step 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09Bonus...

  33. [33]

    The outcome component:∑n i=1 Abase,i=0(Proposition O.1)

  34. [34]

    Proof.Follows directly from Propositions O.1 and O.2

    The rubric component: for each stepk,∑i∈Gk ¯dk,i =0(Proposition O.2). Proof.Follows directly from Propositions O.1 and O.2. The outcome advantage is a scalar broadcast to all tokens, so it sums to zero over rollouts at every token position. The rubric reward at any tokentbelonging to stepkin rolloutiequals ¯dk,i, which sums to zero overG k by Proposition ...