GRAIL: Gradient-Reweighted Advantages for Reinforcement Learning with Verifiable Rewards

Soujanya Poria; Tej Deep Pala; Vernon Toh

arxiv: 2606.04889 · v1 · pith:FFSVVB6Dnew · submitted 2026-06-03 · 💻 cs.CL

GRAIL: Gradient-Reweighted Advantages for Reinforcement Learning with Verifiable Rewards

Tej Deep Pala , Vernon Toh , Soujanya Poria This is my paper

Pith reviewed 2026-06-28 05:55 UTC · model grok-4.3

classification 💻 cs.CL

keywords reinforcement learninglarge language modelsmathematical reasoningadvantage reweightinggradient saliencyverifiable rewardsGRPO

0 comments

The pith

GRAIL reweights token advantages in RL using gradient saliency to focus updates on answer-sensitive steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that broadcasting a single sequence-level advantage to every token in reinforcement learning with verifiable rewards dilutes the gradient signal for LLMs doing mathematical reasoning. It introduces GRAIL as an intrinsic method that uses gradient-activation saliency to assign higher weights to tokens whose local changes most affect the final answer. This token-wise reweighting is intended to strengthen credit assignment for valid reasoning steps without relying on external process reward models. Experiments across five models from the Qwen3, R1-distilled, and OctoThinker families report average gains of 3.60 percent in accuracy and 3.05 percent in Pass@3 over the GRPO baseline. The central argument is that fine-grained, model-intrinsic reweighting can improve reasoning alignment under sparse verifiable rewards.

Core claim

GRAIL is an intrinsic token-wise advantage reweighting method that applies gradient-activation saliency to place more weight on tokens that are more locally sensitive to the final answer, thereby improving learning over uniform advantage distribution in GRPO-style reinforcement learning with verifiable rewards.

What carries the argument

Gradient-activation saliency used to compute token-wise multipliers for sequence-level advantages.

If this is right

Token-level reweighting strengthens the gradient signal for valid logical inferences while down-weighting filler or flawed steps.
The approach achieves measurable gains without any process-level supervision or additional reward models.
Gains appear consistently across multiple model families including Qwen3, R1-distilled, and OctoThinker.
Both final-answer accuracy and Pass@3 metrics improve when advantages are reweighted by local sensitivity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same saliency mechanism could be tested on other verifiable-reward tasks such as code generation or theorem proving.
Saliency maps produced during training might serve as diagnostic tools for identifying weak reasoning patterns in generated solutions.
If saliency correlates with human-labeled step importance, the method could be extended to hybrid supervision settings.

Load-bearing premise

Gradient-activation saliency accurately flags the tokens most responsible for the final reward, and reweighting advantages by these values improves learning rather than adding harmful bias or instability.

What would settle it

If GRAIL applied to the same models and tasks produces equal or lower accuracy and Pass@3 than standard GRPO, the claim that saliency-based reweighting improves credit assignment would be falsified.

read the original abstract

Reinforcement learning with verifiable rewards (e.g. GRPO) is now a common way to improve mathematical reasoning in Large Language Models (LLMs). However, current methods usually broadcast one sequence-level advantage to all tokens, or use costly process reward models (PRMs) for step-level supervision. Uniform advantage distribution assumes that all tokens contribute equally to the final reward. This dilutes the gradient signal, since flawed reasoning steps and filler words are updated as strongly as valid logical inferences. To address this, we introduce Gradient-Reweighted Advantage (GRAIL), an intrinsic token-wise advantage reweighting method. GRAIL uses gradient-activation saliency to place more weight on tokens that are more locally sensitive to the final answer. Evaluations across five models from the Qwen3, R1-distilled and OctoThinker families show that GRAIL consistently outperforms GRPO. GRAIL achieved an average improvement of 3.60% in accuracy and 3.05% in Pass@3, demonstrating that fine-grained reasoning alignment can be achieved without process-level supervision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GRAIL's gradient saliency reweighting is a reasonable idea for token-level advantages in GRPO but the abstract supplies no equations, ablations or stats so the 3.6% gains cannot be evaluated.

read the letter

The main takeaway is that this paper describes a method called GRAIL that reweights advantages in reinforcement learning for LLMs by using gradient-activation saliency to emphasize tokens more sensitive to the final reward. It claims this beats standard GRPO by about 3.6% accuracy on average across several models, all without process reward models.

What is new here is the specific application of saliency from gradients to do token-wise reweighting in the verifiable reward setting. The paper does a decent job laying out the problem with uniform advantage distribution, which makes sense as a motivation.

The approach avoids the cost of training or running a separate PRM, which is a practical plus if it works. The evaluations on Qwen3, R1-distilled, and OctoThinker models show the gains in both accuracy and Pass@3.

That said, the writeup is very light on details. There are no equations showing exactly how the saliency is computed or how the reweighting is applied to the advantages. No ablations isolate whether the saliency component is responsible for the improvement or if it's something else like hyperparameter choices. The stress test point about saliency potentially capturing attention artifacts or frequency rather than logical contribution is a real issue here, and without tests against random reweighting or statistical significance, the gains could easily be within variance.

The central argument that this provides better credit assignment holds up conceptually, but the evidence presented does not yet support the claims strongly.

This paper is for researchers working on RL fine-tuning of reasoning models who are looking for ways to get step-level signals without extra models. A reader interested in incremental improvements to GRPO might get some value from the idea once more details are available.

It does not deserve a serious referee at this stage because the method and results need much more documentation and validation to be reviewable. I would recommend waiting for a fuller version before engaging further.

Referee Report

3 major / 2 minor

Summary. The paper introduces Gradient-Reweighted Advantage (GRAIL), an intrinsic token-wise advantage reweighting technique for reinforcement learning with verifiable rewards (RLVR) in LLMs. It replaces uniform sequence-level advantage broadcasting (as in GRPO) with weights derived from gradient-activation saliency to emphasize tokens locally sensitive to the final answer. Across five models from the Qwen3, R1-distilled, and OctoThinker families, GRAIL is reported to yield average gains of 3.60% accuracy and 3.05% Pass@3 over GRPO while avoiding process reward models.

Significance. If the central claim holds, GRAIL would offer a practical, parameter-free route to finer credit assignment in RLVR without the cost of PRMs, addressing a known limitation of uniform advantage methods. The multi-family evaluation across Qwen3, R1-distilled, and OctoThinker models is a positive aspect that strengthens generalizability claims.

major comments (3)

[Abstract] Abstract: the headline claim that GRAIL 'consistently outperforms GRPO' with 3.60% accuracy and 3.05% Pass@3 gains is presented without any derivation of the saliency reweighting, any ablation isolating the gradient-activation component, or any statistical test showing the deltas exceed those expected from random reweighting or hyper-parameter variation.
[Method] Method section: the description of gradient-activation saliency as identifying 'tokens that are more locally sensitive to the final answer' provides no explicit equation or argument demonstrating that the resulting weights reflect logical contribution rather than attention artifacts or token frequency; this assumption is load-bearing for the claim that reweighting improves policy gradients over uniform advantage.
[Experiments] Experiments section: the reported average improvements across five models lack per-run variance, confidence intervals, or controls comparing against random token reweighting baselines, so it is impossible to determine whether the observed deltas are attributable to GRAIL or to uncontrolled experimental variance.

minor comments (2)

[Abstract] Abstract: the Pass@3 metric is used without a definition or citation to its standard usage in the mathematical reasoning literature.
The manuscript does not state whether code or training configurations will be released, which would aid reproducibility of the multi-model results.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment point by point below, proposing revisions where they strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim that GRAIL 'consistently outperforms GRPO' with 3.60% accuracy and 3.05% Pass@3 gains is presented without any derivation of the saliency reweighting, any ablation isolating the gradient-activation component, or any statistical test showing the deltas exceed those expected from random reweighting or hyper-parameter variation.

Authors: Abstracts function as concise summaries and are not the appropriate location for full derivations, ablations, or statistical tests, which are provided in the Method and Experiments sections. The saliency reweighting derivation appears in the Method section, and the multi-family results are detailed in Experiments. We will revise the abstract to briefly note the mechanism and point to those sections for supporting analysis. We will also add standard deviations to the reported averages to address variance. The consistency of gains across five distinct model families already provides evidence against purely random or hyperparameter-driven effects. revision: partial
Referee: [Method] Method section: the description of gradient-activation saliency as identifying 'tokens that are more locally sensitive to the final answer' provides no explicit equation or argument demonstrating that the resulting weights reflect logical contribution rather than attention artifacts or token frequency; this assumption is load-bearing for the claim that reweighting improves policy gradients over uniform advantage.

Authors: We agree that an explicit equation would enhance precision. The revised Method section will include the formal definition of the gradient-activation saliency, computed as the gradient of the verifiable reward with respect to token-level activations. This follows established gradient-based attribution techniques from the interpretability literature. We will add a short argument explaining that, unlike token frequency or static attention patterns, the weights are conditioned on the reward signal and thus prioritize tokens whose local changes affect the final outcome. The empirical improvements across models support that the reweighting captures meaningful contributions beyond artifacts. revision: yes
Referee: [Experiments] Experiments section: the reported average improvements across five models lack per-run variance, confidence intervals, or controls comparing against random token reweighting baselines, so it is impossible to determine whether the observed deltas are attributable to GRAIL or to uncontrolled experimental variance.

Authors: The Experiments section emphasizes average gains across five models from different families to demonstrate robustness. We will revise the section to report per-run standard deviations and confidence intervals. A random token reweighting baseline was not included because the primary comparison is to the established GRPO method; however, we acknowledge its value as a control and will add a discussion of why the observed consistency across families makes uncontrolled variance unlikely. If the underlying run data permits, we will incorporate the random baseline as an additional experiment. revision: partial

Circularity Check

0 steps flagged

No significant circularity; method and gains are empirically reported without reduction to fitted inputs or self-referential definitions.

full rationale

The paper introduces GRAIL as a gradient-activation saliency reweighting of sequence-level advantages in verifiable-reward RL, contrasting it with uniform broadcasting in GRPO and costly PRMs. The abstract and description present the reweighting rule as an intrinsic computation from gradients, followed by direct empirical comparison across models showing accuracy and Pass@3 gains. No equations or claims reduce the reported deltas to quantities fitted from the same evaluation data, no self-citation chain is invoked as load-bearing justification, and the central claim (saliency-based reweighting improves credit assignment) is not defined in terms of the outcome it is said to produce. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the domain assumption that gradient saliency correctly captures token importance for the final reward; no free parameters or new physical entities are described.

axioms (2)

domain assumption Uniform advantage distribution across tokens dilutes the gradient signal because tokens contribute unequally to the final reward
Stated directly in the abstract as the motivation for reweighting.
domain assumption Gradient-activation saliency measures local sensitivity of each token to the final answer
This is the mechanism GRAIL uses to assign weights.

invented entities (1)

Gradient-Reweighted Advantage (GRAIL) no independent evidence
purpose: Token-wise advantage reweighting for RL with verifiable rewards
New method introduced to address uniform advantage dilution.

pith-pipeline@v0.9.1-grok · 5717 in / 1465 out tokens · 35088 ms · 2026-06-28T05:55:52.666899+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 12 linked inside Pith

[1]

Beyond high-entropy exploration: Correctness-aware low-entropy segment-based advantage shaping for reasoning llms.arXiv preprint arXiv:2512.00908,

Xinzhu Chen, Xuesheng Li, Zhongxiang Sun, and Weijie Yu. Beyond high-entropy exploration: Correctness-aware low-entropy segment-based advantage shaping for reasoning llms.arXiv preprint arXiv:2512.00908,

arXiv
[2]

Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. What does BERT look at? an analysis of BERT’s attention. In Tal Linzen, Grzegorz Chrupała, Yonatan Belinkov, and Dieuwke Hupkes, editors,Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 276–286, Florence, Italy, August

2019
[3]

https://aclanthology.org/W19-4828/

Association for Computational Linguistics. .https://aclanthology.org/W19-4828/. Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Yuchen Zhang, Jiacheng Chen, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, et al. Process reinforcement through implicit rewards.arXiv preprint arXiv:2502.01456,

Pith/arXiv arXiv
[4]

Javier Ferrando, Gerard I

Accessed: 2025-11-25. Javier Ferrando, Gerard I. Gállego, and Marta R. Costa-jussà. Measuring the mixing of contextual information in the transformer. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8698–8714, Abu Dhabi, United Arab Emirates, December

2025
[5]

https://aclanthology.org/2022.emnlp-main.595/

Association for Computational Linguistics. .https://aclanthology.org/2022.emnlp-main.595/. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

Pith/arXiv arXiv 2022
[6]

Pala et al. (2026). GRAIL: Gradient-Reweighted Advantages for Reinforcement Learning with Verifiable Rewards. 13 Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, et al. Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoni...

Pith/arXiv arXiv 2026
[7]

Attention is not explanation

Sarthak Jain and Byron C Wallace. Attention is not explanation. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3543–3556,

2019
[8]

Attention illuminates llm reasoning: The preplan-and-anchor rhythm enables fine-grained policy optimization.arXiv preprint arXiv:2510.13554,

Yang Li, Zhichen Dong, Yuhan Sun, Weixun Wang, Shaopan Xiong, Yijia Luo, Jiashun Liu, Han Lu, Jiamang Wang, Wenbo Su, et al. Attention illuminates llm reasoning: The preplan-and-anchor rhythm enables fine-grained policy optimization.arXiv preprint arXiv:2510.13554,

Pith/arXiv arXiv
[9]

Outcome-grounded advantage reshaping for fine-grained credit assignment in mathematical reasoning.arXiv preprint arXiv:2601.07408,

Ziheng Li, Liu Kang, Feng Xiao, Luxi Xing, Qingyi Si, Zhuoran Li, Weikang Gong, Deqing Yang, Yanghua Xiao, and Hongcheng Guo. Outcome-grounded advantage reshaping for fine-grained credit assignment in mathematical reasoning.arXiv preprint arXiv:2601.07408,

Pith/arXiv arXiv
[10]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InInternational Conference on Learning Representations, volume 2024, pages 39578–39601,

2024
[11]

Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783,

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783,

Pith/arXiv arXiv
[12]

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov

Accessed: 2026-05-26. Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. Advances in neural information processing systems, 35:17359–17372,

2026
[13]

A survey of temporal credit assignment in deep reinforcement learning.arXiv preprint arXiv:2312.01072,

Eduardo Pignatelli, Johan Ferret, Matthieu Geist, Thomas Mesnard, Hado van Hasselt, Olivier Pietquin, and Laura Toni. A survey of temporal credit assignment in deep reinforcement learning.arXiv preprint arXiv:2312.01072,

arXiv
[14]

Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

Pith/arXiv arXiv
[15]

Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

Pith/arXiv arXiv
[16]

Deep inside convolutional networks: Visualising image classification models and saliency maps.arXiv preprint arXiv:1312.6034,

Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps.arXiv preprint arXiv:1312.6034,

Pith/arXiv arXiv
[17]

Mathscale: Scaling instruction tuning for mathe- matical reasoning.arXiv preprint arXiv:2403.02884,

Zhengyang Tang, Xingxing Zhang, Benyou Wang, and Furu Wei. Mathscale: Scaling instruction tuning for mathe- matical reasoning.arXiv preprint arXiv:2403.02884,

arXiv
[18]

Qwen3 technical report, 2025.https://arxiv.org/abs/2505.09388

Qwen Team. Qwen3 technical report, 2025.https://arxiv.org/abs/2505.09388. Pala et al. (2026). GRAIL: Gradient-Reweighted Advantages for Reinforcement Learning with Verifiable Rewards. 14 Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. TRL: Transformers...

Pith/arXiv arXiv 2025
[19]

https://aclanthology.org/2024.acl-long.510/

Association for Computational Linguistics. . https://aclanthology.org/2024.acl-long.510/. Zengzhi Wang, Fan Zhou, Xuefeng Li, and Pengfei Liu. Octothinker: Mid-training incentivizes reinforcement learning scaling.arXiv preprint arXiv:2506.20512,

arXiv 2024
[20]

Sarah Wiegreffe and Yuval Pinter

Preprint. Sarah Wiegreffe and Yuval Pinter. Attention is not not explanation. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 11–20,

2019
[21]

Bnpo: Beta normalization policy optimization.arXiv preprint arXiv:2506.02864,

Changyi Xiao, Mengdi Zhang, and Yixin Cao. Bnpo: Beta normalization policy optimization.arXiv preprint arXiv:2506.02864,

arXiv
[22]

Unlocking exploration in rlvr: Uncertainty-aware advantage shaping for deeper reasoning.arXiv preprint arXiv:2510.10649,

Can Xie, Ruotong Pan, Xiangyu Wu, Yunfei Zhang, Jiayi Fu, Tingting Gao, and Guorui Zhou. Unlocking exploration in rlvr: Uncertainty-aware advantage shaping for deeper reasoning.arXiv preprint arXiv:2510.10649,

Pith/arXiv arXiv
[23]

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122,

Pith/arXiv arXiv
[24]

American invitational mathematics examination (aime) 2024,

Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2024,

2024
[25]

The lessons of developing process reward models in mathematical reasoning

Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. The lessons of developing process reward models in mathematical reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 10495–10516,

2025
[26]

Pala et al. (2026). GRAIL: Gradient-Reweighted Advantages for Reinforcement Learning with Verifiable Rewards. 15 A Hyperparameters A.1 Training Hyperparameters Hyperparameter Value Optimization & Training Precisionbfloat16 Attention Implementation Flash Attention 2 Optimizer Fused AdamW Learning Rate1.0×10 −6 Learning Rate Scheduler Constant Warmup Steps ...

2026
[27]

During the evaluation phase across all six mathematical reasoning benchmarks, we maintain a consistent sampling and verification protocol to ensure robust and fair comparisons

A.2 Evaluation Hyperparameters Hyperparameter Value Number of Generations 3 Sampling Temperature 0.6 Top-p0.95 Top-k20 Table 6Sampling hyperparameters utilized during the benchmark evaluation phase. During the evaluation phase across all six mathematical reasoning benchmarks, we maintain a consistent sampling and verification protocol to ensure robust and...

2024

[1] [1]

Beyond high-entropy exploration: Correctness-aware low-entropy segment-based advantage shaping for reasoning llms.arXiv preprint arXiv:2512.00908,

Xinzhu Chen, Xuesheng Li, Zhongxiang Sun, and Weijie Yu. Beyond high-entropy exploration: Correctness-aware low-entropy segment-based advantage shaping for reasoning llms.arXiv preprint arXiv:2512.00908,

arXiv

[2] [2]

Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. What does BERT look at? an analysis of BERT’s attention. In Tal Linzen, Grzegorz Chrupała, Yonatan Belinkov, and Dieuwke Hupkes, editors,Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 276–286, Florence, Italy, August

2019

[3] [3]

https://aclanthology.org/W19-4828/

Association for Computational Linguistics. .https://aclanthology.org/W19-4828/. Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Yuchen Zhang, Jiacheng Chen, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, et al. Process reinforcement through implicit rewards.arXiv preprint arXiv:2502.01456,

Pith/arXiv arXiv

[4] [4]

Javier Ferrando, Gerard I

Accessed: 2025-11-25. Javier Ferrando, Gerard I. Gállego, and Marta R. Costa-jussà. Measuring the mixing of contextual information in the transformer. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8698–8714, Abu Dhabi, United Arab Emirates, December

2025

[5] [5]

https://aclanthology.org/2022.emnlp-main.595/

Association for Computational Linguistics. .https://aclanthology.org/2022.emnlp-main.595/. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

Pith/arXiv arXiv 2022

[6] [6]

Pala et al. (2026). GRAIL: Gradient-Reweighted Advantages for Reinforcement Learning with Verifiable Rewards. 13 Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, et al. Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoni...

Pith/arXiv arXiv 2026

[7] [7]

Attention is not explanation

Sarthak Jain and Byron C Wallace. Attention is not explanation. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3543–3556,

2019

[8] [8]

Attention illuminates llm reasoning: The preplan-and-anchor rhythm enables fine-grained policy optimization.arXiv preprint arXiv:2510.13554,

Yang Li, Zhichen Dong, Yuhan Sun, Weixun Wang, Shaopan Xiong, Yijia Luo, Jiashun Liu, Han Lu, Jiamang Wang, Wenbo Su, et al. Attention illuminates llm reasoning: The preplan-and-anchor rhythm enables fine-grained policy optimization.arXiv preprint arXiv:2510.13554,

Pith/arXiv arXiv

[9] [9]

Outcome-grounded advantage reshaping for fine-grained credit assignment in mathematical reasoning.arXiv preprint arXiv:2601.07408,

Ziheng Li, Liu Kang, Feng Xiao, Luxi Xing, Qingyi Si, Zhuoran Li, Weikang Gong, Deqing Yang, Yanghua Xiao, and Hongcheng Guo. Outcome-grounded advantage reshaping for fine-grained credit assignment in mathematical reasoning.arXiv preprint arXiv:2601.07408,

Pith/arXiv arXiv

[10] [10]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InInternational Conference on Learning Representations, volume 2024, pages 39578–39601,

2024

[11] [11]

Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783,

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783,

Pith/arXiv arXiv

[12] [12]

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov

Accessed: 2026-05-26. Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. Advances in neural information processing systems, 35:17359–17372,

2026

[13] [13]

A survey of temporal credit assignment in deep reinforcement learning.arXiv preprint arXiv:2312.01072,

Eduardo Pignatelli, Johan Ferret, Matthieu Geist, Thomas Mesnard, Hado van Hasselt, Olivier Pietquin, and Laura Toni. A survey of temporal credit assignment in deep reinforcement learning.arXiv preprint arXiv:2312.01072,

arXiv

[14] [14]

Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

Pith/arXiv arXiv

[15] [15]

Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

Pith/arXiv arXiv

[16] [16]

Deep inside convolutional networks: Visualising image classification models and saliency maps.arXiv preprint arXiv:1312.6034,

Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps.arXiv preprint arXiv:1312.6034,

Pith/arXiv arXiv

[17] [17]

Mathscale: Scaling instruction tuning for mathe- matical reasoning.arXiv preprint arXiv:2403.02884,

Zhengyang Tang, Xingxing Zhang, Benyou Wang, and Furu Wei. Mathscale: Scaling instruction tuning for mathe- matical reasoning.arXiv preprint arXiv:2403.02884,

arXiv

[18] [18]

Qwen3 technical report, 2025.https://arxiv.org/abs/2505.09388

Qwen Team. Qwen3 technical report, 2025.https://arxiv.org/abs/2505.09388. Pala et al. (2026). GRAIL: Gradient-Reweighted Advantages for Reinforcement Learning with Verifiable Rewards. 14 Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. TRL: Transformers...

Pith/arXiv arXiv 2025

[19] [19]

https://aclanthology.org/2024.acl-long.510/

Association for Computational Linguistics. . https://aclanthology.org/2024.acl-long.510/. Zengzhi Wang, Fan Zhou, Xuefeng Li, and Pengfei Liu. Octothinker: Mid-training incentivizes reinforcement learning scaling.arXiv preprint arXiv:2506.20512,

arXiv 2024

[20] [20]

Sarah Wiegreffe and Yuval Pinter

Preprint. Sarah Wiegreffe and Yuval Pinter. Attention is not not explanation. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 11–20,

2019

[21] [21]

Bnpo: Beta normalization policy optimization.arXiv preprint arXiv:2506.02864,

Changyi Xiao, Mengdi Zhang, and Yixin Cao. Bnpo: Beta normalization policy optimization.arXiv preprint arXiv:2506.02864,

arXiv

[22] [22]

Unlocking exploration in rlvr: Uncertainty-aware advantage shaping for deeper reasoning.arXiv preprint arXiv:2510.10649,

Can Xie, Ruotong Pan, Xiangyu Wu, Yunfei Zhang, Jiayi Fu, Tingting Gao, and Guorui Zhou. Unlocking exploration in rlvr: Uncertainty-aware advantage shaping for deeper reasoning.arXiv preprint arXiv:2510.10649,

Pith/arXiv arXiv

[23] [23]

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122,

Pith/arXiv arXiv

[24] [24]

American invitational mathematics examination (aime) 2024,

Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2024,

2024

[25] [25]

The lessons of developing process reward models in mathematical reasoning

Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. The lessons of developing process reward models in mathematical reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 10495–10516,

2025

[26] [26]

Pala et al. (2026). GRAIL: Gradient-Reweighted Advantages for Reinforcement Learning with Verifiable Rewards. 15 A Hyperparameters A.1 Training Hyperparameters Hyperparameter Value Optimization & Training Precisionbfloat16 Attention Implementation Flash Attention 2 Optimizer Fused AdamW Learning Rate1.0×10 −6 Learning Rate Scheduler Constant Warmup Steps ...

2026

[27] [27]

During the evaluation phase across all six mathematical reasoning benchmarks, we maintain a consistent sampling and verification protocol to ensure robust and fair comparisons

A.2 Evaluation Hyperparameters Hyperparameter Value Number of Generations 3 Sampling Temperature 0.6 Top-p0.95 Top-k20 Table 6Sampling hyperparameters utilized during the benchmark evaluation phase. During the evaluation phase across all six mathematical reasoning benchmarks, we maintain a consistent sampling and verification protocol to ensure robust and...

2024