DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards
Pith reviewed 2026-05-21 05:20 UTC · model grok-4.3
The pith
DelTA improves RLVR for LLM reasoning by estimating per-token coefficients that sharpen side-wise gradient centroids.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DelTA estimates token coefficients to amplify side-specific token-gradient directions and downweight shared or weakly discriminative ones. These coefficients reweight a self-normalized RLVR surrogate, making the effective side-wise centroids more contrastive and thereby reshaping the RLVR update direction.
What carries the argument
Token coefficient estimation that reweights the RLVR surrogate to enhance contrast between positive- and negative-side centroids of advantage-weighted token-gradient vectors.
If this is right
- On seven mathematical benchmarks DelTA outperforms strongest same-scale baselines by 3.26 average points on Qwen3-8B-Base.
- It outperforms by 2.62 average points on Qwen3-14B-Base.
- The gains extend to code generation tasks and out-of-domain evaluations.
- The method works on different backbone models.
Where Pith is reading between the lines
- The same coefficient estimation could be tested in other sequence-level RL settings where rewards arrive only at the end of a trajectory.
- Running DelTA on tasks without verifiable correctness signals might show whether the discriminator view depends on having clear positive-negative labels.
- Combining the coefficient estimation with existing variance-reduction techniques could further stabilize the reshaped update directions.
Load-bearing premise
That reweighting token gradients with estimated coefficients will produce more effective policy updates by making centroids distinguish high-reward responses from low-reward ones.
What would settle it
Applying DelTA to the seven mathematical benchmarks and finding average scores no higher than the strongest same-scale baselines would show the reweighting failed to improve the update direction.
Figures
read the original abstract
Reinforcement learning from verifiable rewards (RLVR) has emerged as a central technique for improving the reasoning capabilities of large language models. Despite its effectiveness, how response-level rewards translate into token-level probability changes remains poorly understood. We introduce a discriminator view of RLVR updates, showing that the policy-gradient update direction implicitly acts as a linear discriminator over token-gradient vectors and thereby determines which token probabilities are increased or decreased during learning. Under standard sequence-level RLVR, this discriminator is constructed from positive- and negative-side centroids formed by advantage-weighted averaging of token-gradient vectors. However, such centroid construction can be dominated by shared high-frequency patterns, such as formatting tokens, diluting sparse yet discriminative directions that better distinguish high-reward responses from low-reward ones. To address this limitation, we propose $\textbf{DelTA}$, a discriminative token credit assignment method that estimates token coefficients to amplify side-specific token-gradient directions and downweight shared or weakly discriminative ones. These coefficients reweight a self-normalized RLVR surrogate, making the effective side-wise centroids more contrastive and thereby reshaping the RLVR update direction. On seven mathematical benchmarks, DelTA outperforms the strongest same-scale baselines by 3.26 and 2.62 average points on Qwen3-8B-Base and Qwen3-14B-Base, respectively. Additional results on code generation, a different backbone, and out-of-domain evaluations further demonstrate the generalization ability of DelTA.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces DelTA, a discriminative token credit assignment method for reinforcement learning from verifiable rewards (RLVR) in large language models. It frames the policy-gradient update direction as a linear discriminator over token-gradient vectors, where standard advantage-weighted positive- and negative-side centroids can be dominated by shared high-frequency patterns. DelTA estimates per-token coefficients to amplify side-specific discriminative directions and downweight weakly discriminative ones, reweighting a self-normalized RLVR surrogate to produce more contrastive centroids. On seven mathematical benchmarks, it reports average improvements of 3.26 and 2.62 points over strongest same-scale baselines on Qwen3-8B-Base and Qwen3-14B-Base, with further results on code generation, alternative backbones, and out-of-domain settings.
Significance. If the discriminator view is accurate and the learned coefficients demonstrably improve centroid contrastivity beyond self-normalization, the work would supply both a conceptual lens for token-level credit assignment in RLVR and a practical technique that could enhance reasoning performance in LLMs. The reported gains on math and code tasks, together with claims of generalization, suggest potential utility for post-training pipelines, provided the source of the gains is isolated and the method is shown to be robust.
major comments (3)
- Abstract: the reported 3.26 / 2.62 point lifts are presented without error bars, number of random seeds, or statistical tests, making it impossible to determine whether the gains exceed run-to-run variance or baseline implementation differences.
- Method section on coefficient estimation: the construction reweights a self-normalized surrogate, yet no ablation is described that isolates the contribution of the learned per-token coefficients from the self-normalization step itself; without this, the central claim that the coefficients amplify sparse discriminative directions remains unverified.
- Experiments: no quantitative check (e.g., cosine distance between side-wise centroids or linear classification accuracy on held-out token-gradient vectors) is reported to confirm that the estimated coefficients produce measurably more contrastive centroids than the unweighted self-normalized baseline.
minor comments (1)
- The abstract refers to 'additional results on code generation, a different backbone, and out-of-domain evaluations' without naming the specific benchmarks or reporting the corresponding numerical improvements, which would strengthen the generalization claim.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. The comments highlight important aspects of statistical reporting, ablation design, and empirical validation that we agree merit clarification and expansion. We outline our responses below and the revisions we will incorporate to address them.
read point-by-point responses
-
Referee: Abstract: the reported 3.26 / 2.62 point lifts are presented without error bars, number of random seeds, or statistical tests, making it impossible to determine whether the gains exceed run-to-run variance or baseline implementation differences.
Authors: We agree that the absence of error bars and seed information limits interpretability of the reported gains. Our original runs used a single fixed seed per configuration for computational efficiency and reproducibility. In the revised manuscript we will report results averaged over three independent random seeds for the primary math benchmarks, including mean and standard deviation, and will add a brief statistical note comparing the observed improvements to baseline variance. revision: yes
-
Referee: Method section on coefficient estimation: the construction reweights a self-normalized surrogate, yet no ablation is described that isolates the contribution of the learned per-token coefficients from the self-normalization step itself; without this, the central claim that the coefficients amplify sparse discriminative directions remains unverified.
Authors: We appreciate the request for clearer isolation. Self-normalization is an integral part of the surrogate we start from, yet the learned coefficients are the novel component intended to amplify discriminative directions. We will add a dedicated ablation subsection that compares (1) standard RLVR, (2) self-normalized RLVR without learned coefficients, and (3) the full DelTA formulation. This will directly quantify the incremental benefit attributable to the per-token coefficients. revision: yes
-
Referee: Experiments: no quantitative check (e.g., cosine distance between side-wise centroids or linear classification accuracy on held-out token-gradient vectors) is reported to confirm that the estimated coefficients produce measurably more contrastive centroids than the unweighted self-normalized baseline.
Authors: This is a fair request for direct evidence supporting the discriminator interpretation. We will include new quantitative diagnostics in the experiments section: cosine distances between positive- and negative-side centroids under both the self-normalized baseline and DelTA, plus the accuracy of a linear probe trained to separate held-out token-gradient vectors using the reweighted centroids. These metrics will be reported alongside the main results to verify increased contrastivity. revision: yes
Circularity Check
No significant circularity; derivation is self-contained empirical method
full rationale
The paper presents an interpretive discriminator view of standard RLVR policy gradients (constructed via advantage-weighted token-gradient centroids) and then introduces DelTA as an empirical modification that estimates per-token coefficients to reweight a self-normalized surrogate. No load-bearing step reduces by definition or construction to its own inputs: the claimed outperformance on mathematical benchmarks is an experimental result, not a first-principles prediction forced by fitted parameters or self-citation chains. The self-normalization and coefficient estimation are explicitly part of the proposed algorithm rather than hidden tautologies, and the paper does not invoke uniqueness theorems or prior self-work to force its choices. The central claim remains independently falsifiable via the reported benchmark comparisons.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The policy-gradient update direction implicitly acts as a linear discriminator over token-gradient vectors
invented entities (1)
-
Discriminative token coefficients
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
DelTA estimates token coefficients to amplify side-specific token-gradient directions and downweight shared or weakly discriminative ones. These coefficients reweight a self-normalized RLVR surrogate, making the effective side-wise centroids more contrastive
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
α(k)_i,t = σ( (∥v_i,t − μ(k)_−∥²₂ − ∥v_i,t − μ(k)_+∥²₂) / γ(k)_+ )
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Scaling Learning Algorithms Towards
Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
-
[2]
and Osindero, Simon and Teh, Yee Whye , journal =
Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =
- [3]
-
[4]
Advances in Neural Information Processing Systems , volume=
Estimating training data influence by tracing gradient descent , author=. Advances in Neural Information Processing Systems , volume=
-
[5]
Advances in Neural Information Processing Systems , volume=
First is better than last for language data influence , author=. Advances in Neural Information Processing Systems , volume=
-
[8]
Advances in neural information processing systems , volume=
Supervised contrastive learning , author=. Advances in neural information processing systems , volume=
-
[9]
Nature Reviews Methods Primers , volume=
Linear discriminant analysis , author=. Nature Reviews Methods Primers , volume=. 2024 , publisher=
work page 2024
-
[10]
Applied multiple regression/correlation analysis for the behavioral sciences , author=. 2013 , publisher=
work page 2013
-
[12]
Advances in neural information processing systems , volume=
Sglang: Efficient execution of structured language model programs , author=. Advances in neural information processing systems , volume=
-
[13]
American Invitational Mathematics Examination (AIME) 2025 , author=
work page 2025
-
[14]
American Invitational Mathematics Examination (AIME) 2024 , author=
work page 2024
-
[15]
American Invitational Mathematics Examination (AIME) 2026 , author=
work page 2026
-
[16]
MathArena: Evaluating LLMs on Uncontaminated Math Competitions , author =
-
[17]
HybridFlow: A Flexible and Efficient RLHF Framework , author =. 2024 , journal =
work page 2024
- [18]
-
[23]
Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[24]
Measuring Mathematical Problem Solving With the MATH Dataset
Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Advances in neural information processing systems , volume=
Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation , author=. Advances in neural information processing systems , volume=
-
[31]
Math-shepherd: Verify and reinforce llms step-by-step without human annotations , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[32]
Advances in Neural Information Processing Systems , volume=
Coderl: Mastering code generation through pretrained models and deep reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=
-
[36]
Vineppo: Unlocking rl potential for llm reasoning through refined credit assignment , author=
-
[43]
Findings of the Association for Computational Linguistics: ACL 2025 , pages=
The lessons of developing process reward models in mathematical reasoning , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=
work page 2025
-
[44]
arXiv preprint arXiv:2412.01981 , year=
Free Process Rewards without Process Labels , author=. arXiv preprint arXiv:2412.01981 , year=
-
[48]
VinePPO: Refining Credit Assignment in RL Training of LLMs , author=. 2025 , eprint=
work page 2025
-
[49]
Matharena: Evaluating llms on uncontaminated math competitions, February 2025
Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, and Martin Vechev. Matharena: Evaluating llms on uncontaminated math competitions, February 2025. URL https://matharena.ai/
work page 2025
-
[50]
Applied multiple regression/correlation analysis for the behavioral sciences
Jacob Cohen, Patricia Cohen, Stephen G West, and Leona S Aiken. Applied multiple regression/correlation analysis for the behavioral sciences. Routledge, 2013
work page 2013
-
[51]
Process Reinforcement through Implicit Rewards
Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, et al. Process reinforcement through implicit rewards. arXiv preprint arXiv:2502.01456, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[52]
Soft Adaptive Policy Optimization
Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, and Junyang Lin. Soft adaptive policy optimization. arXiv preprint arXiv:2511.20347, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[53]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[54]
Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, et al. Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning. arXiv preprint arXiv:2504.11456, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[55]
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
Maggie Huan, Yuetai Li, Tuney Zheng, Xiaoyu Xu, Seungone Kim, Minxin Du, Radha Poovendran, Graham Neubig, and Xiang Yue. Does math reasoning improve general llm capabilities? understanding transferability of llm reasoning. arXiv preprint arXiv:2507.00432, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[56]
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[57]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[58]
Vineppo: Refining credit assignment in rl training of llms, 2025
Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. Vineppo: Refining credit assignment in rl training of llms, 2025. URL https://arxiv.org/abs/2410.01679
-
[59]
Supervised contrastive learning
Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. Advances in neural information processing systems, 33: 0 18661--18673, 2020
work page 2020
-
[60]
Coderl: Mastering code generation through pretrained models and deep reinforcement learning
Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Chu Hong Hoi. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Advances in Neural Information Processing Systems, 35: 0 21314--21328, 2022
work page 2022
-
[61]
Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Advances in neural information processing systems, 36: 0 21558--21572, 2023
work page 2023
-
[62]
Understanding R1-Zero-Like Training: A Critical Perspective
Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. arXiv preprint arXiv:2503.20783, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[63]
Chiyu Ma, Shuo Yang, Kexin Huang, Jinda Lu, Haoming Meng, Shangshang Wang, Bolin Ding, Soroush Vosoughi, Guoyin Wang, and Jingren Zhou. Fipo: Eliciting deep reasoning with future-kl influenced policy optimization. arXiv preprint arXiv:2603.19835, 2026
-
[64]
Haoming Meng, Kexin Huang, Shaohang Wei, Chiyu Ma, Shuo Yang, Xue Wang, Guoyin Wang, Bolin Ding, and Jingren Zhou. Sparse but critical: A token-level analysis of distributional shifts in rlvr fine-tuning of llms. arXiv preprint arXiv:2603.22446, 2026
-
[65]
Team Olmo, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, et al. Olmo 3. arXiv preprint arXiv:2512.13961, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[66]
Estimating training data influence by tracing gradient descent
Garima Pruthi, Frederick Liu, Satyen Kale, and Mukund Sundararajan. Estimating training data influence by tracing gradient descent. Advances in Neural Information Processing Systems, 33: 0 19920--19930, 2020
work page 2020
-
[67]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[68]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[69]
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[70]
Execution-based code generation using deep reinforcement learning
Parshin Shojaee, Aneesh Jain, Sindhu Tipirneni, and Chandan K Reddy. Execution-based code generation using deep reinforcement learning. arXiv preprint arXiv:2301.13816, 2023
-
[71]
Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[72]
Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning. arXiv preprint arXiv:2506.01939, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[73]
Capo: Towards enhancing llm reasoning through generative credit assignment
Guofu Xie, Yunsheng Shi, Hongtao Tian, Ting Yao, and Xiao Zhang. Capo: Towards enhancing llm reasoning through generative credit assignment. arXiv preprint arXiv:2508.02298, 2025
-
[74]
Learning to Reason under Off-Policy Guidance
Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance. arXiv preprint arXiv:2504.14945, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[75]
An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[76]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[77]
First is better than last for language data influence
Chih-Kuan Yeh, Ankur Taly, Mukund Sundararajan, Frederick Liu, and Pradeep Ravikumar. First is better than last for language data influence. Advances in Neural Information Processing Systems, 35: 0 32285--32298, 2022
work page 2022
-
[78]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[79]
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? arXiv preprint arXiv:2504.13837, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[80]
Stephint: Multi-level stepwise hints enhance reinforcement learning to reason
Kaiyi Zhang, Ang Lv, Jinpeng Li, Yongbo Wang, Feng Wang, Haoyuan Hu, and Rui Yan. Stephint: Multi-level stepwise hints enhance reinforcement learning to reason. arXiv preprint arXiv:2507.02841, 2025 a
-
[81]
American invitational mathematics examination (aime) 2024, 2024
Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2024, 2024
work page 2024
-
[82]
American invitational mathematics examination (aime) 2025, 2025
Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2025, 2025
work page 2025
-
[83]
American invitational mathematics examination (aime) 2026, 2026
Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2026, 2026
work page 2026
-
[84]
The lessons of developing process reward models in mathematical reasoning
Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. The lessons of developing process reward models in mathematical reasoning. In Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 10495--10516, 2025 b
work page 2025
-
[85]
Shuping Zhao, Bob Zhang, Jian Yang, Jianhang Zhou, and Yong Xu. Linear discriminant analysis. Nature Reviews Methods Primers, 4 0 (1): 0 70, 2024
work page 2024
-
[86]
Group Sequence Policy Optimization
Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization. arXiv preprint arXiv:2507.18071, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.