DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

Kaiyi Zhang; Wei Wu; Yankai Lin

arxiv: 2605.21467 · v1 · pith:T7YFYCYOnew · submitted 2026-05-20 · 💻 cs.LG · cs.CL

DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

Kaiyi Zhang , Wei Wu , Yankai Lin This is my paper

Pith reviewed 2026-05-21 05:20 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords reinforcement learning from verifiable rewardstoken credit assignmentlarge language modelsmathematical reasoningpolicy gradientdiscriminator viewRLVR

0 comments

The pith

DelTA improves RLVR for LLM reasoning by estimating per-token coefficients that sharpen side-wise gradient centroids.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper frames standard RLVR policy-gradient updates as a linear discriminator over token-gradient vectors, where positive and negative centroids decide which tokens increase or decrease in probability. These centroids are often dominated by frequent shared patterns such as formatting tokens, which dilutes the sparse signals that actually separate high-reward from low-reward responses. DelTA estimates coefficients for each token that amplify discriminative directions on one side while downweighting common or weak ones, then applies the coefficients to reweight the self-normalized RLVR surrogate. The resulting update direction produces more contrastive centroids and better reasoning performance. A reader would care because the method supplies a concrete way to translate sequence-level rewards into more precise token-level credit assignment.

Core claim

DelTA estimates token coefficients to amplify side-specific token-gradient directions and downweight shared or weakly discriminative ones. These coefficients reweight a self-normalized RLVR surrogate, making the effective side-wise centroids more contrastive and thereby reshaping the RLVR update direction.

What carries the argument

Token coefficient estimation that reweights the RLVR surrogate to enhance contrast between positive- and negative-side centroids of advantage-weighted token-gradient vectors.

If this is right

On seven mathematical benchmarks DelTA outperforms strongest same-scale baselines by 3.26 average points on Qwen3-8B-Base.
It outperforms by 2.62 average points on Qwen3-14B-Base.
The gains extend to code generation tasks and out-of-domain evaluations.
The method works on different backbone models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same coefficient estimation could be tested in other sequence-level RL settings where rewards arrive only at the end of a trajectory.
Running DelTA on tasks without verifiable correctness signals might show whether the discriminator view depends on having clear positive-negative labels.
Combining the coefficient estimation with existing variance-reduction techniques could further stabilize the reshaped update directions.

Load-bearing premise

That reweighting token gradients with estimated coefficients will produce more effective policy updates by making centroids distinguish high-reward responses from low-reward ones.

What would settle it

Applying DelTA to the seven mathematical benchmarks and finding average scores no higher than the strongest same-scale baselines would show the reweighting failed to improve the update direction.

Figures

Figures reproduced from arXiv: 2605.21467 by Kaiyi Zhang, Wei Wu, Yankai Lin.

**Figure 2.** Figure 2: Training dynamics of DelTA compared with DAPO. [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Training reward under different tokenselection strategies. AIME25 AIME26 HMMT25 HMMT26 Avg. 10 15 20 25 30 Accuracy DAPO Random 50% Top 50% [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 5.** Figure 5: Token clouds of high-weight and low-weight tokens. To understand what DelTA emphasizes, we visualize high- and lowweight token clouds in [PITH_FULL_IMAGE:figures/full_fig_p027_5.png] view at source ↗

read the original abstract

Reinforcement learning from verifiable rewards (RLVR) has emerged as a central technique for improving the reasoning capabilities of large language models. Despite its effectiveness, how response-level rewards translate into token-level probability changes remains poorly understood. We introduce a discriminator view of RLVR updates, showing that the policy-gradient update direction implicitly acts as a linear discriminator over token-gradient vectors and thereby determines which token probabilities are increased or decreased during learning. Under standard sequence-level RLVR, this discriminator is constructed from positive- and negative-side centroids formed by advantage-weighted averaging of token-gradient vectors. However, such centroid construction can be dominated by shared high-frequency patterns, such as formatting tokens, diluting sparse yet discriminative directions that better distinguish high-reward responses from low-reward ones. To address this limitation, we propose $\textbf{DelTA}$, a discriminative token credit assignment method that estimates token coefficients to amplify side-specific token-gradient directions and downweight shared or weakly discriminative ones. These coefficients reweight a self-normalized RLVR surrogate, making the effective side-wise centroids more contrastive and thereby reshaping the RLVR update direction. On seven mathematical benchmarks, DelTA outperforms the strongest same-scale baselines by 3.26 and 2.62 average points on Qwen3-8B-Base and Qwen3-14B-Base, respectively. Additional results on code generation, a different backbone, and out-of-domain evaluations further demonstrate the generalization ability of DelTA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DelTA adds learned token coefficients to reweight RLVR updates for sharper credit assignment, with a few-point math benchmark lift, but the self-normalization step may already explain much of the gain.

read the letter

DelTA frames the standard RLVR policy gradient as a linear discriminator over token-gradient vectors and then learns per-token coefficients to boost sparse discriminative directions while downweighting shared high-frequency ones like formatting tokens. The coefficients reweight a self-normalized surrogate so the effective positive and negative centroids become more contrastive. That is the core new move. The paper reports average gains of 3.26 points on Qwen3-8B-Base and 2.62 on the 14B version across seven math benchmarks, plus some supporting runs on code generation and out-of-domain checks. If the full text shows the coefficient estimator is trained separately and the centroids really do separate better, the idea is a clean, targeted tweak to an existing pipeline. The empirical numbers are the main evidence offered. The soft spot is the one flagged in the stress test. Self-normalization by itself already reduces the influence of frequent tokens, so it is not obvious that the extra learned coefficients are the active ingredient rather than the normalization already present in the surrogate. Without ablations that turn the coefficients on and off, or direct measurements of centroid contrast before and after, the reported lifts could be driven by the normalization step alone. The circularity risk is also real if coefficient estimation and final evaluation share too much data. This is for groups already running RLVR on reasoning models and looking for small levers on token credit. The discriminator view is clear enough that a serious referee could usefully press on the ablations and the exact training procedure for the coefficients. I would send it to review rather than desk-reject, with the expectation that the authors will need to isolate what the new component actually contributes.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces DelTA, a discriminative token credit assignment method for reinforcement learning from verifiable rewards (RLVR) in large language models. It frames the policy-gradient update direction as a linear discriminator over token-gradient vectors, where standard advantage-weighted positive- and negative-side centroids can be dominated by shared high-frequency patterns. DelTA estimates per-token coefficients to amplify side-specific discriminative directions and downweight weakly discriminative ones, reweighting a self-normalized RLVR surrogate to produce more contrastive centroids. On seven mathematical benchmarks, it reports average improvements of 3.26 and 2.62 points over strongest same-scale baselines on Qwen3-8B-Base and Qwen3-14B-Base, with further results on code generation, alternative backbones, and out-of-domain settings.

Significance. If the discriminator view is accurate and the learned coefficients demonstrably improve centroid contrastivity beyond self-normalization, the work would supply both a conceptual lens for token-level credit assignment in RLVR and a practical technique that could enhance reasoning performance in LLMs. The reported gains on math and code tasks, together with claims of generalization, suggest potential utility for post-training pipelines, provided the source of the gains is isolated and the method is shown to be robust.

major comments (3)

Abstract: the reported 3.26 / 2.62 point lifts are presented without error bars, number of random seeds, or statistical tests, making it impossible to determine whether the gains exceed run-to-run variance or baseline implementation differences.
Method section on coefficient estimation: the construction reweights a self-normalized surrogate, yet no ablation is described that isolates the contribution of the learned per-token coefficients from the self-normalization step itself; without this, the central claim that the coefficients amplify sparse discriminative directions remains unverified.
Experiments: no quantitative check (e.g., cosine distance between side-wise centroids or linear classification accuracy on held-out token-gradient vectors) is reported to confirm that the estimated coefficients produce measurably more contrastive centroids than the unweighted self-normalized baseline.

minor comments (1)

The abstract refers to 'additional results on code generation, a different backbone, and out-of-domain evaluations' without naming the specific benchmarks or reporting the corresponding numerical improvements, which would strengthen the generalization claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments highlight important aspects of statistical reporting, ablation design, and empirical validation that we agree merit clarification and expansion. We outline our responses below and the revisions we will incorporate to address them.

read point-by-point responses

Referee: Abstract: the reported 3.26 / 2.62 point lifts are presented without error bars, number of random seeds, or statistical tests, making it impossible to determine whether the gains exceed run-to-run variance or baseline implementation differences.

Authors: We agree that the absence of error bars and seed information limits interpretability of the reported gains. Our original runs used a single fixed seed per configuration for computational efficiency and reproducibility. In the revised manuscript we will report results averaged over three independent random seeds for the primary math benchmarks, including mean and standard deviation, and will add a brief statistical note comparing the observed improvements to baseline variance. revision: yes
Referee: Method section on coefficient estimation: the construction reweights a self-normalized surrogate, yet no ablation is described that isolates the contribution of the learned per-token coefficients from the self-normalization step itself; without this, the central claim that the coefficients amplify sparse discriminative directions remains unverified.

Authors: We appreciate the request for clearer isolation. Self-normalization is an integral part of the surrogate we start from, yet the learned coefficients are the novel component intended to amplify discriminative directions. We will add a dedicated ablation subsection that compares (1) standard RLVR, (2) self-normalized RLVR without learned coefficients, and (3) the full DelTA formulation. This will directly quantify the incremental benefit attributable to the per-token coefficients. revision: yes
Referee: Experiments: no quantitative check (e.g., cosine distance between side-wise centroids or linear classification accuracy on held-out token-gradient vectors) is reported to confirm that the estimated coefficients produce measurably more contrastive centroids than the unweighted self-normalized baseline.

Authors: This is a fair request for direct evidence supporting the discriminator interpretation. We will include new quantitative diagnostics in the experiments section: cosine distances between positive- and negative-side centroids under both the self-normalized baseline and DelTA, plus the accuracy of a linear probe trained to separate held-out token-gradient vectors using the reweighted centroids. These metrics will be reported alongside the main results to verify increased contrastivity. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained empirical method

full rationale

The paper presents an interpretive discriminator view of standard RLVR policy gradients (constructed via advantage-weighted token-gradient centroids) and then introduces DelTA as an empirical modification that estimates per-token coefficients to reweight a self-normalized surrogate. No load-bearing step reduces by definition or construction to its own inputs: the claimed outperformance on mathematical benchmarks is an experimental result, not a first-principles prediction forced by fitted parameters or self-citation chains. The self-normalization and coefficient estimation are explicitly part of the proposed algorithm rather than hidden tautologies, and the paper does not invoke uniqueness theorems or prior self-work to force its choices. The central claim remains independently falsifiable via the reported benchmark comparisons.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Only the abstract is available, so the ledger records the high-level assumptions stated there. The work rests on the standard RLVR framework plus one new modeling choice.

axioms (1)

domain assumption The policy-gradient update direction implicitly acts as a linear discriminator over token-gradient vectors
Introduced in the first paragraph of the abstract as the central analytical view.

invented entities (1)

Discriminative token coefficients no independent evidence
purpose: Amplify side-specific token-gradient directions and downweight shared or weakly discriminative ones
Core new component of DelTA that reweights the self-normalized RLVR surrogate

pith-pipeline@v0.9.0 · 5795 in / 1551 out tokens · 81687 ms · 2026-05-21T05:20:44.163198+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DelTA estimates token coefficients to amplify side-specific token-gradient directions and downweight shared or weakly discriminative ones. These coefficients reweight a self-normalized RLVR surrogate, making the effective side-wise centroids more contrastive
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

α(k)_i,t = σ( (∥v_i,t − μ(k)_−∥²₂ − ∥v_i,t − μ(k)_+∥²₂) / γ(k)_+ )

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 21 internal anchors

[1]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page
[2]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page
[3]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016
[4]

Advances in Neural Information Processing Systems , volume=

Estimating training data influence by tracing gradient descent , author=. Advances in Neural Information Processing Systems , volume=

work page
[5]

Advances in Neural Information Processing Systems , volume=

First is better than last for language data influence , author=. Advances in Neural Information Processing Systems , volume=

work page
[8]

Advances in neural information processing systems , volume=

Supervised contrastive learning , author=. Advances in neural information processing systems , volume=

work page
[9]

Nature Reviews Methods Primers , volume=

Linear discriminant analysis , author=. Nature Reviews Methods Primers , volume=. 2024 , publisher=

work page 2024
[10]

2013 , publisher=

Applied multiple regression/correlation analysis for the behavioral sciences , author=. 2013 , publisher=

work page 2013
[12]

Advances in neural information processing systems , volume=

Sglang: Efficient execution of structured language model programs , author=. Advances in neural information processing systems , volume=

work page
[13]

American Invitational Mathematics Examination (AIME) 2025 , author=

work page 2025
[14]

American Invitational Mathematics Examination (AIME) 2024 , author=

work page 2024
[15]

American Invitational Mathematics Examination (AIME) 2026 , author=

work page 2026
[16]

MathArena: Evaluating LLMs on Uncontaminated Math Competitions , author =

work page
[17]

2024 , journal =

HybridFlow: A Flexible and Efficient RLHF Framework , author =. 2024 , journal =

work page 2024
[18]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025
[23]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[24]

Measuring Mathematical Problem Solving With the MATH Dataset

Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Advances in neural information processing systems , volume=

Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation , author=. Advances in neural information processing systems , volume=

work page
[31]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Math-shepherd: Verify and reinforce llms step-by-step without human annotations , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[32]

Advances in Neural Information Processing Systems , volume=

Coderl: Mastering code generation through pretrained models and deep reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[36]

Vineppo: Unlocking rl potential for llm reasoning through refined credit assignment , author=

work page
[43]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

The lessons of developing process reward models in mathematical reasoning , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

work page 2025
[44]

arXiv preprint arXiv:2412.01981 , year=

Free Process Rewards without Process Labels , author=. arXiv preprint arXiv:2412.01981 , year=

work page arXiv
[48]

2025 , eprint=

VinePPO: Refining Credit Assignment in RL Training of LLMs , author=. 2025 , eprint=

work page 2025
[49]

Matharena: Evaluating llms on uncontaminated math competitions, February 2025

Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, and Martin Vechev. Matharena: Evaluating llms on uncontaminated math competitions, February 2025. URL https://matharena.ai/

work page 2025
[50]

Applied multiple regression/correlation analysis for the behavioral sciences

Jacob Cohen, Patricia Cohen, Stephen G West, and Leona S Aiken. Applied multiple regression/correlation analysis for the behavioral sciences. Routledge, 2013

work page 2013
[51]

Process Reinforcement through Implicit Rewards

Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, et al. Process reinforcement through implicit rewards. arXiv preprint arXiv:2502.01456, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

Soft Adaptive Policy Optimization

Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, and Junyang Lin. Soft adaptive policy optimization. arXiv preprint arXiv:2511.20347, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, et al. Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning. arXiv preprint arXiv:2504.11456, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

Maggie Huan, Yuetai Li, Tuney Zheng, Xiaoyu Xu, Seungone Kim, Minxin Du, Radha Poovendran, Graham Neubig, and Xiang Yue. Does math reasoning improve general llm capabilities? understanding transferability of llm reasoning. arXiv preprint arXiv:2507.00432, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[58]

Vineppo: Refining credit assignment in rl training of llms, 2025

Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. Vineppo: Refining credit assignment in rl training of llms, 2025. URL https://arxiv.org/abs/2410.01679

work page arXiv 2025
[59]

Supervised contrastive learning

Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. Advances in neural information processing systems, 33: 0 18661--18673, 2020

work page 2020
[60]

Coderl: Mastering code generation through pretrained models and deep reinforcement learning

Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Chu Hong Hoi. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Advances in Neural Information Processing Systems, 35: 0 21314--21328, 2022

work page 2022
[61]

Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Advances in neural information processing systems, 36: 0 21558--21572, 2023

work page 2023
[62]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. arXiv preprint arXiv:2503.20783, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[63]

Fipo: Eliciting deep reasoning with future-kl influenced policy optimization.arXiv preprint arXiv:2603.19835, 2026

Chiyu Ma, Shuo Yang, Kexin Huang, Jinda Lu, Haoming Meng, Shangshang Wang, Bolin Ding, Soroush Vosoughi, Guoyin Wang, and Jingren Zhou. Fipo: Eliciting deep reasoning with future-kl influenced policy optimization. arXiv preprint arXiv:2603.19835, 2026

work page arXiv 2026
[64]

Sparse but critical: A token-level analysis of distributional shifts in rlvr fine-tuning of llms.arXiv preprint arXiv:2603.22446, 2026

Haoming Meng, Kexin Huang, Shaohang Wei, Chiyu Ma, Shuo Yang, Xue Wang, Guoyin Wang, Bolin Ding, and Jingren Zhou. Sparse but critical: A token-level analysis of distributional shifts in rlvr fine-tuning of llms. arXiv preprint arXiv:2603.22446, 2026

work page arXiv 2026
[65]

Team Olmo, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, et al. Olmo 3. arXiv preprint arXiv:2512.13961, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[66]

Estimating training data influence by tracing gradient descent

Garima Pruthi, Frederick Liu, Satyen Kale, and Mukund Sundararajan. Estimating training data influence by tracing gradient descent. Advances in Neural Information Processing Systems, 33: 0 19920--19930, 2020

work page 2020
[67]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[68]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[69]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[70]

Execution-based code generation using deep reinforcement learning

Parshin Shojaee, Aneesh Jain, Sindhu Tipirneni, and Chandan K Reddy. Execution-based code generation using deep reinforcement learning. arXiv preprint arXiv:2301.13816, 2023

work page arXiv 2023
[71]

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[72]

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning. arXiv preprint arXiv:2506.01939, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[73]

Capo: Towards enhancing llm reasoning through generative credit assignment

Guofu Xie, Yunsheng Shi, Hongtao Tian, Ting Yao, and Xiao Zhang. Capo: Towards enhancing llm reasoning through generative credit assignment. arXiv preprint arXiv:2508.02298, 2025

work page arXiv 2025
[74]

Learning to Reason under Off-Policy Guidance

Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance. arXiv preprint arXiv:2504.14945, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[75]

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[76]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[77]

First is better than last for language data influence

Chih-Kuan Yeh, Ankur Taly, Mukund Sundararajan, Frederick Liu, and Pradeep Ravikumar. First is better than last for language data influence. Advances in Neural Information Processing Systems, 35: 0 32285--32298, 2022

work page 2022
[78]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[79]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? arXiv preprint arXiv:2504.13837, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[80]

Stephint: Multi-level stepwise hints enhance reinforcement learning to reason

Kaiyi Zhang, Ang Lv, Jinpeng Li, Yongbo Wang, Feng Wang, Haoyuan Hu, and Rui Yan. Stephint: Multi-level stepwise hints enhance reinforcement learning to reason. arXiv preprint arXiv:2507.02841, 2025 a

work page arXiv 2025
[81]

American invitational mathematics examination (aime) 2024, 2024

Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2024, 2024

work page 2024
[82]

American invitational mathematics examination (aime) 2025, 2025

Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2025, 2025

work page 2025
[83]

American invitational mathematics examination (aime) 2026, 2026

Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2026, 2026

work page 2026
[84]

The lessons of developing process reward models in mathematical reasoning

Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. The lessons of developing process reward models in mathematical reasoning. In Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 10495--10516, 2025 b

work page 2025
[85]

Linear discriminant analysis

Shuping Zhao, Bob Zhang, Jian Yang, Jianhang Zhou, and Yong Xu. Linear discriminant analysis. Nature Reviews Methods Primers, 4 0 (1): 0 70, 2024

work page 2024
[86]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization. arXiv preprint arXiv:2507.18071, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page

[2] [2]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page

[3] [3]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016

[4] [4]

Advances in Neural Information Processing Systems , volume=

Estimating training data influence by tracing gradient descent , author=. Advances in Neural Information Processing Systems , volume=

work page

[5] [5]

Advances in Neural Information Processing Systems , volume=

First is better than last for language data influence , author=. Advances in Neural Information Processing Systems , volume=

work page

[6] [8]

Advances in neural information processing systems , volume=

Supervised contrastive learning , author=. Advances in neural information processing systems , volume=

work page

[7] [9]

Nature Reviews Methods Primers , volume=

Linear discriminant analysis , author=. Nature Reviews Methods Primers , volume=. 2024 , publisher=

work page 2024

[8] [10]

2013 , publisher=

Applied multiple regression/correlation analysis for the behavioral sciences , author=. 2013 , publisher=

work page 2013

[9] [12]

Advances in neural information processing systems , volume=

Sglang: Efficient execution of structured language model programs , author=. Advances in neural information processing systems , volume=

work page

[10] [13]

American Invitational Mathematics Examination (AIME) 2025 , author=

work page 2025

[11] [14]

American Invitational Mathematics Examination (AIME) 2024 , author=

work page 2024

[12] [15]

American Invitational Mathematics Examination (AIME) 2026 , author=

work page 2026

[13] [16]

MathArena: Evaluating LLMs on Uncontaminated Math Competitions , author =

work page

[14] [17]

2024 , journal =

HybridFlow: A Flexible and Efficient RLHF Framework , author =. 2024 , journal =

work page 2024

[15] [18]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025

[16] [23]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page

[17] [24]

Measuring Mathematical Problem Solving With the MATH Dataset

Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[18] [26]

Advances in neural information processing systems , volume=

Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation , author=. Advances in neural information processing systems , volume=

work page

[19] [31]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Math-shepherd: Verify and reinforce llms step-by-step without human annotations , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page

[20] [32]

Advances in Neural Information Processing Systems , volume=

Coderl: Mastering code generation through pretrained models and deep reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

work page

[21] [36]

Vineppo: Unlocking rl potential for llm reasoning through refined credit assignment , author=

work page

[22] [43]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

The lessons of developing process reward models in mathematical reasoning , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

work page 2025

[23] [44]

arXiv preprint arXiv:2412.01981 , year=

Free Process Rewards without Process Labels , author=. arXiv preprint arXiv:2412.01981 , year=

work page arXiv

[24] [48]

2025 , eprint=

VinePPO: Refining Credit Assignment in RL Training of LLMs , author=. 2025 , eprint=

work page 2025

[25] [49]

Matharena: Evaluating llms on uncontaminated math competitions, February 2025

Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, and Martin Vechev. Matharena: Evaluating llms on uncontaminated math competitions, February 2025. URL https://matharena.ai/

work page 2025

[26] [50]

Applied multiple regression/correlation analysis for the behavioral sciences

Jacob Cohen, Patricia Cohen, Stephen G West, and Leona S Aiken. Applied multiple regression/correlation analysis for the behavioral sciences. Routledge, 2013

work page 2013

[27] [51]

Process Reinforcement through Implicit Rewards

Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, et al. Process reinforcement through implicit rewards. arXiv preprint arXiv:2502.01456, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [52]

Soft Adaptive Policy Optimization

Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, and Junyang Lin. Soft adaptive policy optimization. arXiv preprint arXiv:2511.20347, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [53]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [54]

DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, et al. Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning. arXiv preprint arXiv:2504.11456, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [55]

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

Maggie Huan, Yuetai Li, Tuney Zheng, Xiaoyu Xu, Seungone Kim, Minxin Du, Radha Poovendran, Graham Neubig, and Xiang Yue. Does math reasoning improve general llm capabilities? understanding transferability of llm reasoning. arXiv preprint arXiv:2507.00432, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [56]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [57]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [58]

Vineppo: Refining credit assignment in rl training of llms, 2025

Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. Vineppo: Refining credit assignment in rl training of llms, 2025. URL https://arxiv.org/abs/2410.01679

work page arXiv 2025

[35] [59]

Supervised contrastive learning

Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. Advances in neural information processing systems, 33: 0 18661--18673, 2020

work page 2020

[36] [60]

Coderl: Mastering code generation through pretrained models and deep reinforcement learning

Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Chu Hong Hoi. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Advances in Neural Information Processing Systems, 35: 0 21314--21328, 2022

work page 2022

[37] [61]

Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Advances in neural information processing systems, 36: 0 21558--21572, 2023

work page 2023

[38] [62]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. arXiv preprint arXiv:2503.20783, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [63]

Fipo: Eliciting deep reasoning with future-kl influenced policy optimization.arXiv preprint arXiv:2603.19835, 2026

Chiyu Ma, Shuo Yang, Kexin Huang, Jinda Lu, Haoming Meng, Shangshang Wang, Bolin Ding, Soroush Vosoughi, Guoyin Wang, and Jingren Zhou. Fipo: Eliciting deep reasoning with future-kl influenced policy optimization. arXiv preprint arXiv:2603.19835, 2026

work page arXiv 2026

[40] [64]

Sparse but critical: A token-level analysis of distributional shifts in rlvr fine-tuning of llms.arXiv preprint arXiv:2603.22446, 2026

Haoming Meng, Kexin Huang, Shaohang Wei, Chiyu Ma, Shuo Yang, Xue Wang, Guoyin Wang, Bolin Ding, and Jingren Zhou. Sparse but critical: A token-level analysis of distributional shifts in rlvr fine-tuning of llms. arXiv preprint arXiv:2603.22446, 2026

work page arXiv 2026

[41] [65]

Team Olmo, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, et al. Olmo 3. arXiv preprint arXiv:2512.13961, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [66]

Estimating training data influence by tracing gradient descent

Garima Pruthi, Frederick Liu, Satyen Kale, and Mukund Sundararajan. Estimating training data influence by tracing gradient descent. Advances in Neural Information Processing Systems, 33: 0 19920--19930, 2020

work page 2020

[43] [67]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[44] [68]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[45] [69]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[46] [70]

Execution-based code generation using deep reinforcement learning

Parshin Shojaee, Aneesh Jain, Sindhu Tipirneni, and Chandan K Reddy. Execution-based code generation using deep reinforcement learning. arXiv preprint arXiv:2301.13816, 2023

work page arXiv 2023

[47] [71]

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[48] [72]

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning. arXiv preprint arXiv:2506.01939, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[49] [73]

Capo: Towards enhancing llm reasoning through generative credit assignment

Guofu Xie, Yunsheng Shi, Hongtao Tian, Ting Yao, and Xiao Zhang. Capo: Towards enhancing llm reasoning through generative credit assignment. arXiv preprint arXiv:2508.02298, 2025

work page arXiv 2025

[50] [74]

Learning to Reason under Off-Policy Guidance

Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance. arXiv preprint arXiv:2504.14945, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[51] [75]

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[52] [76]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[53] [77]

First is better than last for language data influence

Chih-Kuan Yeh, Ankur Taly, Mukund Sundararajan, Frederick Liu, and Pradeep Ravikumar. First is better than last for language data influence. Advances in Neural Information Processing Systems, 35: 0 32285--32298, 2022

work page 2022

[54] [78]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[55] [79]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? arXiv preprint arXiv:2504.13837, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[56] [80]

Stephint: Multi-level stepwise hints enhance reinforcement learning to reason

Kaiyi Zhang, Ang Lv, Jinpeng Li, Yongbo Wang, Feng Wang, Haoyuan Hu, and Rui Yan. Stephint: Multi-level stepwise hints enhance reinforcement learning to reason. arXiv preprint arXiv:2507.02841, 2025 a

work page arXiv 2025

[57] [81]

American invitational mathematics examination (aime) 2024, 2024

Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2024, 2024

work page 2024

[58] [82]

American invitational mathematics examination (aime) 2025, 2025

Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2025, 2025

work page 2025

[59] [83]

American invitational mathematics examination (aime) 2026, 2026

Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2026, 2026

work page 2026

[60] [84]

The lessons of developing process reward models in mathematical reasoning

Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. The lessons of developing process reward models in mathematical reasoning. In Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 10495--10516, 2025 b

work page 2025

[61] [85]

Linear discriminant analysis

Shuping Zhao, Bob Zhang, Jian Yang, Jianhang Zhou, and Yong Xu. Linear discriminant analysis. Nature Reviews Methods Primers, 4 0 (1): 0 70, 2024

work page 2024

[62] [86]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization. arXiv preprint arXiv:2507.18071, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025