Leveraging Error Diversity in Group Rollouts for Reinforcement Learning

Jiaqi Wang; Nan Duan; Shuai Dong; Tong Yang; Weichu Xie; Wenpu Liu; Wenqi Shao; Xiaoying Zhang; Yongfu Zhu; Yuqi Xu

arxiv: 2605.17333 · v1 · pith:FTTOUIFInew · submitted 2026-05-17 · 💻 cs.LG

Leveraging Error Diversity in Group Rollouts for Reinforcement Learning

Wenpu Liu , Yuqi Xu , Weichu Xie , Yongfu Zhu , Shuai Dong , Ziyue Wang , Wenqi Shao , Xiaoying Zhang

show 3 more authors

Tong Yang Nan Duan Jiaqi Wang

This is my paper

Pith reviewed 2026-05-20 14:48 UTC · model grok-4.3

classification 💻 cs.LG

keywords reinforcement learningRLVRerror diversityadvantage shapinggroup rolloutsmathematical reasoningverifiable rewards

0 comments

The pith

Error diversity within group rollouts predicts RLVR success and can be leveraged to improve performance via advantage modulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard RLVR samples multiple responses per prompt and assigns binary rewards based on individual correctness, but typically discards the distribution of errors across the group. Empirical analysis shows that prompts eliciting diverse wrong answers lead to substantially larger training gains than those producing repetitive homogeneous failures. The paper introduces Error Diversity Advantage Shaping (EDAS) as a lightweight post-hoc adjustment that modulates the advantage signal for incorrect rollouts, applying stronger penalties to dominant repeated errors and milder penalties to rare exploratory ones. This adjustment integrates into any existing RLVR algorithm and produces consistent gains on math benchmarks without requiring per-problem tuning or additional compute.

Core claim

Error diversity within a group of rollouts is a strong predictor of training success in RLVR. Problems that produce varied incorrect answers benefit more from the learning process than those that generate the same failures repeatedly. EDAS shapes the advantage for incorrect responses by amplifying penalties for common errors and attenuating penalties for uncommon ones, thereby encouraging the model to sustain diverse reasoning paths and avoid perseverating on repeated mistakes.

What carries the argument

Error Diversity Advantage Shaping (EDAS), a post-hoc adjustment to the advantage signal for incorrect rollouts that scales penalties according to intra-group error diversity.

If this is right

Consistent improvements when EDAS is added to multiple mainstream RLVR algorithms across different models
Average gain of 6.29 points over DAPO on Qwen3-8B evaluated across seven math benchmarks
Encourages maintenance of diverse reasoning paths by reducing penalties on rare errors and increasing them on repeated ones

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Methods that increase the number of rollouts per prompt may see amplified benefits if they naturally capture higher error diversity
Future RLVR pipelines could benefit from routinely reporting error distribution statistics alongside average accuracy
The approach may extend to other domains using binary verifiable rewards where multiple generations are feasible

Load-bearing premise

The observed correlation between higher intra-group error diversity and larger training gains is causal and can be safely exploited through a simple post-hoc advantage adjustment without introducing instability or needing problem-specific tuning.

What would settle it

A controlled experiment that applies EDAS to the same set of prompts while artificially varying error diversity levels in the rollouts and checks whether the expected performance difference between high-diversity and low-diversity groups disappears or reverses.

read the original abstract

Reinforcement Learning from Verifiable Rewards (RLVR) typically samples multiple responses per prompt and assigns binary rewards based on individual correctness, yet the collective structure of the group output, specifically the distribution of errors, is largely discarded. We identify this as a missed opportunity: empirical analysis reveals that error diversity within a group is a strong predictor of training success, with problems eliciting diverse wrong answers benefiting substantially more from RLVR than those producing homogeneous failures. Motivated by this observation, we propose Error Diversity Advantage Shaping (EDAS), a lightweight, algorithm-agnostic technique that modulates the advantage signal for incorrect rollouts based on intra-group error diversity. EDAS amplifies penalties for dominant, repeated errors and attenuates penalties for rare, exploratory ones, thereby encouraging the model to maintain diverse reasoning paths and discouraging error perseveration. Crucially, EDAS operates as a simple post-hoc adjustment that can be seamlessly integrated into any RLVR algorithm. We validate EDAS on top of several mainstream RLVR methods across a series of models and seven challenging math benchmarks, demonstrating consistent improvements. Notably, EDAS yields an average improvement of 6.29 points over DAPO on Qwen3-8B across seven benchmarks, confirming that exploiting the latent information in group rollouts is a broadly effective strategy for strengthening RLVR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EDAS is a simple post-hoc tweak that uses intra-group error diversity to shape advantages in RLVR and reports clear gains on math benchmarks, but the lack of normalization details leaves open whether the gradient stays unbiased.

read the letter

Hi, the paper's main takeaway is that error diversity within a group rollout is a useful signal for RLVR on reasoning tasks. They turn that observation into EDAS, which modulates the advantage for wrong answers so that repeated errors get stronger penalties and rare ones get weaker ones. This is presented as a lightweight add-on that works on top of existing methods like DAPO. On Qwen3-8B it gives an average 6.29 point lift across seven math benchmarks, which is the kind of practical improvement people actually try to ship. The idea itself feels new in this subfield; most prior work treats the group outputs as independent samples and discards the distribution of mistakes. Here they explicitly keep that structure and use it to discourage error perseveration while keeping exploratory paths alive. That motivation is grounded in their empirical analysis, and the fact that it is algorithm-agnostic is a real plus for adoption. The experiments cover multiple base methods and models, which helps show the effect is not tied to one setup. Still, the description stays at the level of the abstract: no equations for the modulation function, no mention of whether the adjustment is zero-mean normalized per group, and no error bars or ablations on the diversity strength hyperparameter. The stress-test concern about possible bias in the policy gradient therefore lands as a real open question rather than a minor detail. If the modulation shifts the expected advantage away from the original binary-reward baseline, the claimed improvements could partly reflect that shift instead of pure diversity exploitation. Without those checks it is hard to know how robust the result is across different error-homogeneity regimes. This work is aimed at people already running group-based RLVR for LLM reasoning. A reader who wants a quick lever to try on top of DAPO or GRPO would get immediate value from the reported numbers. It is coherent on its own terms and shows honest engagement with the practical side of the problem, so it deserves a serious referee who can ask for the missing normalization proof and extra controls. I would send it to review with those specific requests rather than desk-reject it.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that error diversity within groups of rollouts is a strong predictor of RLVR training success, with diverse-error problems benefiting more than those with homogeneous failures. It proposes Error Diversity Advantage Shaping (EDAS), a lightweight post-hoc adjustment that modulates advantages for incorrect rollouts by amplifying penalties on dominant/repeated errors and attenuating them on rare/exploratory ones, thereby encouraging diverse reasoning paths. EDAS is presented as algorithm-agnostic and integrable into any RLVR method; empirical results show consistent gains, including a 6.29-point average improvement over DAPO on Qwen3-8B across seven math benchmarks.

Significance. If the central empirical claim holds and the modulation preserves unbiased gradients, EDAS would provide a simple, low-overhead way to exploit group structure in RLVR without altering core algorithms or requiring per-problem tuning. This could strengthen reasoning performance in verifiable-reward settings, but the absence of theoretical invariance guarantees and limited ablation details limit the assessed impact.

major comments (2)

[EDAS description] EDAS description (motivation and method sections): the post-hoc modulation of advantages for incorrect rollouts is defined using intra-group error diversity, but the manuscript does not state whether the modulation factor is explicitly zero-mean normalized (or otherwise centered) per group. Without this, the expected advantage deviates from the binary-reward baseline, breaking equivalence to standard advantage estimation and risking a systematic shift in the policy gradient direction, especially on problems with varying error homogeneity.
[Empirical results] Empirical results section: the reported 6.29-point average improvement over DAPO on Qwen3-8B is presented without error bars, standard deviations across runs, or statistical significance tests. This makes it impossible to assess whether the gains are robust or could be explained by variance in the underlying RLVR baselines.

minor comments (2)

[Abstract / Motivation] The abstract and motivation claim error diversity is 'a strong predictor' but provide no explicit definition or formula for the diversity metric (e.g., entropy over error types or number of unique wrong answers).
[Experiments] No ablation is shown on the diversity shaping strength hyperparameter itself, leaving open whether the reported gains require per-benchmark tuning or generalize with a fixed value.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments have prompted us to strengthen the technical clarity and empirical rigor of the work. We address each major comment below and indicate the revisions made.

read point-by-point responses

Referee: [EDAS description] EDAS description (motivation and method sections): the post-hoc modulation of advantages for incorrect rollouts is defined using intra-group error diversity, but the manuscript does not state whether the modulation factor is explicitly zero-mean normalized (or otherwise centered) per group. Without this, the expected advantage deviates from the binary-reward baseline, breaking equivalence to standard advantage estimation and risking a systematic shift in the policy gradient direction, especially on problems with varying error homogeneity.

Authors: We appreciate this observation, which correctly identifies a potential source of bias not explicitly addressed in the original submission. The initial EDAS formulation modulates advantages for incorrect rollouts based on intra-group error diversity without per-group zero-mean centering, which can indeed cause the group-level expected advantage to deviate from the binary-reward baseline and introduce a systematic shift in the policy gradient. To resolve this while preserving the lightweight, post-hoc, and algorithm-agnostic character of EDAS, we have revised the method to explicitly zero-mean normalize the modulation factors within each group for the incorrect rollouts. This ensures that the sum of modulated advantages for incorrect responses remains zero relative to the original binary advantage, maintaining equivalence to standard advantage estimation and unbiased gradients. The revised manuscript now states this normalization explicitly in the EDAS description and includes a brief note on the resulting invariance property. revision: yes
Referee: [Empirical results] Empirical results section: the reported 6.29-point average improvement over DAPO on Qwen3-8B is presented without error bars, standard deviations across runs, or statistical significance tests. This makes it impossible to assess whether the gains are robust or could be explained by variance in the underlying RLVR baselines.

Authors: We agree that the absence of variability measures and significance testing in the reported results limits assessment of robustness. In the revised manuscript we have added standard deviations computed across three independent training runs for all main results, including the DAPO comparison on Qwen3-8B, and have included error bars in the corresponding tables and figures. We have also performed paired t-tests between the EDAS-augmented runs and the baseline runs, reporting p-values in the results section. These additions confirm that the 6.29-point average improvement is statistically significant and not explained by training variance. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical observation plus post-hoc method validated externally

full rationale

The paper's chain begins with an empirical observation that intra-group error diversity correlates with RLVR training success on math benchmarks, then introduces EDAS as a simple post-hoc advantage modulation rule that amplifies penalties on repeated errors and attenuates them on rare ones. This modulation is presented as an algorithm-agnostic adjustment without any derivation that reduces the reported 6.29-point average gain (or any other performance number) to a fitted hyperparameter, self-referential definition, or self-citation chain. The improvements are shown through direct experiments on Qwen3-8B and other models across seven held-out benchmarks, making the central claim falsifiable outside the method's own construction. No equations are supplied that equate the modulated advantage to the original binary-reward baseline by algebraic identity, and the motivation section treats the diversity signal as an observed input rather than a quantity defined from the final result.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that error diversity can be measured reliably from binary correctness signals and that a simple diversity-based reweighting of advantages will generalize across models and benchmarks without introducing instability. No new physical entities are postulated.

free parameters (1)

diversity shaping strength
A scalar that controls how strongly common errors are penalized versus rare ones; its value is not reported in the abstract and must be chosen or tuned.

axioms (1)

domain assumption Binary correctness rewards are sufficient to define meaningful error clusters within a rollout group.
Invoked when the method treats repeated wrong answers as the same error type.

pith-pipeline@v0.9.0 · 5792 in / 1451 out tokens · 38919 ms · 2026-05-20T14:48:10.004021+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 12 internal anchors

[1]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[2]

Matharena: Evaluating llms on uncontaminated math competitions, february 2025.https://matharena.ai, 8, 2025

Mislav Balunovic, Jasper Dekoninck, Ivo Petrov, Nikola Jovanovic, and Martin Vechev. Matharena: Evaluating llms on uncontaminated math competitions, february 2025.https://matharena.ai, 8, 2025

work page 2025
[3]

Post-training as reweighting: A stochastic view of reasoning trajectories in language models.arXiv preprint arXiv:2511.07368, 2025

Dake Bu, Wei Huang, Andi Han, Atsushi Nitanda, Bo Xue, Qingfu Zhang, Hau-San Wong, and Taiji Suzuki. Post-training as reweighting: A stochastic view of reasoning trajectories in language models.arXiv preprint arXiv:2511.07368, 2025

work page arXiv 2025
[4]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

arXiv preprint arXiv:2505.09655 , year=

Xiwen Chen, Wenhui Zhu, Peijie Qiu, Xuanzhao Dong, Hao Wang, Haiyu Wu, Huayu Li, Aristeidis Sotiras, Yalin Wang, and Abolfazl Razi. Dra-grpo: Exploring diversity-aware reward adjustment for r1-zero-like training of large language models.arXiv preprint arXiv:2505.09655, 2025

work page arXiv 2025
[6]

Reasoning with exploration: An entropy perspective

Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Xin Zhao, Zhenliang Zhang, and Furu Wei. Reasoning with exploration: An entropy perspective. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 30377–30385, 2026

work page 2026
[7]

American invitational mathematics examination-aime 2024, 2024

MAA Codeforces. American invitational mathematics examination-aime 2024, 2024

work page 2024
[8]

Harder is better: Boost- ing mathematical reasoning via difficulty-aware grpo and multi-aspect question reformulation.arXiv preprint arXiv:2601.20614, 2026

Yanqi Dai, Yuxiang Ji, Xiao Zhang, Yong Wang, Xiangxiang Chu, and Zhiwu Lu. Harder is better: Boost- ing mathematical reasoning via difficulty-aware grpo and multi-aspect question reformulation.arXiv preprint arXiv:2601.20614, 2026

work page arXiv 2026
[9]

Interleaved latent visual reasoning with selective perceptual modeling.arXiv preprint arXiv:2512.05665, 2025

Shuai Dong, Siyuan Wang, Xingyu Liu, Chenglin Li, Haowen Hou, and Zhongyu Wei. Interleaved latent visual reasoning with selective perceptual modeling.arXiv preprint arXiv:2512.05665, 2025

work page arXiv 2025
[10]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Reasoning through exploration: A reinforcement learning framework for robust function calling.arXiv preprint arXiv:2508.05118, 2025

Bingguang Hao, Zengzhuang Xu, Maolin Wang, Yuntao Wen, Yicheng Chen, Cunyin Peng, Long Chen, Dong Wang, Xiangyu Zhao, Jinjie Gu, et al. Reasoning through exploration: A reinforcement learning framework for robust function calling.arXiv preprint arXiv:2508.05118, 2025

work page arXiv 2025
[12]

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Paper...

work page 2024
[13]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Alek- sander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Setpo: Set-level policy optimization for diversity-preserving llm reasoning.arXiv preprint arXiv:2602.01062, 2026

Chenyi Li, Yuan Zhang, Bo Wang, Guoqing Ma, Wei Tang, Haoyang Huang, and Nan Duan. Setpo: Set-level policy optimization for diversity-preserving llm reasoning.arXiv preprint arXiv:2602.01062, 2026

work page arXiv 2026
[16]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Code-r1: Reproducing r1 for code with reliable rewards

Jiawei Liu and Lingming Zhang. Code-r1: Reproducing r1 for code with reliable rewards. 2025

work page 2025
[18]

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.Advances in neural information processing systems, 36:21558–21572, 2023. 11

work page 2023
[19]

Diversity-aware training for test-time scaling

Bohan Lyu, Qixin Xu, Zihan Zhu, Yesai Wu, Zhong Zhang, Haotian Chen, Xin Cong, Xiaojiang Liu, Zhiyuan Liu, and Maosong Sun. Diversity-aware training for test-time scaling

work page
[20]

American mathematics competitions - amc.https://maa.org/, 2023

MAA. American mathematics competitions - amc.https://maa.org/, 2023

work page 2023
[21]

American invitational mathematics examination-AIME 2025.https://maa.org/, 2025

MAA. American invitational mathematics examination-AIME 2025.https://maa.org/, 2025

work page 2025
[22]

American invitational mathematics examination-AIME 2026.https://maa.org/, 2026

MAA. American invitational mathematics examination-AIME 2026.https://maa.org/, 2026

work page 2026
[23]

Ngrpo: Negative-enhanced group relative policy optimization

Gongrui Nan, Siye Chen, Jing Huang, Mengyu Lu, Dexun Wang, Chunmei Xie, Weiqi Xiong, Xianzhou Zeng, Qixuan Zhou, Yadong Li, et al. Ngrpo: Negative-enhanced group relative policy optimization.arXiv preprint arXiv:2509.18851, 2025

work page arXiv 2025
[24]

Codeelo: Benchmarking competition-level code generation of llms with human-comparable elo ratings

Shanghaoran Quan, Jiaxi Yang, Bowen Yu, Bo Zheng, Dayiheng Liu, An Yang, Xuancheng Ren, Bofei Gao, Yibo Miao, Yunlong Feng, et al. Codeelo: Benchmarking competition-level code generation of llms with human- comparable elo ratings.arXiv preprint arXiv:2501.01257, 2025

work page arXiv 2025
[25]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[26]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Pass@ k policy optimization: Solving harder reinforcement learning problems.arXiv preprint arXiv:2505.15201, 2025

Christian Walder and Deep Karkhanis. Pass@ k policy optimization: Solving harder reinforcement learning problems.arXiv preprint arXiv:2505.15201, 2025

work page arXiv 2025
[28]

Mmr-grpo: Accelerating grpo-style training through diversity- aware reward reweighting.arXiv preprint arXiv:2601.09085, 2026

Kangda Wei and Ruihong Huang. Mmr-grpo: Accelerating grpo-style training through diversity-aware reward reweighting.arXiv preprint arXiv:2601.09085, 2026

work page arXiv 2026
[29]

Unlocking Exploration in RLVR: Uncertainty-aware Advantage Shaping for Deeper Reasoning

Can Xie, Ruotong Pan, Xiangyu Wu, Yunfei Zhang, Jiayi Fu, Tingting Gao, and Guorui Zhou. Unlocking exploration in rlvr: Uncertainty-aware advantage shaping for deeper reasoning.arXiv preprint arXiv:2510.10649, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Grouter: Decoupling routing from representation for accelerated moe training.arXiv preprint arXiv:2603.06626, 2026

Yuqi Xu, Rizhen Hu, Zihan Liu, Mou Sun, and Kun Yuan. Grouter: Decoupling routing from representation for accelerated moe training.arXiv preprint arXiv:2603.06626, 2026

work page arXiv 2026
[31]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, LingjunLiu, etal. Dapo: Anopen-sourcellmreinforcementlearningsystematscale, 2025.URL https://arxiv. org/abs/2503.14476, 1:2, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Rspo: Risk-seeking policy optimization for pass@ k and max@ k metrics in large language models.arXiv preprint arXiv:2508.01174, 2025

Kaichen Zhang, Shenghao Gao, Yuzhong Hong, Haipeng Sun, Junwei Bao, Hongfei Jiang, Yang Song, Hong Dingqian, and Hui Xiong. Rspo: Risk-seeking policy optimization for pass@ k and max@ k metrics in large language models.arXiv preprint arXiv:2508.01174, 2025

work page arXiv 2025
[34]

Edge-grpo: Entropy-driven grpo with guided error correction for advantage diversity.arXiv preprint arXiv:2507.21848,

Xingjian Zhang, Siwei Wen, Wenjun Wu, and Lei Huang. Edge-grpo: Entropy-driven grpo with guided error correction for advantage diversity.arXiv preprint arXiv:2507.21848, 2025

work page arXiv 2025
[35]

Reinforced Efficient Reasoning via Semantically Diverse Exploration

Ziqi Zhao, Zhaochun Ren, Jiahong Zou, Liu Yang, Zhiwei Xu, Xuri Ge, Zhumin Chen, Xinyu Ma, Daiting Shi, Shuaiqiang Wang, et al. Reinforced efficient reasoning via semantically diverse exploration.arXiv preprint arXiv:2601.05053, 2026. 12 Appendix A Proof of Zero-Sum Property in the Diverse Regime Theorem A.1(Zero-Sum Advantage Redistribution).In the healt...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[1] [1]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[2] [2]

Matharena: Evaluating llms on uncontaminated math competitions, february 2025.https://matharena.ai, 8, 2025

Mislav Balunovic, Jasper Dekoninck, Ivo Petrov, Nikola Jovanovic, and Martin Vechev. Matharena: Evaluating llms on uncontaminated math competitions, february 2025.https://matharena.ai, 8, 2025

work page 2025

[3] [3]

Post-training as reweighting: A stochastic view of reasoning trajectories in language models.arXiv preprint arXiv:2511.07368, 2025

Dake Bu, Wei Huang, Andi Han, Atsushi Nitanda, Bo Xue, Qingfu Zhang, Hau-San Wong, and Taiji Suzuki. Post-training as reweighting: A stochastic view of reasoning trajectories in language models.arXiv preprint arXiv:2511.07368, 2025

work page arXiv 2025

[4] [4]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[5] [5]

arXiv preprint arXiv:2505.09655 , year=

Xiwen Chen, Wenhui Zhu, Peijie Qiu, Xuanzhao Dong, Hao Wang, Haiyu Wu, Huayu Li, Aristeidis Sotiras, Yalin Wang, and Abolfazl Razi. Dra-grpo: Exploring diversity-aware reward adjustment for r1-zero-like training of large language models.arXiv preprint arXiv:2505.09655, 2025

work page arXiv 2025

[6] [6]

Reasoning with exploration: An entropy perspective

Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Xin Zhao, Zhenliang Zhang, and Furu Wei. Reasoning with exploration: An entropy perspective. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 30377–30385, 2026

work page 2026

[7] [7]

American invitational mathematics examination-aime 2024, 2024

MAA Codeforces. American invitational mathematics examination-aime 2024, 2024

work page 2024

[8] [8]

Harder is better: Boost- ing mathematical reasoning via difficulty-aware grpo and multi-aspect question reformulation.arXiv preprint arXiv:2601.20614, 2026

Yanqi Dai, Yuxiang Ji, Xiao Zhang, Yong Wang, Xiangxiang Chu, and Zhiwu Lu. Harder is better: Boost- ing mathematical reasoning via difficulty-aware grpo and multi-aspect question reformulation.arXiv preprint arXiv:2601.20614, 2026

work page arXiv 2026

[9] [9]

Interleaved latent visual reasoning with selective perceptual modeling.arXiv preprint arXiv:2512.05665, 2025

Shuai Dong, Siyuan Wang, Xingyu Liu, Chenglin Li, Haowen Hou, and Zhongyu Wei. Interleaved latent visual reasoning with selective perceptual modeling.arXiv preprint arXiv:2512.05665, 2025

work page arXiv 2025

[10] [10]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Reasoning through exploration: A reinforcement learning framework for robust function calling.arXiv preprint arXiv:2508.05118, 2025

Bingguang Hao, Zengzhuang Xu, Maolin Wang, Yuntao Wen, Yicheng Chen, Cunyin Peng, Long Chen, Dong Wang, Xiangyu Zhao, Jinjie Gu, et al. Reasoning through exploration: A reinforcement learning framework for robust function calling.arXiv preprint arXiv:2508.05118, 2025

work page arXiv 2025

[12] [12]

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Paper...

work page 2024

[13] [13]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Alek- sander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Setpo: Set-level policy optimization for diversity-preserving llm reasoning.arXiv preprint arXiv:2602.01062, 2026

Chenyi Li, Yuan Zhang, Bo Wang, Guoqing Ma, Wei Tang, Haoyang Huang, and Nan Duan. Setpo: Set-level policy optimization for diversity-preserving llm reasoning.arXiv preprint arXiv:2602.01062, 2026

work page arXiv 2026

[16] [16]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

Code-r1: Reproducing r1 for code with reliable rewards

Jiawei Liu and Lingming Zhang. Code-r1: Reproducing r1 for code with reliable rewards. 2025

work page 2025

[18] [18]

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.Advances in neural information processing systems, 36:21558–21572, 2023. 11

work page 2023

[19] [19]

Diversity-aware training for test-time scaling

Bohan Lyu, Qixin Xu, Zihan Zhu, Yesai Wu, Zhong Zhang, Haotian Chen, Xin Cong, Xiaojiang Liu, Zhiyuan Liu, and Maosong Sun. Diversity-aware training for test-time scaling

work page

[20] [20]

American mathematics competitions - amc.https://maa.org/, 2023

MAA. American mathematics competitions - amc.https://maa.org/, 2023

work page 2023

[21] [21]

American invitational mathematics examination-AIME 2025.https://maa.org/, 2025

MAA. American invitational mathematics examination-AIME 2025.https://maa.org/, 2025

work page 2025

[22] [22]

American invitational mathematics examination-AIME 2026.https://maa.org/, 2026

MAA. American invitational mathematics examination-AIME 2026.https://maa.org/, 2026

work page 2026

[23] [23]

Ngrpo: Negative-enhanced group relative policy optimization

Gongrui Nan, Siye Chen, Jing Huang, Mengyu Lu, Dexun Wang, Chunmei Xie, Weiqi Xiong, Xianzhou Zeng, Qixuan Zhou, Yadong Li, et al. Ngrpo: Negative-enhanced group relative policy optimization.arXiv preprint arXiv:2509.18851, 2025

work page arXiv 2025

[24] [24]

Codeelo: Benchmarking competition-level code generation of llms with human-comparable elo ratings

Shanghaoran Quan, Jiaxi Yang, Bowen Yu, Bo Zheng, Dayiheng Liu, An Yang, Xuancheng Ren, Bofei Gao, Yibo Miao, Yunlong Feng, et al. Codeelo: Benchmarking competition-level code generation of llms with human- comparable elo ratings.arXiv preprint arXiv:2501.01257, 2025

work page arXiv 2025

[25] [25]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[26] [26]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

Pass@ k policy optimization: Solving harder reinforcement learning problems.arXiv preprint arXiv:2505.15201, 2025

Christian Walder and Deep Karkhanis. Pass@ k policy optimization: Solving harder reinforcement learning problems.arXiv preprint arXiv:2505.15201, 2025

work page arXiv 2025

[28] [28]

Mmr-grpo: Accelerating grpo-style training through diversity- aware reward reweighting.arXiv preprint arXiv:2601.09085, 2026

Kangda Wei and Ruihong Huang. Mmr-grpo: Accelerating grpo-style training through diversity-aware reward reweighting.arXiv preprint arXiv:2601.09085, 2026

work page arXiv 2026

[29] [29]

Unlocking Exploration in RLVR: Uncertainty-aware Advantage Shaping for Deeper Reasoning

Can Xie, Ruotong Pan, Xiangyu Wu, Yunfei Zhang, Jiayi Fu, Tingting Gao, and Guorui Zhou. Unlocking exploration in rlvr: Uncertainty-aware advantage shaping for deeper reasoning.arXiv preprint arXiv:2510.10649, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

Grouter: Decoupling routing from representation for accelerated moe training.arXiv preprint arXiv:2603.06626, 2026

Yuqi Xu, Rizhen Hu, Zihan Liu, Mou Sun, and Kun Yuan. Grouter: Decoupling routing from representation for accelerated moe training.arXiv preprint arXiv:2603.06626, 2026

work page arXiv 2026

[31] [31]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, LingjunLiu, etal. Dapo: Anopen-sourcellmreinforcementlearningsystematscale, 2025.URL https://arxiv. org/abs/2503.14476, 1:2, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

Rspo: Risk-seeking policy optimization for pass@ k and max@ k metrics in large language models.arXiv preprint arXiv:2508.01174, 2025

Kaichen Zhang, Shenghao Gao, Yuzhong Hong, Haipeng Sun, Junwei Bao, Hongfei Jiang, Yang Song, Hong Dingqian, and Hui Xiong. Rspo: Risk-seeking policy optimization for pass@ k and max@ k metrics in large language models.arXiv preprint arXiv:2508.01174, 2025

work page arXiv 2025

[34] [34]

Edge-grpo: Entropy-driven grpo with guided error correction for advantage diversity.arXiv preprint arXiv:2507.21848,

Xingjian Zhang, Siwei Wen, Wenjun Wu, and Lei Huang. Edge-grpo: Entropy-driven grpo with guided error correction for advantage diversity.arXiv preprint arXiv:2507.21848, 2025

work page arXiv 2025

[35] [35]

Reinforced Efficient Reasoning via Semantically Diverse Exploration

Ziqi Zhao, Zhaochun Ren, Jiahong Zou, Liu Yang, Zhiwei Xu, Xuri Ge, Zhumin Chen, Xinyu Ma, Daiting Shi, Shuaiqiang Wang, et al. Reinforced efficient reasoning via semantically diverse exploration.arXiv preprint arXiv:2601.05053, 2026. 12 Appendix A Proof of Zero-Sum Property in the Diverse Regime Theorem A.1(Zero-Sum Advantage Redistribution).In the healt...

work page internal anchor Pith review Pith/arXiv arXiv 2026