arxiv: 2605.04960 · v1 · submitted 2026-05-06 · 💻 cs.LG · cs.AI

Recognition: unknown

EP-GRPO: Entropy-Progress Aligned Group Relative Policy Optimization with Implicit Process Guidance

Song Yu , Li Li , Wenwen Zhao , Zhisheng Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:29 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords GRPORLVRLLM reasoningentropy modulationpolicy optimizationcredit assignmentmathematical reasoning

0 comments

The pith

EP-GRPO fixes GRPO credit assignment failures in LLM reasoning by mining the model's own entropy and policy divergence for dense self-supervised signals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies three credit assignment failures in standard GRPO for reinforcement learning with verifiable rewards: uniform token granularity that overlooks varying token importance, uniform polarity that misrewards steps, and zero-variance collapse that erases useful gradients. EP-GRPO counters these by using entropy to gate focus on high-information pivots, deriving token-level direction from policy divergence anchored to outcome advantages, and applying cumulative entropy to normalize advantages according to progress. This produces dense guidance without external reward models. Experiments on mathematical reasoning benchmarks show gains in both accuracy and training efficiency.

Core claim

EP-GRPO integrates entropy-gated modulation to prioritize high-entropy decision pivots, implicit process signals from policy divergence anchored to outcome advantages for directional token-level feedback, and cumulative entropy mapping that enables progress-aligned advantage normalization, naturally maintaining gradient flow under zero reward variance.

What carries the argument

Entropy-gated modulation integrated with policy-divergence implicit process signals and cumulative entropy mapping for progress-aligned advantage normalization.

Load-bearing premise

That entropy and policy divergence mined from the model itself supply reliable, unbiased dense guidance that improves outcomes without introducing instability or new failure modes.

What would settle it

Training runs on the same mathematical reasoning benchmarks where EP-GRPO shows no accuracy or efficiency gain over GRPO, or where it produces higher training instability.

Figures

Figures reproduced from arXiv: 2605.04960 by Li Li, Song Yu, Wenwen Zhao, Zhisheng Yang.

**Figure 1.** Figure 1: Conceptual illustration of the fundamental limitations in standard GRPO. The top panel demonstrates Uniform Granularity, where the model fails to distinguish between critical high entropy decision pivots and deterministic low entropy derivations. The middle panel shows Uniform Polarity, where sequence-level rewards lead to the indiscriminate reinforcement or penalization of both correct and incorrect inter… view at source ↗

**Figure 2.** Figure 2: Empirical analysis using standard GRPO. (a) Perturbing high entropy tokens degrades sequence accuracy substantially more than perturbing low entropy tokens, highlighting their disproportionate importance. (b) Incorrect tokens in otherwise correct sequences frequently receive positive advantages, while correct tokens in failed sequences receive negative ones, indicating poor credit assignment. (c) Severe ze… view at source ↗

**Figure 3.** Figure 3: Overview of EP-GRPO. EP-GRPO transforms sparse outcome rewards into dense, token-level learning signals. It first applies entropy-gated modulation to assign higher weights to important tokens while keeping a consistent direction. Then, an implicit progress signal is computed by comparing the policy with a reference model, providing token-level feedback with varying magnitude and direction, further aligned … view at source ↗

**Figure 5.** Figure 5: Comparison of training dynamics between EP-GRPO and GRPO. view at source ↗

**Figure 6.** Figure 6: Training efficiency and output length comparison on Qwen2.5-3B. view at source ↗

read the original abstract

Reinforcement learning with verifiable rewards (RLVR), particularly Group Relative Policy Optimization (GRPO), has advanced LLM reasoning. However, GRPO suffers from three credit assignment failures: uniform token-level granularity that ignores heterogeneous informational value, uniform polarity that penalizes correct steps and rewards incorrect ones, and zero-variance collapse that erases outcome-driven gradients. We systematically quantify these failures, revealing highly non-uniform token informativeness, widespread step-level polarity misalignment, and substantial training waste. To address these limitations, we propose Entropy-Progress Aligned GRPO (EP-GRPO), a framework that mines the model's intrinsic information flow for dense, self-supervised guidance. EP-GRPO integrates entropy-gated modulation to prioritize high entropy decision pivots, implicit process signals from policy divergence anchored to outcome advantages for directional token-level feedback without external reward models, and cumulative entropy mapping that enables progress-aligned advantage normalization, naturally maintaining gradient flow under zero reward variance. Extensive experiments on mathematical reasoning benchmarks demonstrate that EP-GRPO achieves superior accuracy and efficiency compared to GRPO and its variants. The code will be available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EP-GRPO adds entropy gating, divergence signals, and cumulative mapping to GRPO to address credit assignment in LLM reasoning RL, with claimed gains on math benchmarks.

read the letter

EP-GRPO modifies GRPO by pulling entropy and policy divergence from the model itself to create denser token-level guidance during RLVR training for reasoning tasks. The paper first quantifies three GRPO problems: treating all tokens the same regardless of information value, applying the same polarity to correct and incorrect steps, and losing gradients when outcome variance hits zero. It then maps solutions to entropy-gated updates for high-value pivots, divergence anchored to final advantages for directional feedback without extra models, and cumulative entropy for progress-based normalization that preserves flow under zero variance. This self-supervised construction is the main practical contribution, since it stays inside the existing GRPO loop and avoids new external verifiers. Experiments report higher accuracy and better efficiency than GRPO and variants on mathematical reasoning benchmarks, with code promised for release. The specific three-component integration looks new relative to the cited GRPO work. The approach is straightforward to understand and targets a known pain point in long-chain reasoning. The central assumption—that entropy and divergence mined from the policy give reliable, low-bias signals without introducing fresh instabilities—needs the full experimental details to evaluate. The abstract describes the fixes as natural, but any sensitivity to model scale or task type would limit how far the gains generalize. Ablations on each piece separately would help separate real improvement from tuning effects. This paper is for researchers already using or extending GRPO on LLM reasoning. Readers focused on credit assignment fixes in sparse-reward settings will find the mechanisms and benchmark comparisons useful. It has enough concrete problem breakdown, method, and reported results to merit peer review rather than a desk reject.

Referee Report

2 major / 2 minor

Summary. The paper identifies three credit assignment failures in Group Relative Policy Optimization (GRPO) for reinforcement learning with verifiable rewards (RLVR) on LLM reasoning: uniform token-level granularity that ignores heterogeneous informational value, uniform polarity that misaligns rewards for correct/incorrect steps, and zero-variance collapse that erases outcome-driven gradients. It proposes EP-GRPO, which mines the model's intrinsic signals via entropy-gated modulation to prioritize high-entropy decision points, implicit process guidance from policy divergence anchored to outcome advantages for directional token feedback without external models, and cumulative entropy mapping for progress-aligned advantage normalization that preserves gradients under zero variance. Experiments on mathematical reasoning benchmarks are reported to show superior accuracy and efficiency over GRPO and variants, with code to be released.

Significance. If the empirical results and mechanism validations hold, the work is significant for providing a self-supervised, parameter-light approach to dense credit assignment in RLVR that avoids external reward models. The systematic quantification of GRPO failures and the natural handling of zero-variance cases via cumulative entropy add conceptual value; code release supports reproducibility and allows verification of the claimed efficiency gains.

major comments (2)

[§3] §3 (Method): The anchoring of policy divergence to outcome advantages for directional signals requires the explicit equation or mapping (e.g., how divergence is signed or scaled by advantage) to confirm it supplies unbiased token-level feedback without introducing new variance or bias; this is load-bearing for the 'implicit process guidance' claim.
[Experiments] Experiments section: The reported superiority on math benchmarks must be accompanied by ablations that isolate each of the three mechanisms (entropy gating, divergence signals, cumulative entropy) and show they specifically mitigate the quantified failures (non-uniform informativeness, polarity misalignment, zero-variance waste) rather than producing gains through unrelated effects.

minor comments (2)

[Abstract] Abstract and §2: The quantification of the three failures ('highly non-uniform token informativeness', 'widespread step-level polarity misalignment', 'substantial training waste') should explicitly reference the corresponding figures, tables, or statistics in the main text for immediate verifiability.
[Notation] Notation throughout: Define entropy and divergence symbols once in a preliminary section and ensure consistent usage (e.g., distinguish per-token vs. cumulative entropy) to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation for minor revision. The comments identify key areas where the method presentation and experimental validation can be strengthened. We address each major comment below and will update the manuscript accordingly.

read point-by-point responses

Referee: [§3] §3 (Method): The anchoring of policy divergence to outcome advantages for directional signals requires the explicit equation or mapping (e.g., how divergence is signed or scaled by advantage) to confirm it supplies unbiased token-level feedback without introducing new variance or bias; this is load-bearing for the 'implicit process guidance' claim.

Authors: We agree that the anchoring must be stated explicitly to substantiate the claim. In the revised §3.2, we will insert the precise mapping: the implicit process signal at token t is defined as s_t = A * D_t, where A denotes the outcome advantage (+1 or -1 based on the verifiable reward) and D_t is the per-token policy divergence (KL divergence to a fixed reference policy). This multiplication by A supplies the directional polarity while the divergence itself is computed only on high-entropy tokens after gating. We will add a short unbiasedness argument showing that E[s_t] aligns with the GRPO gradient expectation and that no extraneous variance is introduced beyond the bounded divergence term. revision: yes
Referee: [Experiments] Experiments section: The reported superiority on math benchmarks must be accompanied by ablations that isolate each of the three mechanisms (entropy gating, divergence signals, cumulative entropy) and show they specifically mitigate the quantified failures (non-uniform informativeness, polarity misalignment, zero-variance waste) rather than producing gains through unrelated effects.

Authors: We acknowledge that the current experiments primarily report end-to-end gains. In the revised manuscript we will add a dedicated ablation subsection that disables each component in turn (EP-GRPO minus entropy gating, minus divergence signals, minus cumulative entropy mapping) and measures the resulting change in the three failure metrics already quantified in §2: token informativeness variance, step-level polarity misalignment rate, and gradient norm under zero-reward batches. These controlled comparisons will directly link each mechanism to the mitigation of the corresponding failure rather than to unrelated factors. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The abstract and high-level description present EP-GRPO as a self-supervised extension of GRPO that mines intrinsic entropy and policy divergence signals to address credit-assignment issues. No equations, derivations, or fitted-parameter renamings are shown. The three proposed mechanisms (entropy gating, divergence-based directional signals, cumulative entropy normalization) are described as direct responses to externally quantified failures rather than reductions to the paper's own inputs. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear in the supplied text. The central empirical claim rests on benchmark results, which are independent of any internal definitional loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that intrinsic entropy and policy divergence constitute useful self-supervised signals; no free parameters or invented entities are mentioned.

axioms (1)

domain assumption The model's intrinsic information flow (entropy and policy divergence) supplies reliable dense guidance for token-level credit assignment without external reward models.
This is the core premise enabling the three proposed mechanisms.

pith-pipeline@v0.9.0 · 5500 in / 1158 out tokens · 31399 ms · 2026-05-08T17:29:59.270950+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 2 canonical work pages · 2 internal anchors

[1]

6: Training efficiency and output length comparison on Qwen2.5-3B

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda 14 GRPO Higher Temp More Rollouts EP-GRPO Methods 0 5 10 15 20 Runtime (Hours) 14.2h 14.5h 17.4h 15.0h Training Efficiency 0 300 600 900 Training Steps 600 700 800 900 Tokens Length Dynamics GRPO Higher Temp Mor...

1901
[2]

Seed-grpo: Semantic entropy enhanced grpo for uncertainty-aware policy optimization, 2025

Minghan Chen, Guikun Chen, Wenguan Wang, and Yi Yang. Seed-grpo: Semantic entropy enhanced grpo for uncertainty-aware policy optimization, 2025

2025
[3]

Reasoning with exploration: An entropy perspective

Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Xin Zhao, Zhenliang Zhang, and Furu Wei. Reasoning with exploration: An entropy perspective. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 30377–30385, 2026

2026
[4]

Process reinforcement through implicit rewards, 2025

Guoxin Cui et al. Process reinforcement through implicit rewards, 2025

2025
[5]

Prpo: Aligning process reward with outcome reward in policy optimization, 2026

Rui Ding, Yifan Lv, Xiang Meng, Jiaming Song, Chi Wang, Chengrui Jiang, and Yong Cheng. Prpo: Aligning process reward with outcome reward in policy optimization, 2026

2026
[6]

Deepseek- r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junmei Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shi- rong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek- r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

2025
[7]

Skywork Open Reasoner 1 Technical Report

Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, Siyuan Li, Liang Zeng, Tianwen Wei, Cheng Cheng, Bo An, Yang Liu, and Yahui Zhou. Skywork open reasoner 1 technical report.arXiv preprint arXiv:2505.22312, 2025

work page internal anchor Pith review arXiv 2025
[8]

Measuring mathematical problem solving with the math dataset.Advances in Neural Information Processing Systems, 2021

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.Advances in Neural Information Processing Systems, 2021

2021
[9]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022

2022
[10]

Rethinking entropy regularization in large reasoning models, 2025

Yuxian Jiang, Yafu Li, Guanxu Chen, Dongrui Liu, Yu Cheng, and Jing Shao. Rethinking entropy regularization in large reasoning models, 2025

2025
[11]

Language models (mostly) know what they know, 2022

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know, 2022

2022
[12]

Beyond uniform credit: Causal credit assignment for policy optimization, 2026

Mykola Khandoga, Rui Yuan, and Vinay Kumar Sankarapu. Beyond uniform credit: Causal credit assignment for policy optimization, 2026

2026
[13]

Tulu 3: Pushing frontiers in open language model post-training, 2024

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James Validad Miranda, Alisa Liu, Nouha Dziri, Xinxi Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training, 2024. COLM 2024

2024
[14]

Le, Myeongho Jeon, Kim Vu, Viet Lai, and Eunho Yang

Thanh-Long V . Le, Myeongho Jeon, Kim Vu, Viet Lai, and Eunho Yang. No prompt left behind: Exploiting zero-variance prompts in llm reinforce- ment learning via entropy-guided advantage shaping,
[15]

Also accepted at ICLR 2026

2026
[16]

Solving quantitative rea- soning problems with language models.Advances in Neural Information Processing Systems, 35:3843– 3857, 2022

Aitor Lewkowycz, Anders Andreassen, David Do- han, Ethan Dyer, Henryk Michalewski, Vinay Ra- masesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative rea- soning problems with language models.Advances in Neural Information Processing Systems, 35:3843– 3857, 2022

2022
[17]

Step-grpo: Enhancing reasoning quality and efficiency via structured prm-based reinforcement learning

Wen Li, Jinpeng Wang, Liang-Chi Yu, and Xiaozhen Zhang. Step-grpo: Enhancing reasoning quality and efficiency via structured prm-based reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, 2026

2026
[18]

Let’s verify step by step, 2023

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, 15 John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step, 2023

2023
[19]

Token-level policy optimization: Linking group-level rewards to token-level aggregation via markov likelihood, 2025

Xingyu Lin, Yilin Wen, En Wang, Du Su, Wenbin Liu, Chenfu Bao, and Zhonghou Lv. Token-level policy optimization: Linking group-level rewards to token-level aggregation via markov likelihood, 2025

2025
[20]

Re-grpo: Leveraging hard negative cases through large language model guided self training.Neurocomputing, page 132543, 2025

Haoyu Liu and Le Xiao. Re-grpo: Leveraging hard negative cases through large language model guided self training.Neurocomputing, page 132543, 2025

2025
[21]

From uniform to heterogeneous: Tailoring policy optimization to every token’s nature, 2025

Zheng Liu, Mengjie Liu, Siwei Wen, Mengzhang Cai, Bin Cui, Conghui He, and Wentao Zhang. From uniform to heterogeneous: Tailoring policy optimization to every token’s nature, 2025

2025
[22]

Noise-corrected grpo: From noisy rewards to unbiased gradients, 2025

Omar El Mansouri, Mohamed El Amine Seddik, and Salem Lahlou. Noise-corrected grpo: From noisy rewards to unbiased gradients, 2025

2025
[23]

Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730– 27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730– 27744, 2022

2022
[24]

Qwen2.5 technical report, 2025

Qwen Team. Qwen2.5 technical report, 2025

2025
[25]

Multi-task grpo: Reliable llm reasoning across tasks, 2026

Shyam Sundhar Ramesh, Xiaotong Ji, Matthieu Zimmer, Sangwoong Yoon, Zhiyong Wang, Haitham Bou Ammar, Aurelien Lucchi, and Ilija Bogunovic. Multi-task grpo: Reliable llm reasoning across tasks, 2026

2026
[26]

Approximating kl divergence

John Schulman. Approximating kl divergence. http: //joschu.net/blog/kl-approx.html, March 2020

2020
[27]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhari- wal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review arXiv 2017
[28]

Re- warding progress: Scaling automated process veri- fiers for llm reasoning, 2024

Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, and Aviral Kumar. Re- warding progress: Scaling automated process veri- fiers for llm reasoning, 2024

2024
[29]

Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024

2024
[30]

A long way to go: Investigating length correlations in rlhf, 2023

Prasann Singhal, Tanya Goyal, Jiacheng Xu, and Greg Durrett. A long way to go: Investigating length correlations in rlhf, 2023

2023
[31]

Gtpo and grpo- s: Token and sequence-level reward shaping with policy entropy, 2025

Hongze Tan, Zihan Wang, Jianfei Pan, Jinghao Lin, Hao Wang, Yifan Wu, Tao Chen, Zhihang Zheng, Zhihao Tang, and Haihua Yang. Gtpo and grpo- s: Token and sequence-level reward shaping with policy entropy, 2025

2025
[32]

Amc 2023 dataset

Math-AI Team. Amc 2023 dataset. https:// huggingface.co/datasets/math-ai/amc23, 2025. Prob- lems from 2023 American Mathematics Competi- tions (AMC 12A/12B)

2023
[33]

TRL: Transformers Reinforcement Learning, 2020

Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lam- bert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. TRL: Transformers Reinforcement Learning, 2020

2020
[34]

Peiyi Wang, Lei Li, Zhihong Shao, R. X. Xu, Damai Dai, Yifei Li, Deli Chen, Y . Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by- step without human annotations, 2024

2024
[35]

Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning

Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xiong-Hui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[36]

Dapo: An open- source llm reinforcement learning system at scale,

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open- source llm reinforcement learning system at scale,
[37]

Edge-grpo: Entropy-driven grpo with guided error correction for advantage diversity, 2025

Xingjian Zhang, Siwei Wen, Wenjun Wu, and Lei Huang. Edge-grpo: Entropy-driven grpo with guided error correction for advantage diversity, 2025

2025
[38]

American invita- tional mathematics examination (aime) 2024

Yifan Zhang and Team Math-AI. American invita- tional mathematics examination (aime) 2024. https: //huggingface.co/datasets/math-ai/aime24, 2024

2024
[39]

American invita- tional mathematics examination (aime) 2025

Yifan Zhang and Team Math-AI. American invita- tional mathematics examination (aime) 2025. https: //huggingface.co/datasets/math-ai/aime25, 2025

2025
[40]

Bartoldson, Bhavya Kailkhura, Fan Lai, Jiawei Zhao, and Beidi Chen

Haizhong Zheng, Yang Zhou, Brian R. Bartoldson, Bhavya Kailkhura, Fan Lai, Jiawei Zhao, and Beidi Chen. Act only when it pays: Efficient reinforcement learning for llm reasoning via selective rollouts. In Advances in Neural Information Processing Systems (NeurIPS), 2025. Also known as GRESO (GRPO with Efficient Selective Rollout)

2025