arxiv: 2605.00365 · v1 · submitted 2026-05-01 · 💻 cs.LG · cs.CL· stat.ML

Recognition: unknown

Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity

Anamika Lochab , Bolian Li , Ruqi Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:51 UTC · model grok-4.3

classification 💻 cs.LG cs.CLstat.ML

keywords RLVRpolicy optimizationdiversity collapseuniform policymathematical reasoningPass@KGRPO

0 comments

The pith

RLVR objectives ignore how probability spreads among correct answers, causing models to suppress valid alternatives over time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that standard reinforcement learning with verifiable rewards treats every correct solution identically in its objective, paying no attention to whether probability mass is spread evenly or concentrated on just a few. This structural indifference interacts with the stochastic updates during training to create a self-reinforcing loop that narrows the set of outputs the model produces. The authors characterize the ideal policy under robustness and entropy-regularized criteria and show that only the uniform distribution over correct solutions satisfies both. They therefore add a penalty term that supplies gradient signal to underrepresented correct responses, pushing the learned policy toward that uniform allocation. If the analysis holds, training can recover multi-sample coverage on reasoning tasks while keeping single-sample accuracy intact.

Core claim

Common RLVR objectives such as GRPO are indifferent to how probability mass is distributed among correct solutions. Combined with stochastic training dynamics, this indifference induces a self-reinforcing collapse in which probability mass concentrates on a narrow subset of correct outputs. The Uniform-Correct Policy, which allocates mass uniformly across all correct solutions, is uniquely optimal under robustness and entropy-regularized optimality criteria. UCPO modifies GRPO by adding a conditional uniformity penalty that redistributes gradient signal toward underrepresented correct responses.

What carries the argument

The conditional uniformity penalty on the policy's distribution over correct solutions, which supplies additional gradient to increase probability on underrepresented valid answers.

If this is right

Pass@K scores increase on mathematical reasoning benchmarks while Pass@1 remains competitive.
Equation-level diversity within the set of correct solutions rises by up to 45 percent.
The improvement appears consistently across 1.5B to 7B parameter models.
Up to 10 percent absolute gain occurs on AIME24 at Pass@64.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same indifference to correct-set diversity may operate in other domains where multiple valid outputs exist, such as code generation or planning.
Explicit uniformity terms could be combined with existing entropy bonuses to control exploration more precisely.
The collapse mechanism suggests that future verifiable-reward objectives should include explicit terms over the success set rather than treating all successes as equivalent.

Load-bearing premise

The added uniformity penalty can be tuned to redistribute probability among correct answers without introducing instability or degrading the primary reward signal.

What would settle it

Training the same models with the proposed penalty on AIME24 and observing no gain in Pass@64 or in the measured diversity of correct equations relative to the unmodified baseline.

Figures

Figures reproduced from arXiv: 2605.00365 by Anamika Lochab, Bolian Li, Ruqi Zhang.

**Figure 2.** Figure 2: Collapse dynamics under GRPO in a controlled RLVR environment with three correct [PITH_FULL_IMAGE:figures/full_fig_p024_2.png] view at source ↗

**Figure 3.** Figure 3: Training dynamics under GRPO, UCPO, and global entropy regularization. GRPO drives [PITH_FULL_IMAGE:figures/full_fig_p025_3.png] view at source ↗

**Figure 4.** Figure 4: UCPO reweights gradient mass within the correct set. Starting from an imbalanced [PITH_FULL_IMAGE:figures/full_fig_p026_4.png] view at source ↗

**Figure 5.** Figure 5: Effect of global entropy regularization strength in the controlled environment. (A) As the [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗

**Figure 6.** Figure 6: Effect of UCPO interpolation strength τ in the controlled environment. (A) Increasing τ progressively maintains diversity within the correct set, preventing concentration onto a single mode. (B) Mass on incorrect tokens remains negligible across all τ values, confirming that UCPO operates exclusively within the correct set. (C) Correctness mass Zθ stays near 1.0 for all τ , demonstrating that UCPO achieves… view at source ↗

**Figure 7.** Figure 7: Pass@k curves across mathematical reasoning benchmarks and models. UCPO consistently [PITH_FULL_IMAGE:figures/full_fig_p029_7.png] view at source ↗

**Figure 8.** Figure 8: Wall-clock time per training step for GRPO and UCPO under identical configuration [PITH_FULL_IMAGE:figures/full_fig_p030_8.png] view at source ↗

**Figure 9.** Figure 9: Pass@K comparison between the base model and GRPO on Qwen-2.5-7B. While GRPO [PITH_FULL_IMAGE:figures/full_fig_p031_9.png] view at source ↗

read the original abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has achieved substantial gains in single-attempt accuracy (Pass@1) on reasoning tasks, yet often suffers from reduced multi-sample coverage (Pass@K), indicating diversity collapse. We identify a structural cause for this degradation: common RLVR objectives, such as GRPO, are indifferent to how probability mass is distributed among correct solutions. Combined with stochastic training dynamics, this indifference induces a self-reinforcing collapse, in which probability mass concentrates on a narrow subset of correct outputs while alternative valid solutions are suppressed. We formalize this collapse mechanism and further characterize the optimal policy structure under two complementary criteria: robustness and entropy-regularized optimality, which identify the Uniform-Correct Policy as uniquely optimal. Motivated by this analysis, we propose Uniform-Correct Policy Optimization (UCPO), a modification to GRPO that adds a conditional uniformity penalty on the policy's distribution over correct solutions. The penalty redistributes gradient signal toward underrepresented correct responses, encouraging uniform allocation of probability mass within the correct set. Across three models (1.5B-7B parameters) and five mathematical reasoning benchmarks, UCPO improves Pass@K and diversity while maintaining competitive Pass@1, achieving up to +10\% absolute improvement on AIME24 at Pass@64 and up to 45\% higher equation-level diversity within the correct set. The code is available at https://github.com/AnamikaLochab/UCPO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UCPO adds a uniformity penalty to GRPO-style RLVR to counter collapse onto narrow sets of correct answers, delivering Pass@K gains while holding Pass@1, though the penalty needs per-setup tuning with no clear bounds.

read the letter

The main point is that GRPO and similar RLVR objectives assign identical advantage to every correct response, so stochastic sampling during training gradually suppresses probability on unsampled correct paths and concentrates mass on a few. UCPO counters this by adding a conditional uniformity penalty that redistributes gradient toward underrepresented correct solutions, and the analysis shows the uniform-correct policy is optimal under robustness and entropy-regularized criteria.

Referee Report

3 major / 2 minor

Summary. The manuscript identifies a structural indifference in common RLVR objectives such as GRPO to the distribution of probability mass among correct solutions. It argues that this, combined with stochastic sampling, causes self-reinforcing collapse to a narrow set of correct outputs. The authors characterize the Uniform-Correct Policy as optimal under robustness and entropy-regularized criteria, and introduce UCPO which augments GRPO with a conditional uniformity penalty to promote even distribution within the correct set. Empirical evaluations on three models and five math benchmarks demonstrate gains in Pass@K (up to +10% on AIME24 at Pass@64) and diversity metrics while preserving Pass@1 performance.

Significance. If the proposed mechanism and the effectiveness of the uniformity penalty hold, this work provides a targeted solution to the diversity collapse issue in RLVR for reasoning tasks, potentially enabling better multi-sample performance without compromising single-attempt accuracy. The conceptual contribution of identifying the Uniform-Correct Policy as uniquely optimal under the stated criteria adds theoretical insight. The availability of code at the provided GitHub link supports reproducibility and allows verification of the implementation.

major comments (3)

[Abstract and UCPO formulation] The central practical claim depends on the uniformity penalty being tunable to redistribute mass among correct answers without lowering the expected reward gradient that drives Pass@1 performance. No theoretical bound is given on the allowable penalty strength relative to the RLVR term, and the reported results use a single tuned value per model/benchmark. This assumption is load-bearing for the headline improvements (+10% Pass@64, maintained Pass@1).
[Experiments section] The empirical results report improvements in Pass@K and diversity but provide no ablations on the penalty coefficient, no statistical significance tests for the gains, and no controls isolating the uniformity term's contribution from other training factors. This limits verification of the claimed mechanism.
[Analysis of optimality criteria] The formalization of the collapse mechanism (indifference in GRPO-style objectives plus stochastic dynamics leading to self-reinforcing suppression) and the optimality characterization would benefit from an explicit simplified derivation or bound showing how unsampled correct responses receive no gradient.

minor comments (2)

[Abstract] The abstract states 'up to 45% higher equation-level diversity within the correct set' but does not define the exact diversity metric or how it is computed at the equation level.
[Method] Notation for the conditional uniformity penalty term and its integration into the GRPO objective could be clarified with an explicit equation in the method description.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment point by point below, providing clarifications on the theoretical aspects and committing to specific revisions that will strengthen the empirical validation and derivations in the manuscript.

read point-by-point responses

Referee: [Abstract and UCPO formulation] The central practical claim depends on the uniformity penalty being tunable to redistribute mass among correct answers without lowering the expected reward gradient that drives Pass@1 performance. No theoretical bound is given on the allowable penalty strength relative to the RLVR term, and the reported results use a single tuned value per model/benchmark. This assumption is load-bearing for the headline improvements (+10% Pass@64, maintained Pass@1).

Authors: We acknowledge that a general theoretical bound on the allowable penalty strength would provide stronger guarantees. Deriving such a bound is non-trivial because the interaction between the uniformity term and the GRPO gradient depends on the evolving sampling distribution over correct responses. In the revised manuscript, we will add a dedicated sensitivity analysis subsection with plots of Pass@1 and Pass@K as functions of the penalty coefficient across multiple models. These results will identify the practical range where Pass@1 is preserved while diversity gains are realized, thereby supporting the tunability claim empirically. The single tuned values were selected via validation-set grid search to avoid Pass@1 degradation. revision: yes
Referee: [Experiments section] The empirical results report improvements in Pass@K and diversity but provide no ablations on the penalty coefficient, no statistical significance tests for the gains, and no controls isolating the uniformity term's contribution from other training factors. This limits verification of the claimed mechanism.

Authors: We agree that these additions would improve verifiability of the mechanism. In the revised version, we will expand the Experiments section and Appendix to include: (i) full ablations over a grid of penalty coefficients with all metrics reported, (ii) statistical significance tests (bootstrap confidence intervals and paired tests across random seeds) for the Pass@K and diversity improvements, and (iii) an explicit control comparing UCPO to GRPO under matched hyperparameter budgets but without the uniformity penalty. These changes will directly isolate the contribution of the uniformity term. revision: yes
Referee: [Analysis of optimality criteria] The formalization of the collapse mechanism (indifference in GRPO-style objectives plus stochastic dynamics leading to self-reinforcing suppression) and the optimality characterization would benefit from an explicit simplified derivation or bound showing how unsampled correct responses receive no gradient.

Authors: We appreciate this suggestion for greater clarity. In the revised manuscript, we will insert a new simplified derivation (with a small illustrative example) immediately after the GRPO gradient expression in Section 3. The derivation will explicitly show that, for any correct response not appearing in the sampled group, the advantage-weighted gradient term is identically zero because the advantage is computed solely from the sampled trajectories and the correctness indicator provides no update signal to unsampled alternatives. This step-by-step calculation will be followed by the general case to illustrate the self-reinforcing suppression. revision: yes

Circularity Check

0 steps flagged

No significant circularity; optimality characterization and UCPO proposal are self-contained.

full rationale

The paper's core derivation formalizes GRPO indifference via advantage analysis (all correct solutions receive identical reward=1 signals) and stochastic collapse via unsampled responses receiving zero gradient. It then derives the Uniform-Correct Policy as uniquely optimal under explicitly stated robustness and entropy-regularized criteria without defining those criteria in terms of uniformity itself. The UCPO penalty is motivated by this independent analysis rather than fitted to the same data or smuggled via self-citation. Empirical tuning of the penalty coefficient is presented as a practical choice, not a theoretical prediction that reduces to the input distribution. No load-bearing step equates to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract provides no explicit free parameters or invented entities; the central claim rests on standard RL assumptions plus the domain assumption that uniformity within the correct set is desirable and achievable via gradient penalty.

axioms (1)

domain assumption RLVR objectives are indifferent to probability distribution among correct solutions
Stated as the structural cause identified in the paper

pith-pipeline@v0.9.0 · 5571 in / 1180 out tokens · 54674 ms · 2026-05-09T19:51:38.804899+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 19 canonical work pages · 4 internal anchors

[1]

Nature , year=

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning , author=. Nature , year=
[2]

2024 , eprint=

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

2024
[3]

arXiv preprint arXiv:2601.15609 , year=

When Sharpening Becomes Collapse: Sampling Bias and Semantic Coupling in RL with Verifiable Rewards , author=. arXiv preprint arXiv:2601.15609 , year=

work page arXiv
[4]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
[5]

2026 , eprint=

The Invisible Leash: Why RLVR May or May Not Escape Its Origin , author=. 2026 , eprint=

2026
[6]

Agentic reinforced policy optimization

Agentic reinforced policy optimization , author=. arXiv preprint arXiv:2507.19849 , year=

work page arXiv
[7]

Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening

He, Andre Wang and Fried, Daniel and Welleck, Sean. Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1298

work page doi:10.18653/v1/2025.emnlp-main.1298 2025
[8]

Rethinking entropy regularization in large reasoning models.arXiv preprint arXiv:2509.25133, 2025

Rethinking entropy regularization in large reasoning models , author=. arXiv preprint arXiv:2509.25133 , year=

work page arXiv
[9]

2021 , eprint=

Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

2021
[10]

Diversity-incentivized exploration for versatile reasoning

Diversity-incentivized exploration for versatile reasoning , author=. arXiv preprint arXiv:2509.26209 , year=

work page arXiv
[11]

2025 , eprint=

Reasoning with Sampling: Your Base Model is Smarter Than You Think , author=. 2025 , eprint=

2025
[12]

2025 , eprint=

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning , author=. 2025 , eprint=

2025
[13]

2025 , eprint=

Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR , author=. 2025 , eprint=

2025
[14]

2025 , eprint=

First Return, Entropy-Eliciting Explore , author=. 2025 , eprint=

2025
[15]

Reasoning with sampling: Your base model is smarter than you think.arXiv preprint arXiv:2510.14901,

Reasoning with Sampling: Your Base Model is Smarter Than You Think , author=. arXiv preprint arXiv:2510.14901 , year=

work page arXiv
[16]

Enhancing efficiency and exploration in reinforcement learning for llms.arXiv preprint arXiv:2505.18573,

Enhancing Efficiency and Exploration in Reinforcement Learning for LLMs , author=. arXiv preprint arXiv:2505.18573 , year=

work page arXiv
[17]

2025 , eprint=

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models , author=. 2025 , eprint=

2025
[18]

Notion Blog , year=

Deepscaler: Surpassing o1-preview with a 1.5 b model by scaling rl , author=. Notion Blog , year=
[19]

NeurIPS , year=

Measuring Mathematical Problem Solving With the MATH Dataset , author=. NeurIPS , year=
[20]

Omni-math: A universal olympiad level mathematic benchmark for large language models

Omni-math: A universal olympiad level mathematic benchmark for large language models , author=. arXiv preprint arXiv:2410.07985 , year=

work page arXiv
[21]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Think you have solved question answering? try arc, the ai2 reasoning challenge , author=. arXiv preprint arXiv:1803.05457 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Bowman , booktitle=

David Rein and Betty Li Hou and Asa Cooper Stickland and Jackson Petty and Richard Yuanzhe Pang and Julien Dirani and Julian Michael and Samuel R. Bowman , booktitle=. 2024 , url=

2024
[23]

Advances in Neural Information Processing Systems , volume=

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark , author=. Advances in Neural Information Processing Systems , volume=
[24]

Qwen2.5: A Party of Foundation Models , url =

Qwen Team , month =. Qwen2.5: A Party of Foundation Models , url =
[25]

2025 , eprint=

Rethinking Entropy Regularization in Large Reasoning Models , author=. 2025 , eprint=

2025
[26]

Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models.arXiv preprint arXiv:2508.10751, 2025

Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models , author=. arXiv preprint arXiv:2508.10751 , year=

work page arXiv
[27]

Differential smooth- ing mitigates sharpening and improves llm reasoning.arXiv preprint arXiv:2511.19942,

Differential Smoothing Mitigates Sharpening and Improves LLM Reasoning , author=. arXiv preprint arXiv:2511.19942 , year=

work page arXiv
[28]

Transactions on Machine Learning Research , volume=

RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment , author=. Transactions on Machine Learning Research , volume=. 2023 , publisher=

2023
[29]

Hugging Face repository , volume=

Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions , author=. Hugging Face repository , volume=
[30]

The surprising effectiveness of negative reinforcement in llm reasoning, 2025.arXiv preprint arXiv:2506.01347, 2025

The surprising effectiveness of negative reinforcement in LLM reasoning , author=. arXiv preprint arXiv:2506.01347 , year=

work page arXiv
[31]

2025 , eprint=

DAPO: An Open-Source LLM Reinforcement Learning System at Scale , author=. 2025 , eprint=

2025
[32]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

work page internal anchor Pith review arXiv
[33]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Kimi k1. 5: Scaling reinforcement learning with llms , author=. arXiv preprint arXiv:2501.12599 , year=

work page internal anchor Pith review arXiv
[34]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[35]

arXiv preprint arXiv:2509.21128 , year=

Rl squeezes, sft expands: A comparative study of reasoning llms , author=. arXiv preprint arXiv:2509.21128 , year=

work page arXiv
[36]

Outcome-based exploration for LLM reasoning.arXiv preprint arXiv:2509.06941, 2025

Outcome-based exploration for llm reasoning , author=. arXiv preprint arXiv:2509.06941 , year=

work page arXiv
[37]

ACL 2026 , year=

LAD: Learning Advantage Distribution for Reasoning , author=. ACL 2026 , year=

2026
[38]

Advances in neural information processing systems , volume=

Solving quantitative reasoning problems with language models , author=. Advances in neural information processing systems , volume=
[39]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[40]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in

Yang Yue and Zhiqi Chen and Rui Lu and Andrew Zhao and Zhaokai Wang and Yang Yue and Shiji Song and Gao Huang , booktitle=. Does Reinforcement Learning Really Incentivize Reasoning Capacity in. 2025 , url=

2025
[41]

2025 , eprint=

Can GRPO Help LLMs Transcend Their Pretraining Origin? , author=. 2025 , eprint=

2025
[42]

Reasoning with exploration: An entropy perspective.arXiv preprint arXiv:2506.14758, 2025

Reasoning with exploration: An entropy perspective , author=. arXiv preprint arXiv:2506.14758 , year=

work page arXiv
[43]

arXiv preprint arXiv:2602.06717 , year=

F-GRPO: Don't Let Your Policy Learn the Obvious and Forget the Rare , author=. arXiv preprint arXiv:2602.06717 , year=

work page arXiv