arxiv: 2605.02375 · v1 · submitted 2026-05-04 · 💻 cs.LG

Recognition: unknown

Binary Rewards and Reinforcement Learning: Fundamental Challenges

Marc Dymetman

Authors on Pith no claims yet

Pith reviewed 2026-05-09 16:23 UTC · model grok-4.3

classification 💻 cs.LG

keywords binary rewardsreinforcement learningdiversity collapseKL controlmodel misspecificationfiltered modellanguage modelspolicy optimization

0 comments

The pith

Under model misspecification, lowering beta in KL-controlled RL with binary rewards concentrates the policy on fewer and fewer valid outputs rather than the filtered model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Binary rewards create an infinite set of reward-maximizing distributions, so policy gradient methods need additional control to pick one. KL regularization with temperature beta is intended to recover the filtered model, the base model conditioned exactly on valid outputs, as beta approaches zero. In the common case where the base model does not perfectly match the validity function, however, the same pressure to lower beta instead forces the optimizer to place mass on a shrinking handful of high-reward sequences. This mechanism accounts for the diversity collapse seen in practice, where single-sample accuracy rises but multi-sample coverage falls below the base model. The paper supplies explicit relations between beta and a target validity rate and notes that other divergences can target the filtered model directly.

Core claim

Binary rewards create a fundamental degeneracy for policy gradient methods: the set of distributions maximizing expected reward is infinite, with no distinguished element. KL-control resolves this degeneracy by selecting, in the limit beta to 0, the filtered model p* defined as the base model conditioned on validity, which is the unique fully valid distribution closest to the base model in KL divergence. Under model misspecification, the pressure to decrease beta drives the optimizer toward highly concentrated distributions over a small number of valid outputs, collapsing toward ever fewer as beta decreases, rather than toward the filtered model.

What carries the argument

The tilted distribution p_{[beta]} proportional to a(y) times exp(v(y)/beta) that arises from the KL-penalized objective and whose limiting behavior under misspecification produces concentration instead of the filtered model.

If this is right

KL-control selects the filtered model p* in the limit as beta approaches zero when the model is correctly specified.
Explicit formulas relate the hyperparameter beta to the more interpretable target validity rate mu.
The set of reward-maximizing distributions remains infinite without KL control.
Alternative divergences that directly reward coverage of the support of p* avoid concentration on a small number of outputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Monitoring the effective number of distinct valid samples during beta annealing could detect the onset of collapse in large-scale training runs.
Switching to divergences that penalize missing support of p* rather than rewarding high individual validity scores may preserve multi-sample coverage.
The same concentration dynamic may appear in any RL setting that uses binary verifiable rewards and KL regularization under imperfect models.

Load-bearing premise

The assumption that real training occurs under model misspecification so that the validity function does not perfectly align with the base model.

What would settle it

In a controlled autoregressive toy model with known base distribution and validity function, measure whether the number of distinct valid outputs sampled from the optimized policy shrinks as beta is lowered while holding the target validity rate fixed.

Figures

Figures reproduced from arXiv: 2605.02375 by Marc Dymetman.

**Figure 1.** Figure 1: Toy illustration (|Y| = 5, |Y1| = 3) of the small-𝛽 ordering. We plot KL 𝜋𝑖 ∥ 𝑝[𝛽] as a function of 𝜆 = 1/𝛽 for four fixed candidates: 𝜋1 = 𝑝∗, 𝜋2 = 𝛿𝑦1 , 𝜋3 (validity 𝜇𝜋3 = 0.93, TVD(𝜋3, 𝑝∗) = 0.07), and 𝜋4 (validity 𝜇𝜋4 = 0.98, TVD(𝜋4, 𝑝∗) = 0.54). Although 𝜋3 is much closer to 𝑝∗ in total variation than 𝜋4, it has lower validity. Consequently, for sufficiently large 𝜆, KL 𝜋4 ∥ 𝑝[𝛽] < KL 𝜋3 ∥ 𝑝[𝛽] … view at source ↗

**Figure 2.** Figure 2: Information geometry of KL-control for binary rewards. The view at source ↗

**Figure 3.** Figure 3: The information-geometry picture of Fig. view at source ↗

**Figure 4.** Figure 4: Mode collapse in a misspecified bigram model ( view at source ↗

**Figure 5.** Figure 5: Information geometry of KL-control for a general bounded reward (bounds not necessarily view at source ↗

**Figure 6.** Figure 6: Robustness of mode collapse across 8 random base models. Light curves show individual seeds; the heavy curve shows the mean. The dashed red line marks the mean forward-KL-optimal bigram policy 𝜋ˆFKL; the dash-dotted green line marks the mean TVD-optimal bigram policy (besteffort). Both reference lines are means across seeds, so individual seed trajectories need not lie above them uniformly—a seed whose ow… view at source ↗

read the original abstract

Reinforcement learning with verifiable rewards (RLVR) has become a standard approach for improving reasoning in language models, yet models trained with RLVR often suffer from diversity collapse: while single-sample accuracy improves, multi-sample coverage degrades, sometimes falling below the base model. We provide a structural account of this phenomenon grounded in the properties of binary rewards. Binary rewards create a fundamental degeneracy for policy gradient methods: the set of distributions maximizing expected reward is infinite, with no distinguished element. KL-control resolves this degeneracy by selecting, in the limit $\beta\to 0$, the filtered model $p_*:=a(\cdot\mid\mathcal{Y}_1)$ -- the base model conditioned on validity -- which is the unique fully valid distribution closest to the base model in KL divergence. This selection operates through a nontrivial asymmetry: the tilted distribution $p_{[\beta]}\propto a(y)\,e^{v(y)/\beta}$ converges to $p_*$ in forward KL as $\beta\to 0$, yet $p_*$ cannot serve as a direct optimization target because $\mathrm{KL}(q\,\|\,p_*)$ is infinite for any full-support policy $q$. We develop explicit formulas relating the hyperparameter $\beta$ to the more interpretable target validity rate $\mu$. Under model misspecification -- the typical practical regime -- the pressure to decrease $\beta$ drives the optimizer toward highly concentrated distributions over a small number of valid outputs, collapsing toward ever fewer as $\beta$ decreases, rather than toward the filtered model. We illustrate this mechanism on a toy autoregressive experiment and discuss how alternative divergences that target $p_*$ directly -- as pursued empirically by \citet{kruszewski_whatever_2026} -- avoid this failure mode by rewarding coverage of $p_*$'s support rather than concentration on high-validity outputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript presents a theoretical analysis of diversity collapse in reinforcement learning with verifiable binary rewards (RLVR). It shows that binary rewards lead to an infinite set of reward-maximizing distributions, resolved by KL-control which selects the filtered base model p_* in the β → 0 limit via the tilted distribution p_{[β]}. Explicit relations between β and target validity μ are derived, and under model misspecification, the paper argues that decreasing β causes practical optimization to collapse to concentrated distributions over few valid outputs rather than p_*. A toy autoregressive experiment illustrates this, and alternatives using different divergences are discussed.

Significance. This work offers a potential explanation for the common issue of reduced diversity in RLVR-trained models despite improved accuracy. The explicit β-μ formulas and the identification of the forward KL convergence to p_* are valuable contributions. If the mechanism is confirmed, it would support exploring coverage-rewarding objectives as in the cited work. The toy experiment provides initial empirical grounding.

major comments (1)

[Abstract] Abstract (sentence beginning 'Under model misspecification'): The claim that decreasing β drives the optimizer toward highly concentrated distributions over a small number of valid outputs rather than the filtered model p_* is load-bearing for the central account. However, the explicit tilted form p_{[β]} ∝ a(y) exp(v(y)/β) with binary v(y) ∈ {0,1} yields p_β(y | valid) = a(y | valid) exactly for every β (after normalization over the valid set). Thus the global optimum exhibits no additional concentration within valids as β decreases; any such collapse in the toy experiment or practice must arise from optimization failure to reach p_β (e.g., policy-gradient variance or parameterization limits) rather than a structural property of the objective. This distinction requires clarification or correction to sustain the proposed mechanism under misspecification.

minor comments (2)

[Abstract] The citation 'kruszewski_whatever_2026' appears to be a placeholder and should be replaced with the actual reference.
Notation for p_{[β]}, p_*, and the validity function v(y) would benefit from earlier introduction and a dedicated notation table or paragraph to aid readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback on our manuscript. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract (sentence beginning 'Under model misspecification'): The claim that decreasing β drives the optimizer toward highly concentrated distributions over a small number of valid outputs rather than the filtered model p_* is load-bearing for the central account. However, the explicit tilted form p_{[β]} ∝ a(y) exp(v(y)/β) with binary v(y) ∈ {0,1} yields p_β(y | valid) = a(y | valid) exactly for every β (after normalization over the valid set). Thus the global optimum exhibits no additional concentration within valids as β decreases; any such collapse in the toy experiment or practice must arise from optimization failure to reach p_β (e.g., policy-gradient variance or parameterization limits) rather than a structural property of the objective. This distinction requires clarification or correction to sustain the proposed mechanism under misspecification.

Authors: We thank the referee for this precise observation, which is correct: the conditional p_β(y | valid) equals a(y | valid) for every β, so the global optimum exhibits no additional concentration as β decreases. Our abstract statement refers specifically to the dynamics of practical optimizers under model misspecification (the typical regime), where the explicit β-μ relations we derive require lowering β to achieve higher validity; this change in the objective makes the optimization landscape progressively harder for policy-gradient methods, increasing gradient variance and causing finite models to lose support over the full valid set, collapsing onto few outputs. The toy autoregressive experiment illustrates exactly this empirical behavior. We will revise the abstract and discussion to explicitly distinguish the unchanging global optimum p_β from the observed optimization failures, clarifying that collapse arises from inability to reach p_β rather than any change in p_β itself. This clarification sustains the central account while improving precision. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivations are self-contained using standard KL properties

full rationale

The paper derives the tilted distribution p_[β] ∝ a(y) exp(v(y)/β) for binary v, its convergence to p_* as β→0, the infinite KL(q||p_*) issue, and explicit β-μ relations directly from the normalizing partition function and standard exponential tilting identities. These steps are mathematically independent of the target claims and do not reduce any 'prediction' to a fitted quantity or self-referential definition by construction. The misspecification discussion and toy experiment are presented as interpretive consequences rather than load-bearing derivations that collapse to inputs. No self-citation chains or uniqueness theorems from prior author work are invoked to force the central results.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The analysis rests on standard mathematical properties of divergences and RL optimization limits, with β as the primary tunable parameter whose relation to μ is derived rather than fitted ad hoc.

free parameters (1)

beta
Temperature-like hyperparameter in the tilted distribution p_[β] that trades off reward maximization against KL penalty; explicitly related to target validity rate μ via derived formulas.

axioms (2)

domain assumption For binary rewards the set of distributions maximizing expected reward is infinite with no distinguished element.
Core property of policy gradients under 0-1 rewards invoked to establish the initial degeneracy.
standard math The tilted distribution p_[β] ∝ a(y) exp(v(y)/β) converges to the filtered model p_* in forward KL as β → 0.
Standard limit behavior of exponential families and KL divergence used to identify the selection mechanism.

pith-pipeline@v0.9.0 · 5629 in / 1583 out tokens · 69379 ms · 2026-05-09T16:23:27.602631+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 11 canonical work pages · 7 internal anchors

[1]

S.-i. Amari. Information Geometry and Its Applications. Applied Mathematical Sciences, vol. 194. Springer, 2016

2016
[2]

L. D. Brown. Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory. Institute of Mathematical Statistics Lecture Notes---Monograph Series, vol. 9. Institute of Mathematical Statistics, Hayward, CA, 1986

1986
[3]

Csisz \'a r

I. Csisz \'a r. I-divergence geometry of probability distributions and minimization problems. The Annals of Probability, 3(1):146--158, 1975

1975
[4]

Exponential families from a single KL identity

M. Dymetman. Exponential families from a single KL identity. arXiv preprint arXiv:2604.28036, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[5]

D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

D. Go, T. Korbak, G. Kruszewski, J. Rozen, N. Ryu, and M. Dymetman. Aligning language models with preferences through f-divergence minimization. In International Conference on Machine Learning (ICML), 2023

2023
[7]

Khalifa, H

M. Khalifa, H. Elsahar, and M. Dymetman. A distributional approach to controlled text generation. In 9th International Conference on Learning Representations (ICLR), 2021

2021
[8]

H. J. Kappen. Linear theory for control of nonlinear stochastic systems. Physical Review Letters, 95(20):200201, 2005

2005
[9]

M. Kim, T. Thonet, J. Rozen, H. Lee, K. Jung, and M. Dymetman. Guaranteed generation from large language models. In International Conference on Learning Representations (ICLR), 2025

2025
[10]

Korbak, H

T. Korbak, H. Elsahar, G. Kruszewski, and M. Dymetman. On reinforcement learning and distribution matching for fine-tuning language models with no catastrophic forgetting. In Advances in Neural Information Processing Systems (NeurIPS), 2022

2022
[11]

URL https://arxiv

T. Korbak, E. Perez, and C. L. Buckley. RL with KL penalties is better viewed as Bayesian inference. arXiv preprint arXiv:2205.11275, 2022

work page arXiv 2022
[12]

Kruszewski, P

G. Kruszewski, P. Erbacher, J. Rozen, and M. Dymetman. Whatever remains must be true: Filtering drives reasoning in LLMs , shaping diversity. In International Conference on Learning Representations (ICLR), 2026

2026
[13]

C.-C. Lin, A. Jaech, X. Li, M. R. Gormley, and J. Eisner. Limitations of autoregressive models and their alternatives. arXiv preprint arXiv:2010.11939, 2021

work page arXiv 2010
[14]

L. Li, Z. Li, X. Jiang, W. Che, and T. Liu. The choice of divergence: A neglected key to mitigating diversity collapse in reinforcement learning with verifiable reward. arXiv preprint arXiv:2509.07430, 2025

work page arXiv 2025
[15]

Y. Liu, Y. Zeng, Y. Yao, Z. Xie, Z. Sun, B. Wang, H. Wang, Y. Wang, and D. Yin. Understanding R1-Zero-like training: A critical perspective. arXiv preprint arXiv:2503.20783, 2025

work page internal anchor Pith review arXiv 2025
[16]

Ouyang, J

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, et al. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems (NeurIPS), 2022

2022
[17]

R \'e nyi

A. R \'e nyi. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 547--561. University of California Press, 1961

1961
[18]

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, M. Zhang, Y. K. Li, Y. Wu, and D. Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[20]

E. Todorov. Linearly-solvable M arkov decision problems. In Advances in Neural Information Processing Systems (NeurIPS), pages 1369--1376, 2007

2007
[21]

Tractable

H. Zhang, M. Dang, N. Peng, and G. Van den Broeck. Tractable control for autoregressive language generation. arXiv preprint arXiv:2304.07438, 2023

work page arXiv 2023
[22]

M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1--2):1--305, 2008

2008
[23]

R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3--4):229--256, 1992

1992
[24]

Q. Yu, Z. Liu, J. Peng, S. Zheng, C. Lyu, Y. Cao, H. Huang, et al. DAPO : An open-source LLM reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Y. Yue, Z. Chen, A. Lu, Z. Ye, and S. Zheng. Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? In Advances in Neural Information Processing Systems (NeurIPS), 2025

2025
[26]

D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, and G. Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019

work page internal anchor Pith review arXiv 1909