The choice of divergence: A neglected key to mitigating diversity collapse in reinforcement learning with verifiable reward

Haochuan Li, Jian Qian, Yi Tian, Alexander Rakhlin, Ali Jadbabaie · 2023 · arXiv 2509.07430

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

read on arXiv browse 8 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Why Adam Can Beat SGD: Second-Moment Normalization Yields Sharper Tails

cs.LG · 2026-03-03 · unverdicted · novelty 8.0 · 2 refs

Adam achieves a δ^{-1/2} high-probability convergence rate while SGD requires at least δ^{-1} due to second-moment normalization, established via stopping-time/martingale analysis under bounded variance.

Learning, Fast and Slow: Towards LLMs That Adapt Continually

cs.LG · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard RL in continual LLM learning.

SAGE: Shaping Anchors for Guided Exploration in RLVR of LLMs

cs.LG · 2026-05-15 · unverdicted · novelty 6.0

SAGE reshapes the reverse-KL anchor via guide function q(x,y) for controllable empirical support expansion, yielding gains in both pass@1 and pass@k on math reasoning benchmarks.

Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex

cs.LG · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

Listwise Policy Optimization explicitly performs target-projection on the LLM response simplex, unifying and improving group-based RLVR methods with monotonic improvement and flexible divergences.

Binary Rewards and Reinforcement Learning: Fundamental Challenges

cs.LG · 2026-05-04 · unverdicted · novelty 6.0

Binary rewards make the set of reward-maximizing policies infinite in policy gradients; KL control selects the filtered base model but misspecification drives collapse to concentrated valid outputs instead.

From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space

cs.LG · 2026-04-15 · unverdicted · novelty 6.0

PreRL applies reward-driven updates to P(y) in pre-train space, uses Negative Sample Reinforcement to prune bad reasoning paths and boost reflection, and combines with standard RL in Dual Space RL to outperform baselines on reasoning tasks.

CRISP: Compressed Reasoning via Iterative Self-Policy Distillation

cs.LG · 2026-03-05 · conditional · novelty 6.0

CRISP achieves 57-59% token reduction on MATH-500 with 9-16 point accuracy gains on Qwen3 models via iterative self-distillation of concise reasoning behavior.

Unlocking Exploration in RLVR: Uncertainty-aware Advantage Shaping for Deeper Reasoning

cs.AI · 2025-10-12 · unverdicted · novelty 6.0

UCAS refines RLVR advantage signals with a logit-space self-confidence proxy for response-level modulation and asymmetric token-level penalties based on raw logit certainty to boost exploration and reduce entropy collapse.

citing papers explorer

Showing 8 of 8 citing papers.

Why Adam Can Beat SGD: Second-Moment Normalization Yields Sharper Tails cs.LG · 2026-03-03 · unverdicted · none · ref 7 · 2 links
Adam achieves a δ^{-1/2} high-probability convergence rate while SGD requires at least δ^{-1} due to second-moment normalization, established via stopping-time/martingale analysis under bounded variance.
Learning, Fast and Slow: Towards LLMs That Adapt Continually cs.LG · 2026-05-12 · unverdicted · none · ref 32 · 2 links
Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard RL in continual LLM learning.
SAGE: Shaping Anchors for Guided Exploration in RLVR of LLMs cs.LG · 2026-05-15 · unverdicted · none · ref 34
SAGE reshapes the reverse-KL anchor via guide function q(x,y) for controllable empirical support expansion, yielding gains in both pass@1 and pass@k on math reasoning benchmarks.
Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex cs.LG · 2026-05-07 · unverdicted · none · ref 73 · 2 links
Listwise Policy Optimization explicitly performs target-projection on the LLM response simplex, unifying and improving group-based RLVR methods with monotonic improvement and flexible divergences.
Binary Rewards and Reinforcement Learning: Fundamental Challenges cs.LG · 2026-05-04 · unverdicted · none · ref 14
Binary rewards make the set of reward-maximizing policies infinite in policy gradients; KL control selects the filtered base model but misspecification drives collapse to concentrated valid outputs instead.
From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space cs.LG · 2026-04-15 · unverdicted · none · ref 28
PreRL applies reward-driven updates to P(y) in pre-train space, uses Negative Sample Reinforcement to prune bad reasoning paths and boost reflection, and combines with standard RL in Dual Space RL to outperform baselines on reasoning tasks.
CRISP: Compressed Reasoning via Iterative Self-Policy Distillation cs.LG · 2026-03-05 · conditional · none · ref 14
CRISP achieves 57-59% token reduction on MATH-500 with 9-16 point accuracy gains on Qwen3 models via iterative self-distillation of concise reasoning behavior.
Unlocking Exploration in RLVR: Uncertainty-aware Advantage Shaping for Deeper Reasoning cs.AI · 2025-10-12 · unverdicted · none · ref 9
UCAS refines RLVR advantage signals with a logit-space self-confidence proxy for response-level modulation and asymmetric token-level penalties based on raw logit certainty to boost exploration and reduce entropy collapse.

The choice of divergence: A neglected key to mitigating diversity collapse in reinforcement learning with verifiable reward

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer