Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems

Christian Walder; Deep Karkhanis

arxiv: 2505.15201 · v5 · pith:TIWRX7BTnew · submitted 2025-05-21 · 💻 cs.LG · cs.AI· cs.CL· stat.ML

Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems

Christian Walder , Deep Karkhanis This is my paper

classification 💻 cs.LG cs.AIcs.CLstat.ML

keywords passoptimizationrewardsamplesharderlearningperformancesets

0 comments

read the original abstract

Reinforcement Learning (RL) algorithms sample multiple n>1 solution attempts for each problem and reward them independently. This optimizes for pass@1 performance and prioritizes the strength of isolated samples at the expense of the diversity and collective utility of sets of samples. This under-utilizes the sampling capacity, limiting exploration and eventual improvement on harder examples. As a fix, we propose Pass-at-k Policy Optimization (PKPO), a transformation on the final rewards which leads to direct optimization of pass@k performance, thus optimizing for sets of samples that maximize reward when considered jointly. Our contribution is to derive novel low variance unbiased estimators for pass@k and its gradient, in both the binary and continuous reward settings. We show optimization with our estimators reduces to standard RL with rewards that have been jointly transformed by a stable and efficient transformation function. While previous efforts are restricted to k=n, ours is the first to enable robust optimization of pass@k for any arbitrary k <= n. Moreover, instead of trading off pass@1 performance for pass@k gains, our method allows annealing k during training, optimizing both metrics and often achieving strong pass@1 numbers alongside significant pass@k gains. We validate our reward transformations on toy experiments, which reveal the variance reducing properties of our formulations. We also include real-world examples using the open-source LLM, GEMMA-2. We find that our transformation effectively optimizes for the target k. Furthermore, higher k values enable solving more and harder problems, while annealing k boosts both the pass@1 and pass@k . Crucially, for challenging task sets where conventional pass@1 optimization stalls, our pass@k approach unblocks learning, likely due to better exploration by prioritizing joint utility over the utility of individual samples.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 15 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Beyond Absolute Imitation: Anchored Residual Guidance for Privileged On-Policy Distillation
cs.LG 2026-06 unverdicted novelty 7.0

AR-OPD disentangles privileged supervision via anchored residual guidance to reduce hindsight leakage in on-policy distillation, reporting gains of 2.3 points over full privileged OPD and 7.9 over SFT on reasoning tasks.
Residual Skill Optimization for Text-to-SQL Ensembles
cs.CL 2026-05 unverdicted novelty 7.0

Residual skill optimization creates complementary Text-to-SQL agents by training each new skill on prior ensemble failures, yielding accuracy gains on Spider2-Lite and transfer to other dialects and tasks.
Finite-Time Regret Analysis of Retry-Aware Bandits
cs.LG 2026-05 unverdicted novelty 7.0

ReMax achieves the first sublinear regret bound for Gaussian rewards at M=2 by characterizing the optimal sampling distribution via an expected-improvement balance condition and separating saturation from underestimat...
Finite-Time Regret Analysis of Retry-Aware Bandits
cs.LG 2026-05 unverdicted novelty 7.0

ReMax achieves the first sublinear finite-time regret bound for Gaussian bandits with M=2 by deriving an expected-improvement balance condition for its optimal sampling distribution and separating saturation from unde...
Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

GCPO shifts RLVR from rollout competition to team cooperation by assigning advantages via marginal contributions to a determinant-based coverage volume over semantic embeddings, yielding higher accuracy and solution d...
Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

GCPO uses team-level credit assignment via determinant volume over reward-weighted semantic embeddings to promote non-redundant correct reasoning paths, improving both accuracy and diversity in LLM training.
ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

ResRL decouples shared semantics between positive and negative responses in LLM reinforcement learning via SVD-based projection residuals, outperforming baselines including NSR by up to 9.4% on math reasoning benchmarks.
REVES: REvision and VErification--Augmented Training for Test-Time Scaling
cs.LG 2026-06 unverdicted novelty 6.0

REVES augments LLM post-training by decoupling revision and verification signals from successful multi-step trajectories, reporting +6.5 point gains on LiveCodeBench over RL baselines.
Leveraging Error Diversity in Group Rollouts for Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

EDAS modulates advantage signals in RLVR to penalize repeated errors more and rare errors less, yielding consistent gains on math benchmarks when added to existing methods.
SAGE: Shaping Anchors for Guided Exploration in RLVR of LLMs
cs.LG 2026-05 unverdicted novelty 6.0

SAGE reshapes the reverse-KL anchor via guide function q(x,y) for controllable empirical support expansion, yielding gains in both pass@1 and pass@k on math reasoning benchmarks.
What should post-training optimize? A test-time scaling law perspective
cs.LG 2026-05 unverdicted novelty 6.0

Tail-extrapolated estimators approximate best-of-N policy gradients from limited training rollouts by leveraging upper-tail reward statistics under structural assumptions.
ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

ResRL boosts LLM reasoning by modulating negative gradients with SVD-based projection residuals from negative samples, outperforming NSR by 9.4% Avg@16 on math benchmarks while preserving diversity across 12 tasks.
SFT Overtraining Predicts Rank Inversion via Entropy Collapse Under RLVR
cs.LG 2026-06 unverdicted novelty 5.0

SFT depth increases pre-RL pass@1 but can cause entropy collapse that inverts GRPO outcomes on Qwen models via reduced group advantage variance.
Leveraging Error Diversity in Group Rollouts for Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 5.0

EDAS modulates RL advantage signals for incorrect rollouts by amplifying penalties on repeated errors and attenuating them on rare ones, yielding average gains of 6.29 points over DAPO on Qwen3-8B across seven math be...
PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents
cs.LG 2026-05 unverdicted novelty 5.0

PACEvolve++ uses a phase-adaptive reinforcement learning advisor to decouple hypothesis selection from execution in LLM-driven evolutionary search, delivering faster convergence than prior frameworks on load balancing...