Cast a Wider Net: Coordinated Pass@K Policy Optimization for Code Reasoning

Suman Banerjee; Tong Che; Yilong Li

REVIEW 3 major objections 5 minor 1 cited by

Coordinated strategy tuples beat independent sampling for code pass@K under a fixed attempt budget.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.5

2026-07-12 15:55 UTC pith:GQY37BAE

load-bearing objection Solid fixed-budget pass@K recipe that actually coordinates strategies; fix the abstract number and ship the code. the 3 major comments →

arxiv 2605.27000 v3 pith:GQY37BAE submitted 2026-05-26 cs.CL cs.AI

Cast a Wider Net: Coordinated Pass@K Policy Optimization for Code Reasoning

Yilong Li , Suman Banerjee , Tong Che This is my paper

classification cs.CL cs.AI

keywords pass@Kcode generationreinforcement learning with verifiable rewardsstrategy planningtest-time computecompetitive programmingpolicy optimization

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

When you give a code model a fixed number of tries against a verifier, the usual approach draws those tries independently from one answer distribution. On competitive-programming problems that admit several different algorithms, those independent draws often collapse onto near-duplicate reasoning paths and waste the budget. This paper claims that the right object to optimize is a joint planner–solver policy: a planner first emits a coordinated tuple of K high-level methods, each conditioned on the earlier ones, and a shared solver attempts one solution per method. Credit for the planner is given only when the tuple is accepted by a narrow validity gate and at least one branch actually passes the verifier. Under the same K=4 solver-attempt budget, that coordinated policy improves pass@4 over direct sampling, planning prompts, planner-only SFT, and pass@K-oriented RL across APPS, CodeContests, and LiveCodeBench-v6, with the gains carrying over to larger attempt budgets when multiple tuples are pooled.

Core claim

Under a fixed K=4 solver-attempt budget and the same verifier, a joint planner–solver policy that emits a coordinated K-tuple of high-level strategies and receives validity-gated pass@K credit improves pass@4 over independent sampling, planning baselines, planner-only SFT, and pass@K-oriented RL on competitive-programming benchmarks, with significant gains on six of nine model–benchmark cells.

What carries the argument

Coordinated Pass@K Policy Optimization (CPPO): an autoregressive strategy-tuple planner q(S|x) whose multiplicative reward R_plan = J_ψ · R_out is nonzero only for valid tuples that yield at least one verifier-confirmed solver success, trained with split-region GRPO advantages for planner and solver tokens.

Load-bearing premise

The problems of interest admit multiple genuinely distinct algorithmic strategies, so that allocating attempts across a coordinated strategy tuple can beat independent draws from a single answer distribution.

What would settle it

On the same held-out competitive-programming sets, under matched K=4 solver budget and verifier, a coordinated CPPO policy fails to beat the strongest independent-sampling or pass@K RL baseline once training seed and problem resampling are accounted for, or the same pipeline yields no gain once problems are restricted to those with a single canonical solution path.

Watch this falsifier — get emailed when new claim-graph text bears on it.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

Solid fixed-budget pass@K recipe that actually coordinates strategies; fix the abstract number and ship the code.

read the letter

The real news is a joint planner–solver policy for pass@K: one autoregressive strategy tuple of K=4 methods, a shared solver branch per method, and a multiplicative planner reward R_plan = J_ψ · R_out that only credits valid tuples with at least one verifier pass. That factorization, plus split-region GRPO, is the new piece. Planning prompts, PlanSearch, PKPO, and diversity RL are all already on the table; this paper trains the tuple end-to-end under the pass@K objective.

What they did well is the experimental hygiene. Same K=4 solver budget and sandbox verifier across APPS, CodeContests, and LCBv6; stage ablations (SFT → warm-up → full CPPO); reward-component and M ablations; joint-vs-iid inference; diversity metrics; maj@4 on single-answer subsets so they are not just diluting the mode; Gemma-4 transfer; hierarchical bootstrap over problems and seeds; token-normalized accounting. Directional gains hold in all nine size–benchmark cells, significant on six. The multi-strategy competitive-programming scope is stated up front and in Limitations, not smuggled past the reader.

Soft spots, in proportion: the abstract’s headline gain (0.588→0.748, +0.16) does not match the body/Table 1 (0.728, +0.14) for the same Qwen3.5-9B LCBv6 vs PKPO cell. That is a presentation bug on their own strongest number, not a threat to the fixed-budget claim. Three seeds and no released code yet leave the exact point estimates provisional. The validity gate is a frozen LLM-judge RM; they show it is a validity filter, not a quality scorer, which is the right use. Free parameters (τ, M, K_tuple) are standard for this class of work.

This is for people who already spend test-time compute on repeated sampling for code and want a better allocation of the K attempts. Math with a single canonical path is out of scope by design. I would send it to peer review: the method is clear, the comparisons are fair, and the central pattern is reproducible enough to deserve referee time once the abstract is aligned and artifacts are promised.

Referee Report

3 major / 5 minor

Summary. The paper proposes Coordinated Pass@K Policy Optimization (CPPO), which replaces independent answer sampling for pass@K code generation with a joint planner–solver policy: an autoregressive planner emits a K=4 tuple of high-level algorithmic strategies, and a shared solver produces one attempt per strategy. Planner credit uses a multiplicative reward R_plan = J_ψ · R_out that is nonzero only for validity-gated tuples that also yield at least one verifier-confirmed success; solver tokens receive within-tuple outcome advantages under GRPO. Under a matched K=4 solver-attempt budget, CPPO reports higher pass@4 than direct sampling, planning baselines, planner-only SFT, and pass@K-oriented RL (including PKPO and UpSkill) on APPS, CodeContests, and LiveCodeBench-v6 for Qwen3.5-{2B,4B,9B}, with hierarchical bootstrap significance on six of nine cells, plus supporting ablations, diversity metrics, maj@4 checks, and a Gemma-4 transfer experiment.

Significance. If the fixed-budget gains hold under independent replication, the work is a clear contribution to test-time compute allocation for code reasoning: it reframes pass@K from independent draws of one answer distribution into coordinated multi-strategy exploration, with a concrete RLVR training recipe. Strengths include multi-benchmark, multi-size evaluation; stage and reward-component ablations (Tables 5–6); joint-vs-iid inference decomposition (Appendix I); algorithmic-diversity measurement with classifier reliability checks (Appendix H); token-normalized accounting; maj@4 consistency on single-answer subtasks; hierarchical bootstrap over problems and seeds; and explicit scope limits for single-path math. The multiplicative validity gate and split-region GRPO design are well motivated and empirically isolated. The result is practically relevant for competitive-programming-style settings where any one correct attempt suffices.

major comments (3)

[Abstract; Table 1; §1] Abstract vs. body inconsistency on the flagship result: the abstract claims Qwen3.5-9B LiveCodeBench-v6 improves from PKPO 0.588 to 0.748 (+0.16; “paired bootstrap”), while §1, Table 1, Appendix A, and Table 9 report 0.728 (+0.14) under hierarchical bootstrap. This is the paper’s own strongest advertised number and must be reconciled (and the bootstrap procedure named consistently) before acceptance; the directional claim is unaffected, but the abstract currently overstates the point estimate.
[Figure 2; Appendix P; Table 22] For K_solve > 4, Figure 2 and Table 22 pool independent K_tuple=4 rollouts rather than training or sampling a single larger coordinated tuple. The manuscript states this correctly in Appendix P, but the main-text pass@K curves and the claim that “gains persist at larger attempt budgets” can still be read as if coordination scales with K. Please state the pooling protocol in the main text near Figure 2 and avoid language that implies a single K-way joint plan for K>4.
[Table 1; Appendix B; §4.3] Statistical support rests on three training seeds. The hierarchical bootstrap and per-seed CIs in Appendix B are appropriately conservative, and six cells exclude zero, but three unmarked cells (2B APPS, 4B LCBv6, 9B APPS) have intervals that include zero despite positive means. The abstract’s “six of nine” phrasing is accurate; ensure the main claim does not over-generalize to uniform significance, and consider whether additional seeds or a pre-registered primary cell would strengthen the headline transfer result.

minor comments (5)

[Abstract; §4.3] Abstract and early prose sometimes say “paired bootstrap” while §4.3 defines hierarchical problem-and-seed bootstrap; align terminology throughout.
[Table 4; Appendix M] Table 4 / Appendix M: planner–solver token split is missing for some LCBv6 rows (dashed cells). Either recover the split or note why only aggregate decoded length is available so token-normalized comparisons remain interpretable.
[Limitations] Limitations correctly flag single-path math; a short qualitative failure case (or a small MATH/AIME pilot) would make the scope boundary more concrete for readers outside competitive programming.
[Ethics Statement / Code and artifact release] Code, RM checkpoints, and judge prompts are promised only upon acceptance. For reproducibility review, a minimal public artifact (sandbox config, evaluation harness, seed logs) would help even before full release.
[Appendix C; table captions] Minor polish: ensure consistent reporting of ±std vs. bootstrap CIs in captions; check that Example strategy tuples in Appendix C are clearly labeled as post-hoc illustration only (already mostly done).

Circularity Check

0 steps flagged

No significant circularity: pass@K gains are measured by external sandboxed verification on decontaminated held-out benchmarks, not forced by the training objective or self-citation.

full rationale

CPPO is an empirical RLVR method, not a first-principles derivation. The load-bearing claim is that a joint planner–solver policy trained with R_plan = J_ψ · R_out improves pass@4 under a fixed K=4 solver-attempt budget. That claim is tested against independent baselines on APPS, CodeContests valid, and LiveCodeBench-v6 with decontamination, hierarchical bootstrap over problems and seeds, and ablations that remove the validity gate, outcome credit, across-tuple normalization, and joint (vs iid) inference. J_ψ is a frozen validity filter trained on LLM-judge labels for parseability/non-duplication/no-leakage; the paper itself reports that it matches those labels (AUC 0.971) but only weakly predicts solver success (AUC 0.572), so planner credit is not the evaluation target by construction—R_out and final pass@K come from sandboxed execution of official tests. There is no fitted parameter renamed as a prediction, no uniqueness theorem imported from the authors, and no self-citation chain that forces the result. The abstract/body mismatch on the headline LCBv6 number (0.748 vs 0.728) is a presentation inconsistency, not circular reduction of the claim to its inputs. Score 0 is the honest finding.

Axiom & Free-Parameter Ledger

4 free parameters · 4 axioms · 3 invented entities

The central claim rests on standard RLVR/GRPO machinery, the domain premise that competitive-programming tasks have multiple usable strategies, and several hand-chosen training knobs (K=4, τ, M, staged warm-up). The main invented objects are the coordinated strategy-tuple policy and the multiplicative validity-gated planner reward; both are operational definitions with experimental handles, not free-floating physical entities.

free parameters (4)

planner tuple size K_tuple = 4
Fixed at 4 for all main training; larger budgets pool multiple 4-tuples rather than retrain. Choice is design, not derived.
validity-gate threshold τ = 0.17
Chosen on validation for high-recall gating of J_ψ; main experiments use τ=0.17.
across-tuple sample count M = 8
Number of planner tuples per prompt for planner advantage normalization; main setting M=8 after a small sweep.
AdamW learning rate and GRPO clip/KL = lr=5e-7, ε=0.2, β_KL=0.01
Optimizer hyperparameters reused across CPPO and trained baselines (e.g. lr 5e-7, ε=0.2, KL 0.01) and fixed on a CodeContests-train dev subset.

axioms (4)

domain assumption Binary execution verifier V(x,y)∈{0,1} is a sufficient reward for correctness in competitive programming.
RLVR setup in §2–§3; sandbox scoring defines Rout and pass@K.
domain assumption Many competitive-programming problems admit multiple distinct algorithmic strategies such that covering strategies improves pass@K more than resampling one mode.
Stated in Introduction and Limitations; scopes the method away from single-path math.
standard math GRPO with group-normalized advantages and clipped KL-regularized updates is a valid policy optimizer for the split planner/solver token regions.
Optimizer held fixed from Shao et al. / Schulman et al.; Appendix J.
ad hoc to paper A small generative plan-validity model trained on LLM-as-judge labels can gate malformed/duplicate/leaky plans without needing to score task quality.
Stage 2 and §4.4; diagnostics show high judge agreement but weak solver-outcome prediction.

invented entities (3)

Coordinated strategy-tuple policy π_Θ(τ|x)=q_Θ(S|x)∏p_Θ(y_i|x,s_i) independent evidence
purpose: Replace K iid answer draws with one joint plan of K methods plus one solve per method.
Core policy factorization in §3.1; evaluated against iid and planning baselines.
Multiplicative planner reward R_plan=J_ψ·R_out independent evidence
purpose: Assign planner credit only to valid tuples that yield at least one verifier pass.
Eq. (10); ablations remove J_ψ or R_out to isolate roles.
Plan-validity gate J_ψ no independent evidence
purpose: Binary filter for parseable, non-duplicate, non-leaking, on-topic strategy tuples.
Trained Stage 2; frozen during phases; not claimed as a general plan-quality oracle.

pith-pipeline@v1.1.0-grok45 · 34858 in / 3425 out tokens · 40355 ms · 2026-07-12T15:55:22.268501+00:00 · methodology

0 comments

read the original abstract

Repeated sampling with a verifier is the standard way to allocate test-time compute for code generation, with pass@$K$ as the canonical metric. Yet the standard policy class draws $K$ independent samples from a single answer distribution, so attempts often collapse onto near-duplicate reasoning paths and waste the budget on redundant rollouts. This failure is costly in competitive programming, where many problems admit multiple distinct algorithmic strategies and pass@$K$ requires only one correct attempt. We propose Coordinated Pass@$K$ Policy Optimization (CPPO), which turns pass@$K$ generation into joint exploration over strategies: a planner emits a tuple of $K{=}4$ alternative high-level methods, and a shared solver attempts one solution per method. CPPO trains this joint policy with a multiplicative planner reward, $R_{\mathrm{plan}} = J_\psi \cdot R_{\mathrm{out}}$, assigning credit only to valid strategy tuples that lead to verifier-confirmed pass@$K$ success. Across APPS, CodeContests, and LiveCodeBench-v6, CPPO improves pass@$4$ over direct sampling, planning baselines, planner-only SFT, and pass@$K$-oriented RL under the same $K{=}4$ solver-attempt budget, with statistically significant gains on six of nine model--benchmark cells. The largest single gain is $+0.16$ on Qwen3.5-9B LiveCodeBench-v6 over the strongest baseline, PKPO ($0.588 \rightarrow 0.748$; paired bootstrap, $p < 0.05$).

Figures

Figures reproduced from arXiv: 2605.27000 by Suman Banerjee, Tong Che, Yilong Li.

**Figure 1.** Figure 1: Overview of Coordinated Pass@K Policy Optimization. The planner qΘ emits a strategy tuple S = (s1, . . . , sK); the shared solver pΘ produces one solution per strategy; a verifier returns per-branch outcomes ri ∈ {0, 1}, and the outcome reward Rout = maxi ri scores pass@K success. A frozen reward model Jψ(x, S) gates plan validity, giving the planner reward Rplan = Jψ(x, S) · Rout, which is nonzero only wh… view at source ↗

**Figure 2.** Figure 2: LiveCodeBench-v6 pass@K for Qwen3.5-9B (4B counterpart in [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Reward-model diagnostics. (a) The RM matches held-out judge labels but weakly predicts frozen-solver outcomes, supporting its use as a validity gate rather than a plan-quality scorer. (b) Joint distribution of validity decisions and solver outcomes during CPPO rollouts; shading encodes rollout frequency, and only the accepted, solved cell yields nonzero Rplan [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Relationship between algorithmic diversity [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

**Figure 5.** Figure 5: LiveCodeBench-v6 pass@K across baselines for Qwen3.5-4B and Qwen3.5-9B. of comparable size (Direct Solve 0.640/0.750 on E2B/E4B versus 0.420/0.515 on Qwen3.5-2B/4B), so the absolute CPPO margin over Direct Solve compresses (+0.108/ + 0.082 on E2B/E4B, versus +0.126/ + 0.279 on Qwen3.5-2B/4B) – the dominant headroom on APPS is in the strongest baseline rather than the base model. The LCBv6 pattern matches … view at source ↗

**Figure 6.** Figure 6: APPS pass@K across Qwen3.5-2B, Qwen3.5-4B, and Qwen3.5-9B. Each method occupies a single color, with darker shades for larger models. K = 1 K = 2 K = 4 K = 8 K = 16 K 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 p a s s @ K (A P P S) Direct Solve-4B Direct Solve-9B Plan-and-Solve-4B Plan-and-Solve-9B PlanSearch-4B PlanSearch-9B PKPO-4B PKPO-9B CPPO-4B CPPO-9B [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: Grouped APPS pass@K for Qwen3.5-4B and Qwen3.5-9B at K ∈ {1, 2, 4, 8, 16}. Each method occupies a single color across both sizes, with a lighter shade for 4B and a darker shade for 9B. ing proceeds, we periodically refresh the rewardmodel dataset: we label outputs from the updated planner with the same judge, append them to the training pool, rebalance by prompt and pass/fail label, and resume finetuning… view at source ↗

**Figure 8.** Figure 8: PKPO training dynamics for Qwen3.5-2B, 4B, and 9B over 30 epochs (light to dark red): gradient [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗

**Figure 9.** Figure 9: UpSkill training dynamics for Qwen3.5-2B, 4B, and 9B over 30 epochs (light to dark purple): gradient L2 [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Rank-Conditioned Sample Reuse for the Plackett--Luce Best-of-$K$ Objective
cs.LG 2026-07 accept novelty 6.5

Rank-conditioned Horvitz–Thompson reuses all C(n,K) subsets of one Gumbel-Top-n pool for unbiased Plackett–Luce best-of-K value and score-function gradient, with an exact Max-specific DP collapse to a 1-D integral.

Reference graph

Works this paper leans on

10 extracted references · 2 linked inside Pith · cited by 1 Pith paper

[1]

Google DeepMind

Training verifiers to solve math word prob- lems.arXiv preprint arXiv:2110.14168. Google DeepMind. 2026. Gemma 4: Open lightweight models. Official model cards, https://ai. google.dev/gemma. Gemma 4 E2B and E4B. LiveCodeBench-v6 pass@4 values used in this paper (44.0% for E2B, 52.0% for E4B) reproduced from the official model-card evaluation tables. Acces...

Pith/arXiv arXiv 2026
[2]

train + val

DeepSeek-R1: Incentivizing reasoning capa- bility in LLMs via reinforcement learning.Nature, 645:633–638. Dan Hendrycks, Steven Basart, Saurav Kadavath, Man- tas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. 2021. Measuring coding challenge com- petence with APPS. InAdvances in Neural Infor- matio...

Pith/arXiv arXiv 2021
[3]

The same base-model solver is given a prompt that asks it to first write a solution plan and then write the code
[4]

For each problem we sample K full solution attempts under that prompt
[5]

Each attempt passes through the same code extraction, sandboxed execution, and verifier pipeline
[6]

This comparison isolates the effect of adding a planning prompt before code generation

Pass@K is the fraction of problems on which at least one of theKattempts passes the tests. This comparison isolates the effect of adding a planning prompt before code generation. PlanSearch (Wang et al., 2024) as a planning baseline.We follow the released implementation, using eight candidate plans per problem and the same solver/verifier pipeline as the ...

2024
[7]

The base model first generates multiple candi- date plans per problem; we use8candidates
[8]

The candidates are selected and organized into plans usable for solving
[9]

The same frozen base-model solver then gener- ates code conditioned on each selected plan
[10]

This comparison isolates inference-time multi-plan search without planner training

Each problem is evaluated under a budget of K solver attempts, with the same verifier and pass@Kdefinition as above. This comparison isolates inference-time multi-plan search without planner training. PKPO (Walder and Karkhanis, 2025) as a pass@K RL baseline.PKPO transforms the per- sample reward vector r∈ {0,1} n over n≥k samples of the same problem into...

2025

[1] [1]

Google DeepMind

Training verifiers to solve math word prob- lems.arXiv preprint arXiv:2110.14168. Google DeepMind. 2026. Gemma 4: Open lightweight models. Official model cards, https://ai. google.dev/gemma. Gemma 4 E2B and E4B. LiveCodeBench-v6 pass@4 values used in this paper (44.0% for E2B, 52.0% for E4B) reproduced from the official model-card evaluation tables. Acces...

Pith/arXiv arXiv 2026

[2] [2]

train + val

DeepSeek-R1: Incentivizing reasoning capa- bility in LLMs via reinforcement learning.Nature, 645:633–638. Dan Hendrycks, Steven Basart, Saurav Kadavath, Man- tas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. 2021. Measuring coding challenge com- petence with APPS. InAdvances in Neural Infor- matio...

Pith/arXiv arXiv 2021

[3] [3]

The same base-model solver is given a prompt that asks it to first write a solution plan and then write the code

[4] [4]

For each problem we sample K full solution attempts under that prompt

[5] [5]

Each attempt passes through the same code extraction, sandboxed execution, and verifier pipeline

[6] [6]

This comparison isolates the effect of adding a planning prompt before code generation

Pass@K is the fraction of problems on which at least one of theKattempts passes the tests. This comparison isolates the effect of adding a planning prompt before code generation. PlanSearch (Wang et al., 2024) as a planning baseline.We follow the released implementation, using eight candidate plans per problem and the same solver/verifier pipeline as the ...

2024

[7] [7]

The base model first generates multiple candi- date plans per problem; we use8candidates

[8] [8]

The candidates are selected and organized into plans usable for solving

[9] [9]

The same frozen base-model solver then gener- ates code conditioned on each selected plan

[10] [10]

This comparison isolates inference-time multi-plan search without planner training

Each problem is evaluated under a budget of K solver attempts, with the same verifier and pass@Kdefinition as above. This comparison isolates inference-time multi-plan search without planner training. PKPO (Walder and Karkhanis, 2025) as a pass@K RL baseline.PKPO transforms the per- sample reward vector r∈ {0,1} n over n≥k samples of the same problem into...

2025