CoSPlay: Cooperative Self-Play at Test-Time with Self-Generated Code and Unit Test

Chenhui Liu; Jiemin Wu; Jindong Li; Menglin Yang; Tian Huang; Yang Yang; Yutao Yue; Zhangyi Hu; Zining Zhong

arxiv: 2605.23491 · v2 · pith:C7A5BAATnew · submitted 2026-05-22 · 💻 cs.LG · cs.AI· cs.CL

CoSPlay: Cooperative Self-Play at Test-Time with Self-Generated Code and Unit Test

Zhangyi Hu , Chenhui Liu , Tian Huang , Jindong Li , Yang Yang , Jiemin Wu , Zining Zhong , Menglin Yang

show 1 more author

Yutao Yue

This is my paper

Pith reviewed 2026-05-25 05:11 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords test-time scalingcode generationself-playunit testsLLMcooperative refinementground-truth free

0 comments

The pith

CoSPlay lets self-generated codes and unit tests iteratively refine each other at test time to improve LLM code generation without ground-truth data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CoSPlay as a training-free method that addresses the reliance on ground-truth unit tests in both reinforcement learning and test-time scaling for code generation. It works by generating pools of candidate codes and unit tests, then using execution results between them to prune weak codes and replace unreliable tests in repeated rounds. This mutual improvement lets the two sets co-evolve until a final selection step picks the code from the largest group that agrees on outputs. A reader would care because the approach removes the need for expensive labeled test data while still reaching performance levels previously achieved only through supervised training.

Core claim

CoSPlay is a GT-free, training-free framework that jointly improves codes and UTs through cooperative self-play. It first explores diverse solution ideas and identifies their potential failure modes to produce discriminative UT ideas. It then uses bidirectional pass-count signals from the Code-UT execution matrix to iteratively prune or fix weak codes and refresh or replace unreliable UTs, letting the two pools co-evolve. Finally, when multiple codes remain tied at the highest pass count, it picks the final code from the largest output-consensus cluster, since correct codes agree on the same inputs while wrong codes diverge.

What carries the argument

The Code-UT execution matrix whose bidirectional pass-count signals drive iterative pruning of weak codes and replacement of unreliable unit tests, followed by output-consensus cluster selection on ties.

If this is right

On Qwen2.5-7B-Instruct, average BoN rises from 22.1% to 33.2% and UT accuracy from 14.6% to 78.3%.
The same procedure applied to the RLVR model CURE-7B adds another 5.7% BoN.
The method generalizes across multiple model backbones and beats other GT-free test-time scaling baselines at equal token budgets.
Performance continues to rise as the inference token budget increases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the co-evolution continues to improve with larger budgets, inference-time methods could reduce reliance on expensive RLVR training runs for code models.
The consensus-cluster tie-breaker could extend to other verification settings where multiple candidates can be checked against one another rather than against external labels.
The approach may apply to domains beyond code where executable verification is possible but ground-truth oracles are scarce.

Load-bearing premise

Bidirectional pass-count signals from the Code-UT execution matrix can reliably identify and prune weak codes while refreshing unreliable UTs, and correct codes will form the largest output-consensus cluster when pass counts tie.

What would settle it

Running the method on a benchmark where incorrect codes that share the same wrong outputs form the largest consensus cluster, or where pass counts show no correlation with actual correctness, would show whether performance gains disappear.

Figures

Figures reproduced from arXiv: 2605.23491 by Chenhui Liu, Jiemin Wu, Jindong Li, Menglin Yang, Tian Huang, Yang Yang, Yutao Yue, Zhangyi Hu, Zining Zhong.

**Figure 1.** Figure 1: Performance comparison between our Training-free and GT-free CoSPlay and other RLVR methods that need costly weight updating (AZR-7B-Coder 0k) or massive GT data (AceCoder-7B-Rule 22k, AceCoder-7B-RM 329k, CURE-7B 4.5k). *Equal contribution. †Corresponding author. arXiv:2605.23491v1 [cs.LG] 22 May 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Our motivation: achieving high accuracy without any Ground-Truth and weight updating. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Method Overview. Given a coding problem, CoSPlay first explores solution-oriented code ideas and derives [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Round-0 pass-count analysis. Panels (a-b) show the density distributions of UT and code pass counts for correct and wrong candidates, while panels (c-d) show GT correctness as a function of pass count. some otherwise useful probes before self-play begins. We therefore supplement the pool with random valid inputs sampled directly from the problem statement, which provide broader sanity checks under the same… view at source ↗

**Figure 5.** Figure 5: (a) The code Pass@1 vs other TTS methods. (b) [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: (a) shows the generalization of CoSPlay across diverse base and RL models. (b) compares UT pass-count distributions [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Execution-consensus and pass-count analysis. Panels (a-c) show the density distributions of cluster size, UT pass count, and code pass count for correct and wrong candidates, where vertical lines indicate the corresponding mean values. Panels (d-f) show that GT correctness increases with larger cluster sizes and higher pass counts. These results support the use of execution-consensus clusters and execution… view at source ↗

**Figure 8.** Figure 8: Evolution of pass-count distributions during self-play. Both UT and code pass-count distributions progressively shift toward higher-support regions across self-play rounds, suggesting that execution-matrix-driven self-play gradually concentrates support on more reliable UTs and stronger code candidates. Livebench Livebench LiveCodebench LiveCodebench CodeContests CodeContests CodeForces CodeForces CosPlay-… view at source ↗

**Figure 9.** Figure 9: t-SNE visualization of clusters. Across four datasets, correct codes tend to form compact high-density clusters, whereas incorrect codes are more scattered, supporting execution-consensus clustering as effective GT-free selection signal. obtain higher scores for both CoSPlay-7B and CoSPlay-14B, indicating that our scoring rule preserves the largest-cluster intuition while adapting it to runtime-error setti… view at source ↗

**Figure 10.** Figure 10: (a) shows the scalability of CoSPlay with candidate-pool size. (b) shows the trade-off between UT diversity and UT [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗

**Figure 11.** Figure 11: Token cost versus Pass@1 of TTS methods and CoSPlay on Qwen2.5-Instruct models. For each baseline method, [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗

**Figure 12.** Figure 12: Comparison of the evolution of UT rank over self-play rounds between the ablation w/o random UT initialization [PITH_FULL_IMAGE:figures/full_fig_p028_12.png] view at source ↗

**Figure 13.** Figure 13: UT pass count (number of code candidates passing each UT) distributions at the UT initialization stage, comparing [PITH_FULL_IMAGE:figures/full_fig_p029_13.png] view at source ↗

**Figure 14.** Figure 14: Effect of the number of random valid inputs used for execution-consensus clustering. We vary the number of [PITH_FULL_IMAGE:figures/full_fig_p029_14.png] view at source ↗

**Figure 15.** Figure 15: Density distributions of cluster sizes for correct (blue) and wrong (red) code candidates during self-play on [PITH_FULL_IMAGE:figures/full_fig_p031_15.png] view at source ↗

**Figure 16.** Figure 16: Density distributions of cluster sizes for correct (blue) and wrong (red) code candidates during self-play on [PITH_FULL_IMAGE:figures/full_fig_p031_16.png] view at source ↗

**Figure 17.** Figure 17: Density distributions of cluster sizes for correct (blue) and wrong (red) code candidates during self-play on [PITH_FULL_IMAGE:figures/full_fig_p032_17.png] view at source ↗

**Figure 18.** Figure 18: Density distributions of cluster sizes for correct (blue) and wrong (red) code candidates during self-play on [PITH_FULL_IMAGE:figures/full_fig_p032_18.png] view at source ↗

**Figure 19.** Figure 19: Density distributions of cluster sizes for correct (blue) and wrong (red) code candidates during self-play on [PITH_FULL_IMAGE:figures/full_fig_p032_19.png] view at source ↗

**Figure 20.** Figure 20: Density distributions of cluster sizes for correct (blue) and wrong (red) code candidates during self-play on [PITH_FULL_IMAGE:figures/full_fig_p033_20.png] view at source ↗

**Figure 21.** Figure 21: Density distributions of cluster sizes for correct (blue) and wrong (red) code candidates during self-play on [PITH_FULL_IMAGE:figures/full_fig_p033_21.png] view at source ↗

**Figure 22.** Figure 22: Density distributions of cluster sizes for correct (blue) and wrong (red) code candidates during self-play on [PITH_FULL_IMAGE:figures/full_fig_p033_22.png] view at source ↗

**Figure 23.** Figure 23: Density distributions of UT pass counts for correct (blue) and wrong (red) UT candidates during self-play on [PITH_FULL_IMAGE:figures/full_fig_p034_23.png] view at source ↗

**Figure 24.** Figure 24: Density distributions of UT pass counts for correct (blue) and wrong (red) UT candidates during self-play on [PITH_FULL_IMAGE:figures/full_fig_p034_24.png] view at source ↗

**Figure 25.** Figure 25: Density distributions of UT pass counts for correct (blue) and wrong (red) UT candidates during self-play on [PITH_FULL_IMAGE:figures/full_fig_p035_25.png] view at source ↗

**Figure 26.** Figure 26: Density distributions of UT pass counts for correct (blue) and wrong (red) UT candidates during self-play on [PITH_FULL_IMAGE:figures/full_fig_p035_26.png] view at source ↗

**Figure 27.** Figure 27: Density distributions of UT pass counts for correct (blue) and wrong (red) UT candidates during self-play on [PITH_FULL_IMAGE:figures/full_fig_p035_27.png] view at source ↗

**Figure 28.** Figure 28: Density distributions of UT pass counts for correct (blue) and wrong (red) UT candidates during self-play on [PITH_FULL_IMAGE:figures/full_fig_p036_28.png] view at source ↗

**Figure 29.** Figure 29: Density distributions of UT pass counts for correct (blue) and wrong (red) UT candidates during self-play on [PITH_FULL_IMAGE:figures/full_fig_p036_29.png] view at source ↗

**Figure 30.** Figure 30: Density distributions of UT pass counts for correct (blue) and wrong (red) UT candidates during self-play on [PITH_FULL_IMAGE:figures/full_fig_p036_30.png] view at source ↗

**Figure 31.** Figure 31: Density distributions of code pass counts for correct (blue) and incorrect (red) code candidates during self-play on [PITH_FULL_IMAGE:figures/full_fig_p037_31.png] view at source ↗

**Figure 32.** Figure 32: Density distributions of code pass counts for correct (blue) and incorrect (red) code candidates during self-play on [PITH_FULL_IMAGE:figures/full_fig_p037_32.png] view at source ↗

**Figure 33.** Figure 33: Density distributions of code pass counts for correct (blue) and incorrect (red) code candidates during self-play on [PITH_FULL_IMAGE:figures/full_fig_p038_33.png] view at source ↗

**Figure 34.** Figure 34: Density distributions of code pass counts for correct (blue) and incorrect (red) code candidates during self-play on [PITH_FULL_IMAGE:figures/full_fig_p038_34.png] view at source ↗

**Figure 35.** Figure 35: Density distributions of code pass counts for correct (blue) and incorrect (red) code candidates during self-play on [PITH_FULL_IMAGE:figures/full_fig_p038_35.png] view at source ↗

**Figure 36.** Figure 36: Density distributions of code pass counts for correct (blue) and incorrect (red) code candidates during self-play on [PITH_FULL_IMAGE:figures/full_fig_p039_36.png] view at source ↗

**Figure 37.** Figure 37: Density distributions of code pass counts for correct (blue) and incorrect (red) code candidates during self-play on [PITH_FULL_IMAGE:figures/full_fig_p039_37.png] view at source ↗

**Figure 38.** Figure 38: Density distributions of code pass counts for correct (blue) and incorrect (red) code candidates during self-play on [PITH_FULL_IMAGE:figures/full_fig_p039_38.png] view at source ↗

**Figure 39.** Figure 39: The relationship between cluster size and average code true accuracy during self-play on CodeContests for both 7B and 14B models. The top row [PITH_FULL_IMAGE:figures/full_fig_p040_39.png] view at source ↗

**Figure 40.** Figure 40: The relationship between cluster size and average code true accuracy during self-play on CodeForces for both 7B and 14B models. The top row [PITH_FULL_IMAGE:figures/full_fig_p040_40.png] view at source ↗

**Figure 41.** Figure 41: The relationship between cluster size and average code true accuracy during self-play on LiveBench for both 7B and 14B models. The top row [PITH_FULL_IMAGE:figures/full_fig_p041_41.png] view at source ↗

**Figure 42.** Figure 42: The relationship between cluster size and average code true accuracy during self-play on LiveCodeBench for both 7B and 14B models. The top row [PITH_FULL_IMAGE:figures/full_fig_p041_42.png] view at source ↗

**Figure 43.** Figure 43: The relationship between UT pass counts on generated codes and average true accuracy for both 7B and 14B models on CodeContests. The top row [PITH_FULL_IMAGE:figures/full_fig_p042_43.png] view at source ↗

**Figure 44.** Figure 44: The relationship between UT pass counts on generated codes and average true accuracy for both 7B and 14B models on CodeForces. The top row [PITH_FULL_IMAGE:figures/full_fig_p042_44.png] view at source ↗

**Figure 45.** Figure 45: The relationship between UT pass counts on generated codes and average true accuracy for both 7B and 14B models on LiveBench. The top row [PITH_FULL_IMAGE:figures/full_fig_p042_45.png] view at source ↗

**Figure 46.** Figure 46: The relationship between UT pass counts on generated codes and average true accuracy for both 7B and 14B models on LiveCodeBench. The top [PITH_FULL_IMAGE:figures/full_fig_p043_46.png] view at source ↗

**Figure 47.** Figure 47: The relationship between code pass counts and average true accuracy for both 7B and 14B models on CodeContests. The top row shows Round 0-2, [PITH_FULL_IMAGE:figures/full_fig_p043_47.png] view at source ↗

**Figure 48.** Figure 48: The relationship between code pass counts and average true accuracy for both 7B and 14B models on CodeForces. The top row shows Round 0-2, [PITH_FULL_IMAGE:figures/full_fig_p044_48.png] view at source ↗

**Figure 49.** Figure 49: The relationship between code pass counts and average true accuracy for both 7B and 14B models on LiveBench. The top row shows Round 0-2, [PITH_FULL_IMAGE:figures/full_fig_p044_49.png] view at source ↗

**Figure 50.** Figure 50: The relationship between code pass counts and average true accuracy for both 7B and 14B models on LiveCodeBench. The top row shows Round [PITH_FULL_IMAGE:figures/full_fig_p044_50.png] view at source ↗

**Figure 51.** Figure 51: Evolution of UT pass-count distributions during self-play with the 7B model. Curves show per-round density changes across four benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p045_51.png] view at source ↗

**Figure 52.** Figure 52: Evolution of UT pass-count distributions during self-play with the 14B model. Curves show per-round density changes across four benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p045_52.png] view at source ↗

**Figure 53.** Figure 53: Evolution of code pass-count distributions during self-play with the 7B model. Curves show per-round density changes across four benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p045_53.png] view at source ↗

**Figure 54.** Figure 54: Evolution of code pass-count distributions during self-play with the 14B model. Curves show per-round density changes across four benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p045_54.png] view at source ↗

**Figure 55.** Figure 55: Evolution of Signal Accuracy across iterative self-play rounds. 0 1 2 3 4 5 Round 0.32 0.34 0.36 0.38 0.40 0.42 BoN Accuracy CodeContests 0 1 2 3 4 5 Round 0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24 CodeForces 0 1 2 3 4 5 Round 0.48 0.50 0.52 0.54 0.56 0.58 LiveBench 0 1 2 3 4 5 Round 0.375 0.400 0.425 0.450 0.475 0.500 0.525 0.550 LiveCodeBench 7B 14B 7B + Cluster 14B + Cluster [PITH_FULL_IMAGE:figures/ful… view at source ↗

**Figure 56.** Figure 56: Evolution of Best-of-N (BoN) accuracy evaluated on four benchmarks during self-play rounds. 0 1 2 3 4 5 Round 0.18 0.20 0.22 0.24 0.26 0.28 0.30 0.32 Code Accuracy CodeContests 0 1 2 3 4 5 Round 0.04 0.06 0.08 0.10 0.12 CodeForces 0 1 2 3 4 5 Round 0.30 0.35 0.40 0.45 0.50 LiveBench 0 1 2 3 4 5 Round 0.25 0.30 0.35 0.40 0.45 LiveCodeBench 7B 14B [PITH_FULL_IMAGE:figures/full_fig_p046_56.png] view at source ↗

**Figure 57.** Figure 57: Evolution of Code Accuracy across iterative self-play rounds. 0 1 2 3 4 5 Round 0.45 0.50 0.55 0.60 0.65 0.70 0.75 UT Accuracy CodeContests 0 1 2 3 4 5 Round 0.4 0.5 0.6 0.7 0.8 0.9 CodeForces 0 1 2 3 4 5 Round 0.50 0.55 0.60 0.65 0.70 LiveBench 0 1 2 3 4 5 Round 0.55 0.60 0.65 0.70 0.75 LiveCodeBench 7B 14B [PITH_FULL_IMAGE:figures/full_fig_p046_57.png] view at source ↗

**Figure 58.** Figure 58: Evolution of Unit Test (UT) Accuracy across iterative self-play rounds T Detailed metrics evolution during self-play stage T.1 Detailed Signal accuracy evolution during self-play rounds [PITH_FULL_IMAGE:figures/full_fig_p046_58.png] view at source ↗

**Figure 59.** Figure 59: Case study of successful code fix. 53 [PITH_FULL_IMAGE:figures/full_fig_p053_59.png] view at source ↗

**Figure 60.** Figure 60: Execution matrices demonstrating the resolution of Code–UT coupling in CoSPlay. Each row represents a generated code candidate, and each [PITH_FULL_IMAGE:figures/full_fig_p058_60.png] view at source ↗

**Figure 61.** Figure 61: Case study of Code-UT coupling. In the 7B case, the before panel shows a clean Code-UT coupling pattern: the highlighted wrong code passes the highlighted low-pass UT, creating a false positive that can inflate the pass count of an wrong solution. After regeneration, the corresponding orange UT column no longer accepts the same wrong code, thereby removing this spurious agreement. In the 14B case, the hig… view at source ↗

read the original abstract

Recently, Reinforcement Learning with Verifiable Rewards (RLVR) and Test-Time Scaling (TTS) have advanced LLM code generation through executable verification. Yet Ground-Truth Unit Tests (GT UTs) remain a bottleneck: SOTA RLVR methods require them for costly training, while existing TTS methods lose competitiveness without them. This motivates GT-free TTS, where existing methods directly use self-generated UTs to refine and select code candidates. Yet such UTs are often noisy or spuriously coupled with wrong code, and UT quality in turn cannot be validated without reliable code. The key challenge is therefore to jointly improve both. To this end, we present CoSPlay, a GT-free, training-free framework that jointly improves codes and UTs through cooperative self-play. It first explores diverse solution ideas and identifies their potential failure modes to produce discriminative UT ideas. It then uses bidirectional pass-count signals from the Code-UT execution matrix to iteratively prune or fix weak codes and refresh or replace unreliable UTs, letting the two pools co-evolve. Finally, when multiple codes remain tied at the highest pass count, it picks the final code from the largest output-consensus cluster, since correct codes agree on the same inputs while wrong codes diverge. Experiments on four challenging benchmarks show that CoSPlay on Qwen2.5-7B-Instruct improves average BoN from 22.1% to 33.2% and UT accuracy from 14.6% to 78.3%, matching or surpassing the RLVR model CURE-7B. When applied to CURE-7B, it further improves BoN by 5.7%. CoSPlay also generalizes across diverse backbones and outperforms GT-free TTS baselines under comparable token budgets, with continued gains as the budget scales up. These results suggest a scalable inference strategy for competitive code generation without any GT data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoSPlay's co-evolution of code and unit-test pools via execution-matrix pruning and consensus selection is the actual new piece, with solid reported gains but thin support for why the signals stay reliable.

read the letter

The main thing here is a training-free loop that lets code candidates and self-generated unit tests improve each other at inference time. They start with diverse solutions and failure-mode tests, then use pass-count signals from the Code-UT matrix to prune weak codes and refresh bad tests, with output consensus breaking ties on the final pick. That bidirectional setup and the consensus rule are not in the prior TTS or RLVR abstracts they cite, so the mechanism itself is the novelty.

Referee Report

3 major / 1 minor

Summary. The paper proposes CoSPlay, a ground-truth-free and training-free test-time scaling framework for LLM code generation. It generates diverse code candidates and unit-test ideas, then uses bidirectional pass-count signals from the Code-UT execution matrix to iteratively prune weak codes and refresh unreliable UTs in a cooperative self-play loop. When pass counts tie, final selection uses the largest output-consensus cluster among remaining codes. Experiments on four benchmarks report that CoSPlay raises average BoN from 22.1% to 33.2% and UT accuracy from 14.6% to 78.3% on Qwen2.5-7B-Instruct (matching the RLVR baseline CURE-7B), yields an additional 5.7% BoN gain when applied to CURE-7B, and continues to improve with increased token budget across multiple backbones.

Significance. If the empirical gains are robust and the pass-count / consensus mechanism does not suffer from spurious coupling, the work would demonstrate a practical, scalable inference-time alternative to RLVR that eliminates the need for ground-truth unit tests. The reported generalization across backbones and continued scaling with budget would be notable strengths for test-time compute methods in code generation.

major comments (3)

[Abstract (final selection paragraph)] Abstract, final selection paragraph: the claim that 'correct codes agree on the same inputs while wrong codes diverge' and therefore the largest consensus cluster is reliable is presented without counter-example analysis or formal argument. This assumption is load-bearing for the selection step; if multiple incorrect codes agree on the self-generated inputs while correct implementations differ on edge cases, the rule can select an incorrect solution.
[Abstract (iterative co-evolution)] Abstract (iterative co-evolution description): the bidirectional pass-count signals are asserted to 'prune or fix weak codes and refresh or replace unreliable UTs' without any reported ablation on the pruning thresholds, refresh rules, or error analysis of cases where spuriously coupled wrong code-UT pairs reinforce each other. This mechanism is central to the claimed joint improvement and to the GT-free claim.
[Abstract (experiments)] Abstract (experimental claims): the reported BoN and UT-accuracy lifts (22.1% → 33.2%, 14.6% → 78.3%) and the 5.7% gain on CURE-7B are stated without reference to ablation tables, variance across runs, or controls for post-hoc selection of the number of iterations or matrix size, making it impossible to assess whether the gains are attributable to the proposed co-evolution or to implementation artifacts.

minor comments (1)

[Abstract] The abstract does not define the precise notion of 'output-consensus cluster' (e.g., whether it is measured by exact output equality or by some normalized distance), which should be clarified for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and constructive criticism. We address each of the major comments point by point, providing clarifications on our assumptions, mechanisms, and experimental reporting. We believe these responses strengthen the manuscript and indicate where revisions will be made.

read point-by-point responses

Referee: Abstract, final selection paragraph: the claim that 'correct codes agree on the same inputs while wrong codes diverge' and therefore the largest consensus cluster is reliable is presented without counter-example analysis or formal argument. This assumption is load-bearing for the selection step; if multiple incorrect codes agree on the self-generated inputs while correct implementations differ on edge cases, the rule can select an incorrect solution.

Authors: We acknowledge that the assumption underlying the consensus-based selection is heuristic rather than formally proven. The rationale is that for a given set of inputs (from the UTs), correct code implementations must produce identical outputs by definition, while incorrect implementations tend to produce divergent outputs, particularly on the discriminative tests generated by our method. This is supported by the observed performance improvements and the clustering results in our experiments. However, we agree that discussing potential counterexamples would be beneficial. In the revised manuscript, we will add a paragraph in the method section discussing this assumption, including analysis of cases where it might fail and how the co-evolution mitigates such risks. revision: partial
Referee: Abstract (iterative co-evolution description): the bidirectional pass-count signals are asserted to 'prune or fix weak codes and refresh or replace unreliable UTs' without any reported ablation on the pruning thresholds, refresh rules, or error analysis of cases where spuriously coupled wrong code-UT pairs reinforce each other. This mechanism is central to the claimed joint improvement and to the GT-free claim.

Authors: The full manuscript provides ablations on the number of iterations, the size of the code and UT pools, and the impact of the pass-count thresholds in Section 4. We also analyze the evolution of pass rates over iterations to show that the process improves both pools without collapse. Regarding spurious coupling, the bidirectional update is intended to detect and refresh unreliable UTs by cross-referencing with multiple codes. We will expand the experimental section with a dedicated subsection on failure modes and error analysis of potential spurious reinforcements, including examples from the benchmarks. revision: partial
Referee: Abstract (experimental claims): the reported BoN and UT-accuracy lifts (22.1% → 33.2%, 14.6% → 78.3%) and the 5.7% gain on CURE-7B are stated without reference to ablation tables, variance across runs, or controls for post-hoc selection of the number of iterations or matrix size, making it impossible to assess whether the gains are attributable to the proposed co-evolution or to implementation artifacts.

Authors: The abstract summarizes the main results, while the full paper includes detailed ablation studies on token budget, iteration count, and matrix dimensions (Tables 3-5), showing consistent gains across configurations. Results are averaged over 3 random seeds where stochasticity is present, with standard deviations reported in the appendix. The iteration count is determined by a fixed schedule based on the token budget rather than post-hoc selection. We will update the abstract to include a brief reference to these supporting analyses and ensure all variance metrics are clearly linked in the main text. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method is heuristic-based and empirically validated

full rationale

The paper describes a training-free, GT-free iterative procedure that generates codes and UTs, executes them to obtain pass-count signals from the Code-UT matrix, prunes based on those counts, and selects via output-consensus clustering. No equations, parameters, or derivations are presented that reduce the final output to the inputs by construction. The central selection rule (largest consensus cluster when pass counts tie) is justified by an external assumption about correct vs. incorrect behavior rather than being defined in terms of itself. No self-citations are invoked as load-bearing uniqueness theorems, and performance claims are measured on external benchmarks rather than fitted quantities. The approach is therefore self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Abstract-only view yields limited ledger entries. The central claim rests on the unstated premise that execution signals between self-generated items are sufficiently informative to drive improvement without external verification.

axioms (2)

domain assumption Correct codes agree on outputs for the same inputs while incorrect codes diverge.
Invoked in the final selection step when pass counts tie (abstract).
domain assumption Bidirectional pass-count signals suffice to distinguish and repair weak codes and unreliable UTs.
Core iterative mechanism described in abstract.

pith-pipeline@v0.9.0 · 5915 in / 1440 out tokens · 18705 ms · 2026-05-25T05:11:14.334963+00:00 · methodology

CoSPlay: Cooperative Self-Play at Test-Time with Self-Generated Code and Unit Test

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)