Recognition: no theorem link
ZeroCoder: Can LLMs Improve Code Generation Without Ground-Truth Supervision?
Pith reviewed 2026-05-10 18:32 UTC · model grok-4.3
The pith
LLMs can improve code generation without ground-truth supervision by co-evolving a coder and tester through their self-generated execution feedback.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ZeroCoder is a fully label-free co-evolutionary framework that jointly trains a Coder and a Tester using execution feedback from self-generated code-test interactions. For each problem, it executes sampled solutions against sampled tests to form a passing matrix, identifies a consensus subset of likely-correct solutions and consistent tests via a pluggable selection algorithm, and derives role-specific rewards. Low-information cases are filtered first, the Tester is trained with a curriculum on validity and mutation-driven discriminativeness, and DyB4 recalibrates selection priors dynamically to counter selector drift.
What carries the argument
The passing matrix formed by executing self-generated code solutions against self-generated tests, from which a selection algorithm extracts a consensus subset to supply role-specific rewards for joint training.
If this is right
- Code generation improves by up to 14.5 percent over the base model in the fully label-free setting across three models and six benchmarks.
- With the DyB4 selector, code generation gains reach 21.6 percent and test generation improves by 24.3 percent, approaching oracle-supervised performance.
- Both the code generator and the test generator improve through the same co-evolutionary loop.
- The framework works with pluggable selection algorithms and includes rank-based pre-filtering to maintain reward quality.
Where Pith is reading between the lines
- Methods like this could cut the cost of building capable coding models by removing the need for large human-curated test suites.
- The same co-evolution pattern could be tested on other domains where one model can verify another through execution, such as simple program verification tasks.
- Selector drift points to a general need for adaptive calibration in any self-supervised reward system that runs for many iterations.
Load-bearing premise
The consensus subset chosen from the passing matrix of self-generated code-test pairs continues to identify correct solutions and good tests even as the models improve together and the selection rules shift.
What would settle it
If the pass rate of the trained coder on a benchmark that has hidden ground-truth solutions shows no gain or a loss relative to the starting base model, the claim of effective label-free improvement would be falsified.
Figures
read the original abstract
Code generation is important in software engineering, and Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful paradigm to improve it through execution-based feedback. However, most RLVR pipelines rely on human-curated tests, making progress bottlenecked by scarce and costly supervision. Existing work tried to use self-generated tests to ground rewards, but the lack of discriminative tests constrains the effect due to the sub-optimal performance of the model on test generation. We aim to improve code generation without ground-truth supervision by co-evolving code and test generation, so that their interactions yield progressively more informative supervision. To this end, we present ZeroCoder, a fully label-free co-evolutionary framework that jointly trains a Coder and a Tester using execution feedback from self-generated code-test interactions. For each problem, ZeroCoder executes sampled solutions against sampled tests to form a passing matrix, identifies a consensus subset of likely-correct solutions and consistent tests via a pluggable selection algorithm, and derives role-specific rewards. To ensure reward quality, ZeroCoder filters low-information instances via rank-based pre-filtering and trains the Tester with a curriculum balancing validity and mutation-driven discriminativeness. We further identify selector drift, the progressive miscalibration of fixed selection rules during co-evolution, and introduce DyB4, a Bayesian selector that uses as few as 10 labeled instances to recalibrate its priors dynamically. Across three models and six benchmarks, ZeroCoder consistently improves code generation and test generation. In the fully label-free setting, it improves code generation by up to 14.5% over the base model on Qwen2.5-Coder-7B-Instruct. With DyB4, the gain reaches 21.6%, while test generation improves by 24.3%, approaching oracle-supervised performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ZeroCoder, a fully label-free co-evolutionary framework for jointly training coder and tester LLMs. For each problem it samples solutions and tests, executes them to build a passing matrix, applies a pluggable selector to extract a consensus subset of likely-correct codes and consistent tests, derives role-specific rewards, applies rank-based pre-filtering, and trains the tester with a validity/mutation curriculum. It identifies selector drift and proposes DyB4 (a Bayesian selector recalibrated with 10 labeled instances) to mitigate it. Experiments across three models and six benchmarks report code-generation gains of up to 14.5% (label-free) and 21.6% (with DyB4) plus 24.3% test-generation improvement, approaching oracle-supervised performance.
Significance. If the selector reliably isolates correct solutions, the work would meaningfully reduce dependence on human-curated tests in RLVR pipelines for code generation. The explicit treatment of selector drift and the multi-model/multi-benchmark evaluation are strengths; the framework is pluggable and the gains are quantified on standard models (e.g., Qwen2.5-Coder-7B-Instruct).
major comments (2)
- [§3.2] §3.2 (Consensus Selection): The central claim that the pluggable selector extracts a subset whose members are actually correct on hidden ground-truth tests is load-bearing yet unverified. No direct measurement (e.g., precision of the selected subset against oracle tests on a held-out set) is reported, leaving open the risk that correlated latent bugs produce spurious consensus and are then reinforced by the derived rewards.
- [§4.2] §4.2 (DyB4 and Label-Free Setting): DyB4 requires 10 labeled instances for recalibration, so the higher gains (21.6%) are not fully label-free. The manuscript should provide an explicit ablation separating the zero-label regime from the 10-label regime and clarify how this affects the abstract's claim of 'fully label-free' improvement.
minor comments (2)
- [§3.1] The passing matrix is introduced without a formal definition or equation (rows = solutions, columns = tests, entry = pass/fail); adding this early would improve readability.
- [Results section] Table captions and result paragraphs should state the number of runs, random seeds, and whether statistical significance tests were performed on the reported percentage gains.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the selector verification and the distinction between label-free and minimally-supervised regimes. We address each major comment below with specific plans for revision.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Consensus Selection): The central claim that the pluggable selector extracts a subset whose members are actually correct on hidden ground-truth tests is load-bearing yet unverified. No direct measurement (e.g., precision of the selected subset against oracle tests on a held-out set) is reported, leaving open the risk that correlated latent bugs produce spurious consensus and are then reinforced by the derived rewards.
Authors: We agree that direct verification of the selector's precision against oracle tests is important to substantiate the central claim and rule out spurious consensus from correlated bugs. While the reported code-generation gains on benchmarks (where ground-truth tests exist for final evaluation) provide indirect evidence of selector quality, an explicit measurement was not included. In the revised manuscript we will add an analysis reporting precision, recall, and F1-score of the consensus-selected code subset against the hidden ground-truth tests on held-out problems. This will quantify selector reliability and directly address the concern. revision: yes
-
Referee: [§4.2] §4.2 (DyB4 and Label-Free Setting): DyB4 requires 10 labeled instances for recalibration, so the higher gains (21.6%) are not fully label-free. The manuscript should provide an explicit ablation separating the zero-label regime from the 10-label regime and clarify how this affects the abstract's claim of 'fully label-free' improvement.
Authors: The abstract and experimental sections already distinguish the two regimes (14.5% fully label-free vs. 21.6% with DyB4). To improve clarity, we will revise the abstract and introduction to state explicitly that the core ZeroCoder framework is label-free and that DyB4 is an optional Bayesian recalibration module using only 10 labeled instances to counter selector drift. We will also add a dedicated ablation comparing zero-label and 10-label performance across all models and benchmarks, including effects on test-generation quality. This will ensure the 'fully label-free' claim is unambiguously tied to the base framework. revision: yes
Circularity Check
No significant circularity; empirical gains measured on external benchmarks
full rationale
The paper's core contribution is an empirical co-evolutionary procedure that constructs rewards from a passing matrix of self-generated code-test pairs and a pluggable selector. Evaluation occurs on standard benchmarks whose ground-truth tests are external to the training loop, and reported gains (e.g., +14.5 % label-free, +21.6 % with DyB4) are measured against base-model performance on those held-out tests. No equation or derivation reduces the claimed improvement to a tautological re-labeling of the model's own outputs; the selector's correctness is treated as an empirical assumption rather than a definitional identity. DyB4's use of 10 labeled instances is explicitly separated from the fully label-free regime. The framework therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- number of labeled instances for DyB4 =
10
axioms (2)
- domain assumption Execution results from self-generated code-test pairs provide a usable signal for identifying correct solutions and discriminative tests
- domain assumption A pluggable selection algorithm can reliably extract a consensus subset without external ground truth
invented entities (2)
-
selector drift
no independent evidence
-
DyB4
no independent evidence
Reference graph
Works this paper leans on
- [1]
-
[2]
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models.arXiv preprint arXiv:2108.07732(2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[3]
Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. [n. d.]. CodeT: Code Generation with Generated Tests. InThe Eleventh International Conference on Learning Representations
-
[4]
Lili Chen, Mihir Prabhudesai, Katerina Fragkiadaki, Hao Liu, and Deepak Pathak
- [5]
-
[6]
Mouxiang Chen, Zhongxin Liu, He Tao, Yusu Hong, David Lo, Xin Xia, and Jianling Sun. 2024. B4: Towards optimal assessment of plausible code solutions with plausible tests. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 1693–1705
2024
- [7]
-
[8]
Philip Gorinski, Matthieu Zimmer, Gerasimos Lampouras, Derrick Goh Xin Deik, and Ignacio Iacobacci. 2023. Automatic unit test data generation and actor-critic reinforcement learning for code synthesis. InFindings of the Association for Computational Linguistics: EMNLP 2023. 370–384
2023
-
[9]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al . 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al . 2021. Measuring Coding Challenge Competence With APPS. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)
2021
-
[11]
Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. 2021. Measuring Coding Challenge Competence With APPS.NeurIPS (2021)
2021
-
[12]
Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. 2025. R-zero: Self-evolving reasoning llm from zero data.arXiv preprint arXiv:2508.05004(2025)
work page internal anchor Pith review arXiv 2025
-
[13]
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. 2024. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Jeevana Priya Inala, Chenglong Wang, Mei Yang, Andres Codas, Mark Encar- nación, Shuvendu Lahiri, Madanlal Musuvathi, and Jianfeng Gao. 2022. Fault- aware neural code rankers.Advances in Neural Information Processing Systems 35 (2022), 13419–13432
2022
-
[15]
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2025. Live- CodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code. InThe Thirteenth International Conference on Learning Rep- resentations, ICLR 2025, Singapore, April 24-28, 2025. OpenRevie...
2025
-
[16]
Hyosoon Jang, Yunhui Jang, Sungjae Lee, Jungseul Ok, and Sungsoo Ahn. [n. d.]. Self-training large language models with confident reasoning. ([n. d.])
-
[17]
Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization tech- niques for recommender systems.Computer42, 8 (2009), 30–37
2009
-
[18]
Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. 2025. Llms get lost in multi-turn conversation.arXiv preprint arXiv:2505.06120(2025)
work page internal anchor Pith review arXiv 2025
-
[19]
Clifford Lam. 2020. High-dimensional covariance matrix estimation.Wiley Interdisciplinary reviews: computational statistics12, 2 (2020), e1485
2020
-
[20]
Hung Le, Hailin Chen, Amrita Saha, Akash Gokul, Doyen Sahoo, and Shafiq Joty. 2024. CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modules. InThe Twelfth International Conference on Learning Representations
2024
-
[21]
Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. 2022. Competition-level code generation with alphacode.Science378, 6624 (2022), 1092–1097
2022
- [22]
-
[23]
Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2024. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.Advances in Neural Information Processing Systems Lishui Fan, Mouxiang Chen, Tingwei Zhu, Kui Liu, Xin Xia, Shanping Li, and Zhongxin Liu 36 (2024)
2024
- [24]
-
[25]
Ilya Loshchilov and Frank Hutter. 2017. SGDR: Stochastic Gradient Descent with Warm Restarts. InInternational Conference on Learning Representations
2017
-
[26]
Yichuan Ma, Yunfan Shao, Peiji Li, Demin Song, Qipeng Guo, Linyang Li, Xipeng Qiu, and Kai Chen. 2025. UnitCoder: Scalable Code Synthesis from Pre-training Corpora. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 5623–5641
2025
-
[27]
Alexander Quinn Nichol and Prafulla Dhariwal. 2021. Improved denoising diffusion probabilistic models. InInternational conference on machine learning. PMLR, 8162–8171
2021
-
[28]
Theo X Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, and Armando Solar-Lezama. 2023. Is Self-Repair a Silver Bullet for Code Generation?. InThe Twelfth International Conference on Learning Representations
2023
-
[29]
Guilherme Penedo, Anton Lozhkov, Hynek Kydlíček, Loubna Ben Allal, Edward Beeching, Agustín Piqueres Lajarín, Quentin Gallouédec, Nathan Habib, Lewis Tunstall, and Leandro von Werra. 2025. CodeForces. https://huggingface.co/ datasets/open-r1/codeforces
2025
- [30]
- [31]
-
[32]
Olivier Roy and Martin Vetterli. 2007. The effective rank: A measure of effective dimensionality. In2007 15th European signal processing conference. IEEE, 606–610
2007
- [33]
-
[34]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. 2025. Hybridflow: A flexible and efficient rlhf framework. InProceedings of the Twentieth European Conference on Computer Systems. 1279–1297
2025
-
[36]
Freda Shi, Daniel Fried, Marjan Ghazvininejad, Luke Zettlemoyer, and Sida I Wang. 2022. Natural language to code translation with execution. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 3533–3546
2022
- [37]
-
[38]
Zhiruo Wang, Shuyan Zhou, Daniel Fried, and Graham Neubig. 2023. Execution- based evaluation for open-domain code generation. InFindings of the Association for Computational Linguistics: EMNLP 2023. 1271–1290
2023
-
[39]
Liang Wen, Yunke Cai, Fenrui Xiao, Xin He, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu, Tanglifu Tanglifu, Xiaowei Lv, et al. 2025. Light-r1: Curriculum sft, dpo and rl for long cot from scratch and beyond. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track). 318–327
2025
-
[40]
Weimin Xiong, Yiwen Guo, and Hao Chen. 2024. The program testing ability of large language models for code. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track. 23–34
2024
-
[41]
Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, and Radha Poovendran
-
[42]
Kodcode: A diverse, challenging, and verifiable synthetic dataset for coding
Kodcode: A diverse, challenging, and verifiable synthetic dataset for coding. arXiv preprint arXiv:2503.02951(2025)
-
[43]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.15115 2024
-
[45]
Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, Guangtai Liang, Ying Li, Qianxiang Wang, and Tao Xie. 2024. Codereval: A benchmark of pragmatic code generation with generative pre-trained models. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–12
2024
-
[46]
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. 2025. Dapo: An open- source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [47]
- [48]
-
[49]
Kechi Zhang, Ge Li, Yihong Dong, Jingjing Xu, Jun Zhang, Jing Su, Yongfei Liu, and Zhi Jin. 2025. Codedpo: Aligning code models with self generated and verified source code. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 15854–15871
2025
-
[50]
Qingyang Zhang, Haitao Wu, Changqing Zhang, Peilin Zhao, and Yatao Bian
- [51]
-
[52]
Tianyi Zhang, Tao Yu, Tatsunori Hashimoto, Mike Lewis, Wen-tau Yih, Daniel Fried, and Sida Wang. 2023. Coder reviewer reranking for code generation. In International Conference on Machine Learning. PMLR, 41832–41846
2023
-
[53]
Yuanliang Zhang, Yifan Xie, Shanshan Lit, Ke Liu, Chong Wang, Zhouyang Jia, Xiangbing Huang, Jie Song, Chaopeng Luo, Zhizheng Zheng, et al. 2025. Unseen horizons: Unveiling the real capability of llm code generation beyond the familiar. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE, 604–615
2025
-
[54]
Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Matthieu Lin, Shen- zhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. 2025. Absolute zero: Reinforced self-play reasoning with zero data.arXiv preprint arXiv:2505.03335 (2025)
work page internal anchor Pith review arXiv 2025
-
[55]
Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. 2025. Group sequence policy optimization.arXiv preprint arXiv:2507.18071(2025)
work page internal anchor Pith review arXiv 2025
-
[56]
Yuxin Zuo, Kaiyan Zhang, Li Sheng, Shang Qu, Ganqu Cui, Xuekai Zhu, Haozhan Li, Yuchen Zhang, Xinwei Long, Ermo Hua, et al. 2025. Ttrl: Test-time reinforce- ment learning.arXiv preprint arXiv:2504.16084(2025). Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009
work page Pith review arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.