pith. machine review for the scientific record. sign in

arxiv: 2604.07864 · v1 · submitted 2026-04-09 · 💻 cs.SE

Recognition: no theorem link

ZeroCoder: Can LLMs Improve Code Generation Without Ground-Truth Supervision?

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:32 UTC · model grok-4.3

classification 💻 cs.SE
keywords code generationLLMtest generationco-evolutionlabel-freereinforcement learningRLVRsoftware engineering
0
0 comments X

The pith

LLMs can improve code generation without ground-truth supervision by co-evolving a coder and tester through their self-generated execution feedback.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ZeroCoder, a framework that trains an LLM to generate both code solutions and tests for programming problems with no human labels at all. It samples multiple codes and tests per problem, executes them against each other to build a passing matrix, then applies a selection rule to pick out a consensus group of likely correct codes and useful tests that supply training rewards to both roles. The process repeats, letting the two generators improve each other over time. A reader would care because this removes the bottleneck of collecting expensive human-written test cases for training better coding models. The method also adds a Bayesian selector update called DyB4 that uses only a handful of labels to keep the selection accurate as training progresses.

Core claim

ZeroCoder is a fully label-free co-evolutionary framework that jointly trains a Coder and a Tester using execution feedback from self-generated code-test interactions. For each problem, it executes sampled solutions against sampled tests to form a passing matrix, identifies a consensus subset of likely-correct solutions and consistent tests via a pluggable selection algorithm, and derives role-specific rewards. Low-information cases are filtered first, the Tester is trained with a curriculum on validity and mutation-driven discriminativeness, and DyB4 recalibrates selection priors dynamically to counter selector drift.

What carries the argument

The passing matrix formed by executing self-generated code solutions against self-generated tests, from which a selection algorithm extracts a consensus subset to supply role-specific rewards for joint training.

If this is right

  • Code generation improves by up to 14.5 percent over the base model in the fully label-free setting across three models and six benchmarks.
  • With the DyB4 selector, code generation gains reach 21.6 percent and test generation improves by 24.3 percent, approaching oracle-supervised performance.
  • Both the code generator and the test generator improve through the same co-evolutionary loop.
  • The framework works with pluggable selection algorithms and includes rank-based pre-filtering to maintain reward quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Methods like this could cut the cost of building capable coding models by removing the need for large human-curated test suites.
  • The same co-evolution pattern could be tested on other domains where one model can verify another through execution, such as simple program verification tasks.
  • Selector drift points to a general need for adaptive calibration in any self-supervised reward system that runs for many iterations.

Load-bearing premise

The consensus subset chosen from the passing matrix of self-generated code-test pairs continues to identify correct solutions and good tests even as the models improve together and the selection rules shift.

What would settle it

If the pass rate of the trained coder on a benchmark that has hidden ground-truth solutions shows no gain or a loss relative to the starting base model, the claim of effective label-free improvement would be falsified.

Figures

Figures reproduced from arXiv: 2604.07864 by Kui Liu, Lishui Fan, Mouxiang Chen, Shanping Li, Tingwei Zhu, Xin Xia, Zhongxin Liu.

Figure 1
Figure 1. Figure 1: Overview of ZeroCoder framework. Coder Prompt Tester Prompt {The problem statement} Use the following starter code to write the solution, and If the problem provides an entry point: enclose your code within the delimiters below. ```python {The entry point} ``` Read the input from stdin, solve the problem, and write the output to stdout. Do not directly test on the sample inputs. Enclose your code within th… view at source ↗
Figure 2
Figure 2. Figure 2: Coder prompt template used in our experiments. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Tester prompt template used in our experiments. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: A case study demonstrating the advantages of Ze [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Analysis of the selection noise rate of static selectors [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Sensitivity analysis of ZeroCoder: (a) performance [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
read the original abstract

Code generation is important in software engineering, and Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful paradigm to improve it through execution-based feedback. However, most RLVR pipelines rely on human-curated tests, making progress bottlenecked by scarce and costly supervision. Existing work tried to use self-generated tests to ground rewards, but the lack of discriminative tests constrains the effect due to the sub-optimal performance of the model on test generation. We aim to improve code generation without ground-truth supervision by co-evolving code and test generation, so that their interactions yield progressively more informative supervision. To this end, we present ZeroCoder, a fully label-free co-evolutionary framework that jointly trains a Coder and a Tester using execution feedback from self-generated code-test interactions. For each problem, ZeroCoder executes sampled solutions against sampled tests to form a passing matrix, identifies a consensus subset of likely-correct solutions and consistent tests via a pluggable selection algorithm, and derives role-specific rewards. To ensure reward quality, ZeroCoder filters low-information instances via rank-based pre-filtering and trains the Tester with a curriculum balancing validity and mutation-driven discriminativeness. We further identify selector drift, the progressive miscalibration of fixed selection rules during co-evolution, and introduce DyB4, a Bayesian selector that uses as few as 10 labeled instances to recalibrate its priors dynamically. Across three models and six benchmarks, ZeroCoder consistently improves code generation and test generation. In the fully label-free setting, it improves code generation by up to 14.5% over the base model on Qwen2.5-Coder-7B-Instruct. With DyB4, the gain reaches 21.6%, while test generation improves by 24.3%, approaching oracle-supervised performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces ZeroCoder, a fully label-free co-evolutionary framework for jointly training coder and tester LLMs. For each problem it samples solutions and tests, executes them to build a passing matrix, applies a pluggable selector to extract a consensus subset of likely-correct codes and consistent tests, derives role-specific rewards, applies rank-based pre-filtering, and trains the tester with a validity/mutation curriculum. It identifies selector drift and proposes DyB4 (a Bayesian selector recalibrated with 10 labeled instances) to mitigate it. Experiments across three models and six benchmarks report code-generation gains of up to 14.5% (label-free) and 21.6% (with DyB4) plus 24.3% test-generation improvement, approaching oracle-supervised performance.

Significance. If the selector reliably isolates correct solutions, the work would meaningfully reduce dependence on human-curated tests in RLVR pipelines for code generation. The explicit treatment of selector drift and the multi-model/multi-benchmark evaluation are strengths; the framework is pluggable and the gains are quantified on standard models (e.g., Qwen2.5-Coder-7B-Instruct).

major comments (2)
  1. [§3.2] §3.2 (Consensus Selection): The central claim that the pluggable selector extracts a subset whose members are actually correct on hidden ground-truth tests is load-bearing yet unverified. No direct measurement (e.g., precision of the selected subset against oracle tests on a held-out set) is reported, leaving open the risk that correlated latent bugs produce spurious consensus and are then reinforced by the derived rewards.
  2. [§4.2] §4.2 (DyB4 and Label-Free Setting): DyB4 requires 10 labeled instances for recalibration, so the higher gains (21.6%) are not fully label-free. The manuscript should provide an explicit ablation separating the zero-label regime from the 10-label regime and clarify how this affects the abstract's claim of 'fully label-free' improvement.
minor comments (2)
  1. [§3.1] The passing matrix is introduced without a formal definition or equation (rows = solutions, columns = tests, entry = pass/fail); adding this early would improve readability.
  2. [Results section] Table captions and result paragraphs should state the number of runs, random seeds, and whether statistical significance tests were performed on the reported percentage gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the selector verification and the distinction between label-free and minimally-supervised regimes. We address each major comment below with specific plans for revision.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Consensus Selection): The central claim that the pluggable selector extracts a subset whose members are actually correct on hidden ground-truth tests is load-bearing yet unverified. No direct measurement (e.g., precision of the selected subset against oracle tests on a held-out set) is reported, leaving open the risk that correlated latent bugs produce spurious consensus and are then reinforced by the derived rewards.

    Authors: We agree that direct verification of the selector's precision against oracle tests is important to substantiate the central claim and rule out spurious consensus from correlated bugs. While the reported code-generation gains on benchmarks (where ground-truth tests exist for final evaluation) provide indirect evidence of selector quality, an explicit measurement was not included. In the revised manuscript we will add an analysis reporting precision, recall, and F1-score of the consensus-selected code subset against the hidden ground-truth tests on held-out problems. This will quantify selector reliability and directly address the concern. revision: yes

  2. Referee: [§4.2] §4.2 (DyB4 and Label-Free Setting): DyB4 requires 10 labeled instances for recalibration, so the higher gains (21.6%) are not fully label-free. The manuscript should provide an explicit ablation separating the zero-label regime from the 10-label regime and clarify how this affects the abstract's claim of 'fully label-free' improvement.

    Authors: The abstract and experimental sections already distinguish the two regimes (14.5% fully label-free vs. 21.6% with DyB4). To improve clarity, we will revise the abstract and introduction to state explicitly that the core ZeroCoder framework is label-free and that DyB4 is an optional Bayesian recalibration module using only 10 labeled instances to counter selector drift. We will also add a dedicated ablation comparing zero-label and 10-label performance across all models and benchmarks, including effects on test-generation quality. This will ensure the 'fully label-free' claim is unambiguously tied to the base framework. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical gains measured on external benchmarks

full rationale

The paper's core contribution is an empirical co-evolutionary procedure that constructs rewards from a passing matrix of self-generated code-test pairs and a pluggable selector. Evaluation occurs on standard benchmarks whose ground-truth tests are external to the training loop, and reported gains (e.g., +14.5 % label-free, +21.6 % with DyB4) are measured against base-model performance on those held-out tests. No equation or derivation reduces the claimed improvement to a tautological re-labeling of the model's own outputs; the selector's correctness is treated as an empirical assumption rather than a definitional identity. DyB4's use of 10 labeled instances is explicitly separated from the fully label-free regime. The framework therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 2 invented entities

Framework rests on domain assumptions about execution feedback quality and the reliability of consensus selection; introduces selector drift and DyB4 as new constructs without external validation.

free parameters (1)
  • number of labeled instances for DyB4 = 10
    Fixed at 10 to recalibrate Bayesian priors; chosen to balance label cost against drift correction.
axioms (2)
  • domain assumption Execution results from self-generated code-test pairs provide a usable signal for identifying correct solutions and discriminative tests
    Invoked when forming the passing matrix and deriving role-specific rewards.
  • domain assumption A pluggable selection algorithm can reliably extract a consensus subset without external ground truth
    Central to reward derivation in the co-evolution loop.
invented entities (2)
  • selector drift no independent evidence
    purpose: Describes progressive miscalibration of fixed selection rules during co-evolution
    Identified as a new phenomenon requiring mitigation.
  • DyB4 no independent evidence
    purpose: Bayesian selector that dynamically recalibrates using a small number of labels
    New component introduced to address selector drift.

pith-pipeline@v0.9.0 · 5642 in / 1634 out tokens · 35552 ms · 2026-05-10T18:32:03.105221+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 25 canonical work pages · 11 internal anchors

  1. [1]

    Marah Abdin, Sahaj Agarwal, Ahmed Awadallah, Vidhisha Balachandran, Harki- rat Behl, Lingjiao Chen, Gustavo de Rosa, Suriya Gunasekar, Mojan Java- heripi, Neel Joshi, et al. 2025. Phi-4-reasoning technical report.arXiv preprint arXiv:2504.21318(2025)

  2. [2]

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models.arXiv preprint arXiv:2108.07732(2021)

  3. [3]

    Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. [n. d.]. CodeT: Code Generation with Generated Tests. InThe Eleventh International Conference on Learning Representations

  4. [4]

    Lili Chen, Mihir Prabhudesai, Katerina Fragkiadaki, Hao Liu, and Deepak Pathak

  5. [5]

    Self-questioning language models.arXiv preprint arXiv:2508.03682(2025)

  6. [6]

    Mouxiang Chen, Zhongxin Liu, He Tao, Yusu Hong, David Lo, Xin Xia, and Jianling Sun. 2024. B4: Towards optimal assessment of plausible code solutions with plausible tests. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 1693–1705

  7. [7]

    Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2023. Classeval: A manually- crafted benchmark for evaluating llms on class-level code generation.arXiv preprint arXiv:2308.01861(2023)

  8. [8]

    Philip Gorinski, Matthieu Zimmer, Gerasimos Lampouras, Derrick Goh Xin Deik, and Ignacio Iacobacci. 2023. Automatic unit test data generation and actor-critic reinforcement learning for code synthesis. InFindings of the Association for Computational Linguistics: EMNLP 2023. 370–384

  9. [9]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al . 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)

  10. [10]

    Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al . 2021. Measuring Coding Challenge Competence With APPS. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)

  11. [11]

    Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. 2021. Measuring Coding Challenge Competence With APPS.NeurIPS (2021)

  12. [12]

    Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. 2025. R-zero: Self-evolving reasoning llm from zero data.arXiv preprint arXiv:2508.05004(2025)

  13. [13]

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. 2024. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186(2024)

  14. [14]

    Jeevana Priya Inala, Chenglong Wang, Mei Yang, Andres Codas, Mark Encar- nación, Shuvendu Lahiri, Madanlal Musuvathi, and Jianfeng Gao. 2022. Fault- aware neural code rankers.Advances in Neural Information Processing Systems 35 (2022), 13419–13432

  15. [15]

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2025. Live- CodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code. InThe Thirteenth International Conference on Learning Rep- resentations, ICLR 2025, Singapore, April 24-28, 2025. OpenRevie...

  16. [16]

    Hyosoon Jang, Yunhui Jang, Sungjae Lee, Jungseul Ok, and Sungsoo Ahn. [n. d.]. Self-training large language models with confident reasoning. ([n. d.])

  17. [17]

    Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization tech- niques for recommender systems.Computer42, 8 (2009), 30–37

  18. [18]

    Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. 2025. Llms get lost in multi-turn conversation.arXiv preprint arXiv:2505.06120(2025)

  19. [19]

    Clifford Lam. 2020. High-dimensional covariance matrix estimation.Wiley Interdisciplinary reviews: computational statistics12, 2 (2020), e1485

  20. [20]

    Hung Le, Hailin Chen, Amrita Saha, Akash Gokul, Doyen Sahoo, and Shafiq Joty. 2024. CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modules. InThe Twelfth International Conference on Learning Representations

  21. [21]

    Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. 2022. Competition-level code generation with alphacode.Science378, 6624 (2022), 1092–1097

  22. [22]

    Zi Lin, Sheng Shen, Jingbo Shang, Jason Weston, and Yixin Nie. 2025. Learning to solve and verify: A self-play framework for code and test generation.arXiv preprint arXiv:2502.14948(2025)

  23. [23]

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2024. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.Advances in Neural Information Processing Systems Lishui Fan, Mouxiang Chen, Tingwei Zhu, Kui Liu, Xin Xia, Shanping Li, and Zhongxin Liu 36 (2024)

  24. [24]

    Zhihan Liu, Shenao Zhang, Yongfei Liu, Boyi Liu, Yingxiang Yang, and Zhaoran Wang. 2024. Dstc: Direct preference learning with only self-generated tests and code to improve code lms.arXiv preprint arXiv:2411.13611(2024)

  25. [25]

    Ilya Loshchilov and Frank Hutter. 2017. SGDR: Stochastic Gradient Descent with Warm Restarts. InInternational Conference on Learning Representations

  26. [26]

    Yichuan Ma, Yunfan Shao, Peiji Li, Demin Song, Qipeng Guo, Linyang Li, Xipeng Qiu, and Kai Chen. 2025. UnitCoder: Scalable Code Synthesis from Pre-training Corpora. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 5623–5641

  27. [27]

    Alexander Quinn Nichol and Prafulla Dhariwal. 2021. Improved denoising diffusion probabilistic models. InInternational conference on machine learning. PMLR, 8162–8171

  28. [28]

    Theo X Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, and Armando Solar-Lezama. 2023. Is Self-Repair a Silver Bullet for Code Generation?. InThe Twelfth International Conference on Learning Representations

  29. [29]

    Guilherme Penedo, Anton Lozhkov, Hynek Kydlíček, Loubna Ben Allal, Edward Beeching, Agustín Piqueres Lajarín, Quentin Gallouédec, Nathan Habib, Lewis Tunstall, and Leandro von Werra. 2025. CodeForces. https://huggingface.co/ datasets/open-r1/codeforces

  30. [30]

    Mihir Prabhudesai, Lili Chen, Alex Ippoliti, Katerina Fragkiadaki, Hao Liu, and Deepak Pathak. 2025. Maximizing confidence alone improves reasoning.arXiv preprint arXiv:2505.22660(2025)

  31. [31]

    Archiki Prasad, Elias Stengel-Eskin, Justin Chih-Yao Chen, Zaid Khan, and Mohit Bansal. 2025. Learning to generate unit tests for automated debugging.arXiv preprint arXiv:2502.01619(2025)

  32. [32]

    Olivier Roy and Martin Vetterli. 2007. The effective rank: A measure of effective dimensionality. In2007 15th European signal processing conference. IEEE, 606–610

  33. [33]

    Sheikh Shafayat, Fahim Tajwar, Ruslan Salakhutdinov, Jeff Schneider, and An- drea Zanette. 2025. Can Large Reasoning Models Self-Train?arXiv preprint arXiv:2505.21444(2025)

  34. [34]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300(2024)

  35. [35]

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. 2025. Hybridflow: A flexible and efficient rlhf framework. InProceedings of the Twentieth European Conference on Computer Systems. 1279–1297

  36. [36]

    Freda Shi, Daniel Fried, Marjan Ghazvininejad, Luke Zettlemoyer, and Sida I Wang. 2022. Natural language to code translation with execution. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 3533–3546

  37. [37]

    Yinjie Wang, Ling Yang, Ye Tian, Ke Shen, and Mengdi Wang. 2025. Co- evolving llm coder and unit tester via reinforcement learning.arXiv preprint arXiv:2506.03136(2025)

  38. [38]

    Zhiruo Wang, Shuyan Zhou, Daniel Fried, and Graham Neubig. 2023. Execution- based evaluation for open-domain code generation. InFindings of the Association for Computational Linguistics: EMNLP 2023. 1271–1290

  39. [39]

    Liang Wen, Yunke Cai, Fenrui Xiao, Xin He, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu, Tanglifu Tanglifu, Xiaowei Lv, et al. 2025. Light-r1: Curriculum sft, dpo and rl for long cot from scratch and beyond. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track). 318–327

  40. [40]

    Weimin Xiong, Yiwen Guo, and Hao Chen. 2024. The program testing ability of large language models for code. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track. 23–34

  41. [41]

    Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, and Radha Poovendran

  42. [42]

    Kodcode: A diverse, challenging, and verifiable synthetic dataset for coding

    Kodcode: A diverse, challenging, and verifiable synthetic dataset for coding. arXiv preprint arXiv:2503.02951(2025)

  43. [43]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

  44. [44]

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

  45. [45]

    Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, Guangtai Liang, Ying Li, Qianxiang Wang, and Tao Xie. 2024. Codereval: A benchmark of pragmatic code generation with generative pre-trained models. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–12

  46. [46]

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. 2025. Dapo: An open- source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476 (2025)

  47. [47]

    Michał Zawalski, Błażej Osiński, Henryk Michalewski, and Piotr Miłoś. 2021. Off-policy correction for multi-agent reinforcement learning.arXiv preprint arXiv:2111.11229(2021)

  48. [48]

    Huaye Zeng, Dongfu Jiang, Haozhe Wang, Ping Nie, Xiaotong Chen, and Wenhu Chen. 2025. Acecoder: Acing coder rl via automated test-case synthesis.arXiv preprint arXiv:2502.01718(2025)

  49. [49]

    Kechi Zhang, Ge Li, Yihong Dong, Jingjing Xu, Jun Zhang, Jing Su, Yongfei Liu, and Zhi Jin. 2025. Codedpo: Aligning code models with self generated and verified source code. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 15854–15871

  50. [50]

    Qingyang Zhang, Haitao Wu, Changqing Zhang, Peilin Zhao, and Yatao Bian

  51. [51]

    Right question is already half the answer: Fully unsupervised llm reasoning incentivization.arXiv preprint arXiv:2504.05812(2025)

  52. [52]

    Tianyi Zhang, Tao Yu, Tatsunori Hashimoto, Mike Lewis, Wen-tau Yih, Daniel Fried, and Sida Wang. 2023. Coder reviewer reranking for code generation. In International Conference on Machine Learning. PMLR, 41832–41846

  53. [53]

    Yuanliang Zhang, Yifan Xie, Shanshan Lit, Ke Liu, Chong Wang, Zhouyang Jia, Xiangbing Huang, Jie Song, Chaopeng Luo, Zhizheng Zheng, et al. 2025. Unseen horizons: Unveiling the real capability of llm code generation beyond the familiar. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE, 604–615

  54. [54]

    Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Matthieu Lin, Shen- zhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. 2025. Absolute zero: Reinforced self-play reasoning with zero data.arXiv preprint arXiv:2505.03335 (2025)

  55. [55]

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. 2025. Group sequence policy optimization.arXiv preprint arXiv:2507.18071(2025)

  56. [56]

    Yuxin Zuo, Kaiyan Zhang, Li Sheng, Shang Qu, Ganqu Cui, Xuekai Zhu, Haozhan Li, Yuchen Zhang, Xinwei Long, Ermo Hua, et al. 2025. Ttrl: Test-time reinforce- ment learning.arXiv preprint arXiv:2504.16084(2025). Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009