pith. machine review for the scientific record. sign in

arxiv: 2305.01210 · v3 · submitted 2023-05-02 · 💻 cs.SE · cs.CL· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-14 01:58 UTC · model grok-4.3

classification 💻 cs.SE cs.CLcs.LG
keywords large language modelscode generationprogram synthesisbenchmark evaluationfunctional correctnessautomated testingHumanEval
0
0 comments X

The pith

Augmenting HumanEval with 80 times more test cases reveals that LLM-generated code contains substantially more functional errors than prior benchmarks detected.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard benchmarks for LLM code synthesis rely on too few test cases to reliably judge whether generated programs actually work. EvalPlus automatically expands any benchmark dataset with many new inputs produced by a combination of LLM prompting and mutation-based generation. When applied to HumanEval, the resulting HumanEval+ dataset exposes large numbers of programs that pass the original tests yet fail the added ones. Across 26 models this stricter evaluation lowers measured pass rates by 19.3 to 28.9 percent and reverses some model rankings, with certain CodeLlama variants now outperforming ChatGPT. The central message is that earlier performance numbers overstated how correct LLM code really is and that automated test augmentation is needed to correct the record.

Core claim

EvalPlus augments existing code-synthesis benchmarks by generating large numbers of additional test inputs through LLM-based and mutation-based strategies. Applied to HumanEval, it produces HumanEval+ containing 80 times more tests; when 26 popular LLMs are re-evaluated on this dataset, pass@k rates drop by as much as 19.3-28.9 percent because many previously accepted programs fail on the new cases. The stricter tests also reorder model performance, allowing models such as WizardCoder-CodeLlama and Phind-CodeLlama to surpass ChatGPT where they previously did not.

What carries the argument

EvalPlus, an evaluation framework that augments a benchmark with automatically generated test cases using both LLM prompting and mutation strategies to increase coverage of functional behaviors.

Load-bearing premise

The new test cases generated by EvalPlus are themselves functionally correct and do not falsely reject valid programs or miss critical edge cases.

What would settle it

Running the HumanEval+ test suite on a collection of independently verified correct reference implementations and observing a non-negligible failure rate would show the added tests are unreliable.

read the original abstract

Program synthesis has been long studied with recent approaches focused on directly using the power of Large Language Models (LLMs) to generate code. Programming benchmarks, with curated synthesis problems and test-cases, are used to measure the performance of various LLMs on code synthesis. However, these test-cases can be limited in both quantity and quality for fully assessing the functional correctness of the generated code. Such limitation in the existing benchmarks begs the following question: In the era of LLMs, is the code generated really correct? To answer this, we propose EvalPlus -- a code synthesis evaluation framework to rigorously benchmark the functional correctness of LLM-synthesized code. EvalPlus augments a given evaluation dataset with large amounts of test-cases newly produced by an automatic test input generator, powered by both LLM- and mutation-based strategies. While EvalPlus is general, we extend the test-cases of the popular HumanEval benchmark by 80x to build HumanEval+. Our extensive evaluation across 26 popular LLMs (e.g., GPT-4 and ChatGPT) demonstrates that HumanEval+ is able to catch significant amounts of previously undetected wrong code synthesized by LLMs, reducing the pass@k by up-to 19.3-28.9%. We also surprisingly found that test insufficiency can lead to mis-ranking. For example, both WizardCoder-CodeLlama and Phind-CodeLlama now outperform ChatGPT on HumanEval+, while none of them could on HumanEval. Our work not only indicates that prior popular code synthesis evaluation results do not accurately reflect the true performance of LLMs for code synthesis, but also opens up a new direction to improve such programming benchmarks through automated testing. We have open-sourced our tools, enhanced datasets as well as all LLM-generated code at https://github.com/evalplus/evalplus to facilitate and accelerate future LLM-for-code research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces EvalPlus, a framework to augment code synthesis benchmarks with large numbers of new test cases generated via LLM-based and mutation-based strategies. Applied to HumanEval, this produces HumanEval+ containing 80x more tests. Evaluation across 26 LLMs shows that pass@k rates drop by 19.3-28.9% and that test insufficiency can produce incorrect model rankings on the original benchmark.

Significance. If the results hold, the work is significant because it provides large-scale empirical evidence that popular code synthesis benchmarks underestimate functional errors in LLM-generated code and can distort model comparisons. The open-sourcing of the full tool suite, enhanced datasets, and all generated code is a clear strength that directly supports verification and reuse. The breadth of the evaluation (26 models) adds weight to the central empirical claims.

major comments (1)
  1. [Test input generator and validation] The new test cases are validated only by execution against the original HumanEval reference solutions. This catches some errors but cannot detect tests that are inconsistent with the intended specification or that expose incompletenesses in the reference solutions themselves. Because the headline claim (19.3-28.9% drop in pass@k) rests on these tests being semantically correct, the validation procedure is load-bearing and requires either additional safeguards (e.g., manual review of a random sample or cross-validation against multiple independent solutions) or an explicit discussion of the remaining risk.
minor comments (1)
  1. [Abstract] The abstract states the reduction range as 'up-to 19.3-28.9%' without clarifying whether the bounds correspond to different values of k, different models, or both; a brief parenthetical would improve precision.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address the major comment below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses
  1. Referee: The new test cases are validated only by execution against the original HumanEval reference solutions. This catches some errors but cannot detect tests that are inconsistent with the intended specification or that expose incompletenesses in the reference solutions themselves. Because the headline claim (19.3-28.9% drop in pass@k) rests on these tests being semantically correct, the validation procedure is load-bearing and requires either additional safeguards (e.g., manual review of a random sample or cross-validation against multiple independent solutions) or an explicit discussion of the remaining risk.

    Authors: We appreciate the referee's careful scrutiny of the test validation step. In EvalPlus, each newly generated test input is executed against the original HumanEval reference solution; only inputs that the reference passes are retained. This guarantees that every added test is consistent with the benchmark's reference implementation, which functions as the de facto specification for the task. We agree that this procedure cannot detect tests that might conflict with the natural-language problem description or that would expose incompletenesses in the reference solutions themselves. Although HumanEval is a long-standing and widely trusted benchmark, we recognize that reliance on its references introduces a residual risk. In the revised manuscript we will add an explicit discussion of this limitation in the methodology section. We will also report the results of a manual review of a random sample of 100 generated tests (stratified across problems) to provide supplementary evidence of their semantic correctness. These changes will directly address the concern that the reported pass@k reductions depend on the semantic validity of the new tests. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical execution results are independent of inputs

full rationale

The paper's central claim rests on direct execution of LLM-generated code against an expanded test suite (HumanEval+). Test generation uses LLM and mutation strategies, with validation performed by running tests against the original HumanEval reference solutions. This process does not reduce to any self-definitional equivalence, fitted parameter renamed as prediction, or load-bearing self-citation chain. The observed drop in pass@k (19.3-28.9%) is a measured empirical outcome from running the same code on more tests, not a constructed identity with the generation inputs. The framework is self-contained against external benchmarks via execution, with no equations or derivations that loop back to the paper's own fitted values or prior self-citations as the sole justification.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the test generator produces valid tests that accurately reflect functional correctness without introducing artifacts.

axioms (1)
  • domain assumption The LLM- and mutation-based test input generator produces valid, non-redundant test cases that correctly identify functional errors.
    Invoked in the description of EvalPlus construction and HumanEval+ extension; if false, the reported pass@k reductions would not hold.

pith-pipeline@v0.9.0 · 5657 in / 1128 out tokens · 38710 ms · 2026-05-14T01:58:11.769884+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    cs.CL 2023-10 unverdicted novelty 8.0

    SWE-bench reveals that even top language models like Claude 2 resolve only 1.96% of 2,294 real-world GitHub issues, highlighting a gap in practical coding capabilities.

  2. ProgramBench: Can Language Models Rebuild Programs From Scratch?

    cs.SE 2026-05 unverdicted novelty 7.0

    ProgramBench introduces 200 tasks where models must reconstruct full programs like FFmpeg or SQLite from docs alone; none of 9 evaluated LMs fully solve any task and the best passes 95% tests on only 3% of tasks while...

  3. ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation

    cs.SE 2026-04 unverdicted novelty 7.0

    ClassEval-Pro benchmark shows frontier LLMs achieve at most 45.6% Pass@1 on class-level code tasks, with logic errors (56%) and dependency errors (38%) as dominant failure modes.

  4. When Prompt Under-Specification Improves Code Correctness: An Exploratory Study of Prompt Wording and Structure Effects on LLM-Based Code Generation

    cs.SE 2026-04 unverdicted novelty 7.0

    Structurally rich task descriptions make LLMs robust to prompt under-specification, and under-specification can enhance code correctness by disrupting misleading lexical or structural cues.

  5. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

    cs.SE 2024-05 unverdicted novelty 7.0

    SWE-agent introduces a custom agent-computer interface that lets LM agents solve software engineering tasks, reaching 12.5% pass@1 on SWE-bench and 87.7% on HumanEvalFix, exceeding prior non-interactive approaches.

  6. DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation

    cs.LG 2026-05 unverdicted novelty 6.0

    DARE co-evolves difficulty estimation and policy in RL for LLMs to improve training efficiency, final performance, and inference speed by using tailored strategies for different difficulty levels.

  7. Using Semantic Distance to Estimate Uncertainty in LLM-Based Code Generation

    cs.SE 2026-05 unverdicted novelty 6.0

    Semantic distance on program execution behaviors improves uncertainty estimation for LLM code generation and outperforms prior sample-based methods across benchmarks and models.

  8. SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

    cs.LG 2026-05 unverdicted novelty 6.0

    Pruning pretrained MoE models outperforms training from scratch, different compression methods converge after continued pretraining, and combining KD with language modeling loss plus progressive schedules yields a com...

  9. Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code

    cs.SE 2026-05 accept novelty 6.0

    A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.

  10. Defective Task Descriptions in LLM-Based Code Generation: Detection and Analysis

    cs.SE 2026-04 conditional novelty 6.0

    SpecValidator detects lexical vagueness, under-specification, and syntax-formatting defects in LLM code-generation prompts with F1 0.804, outperforming GPT-5-mini and Claude Sonnet 4, and shows that under-specificatio...

  11. You Don't Need Public Tests to Generate Correct Code

    cs.SE 2026-04 unverdicted novelty 6.0

    DryRUN lets LLMs create their own test inputs and run internal simulations for self-correcting code generation, matching the performance of test-dependent methods like CodeSIM on LiveCodeBench without public tests or ...

  12. Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts

    cs.LG 2026-04 unverdicted novelty 6.0

    BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.

  13. Co-Located Tests, Better AI Code: How Test Syntax Structure Affects Foundation Model Code Generation

    cs.SE 2026-04 unverdicted novelty 6.0

    Co-locating tests with implementation code yields substantially higher preservation and correctness in foundation-model-generated programs than separated test syntax.

  14. Structured Safety Auditing for Balancing Code Correctness and Content Safety in LLM-Generated Code

    cs.SE 2026-04 unverdicted novelty 6.0

    Dual Reasoning with explicit safety audits improves the new SUDS metric by 1.32x to 3.42x over baselines on code generation benchmarks containing injected harmful keywords.

  15. Leveraging Mathematical Reasoning of LLMs for Efficient GPU Thread Mapping

    cs.DC 2026-04 unverdicted novelty 6.0

    Large language models derive exact analytical GPU thread mappings for complex 2D/3D domains and fractals via in-context learning, outperforming symbolic regression and enabling up to thousands-fold speedups and energy...

  16. Ensemble-Based Uncertainty Estimation for Code Correctness Estimation

    cs.SE 2026-03 unverdicted novelty 6.0

    Ensemble Semantic Entropy improves correlation with code correctness over single-model methods and powers a cascading scaling system that cuts FLOPs by 64.9% while preserving performance on LiveCodeBench.

  17. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    cs.SE 2024-03 unverdicted novelty 6.0

    LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.

  18. Textbooks Are All You Need

    cs.CL 2023-06 unverdicted novelty 6.0

    A 1.3B-parameter code model trained on 7B tokens of curated textbook and synthetic data achieves 50.6% on HumanEval, indicating data quality can enable strong performance at small scale.

  19. CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society

    cs.AI 2023-03 conditional novelty 6.0

    CAMEL proposes a role-playing framework with inception prompting that enables autonomous multi-agent cooperation among LLMs and generates conversational data for studying their behaviors.

  20. Evaluating LLM-Generated Code: A Benchmark and Developer Study

    cs.SE 2026-05 unverdicted novelty 5.0

    A custom three-fold methodology combining a complex-project correctness benchmark, code quality verification, and structured developer reviews to evaluate LLM-generated code beyond correctness alone.

  21. Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

    cs.CL 2025-03 unverdicted novelty 5.0

    Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.

  22. StarCoder: may the source be with you!

    cs.CL 2023-05 accept novelty 5.0

    StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · cited by 22 Pith papers · 6 internal anchors

  1. [1]

    Ahmed and P

    T. Ahmed and P. Devanbu. Few-shot training llms for project-specific code-summarization. In 37th IEEE/ACM International Conference on Automated Software Engineering, pages 1–5, 2022

  2. [2]

    L. B. Allal, R. Li, D. Kocetkov, C. Mou, C. Akiki, C. M. Ferrandis, N. Muennighoff, M. Mishra, A. Gu, M. Dey, et al. Santacoder: don’t reach for the stars! arXiv preprint arXiv:2301.03988, 2023

  3. [3]

    Austin, A

    J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton. Program synthesis with large language models, 2021

  4. [4]

    S. Bang, S. Nam, I. Chun, H. Y . Jhoo, and J. Lee. Smt-based translation validation for machine learning compiler. In Computer Aided Verification: 34th International Conference, CAV 2022, Haifa, Israel, August 7–10, 2022, Proceedings, Part II, pages 386–407. Springer, 2022

  5. [5]

    Black, L

    S. Black, L. Gao, P. Wang, C. Leahy, and S. Biderman. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow, Mar. 2021. If you use this software, please cite it using these metadata

  6. [6]

    Brown, B

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

  7. [7]

    T. A. Budd. Mutation analysis of program test data. Yale University, 1980

  8. [8]

    Cadar, D

    C. Cadar, D. Dunbar, D. R. Engler, et al. Klee: unassisted and automatic generation of high-coverage tests for complex systems programs. In OSDI, volume 8, pages 209–224, 2008

  9. [9]

    Cassano, J

    F. Cassano, J. Gouwar, D. Nguyen, S. Nguyen, L. Phipps-Costin, D. Pinckney, M.-H. Yee, Y . Zi, C. J. Anderson, M. Q. Feldman, et al. Multipl-e: A scalable and polyglot approach to benchmarking neural code generation. IEEE Transactions on Software Engineering, 2023

  10. [10]

    S. K. Cha, M. Woo, and D. Brumley. Program-adaptive mutational fuzzing. In 2015 IEEE Symposium on Security and Privacy, pages 725–741. IEEE, 2015

  11. [11]

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

  12. [12]

    Chiang, Z

    W.-L. Chiang, Z. Li, Z. Lin, Y . Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y . Zhuang, J. E. Gonzalez, I. Stoica, and E. P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023

  13. [13]

    B. Code. Starcoder. https://github.com/bigcode-project/starcoder, 2023

  14. [14]

    Y . Deng, C. S. Xia, H. Peng, C. Yang, and L. Zhang. Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models. In 32nd International Symposium on Software Testing and Analysis (ISSTA), 2023

  15. [15]

    Y . Deng, C. S. Xia, C. Yang, S. D. Zhang, S. Yang, and L. Zhang. Large language models are edge-case fuzzers: Testing deep learning libraries via fuzzgpt. In 46th International Conference on Software Engineering (ICSE), 2024

  16. [16]

    Fauxpilot: an open-source alternative to github copilot server

    fauxpilot. Fauxpilot: an open-source alternative to github copilot server. https: //github.com/fauxpilot/fauxpilot, 2022

  17. [17]

    U. Feige. A threshold of ln n for approximating set cover. Journal of the ACM (JACM), 45(4):634–652, 1998

  18. [18]

    Fried, A

    D. Fried, A. Aghajanyan, J. Lin, S. Wang, E. Wallace, F. Shi, R. Zhong, S. Yih, L. Zettlemoyer, and M. Lewis. Incoder: A generative model for code infilling and synthesis. In The Eleventh International Conference on Learning Representations, 2023

  19. [19]

    C. Green. Application of theorem proving to problem solving. In Readings in Artificial Intelligence, pages 202–222. Elsevier, 1981

  20. [20]

    S. Gulwani. Automating string processing in spreadsheets using input-output examples. SIGPLAN Not., 46(1):317–330, jan 2011

  21. [21]

    Gulwani, O

    S. Gulwani, O. Polozov, and R. Singh. Program synthesis. Foundations and Trends® in Programming Languages, 4(1-2):1–119, 2017. 11

  22. [22]

    Hastie, R

    T. Hastie, R. Tibshirani, J. H. Friedman, and J. H. Friedman.The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer, 2009

  23. [23]

    Holler, K

    C. Holler, K. Herzig, and A. Zeller. Fuzzing with code fragments. In 21st USENIX Security Sym- posium (USENIX Security 12), pages 445–458, Bellevue, W A, Aug. 2012. USENIX Association

  24. [24]

    Ivankovi´c, G

    M. Ivankovi´c, G. Petrovi´c, R. Just, and G. Fraser. Code coverage at google. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 955–963, 2019

  25. [25]

    S. Iyer, I. Konstas, A. Cheung, and L. Zettlemoyer. Mapping language to code in programmatic context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1643–1652, 2018

  26. [26]

    A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023

  27. [27]

    Jiang, K

    N. Jiang, K. Liu, T. Lutellier, and L. Tan. Impact of code language models on automated program repair. arXiv preprint arXiv:2302.05020, 2023

  28. [28]

    C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023

  29. [29]

    Kalyan, A

    A. Kalyan, A. Mohta, O. Polozov, D. Batra, P. Jain, and S. Gulwani. Neural-guided deductive search for real-time program synthesis from examples. In International Conference on Learning Representations, 2018

  30. [30]

    J. C. King. Symbolic execution and program testing. Communications of the ACM , 19(7):385–394, 1976

  31. [31]

    arXiv preprint arXiv:2006.03511 (2020)

    M.-A. Lachaux, B. Roziere, L. Chanussot, and G. Lample. Unsupervised translation of programming languages. arXiv preprint arXiv:2006.03511, 2020

  32. [32]

    K. R. M. Leino. Dafny: An automatic program verifier for functional correctness. In International conference on logic for programming artificial intelligence and reasoning, pages 348–370. Springer, 2010

  33. [33]

    Y . Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. Dal Lago, et al. Competition-level code generation with alphacode. Science, 378(6624):1092–1097, 2022

  34. [34]

    J. Liu, Y . Wei, S. Yang, Y . Deng, and L. Zhang. Coverage-guided tensor compiler fuzzing with joint ir-pass mutation. Proceedings of the ACM on Programming Languages, 6(OOPSLA1):1–26, Apr. 2022

  35. [35]

    Z. Liu, C. Chen, J. Wang, X. Che, Y . Huang, J. Hu, and Q. Wang. Fill in the blank: Context-aware automated text input generation for mobile gui testing. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1355–1367. IEEE, 2023

  36. [36]

    N. P. Lopes, J. Lee, C.-K. Hur, Z. Liu, and J. Regehr. Alive2: bounded translation validation for llvm. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, pages 65–79, 2021

  37. [37]

    S. Lu, D. Guo, S. Ren, J. Huang, A. Svyatkovskiy, A. Blanco, C. Clement, D. Drain, D. Jiang, D. Tang, et al. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664, 2021

  38. [38]

    Z. Luo, C. Xu, P. Zhao, Q. Sun, X. Geng, W. Hu, C. Tao, J. Ma, Q. Lin, and D. Jiang. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568, 2023

  39. [39]

    Manna and R

    Z. Manna and R. J. Waldinger. Toward automatic program synthesis. Communications of the ACM, 14(3):151–165, 1971

  40. [40]

    W. M. McKeeman. Differential testing for software. Digital Technical Journal, 10(1):100–107, 1998

  41. [41]

    B. Meyer. Applying’design by contract’. Computer, 25(10):40–51, 1992

  42. [42]

    GitHub Copilot – Your AI pair programmer.https://github.com/features/ copilot, 2023

    Microsoft. GitHub Copilot – Your AI pair programmer.https://github.com/features/ copilot, 2023. 12

  43. [43]

    B. P. Miller, L. Fredriksen, and B. So. An empirical study of the reliability of unix utilities. Communications of the ACM, 33(12):32–44, 1990

  44. [44]

    G. C. Necula. Translation validation for an optimizing compiler. In Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation, pages 83–94, 2000

  45. [45]

    Nijkamp, H

    E. Nijkamp, H. Hayashi, C. Xiong, S. Savarese, and Y . Zhou. Codegen2: Lessons for training llms on programming and natural languages. arXiv preprint, 2023

  46. [46]

    Nijkamp, B

    E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y . Zhou, S. Savarese, and C. Xiong. Codegen: An open large language model for code with multi-turn program synthesis. In The Eleventh International Conference on Learning Representations, 2023

  47. [47]

    P. Oehlert. Violating assumptions with fuzzing. IEEE Security & Privacy, 3(2):58–62, 2005

  48. [48]

    Chatgpt: Optimizing language models for dialogue

    OpenAI. Chatgpt: Optimizing language models for dialogue. https://openai.com/blog/ chatgpt/, 2022

  49. [49]

    GPT-4 Technical Report

    OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023

  50. [50]

    Papineni, S

    K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

  51. [51]

    Petrovi´c and M

    G. Petrovi´c and M. Ivankovi´c. State of mutation testing at google. In Proceedings of the 40th international conference on software engineering: Software engineering in practice , pages 163–171, 2018

  52. [52]

    Phind/phind-codellama-34b-v2 · hugging face

    Phind. Phind/phind-codellama-34b-v2 · hugging face. https://huggingface.co/Phind/ Phind-CodeLlama-34B-v2 , 2023

  53. [53]

    Rothermel, M

    G. Rothermel, M. J. Harrold, J. V on Ronne, and C. Hong. Empirical studies of test-suite reduction. Software Testing, Verification and Reliability, 12(4):219–249, 2002

  54. [54]

    Code Llama: Open Foundation Models for Code

    B. Rozière, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y . Adi, J. Liu, T. Remez, J. Rapin, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023

  55. [55]

    Roziere, J

    B. Roziere, J. M. Zhang, F. Charton, M. Harman, G. Synnaeve, and G. Lample. Leveraging automated unit tests for unsupervised code translation. arXiv preprint arXiv:2110.06773, 2021

  56. [56]

    Security

    M. Security. jsfunfuzz. https://github.com/MozillaSecurity/funfuzz, 2007

  57. [57]

    Serebryany

    K. Serebryany. Continuous fuzzing with libfuzzer and addresssanitizer. In 2016 IEEE Cybersecurity Development (SecDev), pages 157–157. IEEE, 2016

  58. [58]

    D. E. Shaw, W. R. Swartout, and C. C. Green. Inferring lisp programs from examples. In IJCAI, volume 75, pages 260–267, 1975

  59. [59]

    A. Shi, A. Gyori, M. Gligoric, A. Zaytsev, and D. Marinov. Balancing trade-offs in test-suite reduction. In Proceedings of the 22nd ACM SIGSOFT international symposium on foundations of software engineering, pages 246–256, 2014

  60. [60]

    Stablelm: Stability ai language models

    Stability-AI. Stablelm: Stability ai language models. https://github.com/Stability-AI/ StableLM, 2023

  61. [61]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polo- sukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

  62. [62]

    R. J. Waldinger and R. C. Lee. Prow: A step toward automatic program writing. In Proceedings of the 1st international joint conference on Artificial intelligence, pages 241–252, 1969

  63. [63]

    Wang and A

    B. Wang and A. Komatsuzaki. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021

  64. [64]

    Y . Wang, H. Le, A. D. Gotmare, N. D. Bui, J. Li, and S. C. Hoi. Codet5+: Open code large language models for code understanding and generation.arXiv preprint arXiv:2305.07922, 2023

  65. [65]

    Y . Wei, C. S. Xia, and L. Zhang. Copiloting the copilots: Fusing large language models with completion engines for automated program repair. arXiv preprint arXiv:2309.00608, 2023

  66. [66]

    Winterer, C

    D. Winterer, C. Zhang, and Z. Su. On the unusual effectiveness of type-aware operator mutations for testing smt solvers. Proceedings of the ACM on Programming Languages , 4(OOPSLA):1–25, 2020. 13

  67. [67]

    C. S. Xia, M. Paltenghi, J. L. Tian, M. Pradel, and L. Zhang. Universal fuzzing via large language models. In 46th International Conference on Software Engineering (ICSE), 2024

  68. [68]

    C. S. Xia, Y . Wei, and L. Zhang. Automated program repair in the era of large pre-trained language models. In Proceedings of the 45th International Conference on Software Engineering (ICSE 2023). Association for Computing Machinery, 2023

  69. [69]

    C. S. Xia and L. Zhang. Less training, more repairing please: revisiting automated program re- pair via zero-shot learning. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 959–971, 2022

  70. [70]

    F. F. Xu, U. Alon, G. Neubig, and V . J. Hellendoorn. A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, pages 1–10, 2022

  71. [71]

    C. Yang, Y . Deng, R. Lu, J. Yao, J. Liu, R. Jabbarvand, and L. Zhang. White-box compiler fuzzing empowered by large language models, 2023

  72. [72]

    X. Yang, Y . Chen, E. Eide, and J. Regehr. Finding and understanding bugs in c compilers. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’11, page 283–294, New York, NY , USA, 2011. Association for Computing Machinery

  73. [73]

    T. Yu, R. Zhang, K. Yang, M. Yasunaga, D. Wang, Z. Li, J. Ma, I. Li, Q. Yao, S. Roman, et al. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018

  74. [74]

    Zalewski

    M. Zalewski. American fuzzing lop (afl). https://lcamtuf.coredump.cx/afl/, 2018

  75. [75]

    Zhang, D

    L. Zhang, D. Marinov, L. Zhang, and S. Khurshid. An empirical study of junit test-suite reduction. In 2011 IEEE 22nd International Symposium on Software Reliability Engineering, pages 170–179. IEEE, 2011

  76. [76]

    arXiv preprint arXiv:2303.17568 , year=

    Q. Zheng, X. Xia, X. Zou, Y . Dong, S. Wang, Y . Xue, Z. Wang, L. Shen, A. Wang, Y . Li, et al. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. arXiv preprint arXiv:2303.17568, 2023. 14 Table 5: Overview of evaluated models. Model Name Sizes Release Year Open-Source Coding CodeGen [46] 2B, 6B, 16B 2022 ✓ INC...