arxiv: 2305.01210 · v3 · submitted 2023-05-02 · 💻 cs.SE · cs.CL· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

Jiawei Liu , Chunqiu Steven Xia , Yuyao Wang , Lingming Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-14 01:58 UTC · model grok-4.3

classification 💻 cs.SE cs.CLcs.LG

keywords large language modelscode generationprogram synthesisbenchmark evaluationfunctional correctnessautomated testingHumanEval

0 comments

The pith

Augmenting HumanEval with 80 times more test cases reveals that LLM-generated code contains substantially more functional errors than prior benchmarks detected.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard benchmarks for LLM code synthesis rely on too few test cases to reliably judge whether generated programs actually work. EvalPlus automatically expands any benchmark dataset with many new inputs produced by a combination of LLM prompting and mutation-based generation. When applied to HumanEval, the resulting HumanEval+ dataset exposes large numbers of programs that pass the original tests yet fail the added ones. Across 26 models this stricter evaluation lowers measured pass rates by 19.3 to 28.9 percent and reverses some model rankings, with certain CodeLlama variants now outperforming ChatGPT. The central message is that earlier performance numbers overstated how correct LLM code really is and that automated test augmentation is needed to correct the record.

Core claim

EvalPlus augments existing code-synthesis benchmarks by generating large numbers of additional test inputs through LLM-based and mutation-based strategies. Applied to HumanEval, it produces HumanEval+ containing 80 times more tests; when 26 popular LLMs are re-evaluated on this dataset, pass@k rates drop by as much as 19.3-28.9 percent because many previously accepted programs fail on the new cases. The stricter tests also reorder model performance, allowing models such as WizardCoder-CodeLlama and Phind-CodeLlama to surpass ChatGPT where they previously did not.

What carries the argument

EvalPlus, an evaluation framework that augments a benchmark with automatically generated test cases using both LLM prompting and mutation strategies to increase coverage of functional behaviors.

Load-bearing premise

The new test cases generated by EvalPlus are themselves functionally correct and do not falsely reject valid programs or miss critical edge cases.

What would settle it

Running the HumanEval+ test suite on a collection of independently verified correct reference implementations and observing a non-negligible failure rate would show the added tests are unreliable.

read the original abstract

Program synthesis has been long studied with recent approaches focused on directly using the power of Large Language Models (LLMs) to generate code. Programming benchmarks, with curated synthesis problems and test-cases, are used to measure the performance of various LLMs on code synthesis. However, these test-cases can be limited in both quantity and quality for fully assessing the functional correctness of the generated code. Such limitation in the existing benchmarks begs the following question: In the era of LLMs, is the code generated really correct? To answer this, we propose EvalPlus -- a code synthesis evaluation framework to rigorously benchmark the functional correctness of LLM-synthesized code. EvalPlus augments a given evaluation dataset with large amounts of test-cases newly produced by an automatic test input generator, powered by both LLM- and mutation-based strategies. While EvalPlus is general, we extend the test-cases of the popular HumanEval benchmark by 80x to build HumanEval+. Our extensive evaluation across 26 popular LLMs (e.g., GPT-4 and ChatGPT) demonstrates that HumanEval+ is able to catch significant amounts of previously undetected wrong code synthesized by LLMs, reducing the pass@k by up-to 19.3-28.9%. We also surprisingly found that test insufficiency can lead to mis-ranking. For example, both WizardCoder-CodeLlama and Phind-CodeLlama now outperform ChatGPT on HumanEval+, while none of them could on HumanEval. Our work not only indicates that prior popular code synthesis evaluation results do not accurately reflect the true performance of LLMs for code synthesis, but also opens up a new direction to improve such programming benchmarks through automated testing. We have open-sourced our tools, enhanced datasets as well as all LLM-generated code at https://github.com/evalplus/evalplus to facilitate and accelerate future LLM-for-code research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EvalPlus shows HumanEval lets too much wrong LLM code pass and gives a workable way to add 80x tests that drops pass rates 19-29% and flips some rankings, but the new tests rest on a thin validation step.

read the letter

The main thing here is that standard benchmarks for LLM code generation are too easy, and this paper supplies a practical fix by automatically expanding test suites with a mix of LLM prompting and mutation. They apply it to HumanEval to create HumanEval+, run it on 26 models, and report clear drops in pass@k plus a couple of ranking reversals where models like WizardCoder now look stronger than ChatGPT. The open-sourcing of the full tool chain, datasets, and generated code is the strongest part; it lets anyone reproduce the numbers without extra work. The empirical pattern holds across the model set, which makes the central claim believable on its own terms. The soft spot is exactly the one the stress-test flags: new tests are accepted if they pass the original reference solutions. That filters out some bad cases but cannot catch tests that are inconsistent with the problem spec or that exploit gaps in the reference. Without a sample of manual review or an independent oracle, a fraction of faulty tests could inflate the reported drops. The paper treats this as a minor engineering detail rather than a load-bearing assumption. This is useful reading for anyone building or evaluating code-generation systems; the method is general enough to apply elsewhere. It deserves peer review because the data and code are there to check, even if reviewers will want more on test quality.

Referee Report

1 major / 1 minor

Summary. The paper introduces EvalPlus, a framework to augment code synthesis benchmarks with large numbers of new test cases generated via LLM-based and mutation-based strategies. Applied to HumanEval, this produces HumanEval+ containing 80x more tests. Evaluation across 26 LLMs shows that pass@k rates drop by 19.3-28.9% and that test insufficiency can produce incorrect model rankings on the original benchmark.

Significance. If the results hold, the work is significant because it provides large-scale empirical evidence that popular code synthesis benchmarks underestimate functional errors in LLM-generated code and can distort model comparisons. The open-sourcing of the full tool suite, enhanced datasets, and all generated code is a clear strength that directly supports verification and reuse. The breadth of the evaluation (26 models) adds weight to the central empirical claims.

major comments (1)

[Test input generator and validation] The new test cases are validated only by execution against the original HumanEval reference solutions. This catches some errors but cannot detect tests that are inconsistent with the intended specification or that expose incompletenesses in the reference solutions themselves. Because the headline claim (19.3-28.9% drop in pass@k) rests on these tests being semantically correct, the validation procedure is load-bearing and requires either additional safeguards (e.g., manual review of a random sample or cross-validation against multiple independent solutions) or an explicit discussion of the remaining risk.

minor comments (1)

[Abstract] The abstract states the reduction range as 'up-to 19.3-28.9%' without clarifying whether the bounds correspond to different values of k, different models, or both; a brief parenthetical would improve precision.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address the major comment below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: The new test cases are validated only by execution against the original HumanEval reference solutions. This catches some errors but cannot detect tests that are inconsistent with the intended specification or that expose incompletenesses in the reference solutions themselves. Because the headline claim (19.3-28.9% drop in pass@k) rests on these tests being semantically correct, the validation procedure is load-bearing and requires either additional safeguards (e.g., manual review of a random sample or cross-validation against multiple independent solutions) or an explicit discussion of the remaining risk.

Authors: We appreciate the referee's careful scrutiny of the test validation step. In EvalPlus, each newly generated test input is executed against the original HumanEval reference solution; only inputs that the reference passes are retained. This guarantees that every added test is consistent with the benchmark's reference implementation, which functions as the de facto specification for the task. We agree that this procedure cannot detect tests that might conflict with the natural-language problem description or that would expose incompletenesses in the reference solutions themselves. Although HumanEval is a long-standing and widely trusted benchmark, we recognize that reliance on its references introduces a residual risk. In the revised manuscript we will add an explicit discussion of this limitation in the methodology section. We will also report the results of a manual review of a random sample of 100 generated tests (stratified across problems) to provide supplementary evidence of their semantic correctness. These changes will directly address the concern that the reported pass@k reductions depend on the semantic validity of the new tests. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical execution results are independent of inputs

full rationale

The paper's central claim rests on direct execution of LLM-generated code against an expanded test suite (HumanEval+). Test generation uses LLM and mutation strategies, with validation performed by running tests against the original HumanEval reference solutions. This process does not reduce to any self-definitional equivalence, fitted parameter renamed as prediction, or load-bearing self-citation chain. The observed drop in pass@k (19.3-28.9%) is a measured empirical outcome from running the same code on more tests, not a constructed identity with the generation inputs. The framework is self-contained against external benchmarks via execution, with no equations or derivations that loop back to the paper's own fitted values or prior self-citations as the sole justification.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the test generator produces valid tests that accurately reflect functional correctness without introducing artifacts.

axioms (1)

domain assumption The LLM- and mutation-based test input generator produces valid, non-redundant test cases that correctly identify functional errors.
Invoked in the description of EvalPlus construction and HumanEval+ extension; if false, the reported pass@k reductions would not hold.

pith-pipeline@v0.9.0 · 5657 in / 1128 out tokens · 38710 ms · 2026-05-14T01:58:11.769884+00:00 · methodology

discussion (0)

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
cs.CL 2023-10 unverdicted novelty 8.0

SWE-bench reveals that even top language models like Claude 2 resolve only 1.96% of 2,294 real-world GitHub issues, highlighting a gap in practical coding capabilities.
ProgramBench: Can Language Models Rebuild Programs From Scratch?
cs.SE 2026-05 unverdicted novelty 7.0

ProgramBench introduces 200 tasks where models must reconstruct full programs like FFmpeg or SQLite from docs alone; none of 9 evaluated LMs fully solve any task and the best passes 95% tests on only 3% of tasks while...
ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation
cs.SE 2026-04 unverdicted novelty 7.0

ClassEval-Pro benchmark shows frontier LLMs achieve at most 45.6% Pass@1 on class-level code tasks, with logic errors (56%) and dependency errors (38%) as dominant failure modes.
When Prompt Under-Specification Improves Code Correctness: An Exploratory Study of Prompt Wording and Structure Effects on LLM-Based Code Generation
cs.SE 2026-04 unverdicted novelty 7.0

Structurally rich task descriptions make LLMs robust to prompt under-specification, and under-specification can enhance code correctness by disrupting misleading lexical or structural cues.
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
cs.SE 2024-05 unverdicted novelty 7.0

SWE-agent introduces a custom agent-computer interface that lets LM agents solve software engineering tasks, reaching 12.5% pass@1 on SWE-bench and 87.7% on HumanEvalFix, exceeding prior non-interactive approaches.
DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation
cs.LG 2026-05 unverdicted novelty 6.0

DARE co-evolves difficulty estimation and policy in RL for LLMs to improve training efficiency, final performance, and inference speed by using tailored strategies for different difficulty levels.
Using Semantic Distance to Estimate Uncertainty in LLM-Based Code Generation
cs.SE 2026-05 unverdicted novelty 6.0

Semantic distance on program execution behaviors improves uncertainty estimation for LLM code generation and outperforms prior sample-based methods across benchmarks and models.
SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training
cs.LG 2026-05 unverdicted novelty 6.0

Pruning pretrained MoE models outperforms training from scratch, different compression methods converge after continued pretraining, and combining KD with language modeling loss plus progressive schedules yields a com...
Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code
cs.SE 2026-05 accept novelty 6.0

A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.
Defective Task Descriptions in LLM-Based Code Generation: Detection and Analysis
cs.SE 2026-04 conditional novelty 6.0

SpecValidator detects lexical vagueness, under-specification, and syntax-formatting defects in LLM code-generation prompts with F1 0.804, outperforming GPT-5-mini and Claude Sonnet 4, and shows that under-specificatio...
You Don't Need Public Tests to Generate Correct Code
cs.SE 2026-04 unverdicted novelty 6.0

DryRUN lets LLMs create their own test inputs and run internal simulations for self-correcting code generation, matching the performance of test-dependent methods like CodeSIM on LiveCodeBench without public tests or ...
Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts
cs.LG 2026-04 unverdicted novelty 6.0

BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.
Co-Located Tests, Better AI Code: How Test Syntax Structure Affects Foundation Model Code Generation
cs.SE 2026-04 unverdicted novelty 6.0

Co-locating tests with implementation code yields substantially higher preservation and correctness in foundation-model-generated programs than separated test syntax.
Structured Safety Auditing for Balancing Code Correctness and Content Safety in LLM-Generated Code
cs.SE 2026-04 unverdicted novelty 6.0

Dual Reasoning with explicit safety audits improves the new SUDS metric by 1.32x to 3.42x over baselines on code generation benchmarks containing injected harmful keywords.
Leveraging Mathematical Reasoning of LLMs for Efficient GPU Thread Mapping
cs.DC 2026-04 unverdicted novelty 6.0

Large language models derive exact analytical GPU thread mappings for complex 2D/3D domains and fractals via in-context learning, outperforming symbolic regression and enabling up to thousands-fold speedups and energy...
Ensemble-Based Uncertainty Estimation for Code Correctness Estimation
cs.SE 2026-03 unverdicted novelty 6.0

Ensemble Semantic Entropy improves correlation with code correctness over single-model methods and powers a cascading scaling system that cuts FLOPs by 64.9% while preserving performance on LiveCodeBench.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
cs.SE 2024-03 unverdicted novelty 6.0

LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
Textbooks Are All You Need
cs.CL 2023-06 unverdicted novelty 6.0

A 1.3B-parameter code model trained on 7B tokens of curated textbook and synthetic data achieves 50.6% on HumanEval, indicating data quality can enable strong performance at small scale.
CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society
cs.AI 2023-03 conditional novelty 6.0

CAMEL proposes a role-playing framework with inception prompting that enables autonomous multi-agent cooperation among LLMs and generates conversational data for studying their behaviors.
Evaluating LLM-Generated Code: A Benchmark and Developer Study
cs.SE 2026-05 unverdicted novelty 5.0

A custom three-fold methodology combining a complex-project correctness benchmark, code quality verification, and structured developer reviews to evaluate LLM-generated code beyond correctness alone.
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
cs.CL 2025-03 unverdicted novelty 5.0

Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.
StarCoder: may the source be with you!
cs.CL 2023-05 accept novelty 5.0

StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · cited by 22 Pith papers · 5 internal anchors

[1]

Ahmed and P

T. Ahmed and P. Devanbu. Few-shot training llms for project-specific code-summarization. In 37th IEEE/ACM International Conference on Automated Software Engineering, pages 1–5, 2022

work page 2022
[2]

L. B. Allal, R. Li, D. Kocetkov, C. Mou, C. Akiki, C. M. Ferrandis, N. Muennighoff, M. Mishra, A. Gu, M. Dey, et al. Santacoder: don’t reach for the stars! arXiv preprint arXiv:2301.03988, 2023

work page arXiv 2023
[3]

Austin, A

J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton. Program synthesis with large language models, 2021

work page 2021
[4]

S. Bang, S. Nam, I. Chun, H. Y . Jhoo, and J. Lee. Smt-based translation validation for machine learning compiler. In Computer Aided Verification: 34th International Conference, CAV 2022, Haifa, Israel, August 7–10, 2022, Proceedings, Part II, pages 386–407. Springer, 2022

work page 2022
[5]

Black, L

S. Black, L. Gao, P. Wang, C. Leahy, and S. Biderman. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow, Mar. 2021. If you use this software, please cite it using these metadata

work page 2021
[6]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901
[7]

T. A. Budd. Mutation analysis of program test data. Yale University, 1980

work page 1980
[8]

Cadar, D

C. Cadar, D. Dunbar, D. R. Engler, et al. Klee: unassisted and automatic generation of high-coverage tests for complex systems programs. In OSDI, volume 8, pages 209–224, 2008

work page 2008
[9]

Cassano, J

F. Cassano, J. Gouwar, D. Nguyen, S. Nguyen, L. Phipps-Costin, D. Pinckney, M.-H. Yee, Y . Zi, C. J. Anderson, M. Q. Feldman, et al. Multipl-e: A scalable and polyglot approach to benchmarking neural code generation. IEEE Transactions on Software Engineering, 2023

work page 2023
[10]

S. K. Cha, M. Woo, and D. Brumley. Program-adaptive mutational fuzzing. In 2015 IEEE Symposium on Security and Privacy, pages 725–741. IEEE, 2015

work page 2015
[11]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[12]

Chiang, Z

W.-L. Chiang, Z. Li, Z. Lin, Y . Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y . Zhuang, J. E. Gonzalez, I. Stoica, and E. P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023

work page 2023
[13]

B. Code. Starcoder. https://github.com/bigcode-project/starcoder, 2023

work page 2023
[14]

Y . Deng, C. S. Xia, H. Peng, C. Yang, and L. Zhang. Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models. In 32nd International Symposium on Software Testing and Analysis (ISSTA), 2023

work page 2023
[15]

Y . Deng, C. S. Xia, C. Yang, S. D. Zhang, S. Yang, and L. Zhang. Large language models are edge-case fuzzers: Testing deep learning libraries via fuzzgpt. In 46th International Conference on Software Engineering (ICSE), 2024

work page 2024
[16]

Fauxpilot: an open-source alternative to github copilot server

fauxpilot. Fauxpilot: an open-source alternative to github copilot server. https: //github.com/fauxpilot/fauxpilot, 2022

work page 2022
[17]

U. Feige. A threshold of ln n for approximating set cover. Journal of the ACM (JACM), 45(4):634–652, 1998

work page 1998
[18]

Fried, A

D. Fried, A. Aghajanyan, J. Lin, S. Wang, E. Wallace, F. Shi, R. Zhong, S. Yih, L. Zettlemoyer, and M. Lewis. Incoder: A generative model for code infilling and synthesis. In The Eleventh International Conference on Learning Representations, 2023

work page 2023
[19]

C. Green. Application of theorem proving to problem solving. In Readings in Artificial Intelligence, pages 202–222. Elsevier, 1981

work page 1981
[20]

S. Gulwani. Automating string processing in spreadsheets using input-output examples. SIGPLAN Not., 46(1):317–330, jan 2011

work page 2011
[21]

Gulwani, O

S. Gulwani, O. Polozov, and R. Singh. Program synthesis. Foundations and Trends® in Programming Languages, 4(1-2):1–119, 2017. 11

work page 2017
[22]

Hastie, R

T. Hastie, R. Tibshirani, J. H. Friedman, and J. H. Friedman.The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer, 2009

work page 2009
[23]

Holler, K

C. Holler, K. Herzig, and A. Zeller. Fuzzing with code fragments. In 21st USENIX Security Sym- posium (USENIX Security 12), pages 445–458, Bellevue, W A, Aug. 2012. USENIX Association

work page 2012
[24]

Ivankovi´c, G

M. Ivankovi´c, G. Petrovi´c, R. Just, and G. Fraser. Code coverage at google. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 955–963, 2019

work page 2019
[25]

S. Iyer, I. Konstas, A. Cheung, and L. Zettlemoyer. Mapping language to code in programmatic context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1643–1652, 2018

work page 2018
[26]

A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Jiang, K

N. Jiang, K. Liu, T. Lutellier, and L. Tan. Impact of code language models on automated program repair. arXiv preprint arXiv:2302.05020, 2023

work page arXiv 2023
[28]

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Kalyan, A

A. Kalyan, A. Mohta, O. Polozov, D. Batra, P. Jain, and S. Gulwani. Neural-guided deductive search for real-time program synthesis from examples. In International Conference on Learning Representations, 2018

work page 2018
[30]

J. C. King. Symbolic execution and program testing. Communications of the ACM , 19(7):385–394, 1976

work page 1976
[31]

Lachaux, B

M.-A. Lachaux, B. Roziere, L. Chanussot, and G. Lample. Unsupervised translation of programming languages. arXiv preprint arXiv:2006.03511, 2020

work page arXiv 2006
[32]

K. R. M. Leino. Dafny: An automatic program verifier for functional correctness. In International conference on logic for programming artificial intelligence and reasoning, pages 348–370. Springer, 2010

work page 2010
[33]

Y . Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. Dal Lago, et al. Competition-level code generation with alphacode. Science, 378(6624):1092–1097, 2022

work page 2022
[34]

J. Liu, Y . Wei, S. Yang, Y . Deng, and L. Zhang. Coverage-guided tensor compiler fuzzing with joint ir-pass mutation. Proceedings of the ACM on Programming Languages, 6(OOPSLA1):1–26, Apr. 2022

work page 2022
[35]

Z. Liu, C. Chen, J. Wang, X. Che, Y . Huang, J. Hu, and Q. Wang. Fill in the blank: Context-aware automated text input generation for mobile gui testing. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1355–1367. IEEE, 2023

work page 2023
[36]

N. P. Lopes, J. Lee, C.-K. Hur, Z. Liu, and J. Regehr. Alive2: bounded translation validation for llvm. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, pages 65–79, 2021

work page 2021
[37]

S. Lu, D. Guo, S. Ren, J. Huang, A. Svyatkovskiy, A. Blanco, C. Clement, D. Drain, D. Jiang, D. Tang, et al. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664, 2021

work page arXiv 2021
[38]

Z. Luo, C. Xu, P. Zhao, Q. Sun, X. Geng, W. Hu, C. Tao, J. Ma, Q. Lin, and D. Jiang. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568, 2023

work page arXiv 2023
[39]

Manna and R

Z. Manna and R. J. Waldinger. Toward automatic program synthesis. Communications of the ACM, 14(3):151–165, 1971

work page 1971
[40]

W. M. McKeeman. Differential testing for software. Digital Technical Journal, 10(1):100–107, 1998

work page 1998
[41]

B. Meyer. Applying’design by contract’. Computer, 25(10):40–51, 1992

work page 1992
[42]

GitHub Copilot – Your AI pair programmer.https://github.com/features/ copilot, 2023

Microsoft. GitHub Copilot – Your AI pair programmer.https://github.com/features/ copilot, 2023. 12

work page 2023
[43]

B. P. Miller, L. Fredriksen, and B. So. An empirical study of the reliability of unix utilities. Communications of the ACM, 33(12):32–44, 1990

work page 1990
[44]

G. C. Necula. Translation validation for an optimizing compiler. In Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation, pages 83–94, 2000

work page 2000
[45]

Nijkamp, H

E. Nijkamp, H. Hayashi, C. Xiong, S. Savarese, and Y . Zhou. Codegen2: Lessons for training llms on programming and natural languages. arXiv preprint, 2023

work page 2023
[46]

Nijkamp, B

E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y . Zhou, S. Savarese, and C. Xiong. Codegen: An open large language model for code with multi-turn program synthesis. In The Eleventh International Conference on Learning Representations, 2023

work page 2023
[47]

P. Oehlert. Violating assumptions with fuzzing. IEEE Security & Privacy, 3(2):58–62, 2005

work page 2005
[48]

Chatgpt: Optimizing language models for dialogue

OpenAI. Chatgpt: Optimizing language models for dialogue. https://openai.com/blog/ chatgpt/, 2022

work page 2022
[49]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Papineni, S

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

work page 2002
[51]

Petrovi´c and M

G. Petrovi´c and M. Ivankovi´c. State of mutation testing at google. In Proceedings of the 40th international conference on software engineering: Software engineering in practice , pages 163–171, 2018

work page 2018
[52]

Phind/phind-codellama-34b-v2 · hugging face

Phind. Phind/phind-codellama-34b-v2 · hugging face. https://huggingface.co/Phind/ Phind-CodeLlama-34B-v2 , 2023

work page 2023
[53]

Rothermel, M

G. Rothermel, M. J. Harrold, J. V on Ronne, and C. Hong. Empirical studies of test-suite reduction. Software Testing, Verification and Reliability, 12(4):219–249, 2002

work page 2002
[54]

Code Llama: Open Foundation Models for Code

B. Rozière, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y . Adi, J. Liu, T. Remez, J. Rapin, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[55]

Roziere, J

B. Roziere, J. M. Zhang, F. Charton, M. Harman, G. Synnaeve, and G. Lample. Leveraging automated unit tests for unsupervised code translation. arXiv preprint arXiv:2110.06773, 2021

work page arXiv 2021
[56]

Security

M. Security. jsfunfuzz. https://github.com/MozillaSecurity/funfuzz, 2007

work page 2007
[57]

Serebryany

K. Serebryany. Continuous fuzzing with libfuzzer and addresssanitizer. In 2016 IEEE Cybersecurity Development (SecDev), pages 157–157. IEEE, 2016

work page 2016
[58]

D. E. Shaw, W. R. Swartout, and C. C. Green. Inferring lisp programs from examples. In IJCAI, volume 75, pages 260–267, 1975

work page 1975
[59]

A. Shi, A. Gyori, M. Gligoric, A. Zaytsev, and D. Marinov. Balancing trade-offs in test-suite reduction. In Proceedings of the 22nd ACM SIGSOFT international symposium on foundations of software engineering, pages 246–256, 2014

work page 2014
[60]

Stablelm: Stability ai language models

Stability-AI. Stablelm: Stability ai language models. https://github.com/Stability-AI/ StableLM, 2023

work page 2023
[61]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polo- sukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

work page 2017
[62]

R. J. Waldinger and R. C. Lee. Prow: A step toward automatic program writing. In Proceedings of the 1st international joint conference on Artificial intelligence, pages 241–252, 1969

work page 1969
[63]

Wang and A

B. Wang and A. Komatsuzaki. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021

work page 2021
[64]

Y . Wang, H. Le, A. D. Gotmare, N. D. Bui, J. Li, and S. C. Hoi. Codet5+: Open code large language models for code understanding and generation.arXiv preprint arXiv:2305.07922, 2023

work page arXiv 2023
[65]

Y . Wei, C. S. Xia, and L. Zhang. Copiloting the copilots: Fusing large language models with completion engines for automated program repair. arXiv preprint arXiv:2309.00608, 2023

work page arXiv 2023
[66]

Winterer, C

D. Winterer, C. Zhang, and Z. Su. On the unusual effectiveness of type-aware operator mutations for testing smt solvers. Proceedings of the ACM on Programming Languages , 4(OOPSLA):1–25, 2020. 13

work page 2020
[67]

C. S. Xia, M. Paltenghi, J. L. Tian, M. Pradel, and L. Zhang. Universal fuzzing via large language models. In 46th International Conference on Software Engineering (ICSE), 2024

work page 2024
[68]

C. S. Xia, Y . Wei, and L. Zhang. Automated program repair in the era of large pre-trained language models. In Proceedings of the 45th International Conference on Software Engineering (ICSE 2023). Association for Computing Machinery, 2023

work page 2023
[69]

C. S. Xia and L. Zhang. Less training, more repairing please: revisiting automated program re- pair via zero-shot learning. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 959–971, 2022

work page 2022
[70]

F. F. Xu, U. Alon, G. Neubig, and V . J. Hellendoorn. A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, pages 1–10, 2022

work page 2022
[71]

C. Yang, Y . Deng, R. Lu, J. Yao, J. Liu, R. Jabbarvand, and L. Zhang. White-box compiler fuzzing empowered by large language models, 2023

work page 2023
[72]

X. Yang, Y . Chen, E. Eide, and J. Regehr. Finding and understanding bugs in c compilers. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’11, page 283–294, New York, NY , USA, 2011. Association for Computing Machinery

work page 2011
[73]

T. Yu, R. Zhang, K. Yang, M. Yasunaga, D. Wang, Z. Li, J. Ma, I. Li, Q. Yao, S. Roman, et al. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018

work page 2018
[74]

Zalewski

M. Zalewski. American fuzzing lop (afl). https://lcamtuf.coredump.cx/afl/, 2018

work page 2018
[75]

Zhang, D

L. Zhang, D. Marinov, L. Zhang, and S. Khurshid. An empirical study of junit test-suite reduction. In 2011 IEEE 22nd International Symposium on Software Reliability Engineering, pages 170–179. IEEE, 2011

work page 2011
[76]

Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x.ArXiv, abs/2303.17568, 2023

Q. Zheng, X. Xia, X. Zou, Y . Dong, S. Wang, Y . Xue, Z. Wang, L. Shen, A. Wang, Y . Li, et al. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. arXiv preprint arXiv:2303.17568, 2023. 14 Table 5: Overview of evaluated models. Model Name Sizes Release Year Open-Source Coding CodeGen [46] 2B, 6B, 16B 2022 ✓ INC...

work page arXiv 2023