Recognition: 2 theorem links
· Lean TheoremIs Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation
Pith reviewed 2026-05-14 01:58 UTC · model grok-4.3
The pith
Augmenting HumanEval with 80 times more test cases reveals that LLM-generated code contains substantially more functional errors than prior benchmarks detected.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EvalPlus augments existing code-synthesis benchmarks by generating large numbers of additional test inputs through LLM-based and mutation-based strategies. Applied to HumanEval, it produces HumanEval+ containing 80 times more tests; when 26 popular LLMs are re-evaluated on this dataset, pass@k rates drop by as much as 19.3-28.9 percent because many previously accepted programs fail on the new cases. The stricter tests also reorder model performance, allowing models such as WizardCoder-CodeLlama and Phind-CodeLlama to surpass ChatGPT where they previously did not.
What carries the argument
EvalPlus, an evaluation framework that augments a benchmark with automatically generated test cases using both LLM prompting and mutation strategies to increase coverage of functional behaviors.
Load-bearing premise
The new test cases generated by EvalPlus are themselves functionally correct and do not falsely reject valid programs or miss critical edge cases.
What would settle it
Running the HumanEval+ test suite on a collection of independently verified correct reference implementations and observing a non-negligible failure rate would show the added tests are unreliable.
read the original abstract
Program synthesis has been long studied with recent approaches focused on directly using the power of Large Language Models (LLMs) to generate code. Programming benchmarks, with curated synthesis problems and test-cases, are used to measure the performance of various LLMs on code synthesis. However, these test-cases can be limited in both quantity and quality for fully assessing the functional correctness of the generated code. Such limitation in the existing benchmarks begs the following question: In the era of LLMs, is the code generated really correct? To answer this, we propose EvalPlus -- a code synthesis evaluation framework to rigorously benchmark the functional correctness of LLM-synthesized code. EvalPlus augments a given evaluation dataset with large amounts of test-cases newly produced by an automatic test input generator, powered by both LLM- and mutation-based strategies. While EvalPlus is general, we extend the test-cases of the popular HumanEval benchmark by 80x to build HumanEval+. Our extensive evaluation across 26 popular LLMs (e.g., GPT-4 and ChatGPT) demonstrates that HumanEval+ is able to catch significant amounts of previously undetected wrong code synthesized by LLMs, reducing the pass@k by up-to 19.3-28.9%. We also surprisingly found that test insufficiency can lead to mis-ranking. For example, both WizardCoder-CodeLlama and Phind-CodeLlama now outperform ChatGPT on HumanEval+, while none of them could on HumanEval. Our work not only indicates that prior popular code synthesis evaluation results do not accurately reflect the true performance of LLMs for code synthesis, but also opens up a new direction to improve such programming benchmarks through automated testing. We have open-sourced our tools, enhanced datasets as well as all LLM-generated code at https://github.com/evalplus/evalplus to facilitate and accelerate future LLM-for-code research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces EvalPlus, a framework to augment code synthesis benchmarks with large numbers of new test cases generated via LLM-based and mutation-based strategies. Applied to HumanEval, this produces HumanEval+ containing 80x more tests. Evaluation across 26 LLMs shows that pass@k rates drop by 19.3-28.9% and that test insufficiency can produce incorrect model rankings on the original benchmark.
Significance. If the results hold, the work is significant because it provides large-scale empirical evidence that popular code synthesis benchmarks underestimate functional errors in LLM-generated code and can distort model comparisons. The open-sourcing of the full tool suite, enhanced datasets, and all generated code is a clear strength that directly supports verification and reuse. The breadth of the evaluation (26 models) adds weight to the central empirical claims.
major comments (1)
- [Test input generator and validation] The new test cases are validated only by execution against the original HumanEval reference solutions. This catches some errors but cannot detect tests that are inconsistent with the intended specification or that expose incompletenesses in the reference solutions themselves. Because the headline claim (19.3-28.9% drop in pass@k) rests on these tests being semantically correct, the validation procedure is load-bearing and requires either additional safeguards (e.g., manual review of a random sample or cross-validation against multiple independent solutions) or an explicit discussion of the remaining risk.
minor comments (1)
- [Abstract] The abstract states the reduction range as 'up-to 19.3-28.9%' without clarifying whether the bounds correspond to different values of k, different models, or both; a brief parenthetical would improve precision.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address the major comment below and will revise the manuscript to incorporate the suggested improvements.
read point-by-point responses
-
Referee: The new test cases are validated only by execution against the original HumanEval reference solutions. This catches some errors but cannot detect tests that are inconsistent with the intended specification or that expose incompletenesses in the reference solutions themselves. Because the headline claim (19.3-28.9% drop in pass@k) rests on these tests being semantically correct, the validation procedure is load-bearing and requires either additional safeguards (e.g., manual review of a random sample or cross-validation against multiple independent solutions) or an explicit discussion of the remaining risk.
Authors: We appreciate the referee's careful scrutiny of the test validation step. In EvalPlus, each newly generated test input is executed against the original HumanEval reference solution; only inputs that the reference passes are retained. This guarantees that every added test is consistent with the benchmark's reference implementation, which functions as the de facto specification for the task. We agree that this procedure cannot detect tests that might conflict with the natural-language problem description or that would expose incompletenesses in the reference solutions themselves. Although HumanEval is a long-standing and widely trusted benchmark, we recognize that reliance on its references introduces a residual risk. In the revised manuscript we will add an explicit discussion of this limitation in the methodology section. We will also report the results of a manual review of a random sample of 100 generated tests (stratified across problems) to provide supplementary evidence of their semantic correctness. These changes will directly address the concern that the reported pass@k reductions depend on the semantic validity of the new tests. revision: yes
Circularity Check
No significant circularity; empirical execution results are independent of inputs
full rationale
The paper's central claim rests on direct execution of LLM-generated code against an expanded test suite (HumanEval+). Test generation uses LLM and mutation strategies, with validation performed by running tests against the original HumanEval reference solutions. This process does not reduce to any self-definitional equivalence, fitted parameter renamed as prediction, or load-bearing self-citation chain. The observed drop in pass@k (19.3-28.9%) is a measured empirical outcome from running the same code on more tests, not a constructed identity with the generation inputs. The framework is self-contained against external benchmarks via execution, with no equations or derivations that loop back to the paper's own fitted values or prior self-citations as the sole justification.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The LLM- and mutation-based test input generator produces valid, non-redundant test cases that correctly identify functional errors.
Forward citations
Cited by 22 Pith papers
-
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
SWE-bench reveals that even top language models like Claude 2 resolve only 1.96% of 2,294 real-world GitHub issues, highlighting a gap in practical coding capabilities.
-
ProgramBench: Can Language Models Rebuild Programs From Scratch?
ProgramBench introduces 200 tasks where models must reconstruct full programs like FFmpeg or SQLite from docs alone; none of 9 evaluated LMs fully solve any task and the best passes 95% tests on only 3% of tasks while...
-
ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation
ClassEval-Pro benchmark shows frontier LLMs achieve at most 45.6% Pass@1 on class-level code tasks, with logic errors (56%) and dependency errors (38%) as dominant failure modes.
-
When Prompt Under-Specification Improves Code Correctness: An Exploratory Study of Prompt Wording and Structure Effects on LLM-Based Code Generation
Structurally rich task descriptions make LLMs robust to prompt under-specification, and under-specification can enhance code correctness by disrupting misleading lexical or structural cues.
-
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
SWE-agent introduces a custom agent-computer interface that lets LM agents solve software engineering tasks, reaching 12.5% pass@1 on SWE-bench and 87.7% on HumanEvalFix, exceeding prior non-interactive approaches.
-
DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation
DARE co-evolves difficulty estimation and policy in RL for LLMs to improve training efficiency, final performance, and inference speed by using tailored strategies for different difficulty levels.
-
Using Semantic Distance to Estimate Uncertainty in LLM-Based Code Generation
Semantic distance on program execution behaviors improves uncertainty estimation for LLM code generation and outperforms prior sample-based methods across benchmarks and models.
-
SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training
Pruning pretrained MoE models outperforms training from scratch, different compression methods converge after continued pretraining, and combining KD with language modeling loss plus progressive schedules yields a com...
-
Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code
A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.
-
Defective Task Descriptions in LLM-Based Code Generation: Detection and Analysis
SpecValidator detects lexical vagueness, under-specification, and syntax-formatting defects in LLM code-generation prompts with F1 0.804, outperforming GPT-5-mini and Claude Sonnet 4, and shows that under-specificatio...
-
You Don't Need Public Tests to Generate Correct Code
DryRUN lets LLMs create their own test inputs and run internal simulations for self-correcting code generation, matching the performance of test-dependent methods like CodeSIM on LiveCodeBench without public tests or ...
-
Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts
BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.
-
Co-Located Tests, Better AI Code: How Test Syntax Structure Affects Foundation Model Code Generation
Co-locating tests with implementation code yields substantially higher preservation and correctness in foundation-model-generated programs than separated test syntax.
-
Structured Safety Auditing for Balancing Code Correctness and Content Safety in LLM-Generated Code
Dual Reasoning with explicit safety audits improves the new SUDS metric by 1.32x to 3.42x over baselines on code generation benchmarks containing injected harmful keywords.
-
Leveraging Mathematical Reasoning of LLMs for Efficient GPU Thread Mapping
Large language models derive exact analytical GPU thread mappings for complex 2D/3D domains and fractals via in-context learning, outperforming symbolic regression and enabling up to thousands-fold speedups and energy...
-
Ensemble-Based Uncertainty Estimation for Code Correctness Estimation
Ensemble Semantic Entropy improves correlation with code correctness over single-model methods and powers a cascading scaling system that cuts FLOPs by 64.9% while preserving performance on LiveCodeBench.
-
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
-
Textbooks Are All You Need
A 1.3B-parameter code model trained on 7B tokens of curated textbook and synthetic data achieves 50.6% on HumanEval, indicating data quality can enable strong performance at small scale.
-
CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society
CAMEL proposes a role-playing framework with inception prompting that enables autonomous multi-agent cooperation among LLMs and generates conversational data for studying their behaviors.
-
Evaluating LLM-Generated Code: A Benchmark and Developer Study
A custom three-fold methodology combining a complex-project correctness benchmark, code quality verification, and structured developer reviews to evaluate LLM-generated code beyond correctness alone.
-
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.
-
StarCoder: may the source be with you!
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
Reference graph
Works this paper leans on
-
[1]
T. Ahmed and P. Devanbu. Few-shot training llms for project-specific code-summarization. In 37th IEEE/ACM International Conference on Automated Software Engineering, pages 1–5, 2022
work page 2022
- [2]
- [3]
-
[4]
S. Bang, S. Nam, I. Chun, H. Y . Jhoo, and J. Lee. Smt-based translation validation for machine learning compiler. In Computer Aided Verification: 34th International Conference, CAV 2022, Haifa, Israel, August 7–10, 2022, Proceedings, Part II, pages 386–407. Springer, 2022
work page 2022
- [5]
- [6]
-
[7]
T. A. Budd. Mutation analysis of program test data. Yale University, 1980
work page 1980
- [8]
-
[9]
F. Cassano, J. Gouwar, D. Nguyen, S. Nguyen, L. Phipps-Costin, D. Pinckney, M.-H. Yee, Y . Zi, C. J. Anderson, M. Q. Feldman, et al. Multipl-e: A scalable and polyglot approach to benchmarking neural code generation. IEEE Transactions on Software Engineering, 2023
work page 2023
-
[10]
S. K. Cha, M. Woo, and D. Brumley. Program-adaptive mutational fuzzing. In 2015 IEEE Symposium on Security and Privacy, pages 725–741. IEEE, 2015
work page 2015
-
[11]
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [12]
-
[13]
B. Code. Starcoder. https://github.com/bigcode-project/starcoder, 2023
work page 2023
-
[14]
Y . Deng, C. S. Xia, H. Peng, C. Yang, and L. Zhang. Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models. In 32nd International Symposium on Software Testing and Analysis (ISSTA), 2023
work page 2023
-
[15]
Y . Deng, C. S. Xia, C. Yang, S. D. Zhang, S. Yang, and L. Zhang. Large language models are edge-case fuzzers: Testing deep learning libraries via fuzzgpt. In 46th International Conference on Software Engineering (ICSE), 2024
work page 2024
-
[16]
Fauxpilot: an open-source alternative to github copilot server
fauxpilot. Fauxpilot: an open-source alternative to github copilot server. https: //github.com/fauxpilot/fauxpilot, 2022
work page 2022
-
[17]
U. Feige. A threshold of ln n for approximating set cover. Journal of the ACM (JACM), 45(4):634–652, 1998
work page 1998
- [18]
-
[19]
C. Green. Application of theorem proving to problem solving. In Readings in Artificial Intelligence, pages 202–222. Elsevier, 1981
work page 1981
-
[20]
S. Gulwani. Automating string processing in spreadsheets using input-output examples. SIGPLAN Not., 46(1):317–330, jan 2011
work page 2011
-
[21]
S. Gulwani, O. Polozov, and R. Singh. Program synthesis. Foundations and Trends® in Programming Languages, 4(1-2):1–119, 2017. 11
work page 2017
- [22]
- [23]
-
[24]
M. Ivankovi´c, G. Petrovi´c, R. Just, and G. Fraser. Code coverage at google. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 955–963, 2019
work page 2019
-
[25]
S. Iyer, I. Konstas, A. Cheung, and L. Zettlemoyer. Mapping language to code in programmatic context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1643–1652, 2018
work page 2018
-
[26]
A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [27]
-
[28]
C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [29]
-
[30]
J. C. King. Symbolic execution and program testing. Communications of the ACM , 19(7):385–394, 1976
work page 1976
-
[31]
M.-A. Lachaux, B. Roziere, L. Chanussot, and G. Lample. Unsupervised translation of programming languages. arXiv preprint arXiv:2006.03511, 2020
-
[32]
K. R. M. Leino. Dafny: An automatic program verifier for functional correctness. In International conference on logic for programming artificial intelligence and reasoning, pages 348–370. Springer, 2010
work page 2010
-
[33]
Y . Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. Dal Lago, et al. Competition-level code generation with alphacode. Science, 378(6624):1092–1097, 2022
work page 2022
-
[34]
J. Liu, Y . Wei, S. Yang, Y . Deng, and L. Zhang. Coverage-guided tensor compiler fuzzing with joint ir-pass mutation. Proceedings of the ACM on Programming Languages, 6(OOPSLA1):1–26, Apr. 2022
work page 2022
-
[35]
Z. Liu, C. Chen, J. Wang, X. Che, Y . Huang, J. Hu, and Q. Wang. Fill in the blank: Context-aware automated text input generation for mobile gui testing. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1355–1367. IEEE, 2023
work page 2023
-
[36]
N. P. Lopes, J. Lee, C.-K. Hur, Z. Liu, and J. Regehr. Alive2: bounded translation validation for llvm. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, pages 65–79, 2021
work page 2021
- [37]
- [38]
-
[39]
Z. Manna and R. J. Waldinger. Toward automatic program synthesis. Communications of the ACM, 14(3):151–165, 1971
work page 1971
-
[40]
W. M. McKeeman. Differential testing for software. Digital Technical Journal, 10(1):100–107, 1998
work page 1998
-
[41]
B. Meyer. Applying’design by contract’. Computer, 25(10):40–51, 1992
work page 1992
-
[42]
GitHub Copilot – Your AI pair programmer.https://github.com/features/ copilot, 2023
Microsoft. GitHub Copilot – Your AI pair programmer.https://github.com/features/ copilot, 2023. 12
work page 2023
-
[43]
B. P. Miller, L. Fredriksen, and B. So. An empirical study of the reliability of unix utilities. Communications of the ACM, 33(12):32–44, 1990
work page 1990
-
[44]
G. C. Necula. Translation validation for an optimizing compiler. In Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation, pages 83–94, 2000
work page 2000
-
[45]
E. Nijkamp, H. Hayashi, C. Xiong, S. Savarese, and Y . Zhou. Codegen2: Lessons for training llms on programming and natural languages. arXiv preprint, 2023
work page 2023
-
[46]
E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y . Zhou, S. Savarese, and C. Xiong. Codegen: An open large language model for code with multi-turn program synthesis. In The Eleventh International Conference on Learning Representations, 2023
work page 2023
-
[47]
P. Oehlert. Violating assumptions with fuzzing. IEEE Security & Privacy, 3(2):58–62, 2005
work page 2005
-
[48]
Chatgpt: Optimizing language models for dialogue
OpenAI. Chatgpt: Optimizing language models for dialogue. https://openai.com/blog/ chatgpt/, 2022
work page 2022
-
[49]
OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[50]
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002
work page 2002
-
[51]
G. Petrovi´c and M. Ivankovi´c. State of mutation testing at google. In Proceedings of the 40th international conference on software engineering: Software engineering in practice , pages 163–171, 2018
work page 2018
-
[52]
Phind/phind-codellama-34b-v2 · hugging face
Phind. Phind/phind-codellama-34b-v2 · hugging face. https://huggingface.co/Phind/ Phind-CodeLlama-34B-v2 , 2023
work page 2023
-
[53]
G. Rothermel, M. J. Harrold, J. V on Ronne, and C. Hong. Empirical studies of test-suite reduction. Software Testing, Verification and Reliability, 12(4):219–249, 2002
work page 2002
-
[54]
Code Llama: Open Foundation Models for Code
B. Rozière, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y . Adi, J. Liu, T. Remez, J. Rapin, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[55]
B. Roziere, J. M. Zhang, F. Charton, M. Harman, G. Synnaeve, and G. Lample. Leveraging automated unit tests for unsupervised code translation. arXiv preprint arXiv:2110.06773, 2021
- [56]
-
[57]
K. Serebryany. Continuous fuzzing with libfuzzer and addresssanitizer. In 2016 IEEE Cybersecurity Development (SecDev), pages 157–157. IEEE, 2016
work page 2016
-
[58]
D. E. Shaw, W. R. Swartout, and C. C. Green. Inferring lisp programs from examples. In IJCAI, volume 75, pages 260–267, 1975
work page 1975
-
[59]
A. Shi, A. Gyori, M. Gligoric, A. Zaytsev, and D. Marinov. Balancing trade-offs in test-suite reduction. In Proceedings of the 22nd ACM SIGSOFT international symposium on foundations of software engineering, pages 246–256, 2014
work page 2014
-
[60]
Stablelm: Stability ai language models
Stability-AI. Stablelm: Stability ai language models. https://github.com/Stability-AI/ StableLM, 2023
work page 2023
-
[61]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polo- sukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017
work page 2017
-
[62]
R. J. Waldinger and R. C. Lee. Prow: A step toward automatic program writing. In Proceedings of the 1st international joint conference on Artificial intelligence, pages 241–252, 1969
work page 1969
-
[63]
B. Wang and A. Komatsuzaki. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021
work page 2021
- [64]
- [65]
-
[66]
D. Winterer, C. Zhang, and Z. Su. On the unusual effectiveness of type-aware operator mutations for testing smt solvers. Proceedings of the ACM on Programming Languages , 4(OOPSLA):1–25, 2020. 13
work page 2020
-
[67]
C. S. Xia, M. Paltenghi, J. L. Tian, M. Pradel, and L. Zhang. Universal fuzzing via large language models. In 46th International Conference on Software Engineering (ICSE), 2024
work page 2024
-
[68]
C. S. Xia, Y . Wei, and L. Zhang. Automated program repair in the era of large pre-trained language models. In Proceedings of the 45th International Conference on Software Engineering (ICSE 2023). Association for Computing Machinery, 2023
work page 2023
-
[69]
C. S. Xia and L. Zhang. Less training, more repairing please: revisiting automated program re- pair via zero-shot learning. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 959–971, 2022
work page 2022
-
[70]
F. F. Xu, U. Alon, G. Neubig, and V . J. Hellendoorn. A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, pages 1–10, 2022
work page 2022
-
[71]
C. Yang, Y . Deng, R. Lu, J. Yao, J. Liu, R. Jabbarvand, and L. Zhang. White-box compiler fuzzing empowered by large language models, 2023
work page 2023
-
[72]
X. Yang, Y . Chen, E. Eide, and J. Regehr. Finding and understanding bugs in c compilers. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’11, page 283–294, New York, NY , USA, 2011. Association for Computing Machinery
work page 2011
-
[73]
T. Yu, R. Zhang, K. Yang, M. Yasunaga, D. Wang, Z. Li, J. Ma, I. Li, Q. Yao, S. Roman, et al. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018
work page 2018
- [74]
- [75]
-
[76]
Q. Zheng, X. Xia, X. Zou, Y . Dong, S. Wang, Y . Xue, Z. Wang, L. Shen, A. Wang, Y . Li, et al. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. arXiv preprint arXiv:2303.17568, 2023. 14 Table 5: Overview of evaluated models. Model Name Sizes Release Year Open-Source Coding CodeGen [46] 2B, 6B, 16B 2022 ✓ INC...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.