arxiv: 2605.09059 · v1 · submitted 2026-05-09 · 💻 cs.SE

Recognition: no theorem link

Evaluating LLM-Generated Code: A Benchmark and Developer Study

Joanna Szych , Anne Schwerk

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:22 UTC · model grok-4.3

classification 💻 cs.SE

keywords LLM code generationcode evaluationbenchmarksdeveloper studycode qualityproduction readinesssoftware engineeringLLM evaluation

0 comments

The pith

Developer reviews uncover production-readiness issues in LLM code that standard correctness benchmarks overlook.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a three-fold evaluation method for code generated by large language models. It pairs a custom correctness benchmark built around a complex multi-level computer science project with code-quality checks and structured developer surveys. The authors apply this method to compare three models and find that the developer reviews surface issues of maintainability and real-world usability that the benchmark alone cannot detect.

Core claim

A three-fold methodology that combines a dedicated correctness benchmark on a complex project, code quality verification, and developer opinions gathered through structured code reviews provides a fuller picture of LLM-generated code than correctness-focused benchmarks. When used on GPT-4.1, DeepSeek-V3-0324, and Claude Opus 4, the reviews produced additional findings on whether the code reaches a production-ready state.

What carries the argument

Three-fold evaluation methodology integrating a custom correctness benchmark, code quality verification, and structured developer code-review surveys.

Load-bearing premise

That feedback collected from developers in a structured review process gives reliable, generalizable information about production readiness that benchmarks miss.

What would settle it

Repeating the developer reviews on the same code samples with new reviewers and finding that they consistently identify no additional production-readiness problems beyond the benchmark results.

read the original abstract

Code generation is one of the tasks for which the use of Large Language Models is widely adopted and highly successful. Given this popularity, there are many benchmarks dedicated to code generation that can help select the best model. However, they primarily focus on measuring solution correctness, leaving other aspects, such as code quality and usability, behind. This paper aims to describe a custom tree-fold evaluation methodology for code generated by Large Language Models that bridges this gap. The methodology includes a dedicated correctness benchmark based on a complex multi-level computer science project, code quality verification, and a survey of developers' opinions on generated code samples gathered through a structured code-review process. The proposed methodology's usage and usefulness are demonstrated by evaluating and comparing three general-purpose Large Language Models: GPT-4.1, DeepSeek-V3-0324, and Claude Opus 4. The results show that reviews gathered from developers can yield many new findings, especially those related to the code being in a production-ready state, that would not be possible to obtain using the standard correctness-focused benchmark approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper shows that structured developer reviews on LLM code can flag production-readiness issues missed by correctness benchmarks alone, and their three-model run on a complex project gives a workable demonstration.

read the letter

The main thing to know is that the authors ran a mixed evaluation on three models—GPT-4.1, DeepSeek-V3, and Claude Opus—using a custom benchmark built around a multi-level computer science project, plus code quality checks and developer code reviews. The reviews turned up concrete points about maintainability, integration, and readiness that the automated correctness scores did not catch. That part lands as a useful illustration rather than a statistical claim.

Referee Report

1 major / 2 minor

Summary. The paper proposes a three-fold methodology for evaluating LLM-generated code: (1) a correctness benchmark built on a complex multi-level computer science project, (2) code quality verification, and (3) structured developer reviews collected via a survey process. The authors apply the methodology to compare three general-purpose LLMs (GPT-4.1, DeepSeek-V3-0324, and Claude Opus 4) and conclude that the developer reviews surface production-readiness insights (e.g., maintainability, deployment concerns) that standard correctness-only benchmarks miss.

Significance. If the concrete examples of additional findings hold, the work is significant for highlighting limitations of purely automated correctness benchmarks in code generation. The new benchmark on a complex project and the explicit inclusion of human developer feedback address a recognized gap in the field, potentially informing more holistic evaluation frameworks. The demonstration-style results provide practical evidence that developer input can reveal usability and production aspects not captured by pass@k or similar metrics.

major comments (1)

[Results] Results section: the claim that developer reviews 'yield many new findings' on production readiness rests on the presentation of specific examples; however, without reported inter-rater agreement, number of reviewers, or exclusion criteria, it is difficult to assess whether the additional insights are robust or idiosyncratic to the small set of reviewed samples.

minor comments (2)

[Methodology] The methodology description would be strengthened by an appendix containing the exact survey instrument and code-review template used with developers.
[Results] Table or figure summarizing per-model correctness scores, quality metrics, and review themes side-by-side would improve readability of the comparative results.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation of minor revision. We address the major comment below and will revise the manuscript to provide the requested methodological details.

read point-by-point responses

Referee: [Results] Results section: the claim that developer reviews 'yield many new findings' on production readiness rests on the presentation of specific examples; however, without reported inter-rater agreement, number of reviewers, or exclusion criteria, it is difficult to assess whether the additional insights are robust or idiosyncratic to the small set of reviewed samples.

Authors: We agree that the current description of the developer review component lacks sufficient detail for readers to fully evaluate the robustness of the reported insights. In the revised manuscript, we will expand the relevant sections to report: the exact number of developers who participated in the structured code-review process, their professional backgrounds and selection criteria, any exclusion criteria applied to code samples or individual responses, and the protocol used to synthesize recurring themes from the qualitative feedback. We will also clarify that the reviews consisted of independent structured assessments rather than paired quantitative ratings, which is why inter-rater agreement metrics were not computed; instead, we will describe how common production-readiness concerns were identified across responses. These additions will make explicit that the examples are drawn from a defined process and are intended to illustrate gaps missed by correctness benchmarks, rather than to claim statistical generalizability. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical three-fold methodology (correctness benchmark on a complex project, code quality checks, and structured developer reviews) to evaluate LLM-generated code and demonstrate that reviews surface production-readiness insights missed by standard benchmarks. No derivation chain, equations, fitted parameters renamed as predictions, or self-citations appear in the described approach or results. The central claim is supported directly by the concrete findings from applying the methodology to GPT-4.1, DeepSeek-V3-0324, and Claude Opus 4, without any reduction to definitional inputs or prior self-referential results. The study is self-contained as a proof-of-concept demonstration.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that developer reviews add unique production-readiness information unavailable from automated benchmarks; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Developer opinions obtained through a structured code-review process provide reliable signals about production readiness that automated correctness and quality metrics miss.
This assumption underpins the third fold and the claim that new findings emerge from the survey.

pith-pipeline@v0.9.0 · 5476 in / 1155 out tokens · 43645 ms · 2026-05-12T02:22:34.681743+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 5 internal anchors

[1]

Ben Athiwaratkun, Sanjay Krishna Gouda, Zijian Wang, Xiaopeng Li, Yuchen Tian, Ming Tan, Wasi Uddin Ahmad, Shiqi Wang, Qing Sun, Mingyue Shang, Sujan Kumar Gonugondla, Hantian Ding, Varun Kumar, Nathan Fulton, Arash Farahani, Siddhartha Jain, Robert Giaquinto, Haifeng Qian, Murali Krishna Ra- manathan, Ramesh Nallapati, Baishakhi Ray, Parminder Bhatia, Su...

work page arXiv 2023
[2]

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models. arXiv:2108.07732 [cs.PL] https://arxiv.org/abs/2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

2025.The Temperature Parameter | DeepSeek API Docs

DeepSeek. 2025.The Temperature Parameter | DeepSeek API Docs. https://api- docs.deepseek.com/quick_start/parameter_settings

work page 2025
[5]

Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2024. Evaluating Large Language Models in Class-Level Code Generation. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering(Lisbon, Portugal)(ICSE ’24). Association for Computing Machinery, New York...

work page doi:10.1145/3597503.3639219 2024
[6]

Eitan Farchi, Shmulik Froimovich, Rami Katan, and Orna Raz. 2024. Auto- matic Generation of Benchmarks and Reliable LLM Judgment for Code Tasks. arXiv:2410.21071 [cs.SE] https://arxiv.org/abs/2410.21071

work page arXiv 2024
[7]

Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2018. Mapping Language to Code in Programmatic Context. arXiv:1808.09588 [cs.CL] https://arxiv.org/abs/1808.09588

work page Pith review arXiv 2018
[8]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real- World GitHub Issues? arXiv:2310.06770 [cs.CL] https://arxiv.org/abs/2310.06770

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. arXiv:2305.01210 [cs.SE] https://arxiv. org/abs/2305.01210

work page internal anchor Pith review arXiv 2023
[10]

Tianyang Liu, Canwen Xu, and Julian McAuley. 2023. RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems. arXiv:2306.03091 [cs.CL] https://arxiv.org/abs/2306.03091

work page arXiv 2023
[11]

Bradley McDanel and Ed Novak. 2025. Designing LLM-Resistant Programming Assignments: Insights and Strategies for CS Educators. InProceedings of the 56th ACM Technical Symposium on Computer Science Education V. 1(Pittsburgh, PA, USA)(SIGCSETS 2025). Association for Computing Machinery, New York, NY, USA, 756–762. doi:10.1145/3641554.3701872

work page doi:10.1145/3641554.3701872 2025
[12]

Tanha Miah and Hong Zhu. 2024. User Centric Evaluation of Code Genera- tion Tools (Invited Paper). In2024 IEEE International Conference on Artificial Intelligence Testing (AITest). 109–119. doi:10.1109/AITest62860.2024.00022

work page doi:10.1109/aitest62860.2024.00022 2024
[13]

2025.Homepage | SonarQube Cloud | Sonar Documentation

SonarSource. 2025.Homepage | SonarQube Cloud | Sonar Documentation. https: //docs.sonarsource.com/sonarqube-cloud

work page 2025
[14]

2025.Software qualities | SonarQube Cloud Documentation

SonarSource. 2025.Software qualities | SonarQube Cloud Documentation. https: //docs.sonarsource.com/sonarqube-cloud/digging-deeper/software-qualities/

work page 2025
[15]

2026.Evaluating LLM-Generated Code: Benchmarking on complex assignment

Joanna Szych. 2026.Evaluating LLM-Generated Code: Benchmarking on complex assignment. https://github.com/AsiaSzych/Tree_of_Life/

work page 2026
[16]

2026.Evaluating LLM-Generated Code: Developer Study

Joanna Szych. 2026.Evaluating LLM-Generated Code: Developer Study. doi:10. 5281/zenodo.18806359

work page 2026
[17]

2025.Building The Tree of Life from Scratch

Christopher Tralie. 2025.Building The Tree of Life from Scratch. http://nifty. stanford.edu/2025/tralie-phylogenetic-trees/

work page 2025
[18]

Ruiqi Wang, Jiyu Guo, Cuiyun Gao, Guodong Fan, Chun Yong Chong, and Xin Xia. 2025. Can LLMs Replace Human Evaluators? An Empirical Study of LLM- as-a-Judge in Software Engineering.Proc. ACM Softw. Eng.2, ISSTA, Article ISSTA086 (June 2025), 23 pages. doi:10.1145/3728963

work page doi:10.1145/3728963 2025
[19]

Wei Wang, Huilong Ning, Gaowei Zhang, Libo Liu, and Yi Wang. 2024. Rocks Coding, Not Development: A Human-Centric, Experimental Evaluation of LLM- Supported SE Tasks.Proc. ACM Softw. Eng.1, FSE, Article 32 (July 2024), 23 pages. doi:10.1145/3643758

work page doi:10.1145/3643758 2024
[20]

Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, Guangtai Liang, Ying Li, Qianxiang Wang, and Tao Xie. 2024. CoderEval: A Benchmark of Prag- matic Code Generation with Generative Pre-trained Models. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering(Lisbon, Portugal)(ICSE ’24). Association for Computing Machin...

work page doi:10.1145/3597503.3623316 2024
[21]

Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. 2023. RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association...

work page doi:10.18653/v1/2023.emnlp-main.151 2023
[22]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv:2306.05685 [cs.CL] https://arxiv.org/abs/2306.05685

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Zihan Wang, Lei Shen, Andi Wang, Yang Li, Teng Su, Zhilin Yang, and Jie Tang. 2024. CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Bench- marking on HumanEval-X. arXiv:2303.17568 [cs.LG] https://arxiv.org/abs/2303. 17568

work page arXiv 2024