pith. machine review for the scientific record. sign in

arxiv: 2605.09059 · v1 · submitted 2026-05-09 · 💻 cs.SE

Recognition: no theorem link

Evaluating LLM-Generated Code: A Benchmark and Developer Study

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:22 UTC · model grok-4.3

classification 💻 cs.SE
keywords LLM code generationcode evaluationbenchmarksdeveloper studycode qualityproduction readinesssoftware engineeringLLM evaluation
0
0 comments X

The pith

Developer reviews uncover production-readiness issues in LLM code that standard correctness benchmarks overlook.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a three-fold evaluation method for code generated by large language models. It pairs a custom correctness benchmark built around a complex multi-level computer science project with code-quality checks and structured developer surveys. The authors apply this method to compare three models and find that the developer reviews surface issues of maintainability and real-world usability that the benchmark alone cannot detect.

Core claim

A three-fold methodology that combines a dedicated correctness benchmark on a complex project, code quality verification, and developer opinions gathered through structured code reviews provides a fuller picture of LLM-generated code than correctness-focused benchmarks. When used on GPT-4.1, DeepSeek-V3-0324, and Claude Opus 4, the reviews produced additional findings on whether the code reaches a production-ready state.

What carries the argument

Three-fold evaluation methodology integrating a custom correctness benchmark, code quality verification, and structured developer code-review surveys.

Load-bearing premise

That feedback collected from developers in a structured review process gives reliable, generalizable information about production readiness that benchmarks miss.

What would settle it

Repeating the developer reviews on the same code samples with new reviewers and finding that they consistently identify no additional production-readiness problems beyond the benchmark results.

read the original abstract

Code generation is one of the tasks for which the use of Large Language Models is widely adopted and highly successful. Given this popularity, there are many benchmarks dedicated to code generation that can help select the best model. However, they primarily focus on measuring solution correctness, leaving other aspects, such as code quality and usability, behind. This paper aims to describe a custom tree-fold evaluation methodology for code generated by Large Language Models that bridges this gap. The methodology includes a dedicated correctness benchmark based on a complex multi-level computer science project, code quality verification, and a survey of developers' opinions on generated code samples gathered through a structured code-review process. The proposed methodology's usage and usefulness are demonstrated by evaluating and comparing three general-purpose Large Language Models: GPT-4.1, DeepSeek-V3-0324, and Claude Opus 4. The results show that reviews gathered from developers can yield many new findings, especially those related to the code being in a production-ready state, that would not be possible to obtain using the standard correctness-focused benchmark approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes a three-fold methodology for evaluating LLM-generated code: (1) a correctness benchmark built on a complex multi-level computer science project, (2) code quality verification, and (3) structured developer reviews collected via a survey process. The authors apply the methodology to compare three general-purpose LLMs (GPT-4.1, DeepSeek-V3-0324, and Claude Opus 4) and conclude that the developer reviews surface production-readiness insights (e.g., maintainability, deployment concerns) that standard correctness-only benchmarks miss.

Significance. If the concrete examples of additional findings hold, the work is significant for highlighting limitations of purely automated correctness benchmarks in code generation. The new benchmark on a complex project and the explicit inclusion of human developer feedback address a recognized gap in the field, potentially informing more holistic evaluation frameworks. The demonstration-style results provide practical evidence that developer input can reveal usability and production aspects not captured by pass@k or similar metrics.

major comments (1)
  1. [Results] Results section: the claim that developer reviews 'yield many new findings' on production readiness rests on the presentation of specific examples; however, without reported inter-rater agreement, number of reviewers, or exclusion criteria, it is difficult to assess whether the additional insights are robust or idiosyncratic to the small set of reviewed samples.
minor comments (2)
  1. [Methodology] The methodology description would be strengthened by an appendix containing the exact survey instrument and code-review template used with developers.
  2. [Results] Table or figure summarizing per-model correctness scores, quality metrics, and review themes side-by-side would improve readability of the comparative results.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation of minor revision. We address the major comment below and will revise the manuscript to provide the requested methodological details.

read point-by-point responses
  1. Referee: [Results] Results section: the claim that developer reviews 'yield many new findings' on production readiness rests on the presentation of specific examples; however, without reported inter-rater agreement, number of reviewers, or exclusion criteria, it is difficult to assess whether the additional insights are robust or idiosyncratic to the small set of reviewed samples.

    Authors: We agree that the current description of the developer review component lacks sufficient detail for readers to fully evaluate the robustness of the reported insights. In the revised manuscript, we will expand the relevant sections to report: the exact number of developers who participated in the structured code-review process, their professional backgrounds and selection criteria, any exclusion criteria applied to code samples or individual responses, and the protocol used to synthesize recurring themes from the qualitative feedback. We will also clarify that the reviews consisted of independent structured assessments rather than paired quantitative ratings, which is why inter-rater agreement metrics were not computed; instead, we will describe how common production-readiness concerns were identified across responses. These additions will make explicit that the examples are drawn from a defined process and are intended to illustrate gaps missed by correctness benchmarks, rather than to claim statistical generalizability. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical three-fold methodology (correctness benchmark on a complex project, code quality checks, and structured developer reviews) to evaluate LLM-generated code and demonstrate that reviews surface production-readiness insights missed by standard benchmarks. No derivation chain, equations, fitted parameters renamed as predictions, or self-citations appear in the described approach or results. The central claim is supported directly by the concrete findings from applying the methodology to GPT-4.1, DeepSeek-V3-0324, and Claude Opus 4, without any reduction to definitional inputs or prior self-referential results. The study is self-contained as a proof-of-concept demonstration.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that developer reviews add unique production-readiness information unavailable from automated benchmarks; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Developer opinions obtained through a structured code-review process provide reliable signals about production readiness that automated correctness and quality metrics miss.
    This assumption underpins the third fold and the claim that new findings emerge from the survey.

pith-pipeline@v0.9.0 · 5476 in / 1155 out tokens · 43645 ms · 2026-05-12T02:22:34.681743+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 5 internal anchors

  1. [1]

    Ben Athiwaratkun, Sanjay Krishna Gouda, Zijian Wang, Xiaopeng Li, Yuchen Tian, Ming Tan, Wasi Uddin Ahmad, Shiqi Wang, Qing Sun, Mingyue Shang, Sujan Kumar Gonugondla, Hantian Ding, Varun Kumar, Nathan Fulton, Arash Farahani, Siddhartha Jain, Robert Giaquinto, Haifeng Qian, Murali Krishna Ra- manathan, Ramesh Nallapati, Baishakhi Ray, Parminder Bhatia, Su...

  2. [2]

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models. arXiv:2108.07732 [cs.PL] https://arxiv.org/abs/2108.07732

  3. [3]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  4. [4]

    2025.The Temperature Parameter | DeepSeek API Docs

    DeepSeek. 2025.The Temperature Parameter | DeepSeek API Docs. https://api- docs.deepseek.com/quick_start/parameter_settings

  5. [5]

    Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2024. Evaluating Large Language Models in Class-Level Code Generation. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering(Lisbon, Portugal)(ICSE ’24). Association for Computing Machinery, New York...

  6. [6]

    Eitan Farchi, Shmulik Froimovich, Rami Katan, and Orna Raz. 2024. Auto- matic Generation of Benchmarks and Reliable LLM Judgment for Code Tasks. arXiv:2410.21071 [cs.SE] https://arxiv.org/abs/2410.21071

  7. [7]

    Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2018. Mapping Language to Code in Programmatic Context. arXiv:1808.09588 [cs.CL] https://arxiv.org/abs/1808.09588

  8. [8]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real- World GitHub Issues? arXiv:2310.06770 [cs.CL] https://arxiv.org/abs/2310.06770

  9. [9]

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. arXiv:2305.01210 [cs.SE] https://arxiv. org/abs/2305.01210

  10. [10]

    Tianyang Liu, Canwen Xu, and Julian McAuley. 2023. RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems. arXiv:2306.03091 [cs.CL] https://arxiv.org/abs/2306.03091

  11. [11]

    Bradley McDanel and Ed Novak. 2025. Designing LLM-Resistant Programming Assignments: Insights and Strategies for CS Educators. InProceedings of the 56th ACM Technical Symposium on Computer Science Education V. 1(Pittsburgh, PA, USA)(SIGCSETS 2025). Association for Computing Machinery, New York, NY, USA, 756–762. doi:10.1145/3641554.3701872

  12. [12]

    Tanha Miah and Hong Zhu. 2024. User Centric Evaluation of Code Genera- tion Tools (Invited Paper). In2024 IEEE International Conference on Artificial Intelligence Testing (AITest). 109–119. doi:10.1109/AITest62860.2024.00022

  13. [13]

    2025.Homepage | SonarQube Cloud | Sonar Documentation

    SonarSource. 2025.Homepage | SonarQube Cloud | Sonar Documentation. https: //docs.sonarsource.com/sonarqube-cloud

  14. [14]

    2025.Software qualities | SonarQube Cloud Documentation

    SonarSource. 2025.Software qualities | SonarQube Cloud Documentation. https: //docs.sonarsource.com/sonarqube-cloud/digging-deeper/software-qualities/

  15. [15]

    2026.Evaluating LLM-Generated Code: Benchmarking on complex assignment

    Joanna Szych. 2026.Evaluating LLM-Generated Code: Benchmarking on complex assignment. https://github.com/AsiaSzych/Tree_of_Life/

  16. [16]

    2026.Evaluating LLM-Generated Code: Developer Study

    Joanna Szych. 2026.Evaluating LLM-Generated Code: Developer Study. doi:10. 5281/zenodo.18806359

  17. [17]

    2025.Building The Tree of Life from Scratch

    Christopher Tralie. 2025.Building The Tree of Life from Scratch. http://nifty. stanford.edu/2025/tralie-phylogenetic-trees/

  18. [18]

    Ruiqi Wang, Jiyu Guo, Cuiyun Gao, Guodong Fan, Chun Yong Chong, and Xin Xia. 2025. Can LLMs Replace Human Evaluators? An Empirical Study of LLM- as-a-Judge in Software Engineering.Proc. ACM Softw. Eng.2, ISSTA, Article ISSTA086 (June 2025), 23 pages. doi:10.1145/3728963

  19. [19]

    Wei Wang, Huilong Ning, Gaowei Zhang, Libo Liu, and Yi Wang. 2024. Rocks Coding, Not Development: A Human-Centric, Experimental Evaluation of LLM- Supported SE Tasks.Proc. ACM Softw. Eng.1, FSE, Article 32 (July 2024), 23 pages. doi:10.1145/3643758

  20. [20]

    Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, Guangtai Liang, Ying Li, Qianxiang Wang, and Tao Xie. 2024. CoderEval: A Benchmark of Prag- matic Code Generation with Generative Pre-trained Models. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering(Lisbon, Portugal)(ICSE ’24). Association for Computing Machin...

  21. [21]

    Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. 2023. RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association...

  22. [22]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv:2306.05685 [cs.CL] https://arxiv.org/abs/2306.05685

  23. [23]

    Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Zihan Wang, Lei Shen, Andi Wang, Yang Li, Teng Su, Zhilin Yang, and Jie Tang. 2024. CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Bench- marking on HumanEval-X. arXiv:2303.17568 [cs.LG] https://arxiv.org/abs/2303. 17568