Recognition: unknown
Evaluation of LLM-Based Software Engineering Tools: Practices, Challenges, and Future Directions
Pith reviewed 2026-05-08 02:49 UTC · model grok-4.3
The pith
LLM-based software engineering tools need new evaluation approaches because their outputs can vary and allow multiple correct answers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the open-ended, multi-valid, and non-deterministic outputs of LLM-based AI4SE systems fundamentally challenge long-standing evaluation assumptions such as the existence of a single ground truth, deterministic outputs, and objective correctness. Through the lens of SE tasks, the paper discusses the necessity of reliable evaluation for trust and adoption, summarizes current practices and their limitations, highlights challenges including absence of stable ground truth, subjectivity and multi-dimensional quality, evaluation instability, limitations of automated evaluation, and fragmentation, and outlines future directions for robust methodologies.
What carries the argument
Evaluation of LLM-based AI4SE tools as a task-dependent concept, with the core mechanism being the mismatch between LLM output characteristics and traditional SE evaluation assumptions of fixed correctness.
Load-bearing premise
That the challenges of absent stable ground truth, subjectivity, non-determinism, automated evaluation limits, and practice fragmentation fully represent the main obstacles to reliable evaluation without further case-specific evidence.
What would settle it
Finding a set of SE tasks where LLM outputs consistently converge to a single objective correctness measure across multiple runs would challenge the paper's view on the fundamental mismatch with traditional assumptions.
read the original abstract
Large Language Models (LLMs) are increasingly embedded in software engineering (SE) tools, powering applications such as code generation, automated code review, and bug triage. As these LLM-based AI for Software Engineering (AI4SE) systems transition from experimental prototypes to widely deployed tools, the question of what it means to evaluate their behavior reliably has become both critical and unanswered. Unlike traditional SE or machine learning systems, LLM-based tools often produce open-ended, natural language outputs, admit multiple valid answers, and exhibit non-deterministic behavior across runs. These characteristics fundamentally challenge long-standing evaluation assumptions such as the existence of a single ground truth, deterministic outputs, and objective correctness. In this paper, we examine LLM evaluation as a general, task-dependent concept through the lens of SE tasks. We discuss why reliable evaluation is essential for trust, adoption, and meaningful assessment of LLM-based tools, summarize the current state of evaluation practices, and highlight their limitations in realistic AI4SE settings. We then identify key challenges facing current approaches, including the absence of stable ground truth, subjectivity and multi-dimensional quality, evaluation instability due to non-determinism, limitations of automated and model-based evaluation, and fragmentation of evaluation practices. Finally, we outline future directions aimed at advancing LLM evaluation toward more robust, scalable, and trustworthy methodologies, to stimulate discussion on principled evaluation practices that can keep pace with the growing role of LLMs in SE.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that open-ended outputs, multiple valid answers, and non-determinism in LLM-based AI4SE tools fundamentally challenge traditional SE evaluation assumptions (single ground truth, deterministic outputs, objective correctness). It reviews current practices and their limitations, identifies five core challenges (unstable ground truth, subjectivity, non-determinism, automated evaluation limits, fragmentation), and outlines future directions for more robust evaluation methodologies.
Significance. If the analysis holds, the work could usefully frame discussion on evaluation standards as LLM tools move into production SE settings. A strength is the explicit future-directions section aimed at stimulating principled practices; however, the conceptual nature without empirical grounding or task-specific counterexamples limits immediate applicability.
major comments (1)
- [Abstract] Abstract: The assertion that LLM characteristics 'fundamentally challenge' objective correctness overlooks the continued use of executable test suites and formal oracles in core tasks such as code generation, repair, and bug triage. These methods (e.g., pass@k or equivalence checking) assess functional correctness independently of output form or run-to-run variation; the paper must explain why such oracles specifically break for LLM outputs rather than generalizing from abstract LLM properties.
minor comments (2)
- [Challenges] The challenges section would be strengthened by citing at least one published study or concrete example per listed challenge to demonstrate prevalence rather than relying on general properties.
- [Future Directions] Future directions remain high-level; adding brief sketches of candidate metrics or protocols (e.g., for handling multi-valid outputs) would improve actionability.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which identifies an important qualification needed in our discussion of evaluation challenges. We have revised the abstract and introduction to more precisely delineate where traditional oracles remain effective and where LLM-specific properties introduce persistent difficulties.
read point-by-point responses
-
Referee: [Abstract] Abstract: The assertion that LLM characteristics 'fundamentally challenge' objective correctness overlooks the continued use of executable test suites and formal oracles in core tasks such as code generation, repair, and bug triage. These methods (e.g., pass@k or equivalence checking) assess functional correctness independently of output form or run-to-run variation; the paper must explain why such oracles specifically break for LLM outputs rather than generalizing from abstract LLM properties.
Authors: We agree that executable test suites and formal oracles continue to provide objective signals for functional correctness in code-generation and repair tasks, and that pass@k and equivalence checking are largely insensitive to surface form or single-run variation. Our original phrasing was overly broad. The manuscript's core argument is that these oracles do not address the full spectrum of LLM-based SE tools: many deployed applications produce open-ended natural-language artifacts (code reviews, triage rationales, refactoring explanations) for which no executable oracle exists; even in code-centric tasks, non-determinism inflates variance in pass@k estimates across independent evaluations, and multiple semantically distinct yet functionally correct outputs complicate the notion of a single ground truth. We have revised the abstract to state that the challenges are most acute for open-ended and multi-valid-answer tasks while acknowledging the continued utility of functional oracles for purely executable outputs. A short clarifying paragraph has also been added to Section 2 to explain why the same oracles become insufficient once outputs include explanatory text or when evaluation must capture dimensions beyond functional correctness. revision: yes
Circularity Check
No significant circularity: conceptual discussion remains self-contained
full rationale
The paper advances a discussion of LLM evaluation challenges in SE by enumerating general properties (open-ended outputs, non-determinism, multiple valid answers) and their implications for ground truth and correctness. No equations, parameter fits, or predictions appear; claims are not defined in terms of themselves or reduced to author-fitted quantities. No load-bearing self-citations or uniqueness theorems from prior author work are invoked. The analysis draws on observable LLM and SE task characteristics without circular reduction, satisfying the default expectation for non-circular papers.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM-based SE tools produce open-ended natural language outputs that admit multiple valid answers and exhibit non-deterministic behavior
Reference graph
Works this paper leans on
-
[1]
Gabriel Aracena, Kyle Luster, Fabio Santos, Igor Steinmacher, and Marco Au- relio Gerosa. 2024. Applying large language models to issue classification. In Proceedings of the Third ACM/IEEE International Workshop on NL-based Software Engineering. 57–60
2024
-
[2]
Deterministic
Berk Atıl, Sarp Aykent, Alexa Chittams, Lisheng Fu, Rebecca J. Passonneau, Evan Radcliffe, Guru Rajan Rajagopal, Adam Sloan, Tomasz Tudrej, Ferhan Ture, Zhe Wu, Lixinyu Xu, and Breck Baldwin. 2025. Non-Determinism of “Deterministic” LLM System Settings in Hosted Environments. InProceedings of the 5th Workshop on Evaluation and Comparison of NLP Systems. A...
2025
-
[3]
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65–72. EASE ’26, June 09–12, 2026, Glasgow, UK Torun et al
2005
-
[4]
Shraddha Barke, Michael B. James, and Nadia Polikarpova. 2023. Grounded Copilot: How Programmers Interact with Code-Generating Models.Proc. ACM Program. Lang.7, OOPSLA1, Article 78 (April 2023), 27 pages. doi:10.1145/3586030
- [5]
- [6]
-
[7]
Gül Calikli and Mohammed Alhamed. 2025. Impact of Request Formats on Effort Estimation: Are LLMs Different Than Humans?Proc. ACM Softw. Eng.2, FSE, Article FSE051 (June 2025), 22 pages. doi:10.1145/3715771
-
[8]
Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al . 2024. A survey on evaluation of large language models.ACM transactions on intelligent systems and technology15, 3 (2024), 1–45
2024
- [9]
-
[10]
Umut Cihan, Vahid Haratian, Arda İçöz, Mert Kaan Gül, Ömercan Devran, Emir- can Furkan Bayendur, Baykal Mehmet Uçar, and Eray Tüzün. 2025. Automated code review in practice. In2025 IEEE/ACM 47th International Conference on Soft- ware Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 425–436
2025
-
[11]
Emre Doğan, Eray Tüzün, K Ayberk Tecimer, and H Altay Güvenir. 2019. Investi- gating the validity of ground truth in code reviewer recommendation studies. In 2019 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). IEEE, 1–6
2019
-
[12]
Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M Zhang. 2023. Large language models for software engineering: Survey and open problems. In2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE). IEEE, 31–53
2023
-
[13]
Kehua Feng, Keyan Ding, Jing Yu, Yiwen Qu, Zhiwen Chen, Gang Yu, Qiang Zhang, Huajun Chen, et al. 2025. SaMer: A scenario-aware multi-dimensional evaluator for large language models. InThe Thirteenth International Conference on Learning Representations
2025
-
[14]
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring massive multitask language under- standing.arXiv preprint arXiv:2009.03300(2020)
work page internal anchor Pith review arXiv 2020
-
[15]
Kai Huang, Jian Zhang, Xinlei Bao, Xu Wang, and Yang Liu. 2025. Comprehensive fine-tuning large language models of code for automated program repair.IEEE Transactions on Software Engineering51, 4 (2025), 904–928
2025
-
[16]
Nihal Jain, Robert Kwiatkowski, Baishakhi Ray, Murali Krishna Ramanathan, and Varun Kumar. 2025. On Mitigating Code LLM Hallucinations with API Documentation. In2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 237–248. doi:10.1109/ ICSE-SEIP66354.2025.00027
- [17]
-
[18]
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023. Swe-bench: Can language models resolve real-world github issues?, 2024.URL https://arxiv. org/abs/2310.067707 (2023)
work page internal anchor Pith review arXiv 2023
-
[19]
2012.Thinking, fast and slow
Daniel Kahneman. 2012.Thinking, fast and slow. Penguin, London
2012
-
[20]
Kahneman, O
D. Kahneman, O. Sibony, and C.R. Sunstein. 2021.Noise: A Flaw in Human Judg- ment. Little, Brown. https://books.google.com.tr/books?id=g2MBEAAAQBAJ
2021
-
[21]
Takyoung Kim, Janvijay Singh, Shuhaib Mehri, Emre Can Acikgoz, Sagnik Mukherjee, Nimet Beyza Bozdag, Sumuk Shashidhar, Gokhan Tur, and Dilek Hakkani-Tür. 2025. AURA: A Diagnostic Framework for Tracking User Satisfac- tion of Interactive Planning Agents.arXiv preprint arXiv:2505.01592(2025)
-
[22]
Atish Kumar Dipongkor. 2024. An ensemble method for bug triaging using large language models. InProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings. 438–440
2024
-
[23]
Yukyung Lee, JoongHoon Kim, Jaehee Kim, Hyowon Cho, Jaewook Kang, Pilsung Kang, and Najoung Kim. 2025. CheckEval: A reliable LLM-as-a-Judge framework for evaluating text generation using checklists. InProceedings of the 2025 Confer- ence on Empirical Methods in Natural Language Processing. Association for Com- putational Linguistics, Suzhou, China, 15771–...
-
[24]
Fengjie Li, Jiajun Jiang, Jiajun Sun, and Hongyu Zhang. 2025. Hybrid automated program repair by combining large language models and program analysis.ACM Transactions on Software Engineering and Methodology34, 7 (2025), 1–28
2025
-
[25]
Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. InText summarization branches out. 74–81
2004
-
[26]
Yue Liu, Chakkrit Tantithamthavorn, Yonghui Liu, and Li Li. 2024. On the reliability and explainability of language models for program generation.ACM Transactions on Software Engineering and Methodology33, 5 (2024), 1–26
2024
-
[27]
Junyi Lu, Lei Yu, Xiaojia Li, Li Yang, and Chun Zuo. 2023. Llama-reviewer: Ad- vancing code review automation with large language models through parameter- efficient fine-tuning. In2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 647–658
2023
-
[28]
Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambro- sio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al . 2021. Codexglue: A machine learning benchmark dataset for code understanding and generation.arXiv preprint arXiv:2102.04664(2021)
work page internal anchor Pith review arXiv 2021
-
[29]
Stephen MacNeil, Andrew Tran, Dan Mogil, Seth Bernstein, Erin Ross, and Ziheng Huang. 2022. Generating diverse code explanations using the gpt-3 large language model. InProceedings of the 2022 ACM conference on international computing education research-volume 2. 37–39
2022
-
[30]
Matéo Mahaut, Laura Aina, Paula Czarnowska, Momchil Hardalov, Thomas Müller, and Lluis Marquez. 2024. Factual Confidence of LLMs: on Reliability and Robustness of Current Estimators. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Bangkok, Thailan...
-
[31]
Shuyin Ouyang, Jie M Zhang, Mark Harman, and Meng Wang. 2025. An empirical study of the non-determinism of chatgpt in code generation.ACM Transactions on Software Engineering and Methodology34, 2 (2025), 1–28
2025
-
[32]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318
2002
-
[33]
Abhilasha Ravichander, Shrusti Ghela, David Wadden, and Yejin Choi. 2025. Halogen: Fantastic llm hallucinations and where to find them. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1402–1425
2025
- [34]
-
[35]
Lin Shi, Chiyu Ma, Wenhua Liang, Xingjian Diao, Weicheng Ma, and Soroush Vosoughi. 2025. Judging the Judges: A Systematic Study of Position Bias in LLM- as-a-Judge. InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Asso- ciation for Computational Linguistics. T...
2025
-
[36]
Sofia Eleni Spatharioti, David Rothschild, Daniel G Goldstein, and Jake M Hofman
-
[37]
InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25)
Effects of LLM-based Search on Decision Making: Speed, Accuracy, and Overreliance. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25). Association for Computing Machinery, New York, NY, USA, Article 1025, 15 pages. doi:10.1145/3706598.3714082
-
[38]
Eray Tüzün, Hakan Erdogmus, Maria Teresa Baldassarre, Michael Felderer, Robert Feldt, and Burak Turhan. 2022. Ground-Truth Deficiencies in Software Engineer- ing: When Codifying the Past Can Be Counterproductive.IEEE Software39, 3 (2022), 85–95. doi:10.1109/MS.2021.3098670
-
[39]
Stefan Wagner, Marvin Muñoz Bartón, Davide Falessi, and Sebastian Baltes
-
[40]
Towards Evaluation Guidelines for Empirical Studies Involving LLMs. In Proceedings of the 2025 IEEE/ACM International Workshop on Methodological Issues with Empirical Studies in Software Engineering(Ottawa, Ontario, Canada)(WSESE ’25). IEEE Press, 24–27. doi:10.1109/WSESE66602.2025.00011
-
[41]
Ruiqi Wang, Jiyu Guo, Cuiyun Gao, Guodong Fan, Chun Yong Chong, and Xin Xia. 2025. Can LLMs Replace Human Evaluators? An Empirical Study of LLM- as-a-Judge in Software Engineering.Proc. ACM Softw. Eng.2, ISSTA, Article ISSTA086 (June 2025), 23 pages. doi:10.1145/3728963
- [42]
-
[43]
Seonghyeon Ye, Hyeonbin Hwang, Sohee Yang, Hyeongu Yun, Yireun Kim, and Minjoon Seo. 2024. Investigating the effectiveness of task-agnostic prefix prompt for instruction following. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 19386–19394
2024
- [44]
- [45]
-
[46]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems36 (2023), 46595–46623
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.