pith. machine review for the scientific record. sign in

arxiv: 2604.24621 · v1 · submitted 2026-04-27 · 💻 cs.SE

Recognition: unknown

Evaluation of LLM-Based Software Engineering Tools: Practices, Challenges, and Future Directions

Authors on Pith no claims yet

Pith reviewed 2026-05-08 02:49 UTC · model grok-4.3

classification 💻 cs.SE
keywords LLM evaluationsoftware engineeringAI4SEevaluation challengesnon-determinismground truthtrustworthy AIfuture directions
0
0 comments X

The pith

LLM-based software engineering tools need new evaluation approaches because their outputs can vary and allow multiple correct answers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how to evaluate tools that use large language models for software engineering tasks such as generating code or reviewing it. These tools often produce outputs that are not fixed and can have several valid versions, which breaks the usual ways of measuring success like having one right answer. Current evaluation methods struggle with this, leading to unreliable assessments. The authors identify specific problems like changing results across runs and fragmented practices, then suggest paths forward for better, more trustworthy evaluation techniques that support wider use of these AI tools.

Core claim

The central claim is that the open-ended, multi-valid, and non-deterministic outputs of LLM-based AI4SE systems fundamentally challenge long-standing evaluation assumptions such as the existence of a single ground truth, deterministic outputs, and objective correctness. Through the lens of SE tasks, the paper discusses the necessity of reliable evaluation for trust and adoption, summarizes current practices and their limitations, highlights challenges including absence of stable ground truth, subjectivity and multi-dimensional quality, evaluation instability, limitations of automated evaluation, and fragmentation, and outlines future directions for robust methodologies.

What carries the argument

Evaluation of LLM-based AI4SE tools as a task-dependent concept, with the core mechanism being the mismatch between LLM output characteristics and traditional SE evaluation assumptions of fixed correctness.

Load-bearing premise

That the challenges of absent stable ground truth, subjectivity, non-determinism, automated evaluation limits, and practice fragmentation fully represent the main obstacles to reliable evaluation without further case-specific evidence.

What would settle it

Finding a set of SE tasks where LLM outputs consistently converge to a single objective correctness measure across multiple runs would challenge the paper's view on the fundamental mismatch with traditional assumptions.

read the original abstract

Large Language Models (LLMs) are increasingly embedded in software engineering (SE) tools, powering applications such as code generation, automated code review, and bug triage. As these LLM-based AI for Software Engineering (AI4SE) systems transition from experimental prototypes to widely deployed tools, the question of what it means to evaluate their behavior reliably has become both critical and unanswered. Unlike traditional SE or machine learning systems, LLM-based tools often produce open-ended, natural language outputs, admit multiple valid answers, and exhibit non-deterministic behavior across runs. These characteristics fundamentally challenge long-standing evaluation assumptions such as the existence of a single ground truth, deterministic outputs, and objective correctness. In this paper, we examine LLM evaluation as a general, task-dependent concept through the lens of SE tasks. We discuss why reliable evaluation is essential for trust, adoption, and meaningful assessment of LLM-based tools, summarize the current state of evaluation practices, and highlight their limitations in realistic AI4SE settings. We then identify key challenges facing current approaches, including the absence of stable ground truth, subjectivity and multi-dimensional quality, evaluation instability due to non-determinism, limitations of automated and model-based evaluation, and fragmentation of evaluation practices. Finally, we outline future directions aimed at advancing LLM evaluation toward more robust, scalable, and trustworthy methodologies, to stimulate discussion on principled evaluation practices that can keep pace with the growing role of LLMs in SE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that open-ended outputs, multiple valid answers, and non-determinism in LLM-based AI4SE tools fundamentally challenge traditional SE evaluation assumptions (single ground truth, deterministic outputs, objective correctness). It reviews current practices and their limitations, identifies five core challenges (unstable ground truth, subjectivity, non-determinism, automated evaluation limits, fragmentation), and outlines future directions for more robust evaluation methodologies.

Significance. If the analysis holds, the work could usefully frame discussion on evaluation standards as LLM tools move into production SE settings. A strength is the explicit future-directions section aimed at stimulating principled practices; however, the conceptual nature without empirical grounding or task-specific counterexamples limits immediate applicability.

major comments (1)
  1. [Abstract] Abstract: The assertion that LLM characteristics 'fundamentally challenge' objective correctness overlooks the continued use of executable test suites and formal oracles in core tasks such as code generation, repair, and bug triage. These methods (e.g., pass@k or equivalence checking) assess functional correctness independently of output form or run-to-run variation; the paper must explain why such oracles specifically break for LLM outputs rather than generalizing from abstract LLM properties.
minor comments (2)
  1. [Challenges] The challenges section would be strengthened by citing at least one published study or concrete example per listed challenge to demonstrate prevalence rather than relying on general properties.
  2. [Future Directions] Future directions remain high-level; adding brief sketches of candidate metrics or protocols (e.g., for handling multi-valid outputs) would improve actionability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback, which identifies an important qualification needed in our discussion of evaluation challenges. We have revised the abstract and introduction to more precisely delineate where traditional oracles remain effective and where LLM-specific properties introduce persistent difficulties.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion that LLM characteristics 'fundamentally challenge' objective correctness overlooks the continued use of executable test suites and formal oracles in core tasks such as code generation, repair, and bug triage. These methods (e.g., pass@k or equivalence checking) assess functional correctness independently of output form or run-to-run variation; the paper must explain why such oracles specifically break for LLM outputs rather than generalizing from abstract LLM properties.

    Authors: We agree that executable test suites and formal oracles continue to provide objective signals for functional correctness in code-generation and repair tasks, and that pass@k and equivalence checking are largely insensitive to surface form or single-run variation. Our original phrasing was overly broad. The manuscript's core argument is that these oracles do not address the full spectrum of LLM-based SE tools: many deployed applications produce open-ended natural-language artifacts (code reviews, triage rationales, refactoring explanations) for which no executable oracle exists; even in code-centric tasks, non-determinism inflates variance in pass@k estimates across independent evaluations, and multiple semantically distinct yet functionally correct outputs complicate the notion of a single ground truth. We have revised the abstract to state that the challenges are most acute for open-ended and multi-valid-answer tasks while acknowledging the continued utility of functional oracles for purely executable outputs. A short clarifying paragraph has also been added to Section 2 to explain why the same oracles become insufficient once outputs include explanatory text or when evaluation must capture dimensions beyond functional correctness. revision: yes

Circularity Check

0 steps flagged

No significant circularity: conceptual discussion remains self-contained

full rationale

The paper advances a discussion of LLM evaluation challenges in SE by enumerating general properties (open-ended outputs, non-determinism, multiple valid answers) and their implications for ground truth and correctness. No equations, parameter fits, or predictions appear; claims are not defined in terms of themselves or reduced to author-fitted quantities. No load-bearing self-citations or uniqueness theorems from prior author work are invoked. The analysis draws on observable LLM and SE task characteristics without circular reduction, satisfying the default expectation for non-circular papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central discussion rests on domain assumptions about LLM behavior rather than new mathematical axioms or invented entities; no free parameters are introduced.

axioms (1)
  • domain assumption LLM-based SE tools produce open-ended natural language outputs that admit multiple valid answers and exhibit non-deterministic behavior
    Invoked directly in the abstract as the basis for challenging traditional evaluation assumptions.

pith-pipeline@v0.9.0 · 5569 in / 1126 out tokens · 42835 ms · 2026-05-08T02:49:01.593350+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 21 canonical work pages · 3 internal anchors

  1. [1]

    Gabriel Aracena, Kyle Luster, Fabio Santos, Igor Steinmacher, and Marco Au- relio Gerosa. 2024. Applying large language models to issue classification. In Proceedings of the Third ACM/IEEE International Workshop on NL-based Software Engineering. 57–60

  2. [2]

    Deterministic

    Berk Atıl, Sarp Aykent, Alexa Chittams, Lisheng Fu, Rebecca J. Passonneau, Evan Radcliffe, Guru Rajan Rajagopal, Adam Sloan, Tomasz Tudrej, Ferhan Ture, Zhe Wu, Lixinyu Xu, and Breck Baldwin. 2025. Non-Determinism of “Deterministic” LLM System Settings in Hosted Environments. InProceedings of the 5th Workshop on Evaluation and Comparison of NLP Systems. A...

  3. [3]

    Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65–72. EASE ’26, June 09–12, 2026, Glasgow, UK Torun et al

  4. [4]

    James, and Nadia Polikarpova

    Shraddha Barke, Michael B. James, and Nadia Polikarpova. 2023. Grounded Copilot: How Programmers Interact with Code-Generating Models.Proc. ACM Program. Lang.7, OOPSLA1, Article 78 (April 2023), 27 pages. doi:10.1145/3586030

  5. [5]

    Paheli Bhattacharya, Manojit Chakraborty, Kartheek NSN Palepu, Vikas Pandey, Ishan Dindorkar, Rakesh Rajpurohit, and Rishabh Gupta. 2023. Exploring large language models for code explanation.arXiv preprint arXiv:2310.16673(2023)

  6. [6]

    Thanh-Long Bui, Hoa Khanh Dam, and Rashina Hoda. 2025. An LLM-based multi-agent framework for agile effort estimation.arXiv preprint arXiv:2509.14483 (2025)

  7. [7]

    Gül Calikli and Mohammed Alhamed. 2025. Impact of Request Formats on Effort Estimation: Are LLMs Different Than Humans?Proc. ACM Softw. Eng.2, FSE, Article FSE051 (June 2025), 22 pages. doi:10.1145/3715771

  8. [8]

    Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al . 2024. A survey on evaluation of large language models.ACM transactions on intelligent systems and technology15, 3 (2024), 1–45

  9. [9]

    Liguo Chen, Qi Guo, Hongrui Jia, Zhengran Zeng, Xin Wang, Yijiang Xu, Jian Wu, Yidong Wang, Qing Gao, Jindong Wang, et al. 2024. A survey on evaluating large language models in code generation tasks.arXiv preprint arXiv:2408.16498 (2024)

  10. [10]

    Umut Cihan, Vahid Haratian, Arda İçöz, Mert Kaan Gül, Ömercan Devran, Emir- can Furkan Bayendur, Baykal Mehmet Uçar, and Eray Tüzün. 2025. Automated code review in practice. In2025 IEEE/ACM 47th International Conference on Soft- ware Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 425–436

  11. [11]

    Emre Doğan, Eray Tüzün, K Ayberk Tecimer, and H Altay Güvenir. 2019. Investi- gating the validity of ground truth in code reviewer recommendation studies. In 2019 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). IEEE, 1–6

  12. [12]

    Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M Zhang. 2023. Large language models for software engineering: Survey and open problems. In2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE). IEEE, 31–53

  13. [13]

    Kehua Feng, Keyan Ding, Jing Yu, Yiwen Qu, Zhiwen Chen, Gang Yu, Qiang Zhang, Huajun Chen, et al. 2025. SaMer: A scenario-aware multi-dimensional evaluator for large language models. InThe Thirteenth International Conference on Learning Representations

  14. [14]

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring massive multitask language under- standing.arXiv preprint arXiv:2009.03300(2020)

  15. [15]

    Kai Huang, Jian Zhang, Xinlei Bao, Xu Wang, and Yang Liu. 2025. Comprehensive fine-tuning large language models of code for automated program repair.IEEE Transactions on Software Engineering51, 4 (2025), 904–928

  16. [16]

    Nihal Jain, Robert Kwiatkowski, Baishakhi Ray, Murali Krishna Ramanathan, and Varun Kumar. 2025. On Mitigating Code LLM Hallucinations with API Documentation. In2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 237–248. doi:10.1109/ ICSE-SEIP66354.2025.00027

  17. [17]

    Pengcheng Jiang, Jiacheng Lin, Zifeng Wang, Jimeng Sun, and Jiawei Han. 2024. GenRES: Rethinking Evaluation for Generative Relation Extraction in the Era of Large Language Models. arXiv:2402.10744 [cs.CL] https://arxiv.org/abs/2402. 10744

  18. [18]

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023. Swe-bench: Can language models resolve real-world github issues?, 2024.URL https://arxiv. org/abs/2310.067707 (2023)

  19. [19]

    2012.Thinking, fast and slow

    Daniel Kahneman. 2012.Thinking, fast and slow. Penguin, London

  20. [20]

    Kahneman, O

    D. Kahneman, O. Sibony, and C.R. Sunstein. 2021.Noise: A Flaw in Human Judg- ment. Little, Brown. https://books.google.com.tr/books?id=g2MBEAAAQBAJ

  21. [21]

    Takyoung Kim, Janvijay Singh, Shuhaib Mehri, Emre Can Acikgoz, Sagnik Mukherjee, Nimet Beyza Bozdag, Sumuk Shashidhar, Gokhan Tur, and Dilek Hakkani-Tür. 2025. AURA: A Diagnostic Framework for Tracking User Satisfac- tion of Interactive Planning Agents.arXiv preprint arXiv:2505.01592(2025)

  22. [22]

    Atish Kumar Dipongkor. 2024. An ensemble method for bug triaging using large language models. InProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings. 438–440

  23. [23]

    Yukyung Lee, JoongHoon Kim, Jaehee Kim, Hyowon Cho, Jaewook Kang, Pilsung Kang, and Najoung Kim. 2025. CheckEval: A reliable LLM-as-a-Judge framework for evaluating text generation using checklists. InProceedings of the 2025 Confer- ence on Empirical Methods in Natural Language Processing. Association for Com- putational Linguistics, Suzhou, China, 15771–...

  24. [24]

    Fengjie Li, Jiajun Jiang, Jiajun Sun, and Hongyu Zhang. 2025. Hybrid automated program repair by combining large language models and program analysis.ACM Transactions on Software Engineering and Methodology34, 7 (2025), 1–28

  25. [25]

    Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. InText summarization branches out. 74–81

  26. [26]

    Yue Liu, Chakkrit Tantithamthavorn, Yonghui Liu, and Li Li. 2024. On the reliability and explainability of language models for program generation.ACM Transactions on Software Engineering and Methodology33, 5 (2024), 1–26

  27. [27]

    Junyi Lu, Lei Yu, Xiaojia Li, Li Yang, and Chun Zuo. 2023. Llama-reviewer: Ad- vancing code review automation with large language models through parameter- efficient fine-tuning. In2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 647–658

  28. [28]

    Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambro- sio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al . 2021. Codexglue: A machine learning benchmark dataset for code understanding and generation.arXiv preprint arXiv:2102.04664(2021)

  29. [29]

    Stephen MacNeil, Andrew Tran, Dan Mogil, Seth Bernstein, Erin Ross, and Ziheng Huang. 2022. Generating diverse code explanations using the gpt-3 large language model. InProceedings of the 2022 ACM conference on international computing education research-volume 2. 37–39

  30. [30]

    Matéo Mahaut, Laura Aina, Paula Czarnowska, Momchil Hardalov, Thomas Müller, and Lluis Marquez. 2024. Factual Confidence of LLMs: on Reliability and Robustness of Current Estimators. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Bangkok, Thailan...

  31. [31]

    Shuyin Ouyang, Jie M Zhang, Mark Harman, and Meng Wang. 2025. An empirical study of the non-determinism of chatgpt in code generation.ACM Transactions on Software Engineering and Methodology34, 2 (2025), 1–28

  32. [32]

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318

  33. [33]

    Abhilasha Ravichander, Shrusti Ghela, David Wadden, and Yejin Choi. 2025. Halogen: Fantastic llm hallucinations and where to find them. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1402–1425

  34. [34]

    Tomohiro Sawada, Daniel Paleka, Alexander Havrilla, Pranav Tadepalli, Paula Vidas, Alexander Kranias, John J Nay, Kshitij Gupta, and Aran Komatsuzaki. 2023. Arb: Advanced reasoning benchmark for large language models.arXiv preprint arXiv:2307.13692(2023)

  35. [35]

    Lin Shi, Chiyu Ma, Wenhua Liang, Xingjian Diao, Weicheng Ma, and Soroush Vosoughi. 2025. Judging the Judges: A Systematic Study of Position Bias in LLM- as-a-Judge. InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Asso- ciation for Computational Linguistics. T...

  36. [36]

    Sofia Eleni Spatharioti, David Rothschild, Daniel G Goldstein, and Jake M Hofman

  37. [37]

    InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25)

    Effects of LLM-based Search on Decision Making: Speed, Accuracy, and Overreliance. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25). Association for Computing Machinery, New York, NY, USA, Article 1025, 15 pages. doi:10.1145/3706598.3714082

  38. [38]

    Eray Tüzün, Hakan Erdogmus, Maria Teresa Baldassarre, Michael Felderer, Robert Feldt, and Burak Turhan. 2022. Ground-Truth Deficiencies in Software Engineer- ing: When Codifying the Past Can Be Counterproductive.IEEE Software39, 3 (2022), 85–95. doi:10.1109/MS.2021.3098670

  39. [39]

    Stefan Wagner, Marvin Muñoz Bartón, Davide Falessi, and Sebastian Baltes

  40. [40]

    In Proceedings of the 2025 IEEE/ACM International Workshop on Methodological Issues with Empirical Studies in Software Engineering(Ottawa, Ontario, Canada)(WSESE ’25)

    Towards Evaluation Guidelines for Empirical Studies Involving LLMs. In Proceedings of the 2025 IEEE/ACM International Workshop on Methodological Issues with Empirical Studies in Software Engineering(Ottawa, Ontario, Canada)(WSESE ’25). IEEE Press, 24–27. doi:10.1109/WSESE66602.2025.00011

  41. [41]

    Ruiqi Wang, Jiyu Guo, Cuiyun Gao, Guodong Fan, Chun Yong Chong, and Xin Xia. 2025. Can LLMs Replace Human Evaluators? An Empirical Study of LLM- as-a-Judge in Software Engineering.Proc. ACM Softw. Eng.2, ISSTA, Article ISSTA086 (June 2025), 23 pages. doi:10.1145/3728963

  42. [42]

    Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, et al . 2024. Justice or prejudice? quantifying biases in llm-as-a-judge.arXiv preprint arXiv:2410.02736 (2024)

  43. [43]

    Seonghyeon Ye, Hyeonbin Hwang, Sohee Yang, Hyeongu Yun, Yireun Kim, and Minjoon Seo. 2024. Investigating the effectiveness of task-agnostic prefix prompt for instruction following. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 19386–19394

  44. [44]

    Quanjun Zhang, Chunrong Fang, Yang Xie, Yaxin Zhang, Yun Yang, Weisong Sun, Shengcheng Yu, and Zhenyu Chen. 2023. A survey on large language models for software engineering.arXiv preprint arXiv:2312.15223(2023)

  45. [45]

    Shun Zhang, Zhenfang Chen, Yikang Shen, Mingyu Ding, Joshua B Tenenbaum, and Chuang Gan. 2023. Planning with large language models for code generation. arXiv preprint arXiv:2303.05510(2023)

  46. [46]

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems36 (2023), 46595–46623