pith. sign in

arxiv: 2605.25536 · v1 · pith:AZF5YB4Inew · submitted 2026-05-25 · 💻 cs.SE · cs.AI

A Tertiary Review of Large Language Model-Based Code Generating Tasks: Trends, Challenges, and Future Directions

Pith reviewed 2026-06-29 20:46 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords large language modelscode generationtertiary reviewsoftware engineeringevaluation challengestrendsfuture directions
0
0 comments X

The pith

LLM-based code generation tasks show promise on benchmarks yet lack robust real-world validation and face efficiency hurdles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This tertiary review examines 30 secondary studies on large language models applied to code-generating tasks in software engineering. It reveals rapid publication growth since 2023 alongside evidence that benchmark accuracy does not strongly support generalization to practical development scenarios. The work highlights fragile robustness across different tasks and configurations, pervasive efficiency constraints, and under-reporting of issues like toxicity and bias. Dominant challenges identified include economic feasibility, evaluation validity, and socio-technical integration into development workflows. The synthesis points toward the necessity of domain-aware model enhancements and more holistic, standardized evaluation approaches.

Core claim

The tertiary study consolidates secondary evidence showing that LLM-based code generating tasks represent a fast-maturing research area with strong reported accuracy on benchmarks but weakly supported real-world generalization, fragile robustness, pervasive efficiency constraints, and under-reported toxicity and bias; it identifies dominant challenges in economic feasibility, evaluation validity, and socio-technical integration, and proposes future directions focused on domain-aware model improvement alongside holistic and standardized evaluation.

What carries the argument

Synthesis of 30 secondary studies using SWEBOK knowledge areas and the HELM framework, identified through library searches and snowballing.

If this is right

  • Reported benchmark accuracy provides limited assurance for real-world software engineering applications.
  • Efficiency and associated costs must be addressed to make LLM-based code generation economically feasible.
  • Standardized and holistic evaluation methods are required to properly assess robustness and integration challenges.
  • Domain-aware improvements to models will be necessary to overcome current limitations in applicability.
  • Future research should prioritize addressing socio-technical aspects of LLM integration in development processes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Without better evaluation practices, industry adoption of these models risks being based on overstated capabilities.
  • Under-reporting of bias and toxicity could lead to unintended consequences in generated code affecting security or fairness.
  • Extending this review with primary studies on actual developer productivity metrics could test the generalization claims.

Load-bearing premise

The 30 secondary studies identified through searches and snowballing represent an unbiased and complete sample of the existing literature.

What would settle it

Discovery of many additional secondary studies on LLM code generation that report strong real-world generalization or efficient performance would undermine the synthesized conclusions on uneven evaluation.

Figures

Figures reproduced from arXiv: 2605.25536 by Jim Buckley, Michael English, Muslim Chochlov.

Figure 1
Figure 1. Figure 1: PRISMA flowchart of the tertiary review process. [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Secondary studies per year by study type [PITH_FULL_IMAGE:figures/full_fig_p019_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Intersection of AI scope and SE scope shows secondary studies grouped by their AI scope, table 11 shows secondary studies grouped by SE scope, table 12 shows the distribution of SE KAs, and [PITH_FULL_IMAGE:figures/full_fig_p022_3.png] view at source ↗
read the original abstract

Context. Large language models (LLMs) are increasingly applied to code-generating tasks (CGTs) in software engineering. While reported results are promising, the broader effects of such application and their integration into real-world development remain insufficiently understood with existing tertiary studies provide little in this area. Objective. This tertiary study consolidates secondary evidence on LLM-based CGTs, synthesizing the publication landscape, effects, scenarios, integration challenges, and future research directions. Method. Following systematic review guidelines, we searched in related digital libraries, complemented by backward-and-forward snowballing and screening step. Study quality was assessed and extraction reliability was audited with inter-rater agreement statistics. Evidence was synthesized using SWEBOK knowledge areas and the HELM framework. Results. We identify 30 secondary studies published between 2017-2025, with rapid growth since 2023. Accuracy seems strong on benchmarks but weakly supported for real-world generalization; robustness is fragile across tasks and configurations; efficiency constraints are pervasive; toxicity and bias are under-reported. Dominant challenges concern economic feasibility, evaluation validity, and socio-technical integration. Future directions suggest domain-aware model improvement and the need for holistic, standardized evaluation. Conclusion. LLM-based CGTs represent a fast-maturing yet unevenly evaluated research area, highlighting the need for domain-aware model improvements and holistic, standardized evaluation, addressing efficiency and associated costs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. This tertiary review identifies 30 secondary studies (2017-2025) on LLM-based code-generating tasks (CGTs) via library searches and snowballing, assesses quality and inter-rater agreement, and synthesizes trends, effects (accuracy, robustness, efficiency, toxicity), scenarios, integration challenges, and future directions using SWEBOK knowledge areas and the HELM framework, concluding that the area is fast-maturing yet unevenly evaluated and requires domain-aware model improvements plus holistic, standardized evaluation addressing efficiency and costs.

Significance. If the 30-study sample is representative and the synthesis reliable, the work consolidates evidence on a rapidly expanding topic, highlighting gaps in real-world generalization, robustness across configurations, and evaluation validity that could usefully inform research priorities in software engineering.

major comments (2)
  1. [Method] Method section: the abstract and description claim adherence to systematic review guidelines, quality assessment, and inter-rater agreement auditing, yet supply no search strings, exact inclusion/exclusion criteria, or handling of data exclusions; this directly affects the load-bearing claim that the 30 studies constitute a representative sample.
  2. [Results] Results / Discussion: the representativeness assumption for the 30 secondary studies (identified via libraries and snowballing) is asserted without reported coverage metrics, bias checks, or comparison to the broader population of secondary studies on LLM-based CGTs, weakening the synthesis of trends and challenges.
minor comments (2)
  1. [Abstract] Abstract: the phrasing 'existing tertiary studies provide little in this area' is vague; a brief citation or count of prior tertiary reviews would clarify the novelty claim.
  2. [Results] The HELM framework application and SWEBOK mapping are described at a high level; an explicit table or appendix showing how individual findings map to framework dimensions would improve traceability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important areas for improving methodological transparency. We address each point below and will make revisions to the manuscript.

read point-by-point responses
  1. Referee: [Method] Method section: the abstract and description claim adherence to systematic review guidelines, quality assessment, and inter-rater agreement auditing, yet supply no search strings, exact inclusion/exclusion criteria, or handling of data exclusions; this directly affects the load-bearing claim that the 30 studies constitute a representative sample.

    Authors: We agree that the current manuscript does not include the specific search strings, detailed inclusion/exclusion criteria, or explicit handling of data exclusions. This is a presentation gap that reduces transparency. In the revised version, we will expand the Method section (and add an appendix if needed) with the exact search strings used across libraries, the complete inclusion and exclusion criteria, and how exclusions were managed during screening and snowballing. These details were part of our protocol and will directly support the representativeness claim. revision: yes

  2. Referee: [Results] Results / Discussion: the representativeness assumption for the 30 secondary studies (identified via libraries and snowballing) is asserted without reported coverage metrics, bias checks, or comparison to the broader population of secondary studies on LLM-based CGTs, weakening the synthesis of trends and challenges.

    Authors: The observation is correct; we did not report quantitative coverage metrics, formal bias checks, or direct comparisons to the full population of secondary studies. In revision, we will add a dedicated paragraph in the Results or Limitations subsection discussing the sampling approach (multi-library search plus snowballing), its potential biases, and qualitative indicators of coverage. We will also note any challenges in obtaining a definitive population count for comparison. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

This tertiary review paper follows a standard systematic literature review methodology (library searches, snowballing, quality assessment, inter-rater agreement) to synthesize 30 secondary studies using SWEBOK and HELM frameworks. There are no mathematical derivations, predictions, fitted parameters, equations, or load-bearing self-citations that reduce claims to inputs by construction. The central claims about trends, challenges, and future directions follow directly from the evidence base without self-referential loops or renaming of results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The review depends on the validity of standard systematic literature review procedures and the representativeness of the selected secondary studies; no free parameters, new entities, or ad-hoc axioms are introduced.

axioms (1)
  • domain assumption Systematic review guidelines and inter-rater agreement statistics ensure reliable evidence synthesis
    Invoked in the Method section of the abstract as the basis for study selection and quality assessment.

pith-pipeline@v0.9.1-grok · 5779 in / 1101 out tokens · 29834 ms · 2026-06-29T20:46:10.246339+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

114 extracted references · 21 canonical work pages · 10 internal anchors

  1. [1]

    d.].ACM Digital Library

    Association for Computing Machinery (ACM) [n. d.].ACM Digital Library. Association for Computing Machinery (ACM). Accessed: 2025-09-12

  2. [2]

    d.].dblp: Computer Science Bibliography

    Schloss Dagstuhl – Leibniz-Zentrum für Informatik [n. d.].dblp: Computer Science Bibliography. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. Accessed: 2025-09-12

  3. [3]

    d.].Google Scholar

    Google [n. d.].Google Scholar. Google. Accessed: 2025-09-12

  4. [4]

    d.].IEEE Xplore Digital Library

    Institute of Electrical and Electronics Engineers (IEEE) [n. d.].IEEE Xplore Digital Library. Institute of Electrical and Electronics Engineers (IEEE). Accessed: 2025-09-12

  5. [5]

    d.].OpenAI GPT-4

    OpenAI [n. d.].OpenAI GPT-4. OpenAI. Accessed: 2025-09-12

  6. [6]

    d.].ScienceDirect

    Elsevier [n. d.].ScienceDirect. Elsevier. Accessed: 2025-09-12

  7. [7]

    d.].Scopus

    Elsevier [n. d.].Scopus. Elsevier. Accessed: 2025-09-12

  8. [8]

    d.].Web of Science Core Collection

    Clarivate [n. d.].Web of Science Core Collection. Clarivate. Accessed: 2025-09-12

  9. [9]

    Areeg Ahmed, Shahira Azab, and Yasser Abdelhamid. 2023. Source-code generation using deep learning: a survey. InEPIA Conference on Artificial Intelligence. Springer, 467–482

  10. [10]

    Gul Aftab Ahmed, James Vincent Patten, Yuanhua Han, Guoxian Lu, David Gregg, Jim Buckley, and Muslim Chochlov. 2023. Using Ensemble Inference to Improve Recall of Clone Detection. In2023 IEEE 17th International Workshop on Software Clones (IWSC). IEEE, 15–21

  11. [11]

    Domenico Amalfitano, Stefano Faralli, Jean Carlo Rossa Hauck, Santiago Matalonga, and Damiano Distante. 2023. Artificial intelligence applied to software testing: A tertiary study.Comput. Surveys56, 3 (2023), 1–38

  12. [12]

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models.arXiv preprint arXiv:2108.07732(2021)

  13. [13]

    Abdulrahman Ahmed Bobakr Baqais and Mohammad Alshayeb. 2020. Automatic software refactoring: a systematic literature review.Software Quality Journal28, 2 (2020), 459–502

  14. [14]

    Jessica Bates, Paul Best, Janice McQuilkin, and Brian Taylor. 2017. Will web search engines replace bibliographic databases in the systematic identification of research?The Journal of Academic Librarianship43, 1 (2017), 8–17

  15. [15]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners.Advances in neural information processing systems33 (2020), 1877–1901

  16. [16]

    Mohamad Adam Bujang and N Baharum. 2017. Guidelines of the minimum sample size requirements for Kappa agreement test. Epidemiol. Biostat. Public Health14, 2 (2017)

  17. [17]

    M. C. and Coauthors. 2025. Supplementary Material (private Zenodo link under review). https://zenodo.org/records/17582721?preview=1&token= eyJhbGciOiJIUzUxMiJ9.eyJpZCI6ImZmZmFkNjQ2LTc1ZWQtNGRiYi04ZGVmLWQwOTAwMDU2YzRkNCIsImRhdGEiOnt9LCJyYW5kb20iOiJiMWFhN2Y2MzgwNTQ2YmFiZDRkMGMyM2JkOGViYTQ1NSJ9. lm0Y2dGigtT7lxiQMjqjvFHn5-zDqC7ep2BB3b9g8sHIjwsD15NjPvPTfjSGP_j...

  18. [18]

    Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. 2021. Extracting training data from large language models. In30th USENIX security symposium (USENIX Security 21). 2633–2650

  19. [19]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374(2021)

  20. [20]

    Muslim Chochlov, Gul Aftab Ahmed, James Vincent Patten, Guoxian Lu, Wei Hou, David Gregg, and Jim Buckley. 2022. Using a nearest-neighbour, BERT-based approach for scalable clone detection. In2022 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 582–591. Manuscript submitted to ACM 40 Trovato et al

  21. [21]

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2023. Palm: Scaling language modeling with pathways.Journal of Machine Learning Research24, 240 (2023), 1–113

  22. [22]

    CORE. 2025. CORE Conference Portal. https://www.core.edu.au/conference-portal. Accessed: 2025-09-22

  23. [23]

    Zheyuan Kevin Cui, Mert Demirer, Sonia Jaffe, Leon Musolff, Sida Peng, and Tobias Salz. 2025. The effects of generative AI on high-skilled work: Evidence from three field experiments with software developers.A vailable at SSRN 4945566(2025)

  24. [24]

    Enrique Dehaerne, Bappaditya Dey, Sandip Halder, Stefan De Gendt, and Wannes Meert. 2022. Code generation using machine learning: A systematic review.Ieee Access10 (2022), 82434–82455

  25. [25]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). 4171–4186

  26. [26]

    Thomas G Dietterich. 2000. Ensemble methods in machine learning. InInternational workshop on multiple classifier systems. Springer, 1–15

  27. [27]

    Liming Dong, Qinghua Lu, and Liming Zhu. 2024. A pilot study in surveying data challenges of automatic software engineering tasks. InProceedings of the 4th International Workshop on Software Engineering and AI for Data Quality in Cyber-Physical Systems/Internet of Things. 6–11

  28. [28]

    Alvan R Feinstein and Domenic V Cicchetti. 1990. High agreement but low kappa: I. The problems of two paradoxes.Journal of clinical epidemiology 43, 6 (1990), 543–549

  29. [29]

    Joseph L Fleiss. 1971. Measuring nominal scale agreement among many raters.Psychological bulletin76, 5 (1971), 378

  30. [30]

    2023.Systems and software engineering — Systems and software Quality Requirements and Evaluation (SQuaRE) — Product quality model

    International Organization for Standardization. 2023.Systems and software engineering — Systems and software Quality Requirements and Evaluation (SQuaRE) — Product quality model. ISO

  31. [31]

    Gordon Fraser and Andrea Arcuri. 2011. Evosuite: automatic test suite generation for object-oriented software. InProceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering. 416–419

  32. [32]

    Gordon Fraser and Andrea Arcuri. 2012. Whole test suite generation.IEEE Transactions on Software Engineering39, 2 (2012), 276–291

  33. [33]

    Simon Cornelius Gorissen, Stefan Sauer, and Wolf G Beckmann. 2024. A survey of natural language-based editing of low-code applications using large language models. InInternational Conference on Human-Centred Software Engineering. Springer, 243–254

  34. [34]

    Muhammet Kürşat Görmez, Murat Yılmaz, and Paul M Clarke. 2024. Large language models for software engineering: A systematic mapping study. InEuropean Conference on Software Process Improvement. Springer, 64–79

  35. [35]

    Sumit Gulwani, Oleksandr Polozov, Rishabh Singh, et al. 2017. Program synthesis.Foundations and Trends®in Programming Languages4, 1-2 (2017), 1–119

  36. [36]

    Eddie Guo, Mehul Gupta, Jiawen Deng, Ye-Jean Park, Michael Paget, and Christopher Naugler. 2024. Automated paper screening for clinical reviews using large language models: data analysis study.Journal of Medical Internet Research26 (2024), e48996

  37. [37]

    Michael Gusenbauer and Neal R Haddaway. 2020. Which academic search systems are suitable for systematic reviews or meta-analyses? Evaluating retrieval qualities of Google Scholar, PubMed, and 26 other resources.Research synthesis methods11, 2 (2020), 181–217

  38. [38]

    Kilem Li Gwet. 2008. Computing inter-rater reliability and its variance in the presence of high agreement.Brit. J. Math. Statist. Psych.61, 1 (2008), 29–48

  39. [39]

    2014.Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters

    Kilem L Gwet. 2014.Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters. Advanced Analytics, LLC

  40. [40]

    Neal Robert Haddaway, Alexandra Mary Collins, Deborah Coughlin, and Stuart Kirk. 2015. The role of Google Scholar in evidence reviews and its applicability to grey literature searching.PloS one10, 9 (2015), e0138237

  41. [41]

    Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large language models for software engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology33, 8 (2024), 1–79

  42. [42]

    Xing Hu, Feifei Niu, Junkai Chen, Xin Zhou, Junwei Zhang, Junda He, Xin Xia, and David Lo. 2025. Assessing and Advancing Benchmarks for Evaluating Large Language Models in Software Engineering Tasks.arXiv preprint arXiv:2505.08903(2025)

  43. [43]

    Kai Huang, Zhengzi Xu, Su Yang, Hongyu Sun, Xuejun Li, Zheng Yan, and Yuqing Zhang. 2023. A survey on automated program repair techniques. arXiv preprint arXiv:2303.18184(2023)

  44. [44]

    Zhang Huangzhao, Zhang Kechi, Li Zhuo, Li Jia, Li Yongmin, Zhao Yunfei, Zhu Yuqi, Liu Fang, Li Ge, and Jin Zhi. 2024. Deep learning for code generation: A survey.SCIENCE CHINA Information Sciences ISSN(2024)

  45. [45]

    Aleksi Huotala, Miikka Kuutila, Paul Ralph, and Mika Mäntylä. 2024. The promise and challenges of using LLMs to accelerate the screening process of systematic reviews. InProceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering. 262–271

  46. [46]

    Rasha Ahmad Husein, Hala Aburajouh, and Cagatay Catal. 2025. Large language models for code completion: A systematic literature review. Computer Standards & Interfaces92 (2025), 103917

  47. [47]

    Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2024. A survey on large language models for code generation.arXiv preprint arXiv:2406.00515(2024)

  48. [48]

    Sathvik Joel, Jie JW Wu, and Fatemeh H Fard. 2024. A survey on llm-based code generation for low-resource and domain-specific programming languages.arXiv preprint arXiv:2410.03981(2024)

  49. [49]

    Barbara Kitchenham and Pearl Brereton. 2013. A systematic review of systematic review process research in software engineering.Information and software technology55, 12 (2013), 2049–2075. Manuscript submitted to ACM A Tertiary Review of Large Language Model-Based Code Generating Tasks: Trends, Challenges, and Future Directions41

  50. [50]

    Barbara Kitchenham, Stuart Charters, et al. 2007. Guidelines for performing systematic literature reviews in software engineering. (2007)

  51. [51]

    Barbara Kitchenham, Lech Madeyski, and David Budgen. 2022. SEGRESS: Software engineering guidelines for reporting secondary studies.IEEE Transactions on Software Engineering49, 3 (2022), 1273–1298

  52. [52]

    Barbara Kitchenham, Rialette Pretorius, David Budgen, O Pearl Brereton, Mark Turner, Mahmood Niazi, and Stephen Linkman. 2010. Systematic literature reviews in software engineering–a tertiary study.Information and software technology52, 8 (2010), 792–805

  53. [53]

    Ron Kohavi et al. 1995. A study of cross-validation and bootstrap for accuracy estimation and model selection. InIjcai, Vol. 14. Montreal, Canada, 1137–1145

  54. [54]

    Zoe Kotti, Rafaila Galanopoulou, and Diomidis Spinellis. 2023. Machine learning for software engineering: A tertiary study.Comput. Surveys55, 12 (2023), 1–39

  55. [55]

    2018.Content analysis: An introduction to its methodology

    Klaus Krippendorff. 2018.Content analysis: An introduction to its methodology. Sage publications

  56. [56]

    Marie-Anne Lachaux, Baptiste Roziere, Lowik Chanussot, and Guillaume Lample. 2020. Unsupervised translation of programming languages. arXiv preprint arXiv:2006.03511(2020)

  57. [57]

    J Richard Landis and Gary G Koch. 1977. The measurement of observer agreement for categorical data.biometrics(1977), 159–174

  58. [58]

    Yunseo Lee, John Youngeun Song, Dongsun Kim, Jindae Kim, Mijung Kim, and Jaechang Nam. 2025. Hallucination by Code Generation LLMs: Taxonomy, Benchmarks, Mitigation, and Challenges.arXiv preprint arXiv:2504.20799(2025)

  59. [59]

    Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer

  60. [60]

    BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension.arXiv preprint arXiv:1910.13461(2019)

  61. [61]

    Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. 2022. Competition-level code generation with alphacode.Science378, 6624 (2022), 1092–1097

  62. [62]

    Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. 2022. Holistic evaluation of language models.arXiv preprint arXiv:2211.09110(2022)

  63. [63]

    Jialiang Lin, Yao Yu, Yu Zhou, Zhiyang Zhou, and Xiaodong Shi. 2020. How many preprints have actually been printed and why: a case study of computer science preprints on arXiv.Scientometrics124, 1 (2020), 555–574

  64. [64]

    Hsiao-Chuan Liu, Chia-Tung Tsai, and Min-Yuh Day. 2024. A Pilot Study on AI-Assisted Code Generation with Large Language Models for Software Engineering. InInternational Conference on Technologies and Applications of Artificial Intelligence. Springer, 162–175

  65. [65]

    Kui Liu, Li Li, Anil Koyuncu, Dongsun Kim, Zhe Liu, Jacques Klein, and Tegawendé F Bissyandé. 2021. A critical review on the evaluation of automated program repair systems.Journal of Systems and Software171 (2021), 110817

  66. [66]

    Mary L McHugh. 2012. Interrater reliability: the kappa statistic.Biochemia medica22, 3 (2012), 276–282

  67. [67]

    Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. 2024. Large language models: A survey.arXiv preprint arXiv:2402.06196(2024)

  68. [68]

    David Moher, Larissa Shamseer, Mike Clarke, Davina Ghersi, Alessandro Liberati, Mark Petticrew, Paul Shekelle, Lesley A Stewart, and Prisma-P Group. 2015. Preferred reporting items for systematic review and meta-analysis protocols (PRISMA-P) 2015 statement.Systematic reviews4, 1 (2015), 1

  69. [69]

    automatic patch generation learned from human-written patches

    Martin Monperrus. 2014. A critical review of" automatic patch generation learned from human-written patches": Essay on the problem statement and the evaluation of automatic software repair. InProceedings of the 36th International Conference on Software Engineering. 234–242

  70. [70]

    Robert G Newcombe. 1998. Two-sided confidence intervals for the single proportion: Comparison of seven methods.Statistics in Medicine17, 8 (1998), 857–872

  71. [71]

    Björn Nykvist, Biljana Macura, Maria Xylia, and Erik Olsson. 2025. Testing the utility of GPT for title and abstract screening in environmental systematic evidence synthesis.Environmental Evidence14, 1 (2025), 7

  72. [72]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.Advances in neural information processing systems35 (2022), 27730–27744

  73. [73]

    Alison O’Mara-Eves, James Thomas, John McNaught, Makoto Miwa, and Sophia Ananiadou. 2015. Using text mining for study identification in systematic reviews: a systematic review of current approaches.Systematic reviews4, 1 (2015), 5

  74. [74]

    Md Rizwan Parvez, Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Retrieval augmented code generation and summarization.arXiv preprint arXiv:2108.11601(2021)

  75. [75]

    Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. 2023. Instruction tuning with gpt-4.arXiv preprint arXiv:2304.03277 (2023)

  76. [76]

    Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. 2023. The impact of ai on developer productivity: Evidence from github copilot. arXiv preprint arXiv:2302.06590(2023)

  77. [77]

    Jorge Pérez, Jessica Díaz, Javier Garcia-Martin, and Bernardo Tabuenca. 2020. Systematic literature reviews in software engineering—enhancement of the study selection process using Cohen’s Kappa statistic.Journal of Systems and Software168 (2020), 110657

  78. [78]

    Kai Petersen, Sairam Vakkalanka, and Ludwik Kuzniarz. 2015. Guidelines for conducting systematic mapping studies in software engineering: An update.Information and software technology64 (2015), 1–18. Manuscript submitted to ACM 42 Trovato et al

  79. [79]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research21, 140 (2020), 1–67

  80. [80]

    Leonardo Criollo Ramírez, Xavier Limón, Ángel J Sánchez-García, and Juan Carlos Pérez-Arriaga. 2024. State of the Art of the Security of Code Generated by LLMs: A Systematic Literature Review. In2024 12th International Conference in Software Engineering Research and Innovation (CONISOFT). IEEE, 331–339

Showing first 80 references.