pith. machine review for the scientific record. sign in

arxiv: 2603.00989 · v3 · submitted 2026-03-01 · 💻 cs.SE

Recognition: no theorem link

Sustainable Code Generation Using Large Language Models: A Systematic Literature Review

Authors on Pith no claims yet

Pith reviewed 2026-05-15 18:46 UTC · model grok-4.3

classification 💻 cs.SE
keywords sustainable code generationlarge language modelssystematic literature reviewenergy efficiencysoftware sustainabilityLLM codeenvironmental impactcode efficiency
0
0 comments X

The pith

Research on sustainable LLM-generated code is limited and lacks any standard measurement framework.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper performs a systematic literature review of studies on code produced by large language models with a focus on sustainability. It looks at how sustainability is defined, which metrics such as energy use and resource consumption are applied, and whether methods like fine-tuning or prompt engineering improve outcomes. The review concludes that the body of work is small and scattered. No common framework or set of benchmarks has emerged to evaluate the environmental impact of the generated code. Without such standards, the growing use of LLMs in software development risks producing inefficient code that raises long-term energy demands during execution.

Core claim

The systematic literature review of primary studies shows that research on the sustainability of LLM-generated code remains relatively limited and fragmented. No widely accepted framework exists for defining sustainability, selecting metrics for energy efficiency and resource usage, or benchmarking results. The analysis examines methodological approaches, evaluation practices, experimental settings, and the effects of techniques such as fine-tuning and prompt engineering, but finds no consensus or standardized practices across the studies.

What carries the argument

Systematic literature review that selects and categorizes primary studies according to their approaches to measuring sustainability in LLM-generated code.

If this is right

  • Clearer definitions of what counts as sustainable code in the LLM context are needed.
  • Standardized evaluation methods and metrics must be developed to allow consistent assessment.
  • Further systematic research is required to advance environmentally friendly AI-assisted software engineering.
  • The potential influence of fine-tuning and prompt engineering on code sustainability remains unclear and needs targeted investigation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers and organizations using LLMs for code tasks will have no reliable way to compare or improve the energy profile of their outputs until benchmarks appear.
  • Widespread adoption of current LLM tools could raise the overall energy footprint of software applications if efficiency is not measured.
  • Future LLM systems might incorporate direct sustainability scoring during generation as a practical response to this gap.
  • This review points toward the value of linking LLM code work with established practices in green software engineering.

Load-bearing premise

The primary studies selected through the systematic search comprehensively and representatively capture the current state of research on LLM-generated sustainable code without significant gaps in coverage.

What would settle it

Identification of a substantial body of additional studies that all apply the same consistent set of metrics and a shared benchmarking framework for sustainability would contradict the claim of fragmentation and lack of standards.

Figures

Figures reproduced from arXiv: 2603.00989 by Aroosa Hameed, Gautam Srivastava, Oussema Kirmani, Sabiya Banu Masthan Ali, Syed Muhammad Danish.

Figure 1
Figure 1. Figure 1: Primary Study Selection predefined set of inclusion and exclusion criteria, summarized in Table II. We included studies that used or evaluated LLMs for code generation, addressed sustainability aspects such as energy efficiency or resource usage, and reported empirical results. Studies were excluded if they did not involve LLMs, lacked sustainability analysis, focused on unrelated tasks, or did not have ac… view at source ↗
read the original abstract

Large Language Models (LLMs) are widely used in software engineering to generate, complete, translate, and fix code, improving developer productivity. While most research focuses on the energy consumption and carbon emissions of model training and inference, far less attention has been given to the sustainability of the code these models produce. The efficiency of generated code affects the long-term environmental impact of software systems. Inefficient code can increase CPU usage, memory consumption, execution time, and overall energy use during deployment and operation. As LLM-generated code becomes more common in real-world projects, even small inefficiencies can lead to high environmental costs over time. This paper examines existing research on the sustainability of code generated by LLMs. We conduct a systematic literature review to analyze selected primary studies and investigate the extent to which LLMs are capable of producing sustainable code. In addition, we examine how sustainability is defined and measured in this context, including the metrics and evaluation strategies used to assess energy efficiency and resource usage. We also explore whether techniques such as fine-tuning and prompt engineering influence the sustainability of generated code. Through a structured analysis of the selected studies, we categorize research efforts based on their methodological approaches, evaluation practices, and experimental settings. The findings indicate that research in this area remains relatively limited and fragmented, with no widely accepted framework for measuring or benchmarking the sustainability of LLM-generated code. These observations highlight the need for clearer definitions, standardized evaluation methods, and systematic research to support environmentally friendly AI-assisted software engineering.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. This paper conducts a systematic literature review on the sustainability of code generated by large language models (LLMs). It examines the extent to which LLMs produce sustainable code, how sustainability is defined and measured (via metrics for energy efficiency, resource usage, CPU/memory consumption, and execution time), the impact of techniques such as fine-tuning and prompt engineering, and categorizes primary studies by methodological approaches, evaluation practices, and experimental settings. The central finding is that research remains limited and fragmented, with no widely accepted framework for measuring or benchmarking the sustainability of LLM-generated code.

Significance. If the synthesis holds, the review would usefully map an emerging intersection of LLM-based code generation and software sustainability, underscoring the environmental stakes of inefficient generated code in deployed systems and calling for standardized metrics and evaluation protocols in AI-assisted software engineering.

major comments (2)
  1. [Methodology] Methodology section: The description of the search strategy, selected databases, search strings, inclusion/exclusion criteria, number of primary studies screened and included, and any quality assessment procedure is absent or insufficiently detailed. This information is load-bearing for the claim that research is 'relatively limited and fragmented,' because an incomplete or non-representative sample could artifactually produce that assessment.
  2. [Results/Discussion] Results and Discussion sections: The conclusion of 'no widely accepted framework' depends on the selected studies being comprehensive. The search terms centered on 'LLM', 'sustainable code', and 'energy efficiency' risk missing relevant work using variant terminology (e.g., 'green code generation', 'carbon-aware code synthesis', or studies focused on specific resource metrics without adopting the review's sustainability framing), which directly affects the representativeness of the evidence base.
minor comments (2)
  1. [Abstract] Abstract: Explicitly stating the final number of included primary studies would immediately convey the scope of the review to readers.
  2. [Methodology] The paper would benefit from a PRISMA-style flow diagram or table summarizing the study selection process to improve transparency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our systematic literature review. The feedback highlights important areas for improving transparency and comprehensiveness, which we will address through revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Methodology] Methodology section: The description of the search strategy, selected databases, search strings, inclusion/exclusion criteria, number of primary studies screened and included, and any quality assessment procedure is absent or insufficiently detailed. This information is load-bearing for the claim that research is 'relatively limited and fragmented,' because an incomplete or non-representative sample could artifactually produce that assessment.

    Authors: We agree that the Methodology section requires substantially more detail to support reproducibility and the validity of our conclusions. In the revised manuscript, we will expand this section to fully describe the search strategy, including the specific databases (ACM Digital Library, IEEE Xplore, Scopus, Web of Science, and arXiv), the complete search strings with Boolean operators, the inclusion/exclusion criteria, a PRISMA flow diagram with exact screening and inclusion counts, and any quality assessment procedures applied. This will provide a stronger basis for assessing the research as limited and fragmented. revision: yes

  2. Referee: [Results/Discussion] Results and Discussion sections: The conclusion of 'no widely accepted framework' depends on the selected studies being comprehensive. The search terms centered on 'LLM', 'sustainable code', and 'energy efficiency' risk missing relevant work using variant terminology (e.g., 'green code generation', 'carbon-aware code synthesis', or studies focused on specific resource metrics without adopting the review's sustainability framing), which directly affects the representativeness of the evidence base.

    Authors: We acknowledge the potential for terminology variation to affect coverage in this emerging area. Our original search incorporated core terms along with some synonyms (such as 'green software' and 'energy-aware generation'), but we agree it may have missed certain framings. In revision, we will conduct an expanded search using the suggested variant terms, update the results to reflect any additional studies found, and add an explicit limitations discussion on terminology challenges and their implications for completeness. This will either reinforce or refine our conclusion regarding the absence of a widely accepted framework. revision: yes

Circularity Check

0 steps flagged

No circularity: SLR reports external primary-study observations

full rationale

This is a systematic literature review whose central claim (research remains limited and fragmented with no accepted framework) is obtained by counting and categorizing the primary studies returned by the described search string across databases. No equations, fitted parameters, predictions, or self-citations appear in the provided text; the methodology is a standard SLR protocol whose output is the set of external papers themselves. The claim therefore does not reduce to any input by construction and rests on the representativeness of the retrieved corpus rather than on any internal definitional loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the application of standard systematic review methods to selected studies; no free parameters are fitted, no new entities are postulated, and the review assumes established SLR protocols without ad-hoc inventions.

axioms (1)
  • domain assumption Standard systematic literature review methodology in software engineering is sufficient to identify and synthesize relevant primary studies on LLM code sustainability.
    Invoked implicitly by conducting the SLR and drawing conclusions about the state of the field without specifying protocol details such as PRISMA adherence or exact search terms.

pith-pipeline@v0.9.0 · 5581 in / 1296 out tokens · 50761 ms · 2026-05-15T18:46:49.233282+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

141 extracted references · 141 canonical work pages · 10 internal anchors

  1. [1]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017

  2. [2]

    Block- fest: Blockchain-enhanced federated sparse transformers for privacy- preserving res forecasting in internet of vehicles systems,

    A. Hameed, S. M. Danish, A. Ranjha, and G. Srivastava, “Block- fest: Blockchain-enhanced federated sparse transformers for privacy- preserving res forecasting in internet of vehicles systems,”IEEE Internet of Things Journal, 2025

  3. [3]

    A Survey of Large Language Models

    W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y . Hou, Y . Min, B. Zhang, J. Zhang, Z. Dong, Y . Du, C. Yang, Y . Chen, Z. Chen, J. Jiang, R. Ren, Y . Li, X. Tang, Z. Liu, P. Liu, J.-Y . Nie, and J.-R. Wen, “A survey of large language models,” 2023. [Online]. Available: https://arxiv.org/abs/2303.18223

  4. [4]

    GitHub Copilot documentation,

    GitHub, “GitHub Copilot documentation,” https://docs.github.com/en/ copilot, accessed: 2026-02-21

  5. [5]

    zai-org, “CodeGeeX: An open multilingual code generation model (kdd

  6. [6]

    [source code],” https://github.com/zai-org/CodeGeeX, accessed: 2026-02-21

  7. [7]

    Amazon CodeWhisperer Documenta- tion,

    Amazon Web Services, Inc., “Amazon CodeWhisperer Documenta- tion,” https://docs.aws.amazon.com/codewhisperer/, accessed: 2026-02- 21

  8. [8]

    Llms meet federated learning for scalable and secure iot management,

    Y . Otoum, A. Asad, and A. Nayak, “Llms meet federated learning for scalable and secure iot management,”arXiv preprint arXiv:2504.16032, 2025

  9. [9]

    Aligning pre-trained llms for enhanced uav power consumption forecasting,

    A. Hameed, S. M. Danish, and A. Leivadeas, “Aligning pre-trained llms for enhanced uav power consumption forecasting,” inProceedings of the IEEE Global Communications Conference (GLOBECOM), IEEE. Taipei, Taiwan: IEEE, 2025

  10. [10]

    Herrington,Code generation in action

    J. Herrington,Code generation in action. Manning Publications Co., 2003

  11. [11]

    Refactor- coderqa: Benchmarking llms for multi-domain coding question solu- tions in cloud and edge deployment,

    S. Rahman, A. Hameed, G. Srivastava, and S. M. Danish, “Refactor- coderqa: Benchmarking llms for multi-domain coding question solu- tions in cloud and edge deployment,”arXiv preprint arXiv:2509.10436, 2025

  12. [12]

    Trends in ai inference energy consumption: Beyond the performance-vs- parameter laws of deep learning,

    R. Desislavov, F. Mart ´ınez-Plumed, and J. Hern ´andez-Orallo, “Trends in ai inference energy consumption: Beyond the performance-vs- parameter laws of deep learning,”Sustainable Computing: Informatics and Systems, vol. 38, p. 100857, 2023

  13. [13]

    Holistically evaluating the environmental impact of creating language models. arxiv 2025,

    J. Morrison, C. Na, J. Fernandez, T. Dettmers, E. Strubell, and J. Dodge, “Holistically evaluating the environmental impact of creating language models. arxiv 2025,”arXiv preprint arXiv:2503.05804, 2025

  14. [14]

    Learn to code sustainably: An empirical study on llm-based green code generation,

    T. Vartziotis, I. Dellatolas, G. Dasoulas, M. Schmidt, F. Schneider, T. Hoffmann, S. Kotsopoulos, and M. Keckeisen, “Learn to code sustainably: An empirical study on llm-based green code generation,” arXiv preprint arXiv:2403.03344, 2024

  15. [15]

    Assessing the impact of refactoring energy-inefficient code patterns on software sustainability: An industry case study,

    R. Mehra, P. Pathania, V . S. Sharma, V . Kaulgud, S. Podder, and A. P. Burden, “Assessing the impact of refactoring energy-inefficient code patterns on software sustainability: An industry case study,” in 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2023, pp. 1825–1827

  16. [16]

    The real climate and transformative impact of ict: A critique of estimates, trends, and regulations,

    C. Freitag, M. Berners-Lee, K. Widdicks, B. Knowles, G. S. Blair, and A. Friday, “The real climate and transformative impact of ict: A critique of estimates, trends, and regulations,”Patterns, vol. 2, no. 9, 2021

  17. [17]

    Who is using ai to code? global diffusion and impact of generative ai,

    S. Daniotti, J. Wachs, X. Feng, and F. Neffke, “Who is using ai to code? global diffusion and impact of generative ai,”arXiv preprint arXiv:2506.08945, 2025

  18. [18]

    A comprehensive overview of large language models,

    H. Naveed, A. U. Khan, S. Qiu, M. Saqib, S. Anwar, M. Usman, N. Akhtar, N. Barnes, and A. Mian, “A comprehensive overview of large language models,”ACM Transactions on Intelligent Systems and Technology, 2023

  19. [19]

    Phishing detection in the gen-ai era: Quantized llms vs classical models,

    J. Thapa, G. Chahal, S ¸. V . Gabreanu, and Y . Otoum, “Phishing detection in the gen-ai era: Quantized llms vs classical models,” in2025 IEEE/ACIS 29th International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD). IEEE, 2025, pp. 856–863

  20. [20]

    Llm-based threat detec- tion and prevention framework for iot ecosystems,

    Y . Otoum, A. Asad, and A. Nayak, “Llm-based threat detec- tion and prevention framework for iot ecosystems,”arXiv preprint arXiv:2505.00240, 2025

  21. [22]

    Qwen2.5-Coder Technical Report

    B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, L. Zhang, T. Liu, J. Zhang, B. Yu, K. Lu, K. Dang, Y . Fan, Y . Zhang, A. Yang, R. Men, F. Huang, B. Zheng, Y . Miao, S. Quan, Y . Feng, X. Ren, X. Ren, J. Zhou, and J. Lin, “Qwen2.5-coder technical report,” 2024. [Online]. Available: https://arxiv.org/abs/2409.12186

  22. [23]

    Word embeddings: A survey,

    F. Almeida and G. Xex ´eo, “Word embeddings: A survey,”arXiv preprint arXiv:1901.09069, 2019

  23. [24]

    Compare encoder-decoder, encoder-only, and decoder-only architectures for text generation on low-resource datasets,

    P.-X. Cai, Y .-C. Fan, and F.-Y . Leu, “Compare encoder-decoder, encoder-only, and decoder-only architectures for text generation on low-resource datasets,” inInternational Conference on Broadband and Wireless Computing, Communication and Applications. Springer, 2021, pp. 216–225

  24. [25]

    Block-fedl: Electric vehicle charging load forecasting using federated learning and blockchain,

    S. M. Danish, A. Hameed, A. Ranjha, G. Srivastava, and K. Zhang, “Block-fedl: Electric vehicle charging load forecasting using federated learning and blockchain,”IEEE Transactions on Vehicular Technology, vol. 74, no. 2, pp. 2048–2056, 2024

  25. [26]

    Toward qos prediction based on temporal transformers for iot applications,

    A. Hameed, J. Violos, A. Leivadeas, N. Santi, R. Gr ¨unblatt, and N. Mitton, “Toward qos prediction based on temporal transformers for iot applications,”IEEE Transactions on Network and Service Management, vol. 19, no. 4, pp. 4010–4027, 2022

  26. [27]

    A neural probabilistic language model,

    Y . Bengio, R. Ducharme, P. Vincent, and C. Jauvin, “A neural probabilistic language model,”Journal of Machine Learning Research, vol. 3, no. Feb, pp. 1137–1155, 2003. [Online]. Available: https://www.jmlr.org/papers/v3/bengio03a.html

  27. [28]

    Backpropagation through time: What it does and how to do it,

    P. J. Werbos, “Backpropagation through time: What it does and how to do it,”Proceedings of the IEEE, vol. 78, no. 10, pp. 1550–1560, Oct. 1990

  28. [29]

    Decoupled weight decay regularization,

    I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” inInternational Conference on Learning Representations (ICLR),

  29. [30]

    Decoupled Weight Decay Regularization

    [Online]. Available: https://arxiv.org/abs/1711.05101

  30. [31]

    Sequence to sequence learning with neural networks,

    I. Sutskever, O. Vinyals, and Q. V . Le, “Sequence to sequence learning with neural networks,” inAdvances in Neural Information Processing Systems, vol. 27, 2014, pp. 3104–3112. [Online]. Available: https://papers.neurips.cc/paper/ 5346-sequence-to-sequence-learning-with-neural-networks.pdf

  31. [32]

    Small language models: Survey, measurements, and insights,

    Z. Lu, X. Li, D. Cai, R. Yi, F. Liu, X. Zhang, N. D. Lane, and M. Xu, “Small language models: Survey, measurements, and insights,”arXiv preprint arXiv:2409.15790, 2024

  32. [33]

    Advances in small language mod- els: A comprehensive survey on efficient nlp solutions for resource- constrained environments,

    A. Kshatriya and K. D. Prajapati, “Advances in small language mod- els: A comprehensive survey on efficient nlp solutions for resource- constrained environments,”Authorea Preprints, 2025

  33. [34]

    Knowledge distillation: A survey,

    J. Gou, B. Yu, S. J. Maybank, and D. Tao, “Knowledge distillation: A survey,”International journal of computer vision, vol. 129, no. 6, pp. 1789–1819, 2021

  34. [35]

    CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation,

    Y . Wang, W. Wang, S. Joty, and S. C. H. Hoi, “CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation,” inProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih, Eds. Online and Punta Cana, Dominican Republic: Association ...

  35. [36]

    Dy- namollm: Designing llm inference clusters for performance and energy efficiency,

    J. Stojkovic, C. Zhang, ´I. Goiri, J. Torrellas, and E. Choukse, “Dy- namollm: Designing llm inference clusters for performance and energy efficiency,” in2025 IEEE International Symposium on High Perfor- mance Computer Architecture (HPCA). IEEE, 2025, pp. 1348–1362

  36. [37]

    Beyond the limits: A survey of techniques to ex- tend the context length in large language models,

    X. Wang, M. Salmani, P. Omidi, X. Ren, M. Rezagholizadeh, and A. Eshaghi, “Beyond the limits: A survey of techniques to ex- tend the context length in large language models,”arXiv preprint arXiv:2402.02244, 2024

  37. [38]

    Samsi, et al

    S. Samsi, D. Zhao, J. McDonald, B. Li, A. Michaleas, M. Jones, W. Bergeron, J. Kepner, D. Tiwari, and V . Gadepally, “From words to watts: Benchmarking the energy costs of large language model inference,” 2023. [Online]. Available: https://arxiv.org/abs/2310.03003

  38. [39]

    When scaling meets llm finetuning: The effect of data, model and finetuning method,

    B. Zhang, Z. Liu, C. Cherry, and O. Firat, “When scaling meets llm finetuning: The effect of data, model and finetuning method,”arXiv preprint arXiv:2402.17193, 2024

  39. [40]

    Tokenpowerbench: Benchmarking the power consumption of llm inference,

    C. Niu, W. Zhang, J. Li, Y . Zhao, T. Wang, X. Wang, and Y . Chen, “Tokenpowerbench: Benchmarking the power consumption of llm inference,” 2025. [Online]. Available: https://arxiv.org/abs/2512.03024

  40. [41]

    Fine tuning llm for en- terprise: Practical guidelines and recommendations,

    K. VM, H. Warrier, Y . Guptaet al., “Fine tuning llm for en- terprise: Practical guidelines and recommendations,”arXiv preprint arXiv:2404.10779, 2024

  41. [42]

    Energy and policy considerations for deep learning in NLP,

    E. Strubell, A. Ganesh, and A. McCallum, “Energy and policy considerations for deep learning in NLP,” inProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, A. Korhonen, D. Traum, and L. M `arquez, Eds. Florence, Italy: Association for Computational Linguistics, Jul. 2019, pp. 3645–3650. [Online]. Available: https://acl...

  42. [43]

    Efficient prompting for llm-based generative internet of things,

    B. Xiao, B. Kantarci, J. Kang, D. Niyato, and M. Guizani, “Efficient prompting for llm-based generative internet of things,”IEEE Internet of Things Journal, 2024

  43. [44]

    Prompt programming for large language models: Beyond the few-shot paradigm,

    L. Reynolds and K. McDonell, “Prompt programming for large language models: Beyond the few-shot paradigm,” 2021. [Online]. Available: https://arxiv.org/abs/2102.07350

  44. [45]

    Prompt engineering in large language models,

    G. Marvin, N. Hellen, D. Jjingo, and J. Nakatumba-Nabende, “Prompt engineering in large language models,” inInternational conference on data intelligence and cognitive informatics. Springer, 2023, pp. 387– 402

  45. [46]

    Programming is hard-or at least it used to be: Educational opportunities and challenges of ai code generation,

    B. A. Becker, P. Denny, J. Finnie-Ansley, A. Luxton-Reilly, J. Prather, and E. A. Santos, “Programming is hard-or at least it used to be: Educational opportunities and challenges of ai code generation,” in Proceedings of the 54th ACM Technical Symposium on Computer Science Education V . 1, 2023, pp. 500–506

  46. [47]

    In-ide code generation from natural language: Promise and challenges,

    F. F. Xu, B. Vasilescu, and G. Neubig, “In-ide code generation from natural language: Promise and challenges,”ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 31, no. 2, pp. 1–47, 2022

  47. [48]

    Huynh, B

    N. Huynh and B. Lin, “Large language models for code generation: A comprehensive survey of challenges, techniques, evaluation, and applications,”arXiv preprint arXiv:2503.01245, 2025

  48. [49]

    A survey on code generation with llm-based agents,

    Y . Dong, X. Jiang, J. Qian, T. Wang, K. Zhang, Z. Jin, and G. Li, “A survey on code generation with llm-based agents,”arXiv preprint arXiv:2508.00083, 2025

  49. [50]

    Learning from examples to improve code completion systems,

    M. Bruch, M. Monperrus, and M. Mezini, “Learning from examples to improve code completion systems,” inProceedings of the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering, 2009, pp. 213–222

  50. [51]

    Self-organized agents: A llm multi- agent framework toward ultra large-scale code generation and opti- mization,

    Y . Ishibashi and Y . Nishimura, “Self-organized agents: A llm multi- agent framework toward ultra large-scale code generation and opti- mization,”arXiv preprint arXiv:2404.02183, 2024

  51. [52]

    Code refactoring with llm: A comprehensive evaluation with few-shot settings,

    M. R. Tapader, M. M. Rahman, A. I. Shiplu, M. F. I. Amin, and Y . Watanobe, “Code refactoring with llm: A comprehensive evaluation with few-shot settings,”arXiv preprint arXiv:2511.21788, 2025

  52. [53]

    Llm-based code generation: A systematic literature review with technical and demographic insights,

    K. U. Danyaro, M. Nasser, A. Zakari, S. Abdullahi, A. Khanzada, M. M. Yakubu, S. Shoaibet al., “Llm-based code generation: A systematic literature review with technical and demographic insights,” IEEE Access, vol. 13, pp. 194 915–194 939, 2025

  53. [54]

    A systematic review about large language models (llms) applied to code generation,

    F. A. Bacin, B. A. de Mello, G. D. Salton, and S. da Silva Feitosa, “A systematic review about large language models (llms) applied to code generation,”Revista Brasileira de Computac ¸˜ao Aplicada, vol. 17, no. 3, pp. 1–13, 2025

  54. [55]

    Usage of large language model for code generation tasks: A review,

    S. Bistarelli, M. Fiore, I. Mercanti, and M. Mongiello, “Usage of large language model for code generation tasks: A review,”SN Computer Science, vol. 6, no. 6, p. 673, 2025

  55. [56]

    Automatic code generation techniques: A systematic literature review,

    M. Alharbi and M. Alshayeb, “Automatic code generation techniques: A systematic literature review,”Automated Software Engineering, vol. 33, no. 1, p. 4, 2026

  56. [57]

    A Survey on Large Language Models for Code Generation

    J. Jiang, F. Wang, J. Shen, S. Kim, and S. Kim, “A survey on large language models for code generation,” 2024. [Online]. Available: https://arxiv.org/abs/2406.00515

  57. [58]

    Large language models meet NL2Code: A survey,

    D. Zan, B. Chen, F. Zhang, D. Lu, B. Wu, B. Guan, Y . Wang, and J.-G. Lou, “Large language models meet NL2Code: A survey,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds. Toronto, Canada: Association for Computational Linguistics, Jul. 2023, ...

  58. [59]

    A survey on evaluating large language models in code generation tasks,

    L. Chen, Q. Guo, H. Jia, Z. Zeng, X. Wang, Y . Xu, J. Wu, Y . Wang, Q. Gao, J. Wang, W. Ye, and S. Zhang, “A survey on evaluating large language models in code generation tasks,” 2024. [Online]. Available: https://arxiv.org/abs/2408.16498

  59. [60]

    A systematic survey on large language models for code generation,

    S. K. Jabrw and Q. I. Sarhan, “A systematic survey on large language models for code generation,”ARO - The Scientific Journal of Koya University, vol. 13, no. 2, pp. 83–99, 2025. [Online]. Available: https://aro.koyauniversity.org/index.php/aro/article/view/2159

  60. [61]

    Large language models for software engi- neering: A systematic literature review,

    X. Hou, Y . Zhao, Y . Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo, J. Grundy, and H. Wang, “Large language models for software engi- neering: A systematic literature review,”ACM Transactions on Software Engineering and Methodology, vol. 33, no. 8, pp. 1–79, 2024

  61. [62]

    Large language models for software engineering: A systematic literature review,

    ——, “Large language models for software engineering: A systematic literature review,” 2023. [Online]. Available: https://arxiv.org/abs/2308. 10620

  62. [63]

    Large language models for software engineering: Survey and open problems,

    A. Fan, B. Gokkaya, M. Harman, M. Lyubarskiy, S. Sengupta, S. Yoo, and J. M. Zhang, “Large language models for software engineering: Survey and open problems,” 2023. [Online]. Available: https://arxiv.org/abs/2310.03533

  63. [64]

    Large language models for code completion: A systematic literature review,

    R. A. Husein, H. Aburajouh, and C. Catal, “Large language models for code completion: A systematic literature review,”Computer Standards & Interfaces, vol. 92, p. 103917, 2025

  64. [65]

    Large language models for code completion: A systematic literature review,

    ——, “Large language models for code completion: A systematic literature review,”Computer Standards & Interfaces, vol. 92, p. 103917, Mar. 2025. [Online]. Available: https://doi.org/10.1016/j.csi. 2024.103917

  65. [67]

    A survey of LLM-based automated program repair: Taxonomies, design paradigms, and applications,

    B. Yang, Z. Cai, F. Liu, B. Le, L. Zhang, T. F. Bissyand ´e, Y . Liu, and H. Tian, “A survey of LLM-based automated program repair: Taxonomies, design paradigms, and applications,” 2025. [Online]. Available: https://arxiv.org/abs/2506.23749

  66. [68]

    Zhang, C

    Q. Zhang, C. Fang, Y . Xie, Y . Ma, W. Sun, Y . Yang, and Z. Chen, “A systematic literature review on large language models for automated program repair,” 2024. [Online]. Available: https://arxiv.org/abs/2405.01466

  67. [69]

    A systematic literature review of parameter-efficient fine-tuning for large code models,

    S. Afrin, M. Z. Haque, and A. Mastropaolo, “A systematic literature review of parameter-efficient fine-tuning for large code models,”ACM Transactions on Software Engineering and Methodology, 2025

  68. [70]

    A systematic literature review of parameter-efficient fine-tuning for large code models,

    M. Z. Haque, S. Afrin, and A. Mastropaolo, “A systematic literature review of parameter-efficient fine-tuning for large code models,” 2025. [Online]. Available: https://arxiv.org/abs/2504.21569

  69. [71]

    A survey on llm-based code generation for low-resource and domain-specific programming languages,

    S. Joel, J. Wu, and F. Fard, “A survey on llm-based code generation for low-resource and domain-specific programming languages,”ACM Transactions on Software Engineering and Methodology, 2024

  70. [72]

    On the effectiveness of large language models in domain- specific code generation,

    X. Guet al., “On the effectiveness of large language models in domain- specific code generation,”ACM Transactions on Software Engineering and Methodology, 2025

  71. [73]

    A systematic literature review on large language models applications in computer programming teaching evaluation process,

    A. F. Pereira and R. F. Mello, “A systematic literature review on large language models applications in computer programming teaching evaluation process,”IEEE Access, 2025

  72. [74]

    Use of ai-driven code generation models in teaching and learning programming: A systematic literature review,

    D. Cambaz and X. Zhang, “Use of ai-driven code generation models in teaching and learning programming: A systematic literature review,” in Proceedings of the ACM SIGCSE Technical Symposium. ACM, 2024

  73. [75]

    Large language models in computer science education: A systematic literature review,

    “Large language models in computer science education: A systematic literature review,”ACM Transactions on Computing Education, 2025

  74. [76]

    Computing education using generative artificial intelligence: A systematic literature review,

    F. J. Agboet al., “Computing education using generative artificial intelligence: A systematic literature review,”Computers and Education: Artificial Intelligence, 2025

  75. [77]

    The impact of llm-assistants on software developer productivity: A systematic literature review,

    A. Mohamed, M. Assi, and M. Guizani, “The impact of llm-assistants on software developer productivity: A systematic literature review,” arXiv preprint arXiv:2507.03156, 2025

  76. [78]

    The Impact of AI on Developer Productivity: Evidence from GitHub Copilot

    S. Peng, E. Kalliamvakou, P. Cihon, and M. Demirer, “The impact of ai on developer productivity: Evidence from github copilot,” 2023. [Online]. Available: https://arxiv.org/abs/2302.06590

  77. [79]

    Impact of large language models on quality and efficiency of code generation: Systematic literature review,

    V . F. Quevedo-Tumailli, S. E. Arias Calder ´on, V . A. Ortega Manjarrez, and B. Ortega-Tenezaca, “Impact of large language models on quality and efficiency of code generation: Systematic literature review,”Revista Digital Novasinergia, vol. 8, pp. 52–66, 2025

  78. [80]

    Benchmarks and metrics for evaluations of code generation: A critical review,

    D. G. Paul, H. Zhu, and I. Bayley, “Benchmarks and metrics for evaluations of code generation: A critical review,” in2024 IEEE International Conference on Artificial Intelligence Testing (AITest), 2024, pp. 87–94. [Online]. Available: https://arxiv.org/abs/2406.12655

  79. [82]

    Quality assurance of llm-generated code: Addressing non-functional quality characteristics,

    X. Sun, D. St ˚ahl, K. Sandahl, and C. Kessler, “Quality assurance of llm-generated code: Addressing non-functional quality characteristics,”

  80. [83]

Showing first 80 references.