pith. machine review for the scientific record. sign in

arxiv: 2604.15385 · v1 · submitted 2026-04-16 · 💻 cs.SE · cs.LG

Recognition: unknown

Prompt-Driven Code Summarization: A Systematic Literature Review

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:30 UTC · model grok-4.3

classification 💻 cs.SE cs.LG
keywords code summarizationprompt engineeringlarge language modelssystematic literature reviewsoftware documentationLLM promptingevaluation metrics
0
0 comments X

The pith

A review of prompting techniques shows LLMs can generate better code summaries, but optimal strategies and evaluations remain unclear across studies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper conducts a systematic literature review to consolidate research on using large language models to create natural language summaries from source code. It notes that prompt design is critical to LLM success in this task, with approaches like few-shot examples, chain-of-thought reasoning, retrieval-augmented generation, and zero-shot learning offering improvements for software documentation. Yet the studies are scattered, leaving open questions about which techniques perform best for given models or conditions. Evaluation often relies on simple overlap metrics that miss deeper semantic accuracy, and the review maps these patterns while highlighting gaps for future work on reliable automated documentation.

Core claim

The central claim is that prompting paradigms such as few-shot prompting, chain-of-thought reasoning, retrieval-augmented generation, and zero-shot learning demonstrate promise for enhancing LLM performance on code summarization, yet existing research remains fragmented with limited insight into the best strategies for specific models and contexts, and most evaluations depend on overlap-based metrics that may fail to reflect semantic quality.

What carries the argument

Categorization of prompting paradigms (few-shot, zero-shot, chain-of-thought, retrieval-augmented) combined with cross-study analysis of their effectiveness and evaluation practices in LLM-driven code summarization.

If this is right

  • Researchers need targeted experiments to identify conditions under which particular prompting strategies outperform others for different models.
  • Development of metrics that capture semantic quality beyond simple overlap measures would improve assessment of summary usefulness.
  • Clearer guidelines on prompt design could support more consistent integration of automated summarization into developer workflows.
  • Filling identified gaps would reduce reliance on manual documentation and aid tasks such as code review and maintenance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If prompting best practices become standardized, integrated tools in development environments could produce more reliable documentation at scale.
  • Better code summaries from optimized prompts may indirectly improve accuracy in downstream applications like defect localization or commit message generation.
  • The current fragmentation points to a need for shared benchmark datasets that test prompting strategies across varied codebases and languages.

Load-bearing premise

The collected studies form a representative sample of the field and the chosen categorization of prompting paradigms accurately reflects the underlying technical distinctions without significant selection or reporting bias.

What would settle it

A new comprehensive survey that incorporates overlooked recent papers and finds either consistent superiority of one prompting method across models or evaluation results that align closely on semantic quality would contradict the fragmentation and limited-understanding conclusions.

Figures

Figures reproduced from arXiv: 2604.15385 by Afia Farjana, Antonio Mastropaolo, Zaiyu Cheng.

Figure 1
Figure 1. Figure 1: Evolution timeline of code summarization paradigms (2010–2025), based on Zhu & Pan [ [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Study selection process for the systematic review on prompt-based code summarization. The pipeline includes four stages: (i) identification [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Publication Year initial study corpus. For studies without author-supplied keywords, manual annotation was performed using title, abstract, and methodological content, following the guidelines of [77]. This process yielded 4 candidate papers for manual analysis. Both the Python script and the complete prompt-engineering terms used in our search are provided in our replication package. Thresholding rule: we… view at source ↗
Figure 4
Figure 4. Figure 4: Venue Distribution of Prompt Engineering Techniques in Code Summarization [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: A Taxonomy of Prompt Engineering Techniques for Code Summarization Across Granularity Levels and Prompt Paradigms. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Trend of LLM family adoption across code-summarization studies (2020–2025). The trend highlights a gradual diversification of model [PITH_FULL_IMAGE:figures/full_fig_p027_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Distribution of programming languages across 29 prompt-based code summarization studies. Each ring represents a prompting paradigm [PITH_FULL_IMAGE:figures/full_fig_p029_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Artifact-sharing landscape across 29 studies. Preprocessing and training scripts are the most frequently shared artifacts, followed by [PITH_FULL_IMAGE:figures/full_fig_p036_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Year-wise availability of replication packages in primary studies (2020–2025). Bars show [PITH_FULL_IMAGE:figures/full_fig_p037_9.png] view at source ↗
read the original abstract

Software documentation is essential for program comprehension, developer onboarding, code review, and long-term maintenance. Yet producing quality documentation manually is time-consuming and frequently yields incomplete or inconsistent results. Large language models (LLMs) offer a promising solution by automatically generating natural language descriptions from source code, helping developers understand code more efficiently, facilitating maintenance, and supporting downstream activities such as defect localization and commit message generation. However, the effectiveness of LLMs in documentation tasks critically depends on how they are prompted. Properly structured instructions can substantially improve model performance, making prompt engineering-the design of input prompts to guide model behavior-a foundational technique in LLM-based software engineering. Approaches such as few-shot prompting, chain-of-thought reasoning, retrieval-augmented generation, and zero-shot learning show promise for code summarization, yet current research remains fragmented. There is limited understanding of which prompting strategies work best, for which models, and under what conditions. Moreover, evaluation practices vary widely, with most studies relying on overlap-based metrics that may not capture semantic quality. This systematic literature review consolidates existing evidence, categorizes prompting paradigms, examines their effectiveness, and identifies gaps to guide future research and practical adoption.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. This paper presents a systematic literature review on prompt-driven code summarization with large language models. It consolidates evidence on prompting paradigms (few-shot prompting, chain-of-thought reasoning, retrieval-augmented generation, and zero-shot learning), examines their effectiveness for generating natural language code descriptions, notes that research remains fragmented with limited understanding of optimal strategies/models/conditions, critiques reliance on overlap-based evaluation metrics, and identifies gaps to guide future work.

Significance. If methodologically rigorous, this SLR would be a useful consolidation in an active area of LLM applications for software engineering. It would help clarify promising prompting directions, surface evaluation weaknesses, and reduce fragmentation by mapping what is known about prompt effectiveness for code summarization tasks that support comprehension, maintenance, and downstream activities.

major comments (1)
  1. [Methodology section] Methodology section: The central synthesis—that prompting approaches show promise yet research is fragmented with limited understanding of best strategies—depends on the collected studies forming a representative sample and the chosen categorization (few-shot, CoT, RAG, zero-shot) accurately reflecting technical distinctions without selection or reporting bias. The abstract outlines scope but provides no explicit search protocol, databases, inclusion/exclusion criteria, or quality assessment details, leaving these load-bearing elements unverified and risking incomplete coverage or forced groupings that undermine the gap analysis.
minor comments (1)
  1. [Abstract] Abstract: The claim that 'most studies relying on overlap-based metrics' would be strengthened by stating the total number of primary studies reviewed and the covered time period.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our systematic literature review. We address the major comment on methodology below and will revise the manuscript accordingly to improve transparency and verifiability.

read point-by-point responses
  1. Referee: [Methodology section] Methodology section: The central synthesis—that prompting approaches show promise yet research is fragmented with limited understanding of best strategies—depends on the collected studies forming a representative sample and the chosen categorization (few-shot, CoT, RAG, zero-shot) accurately reflecting technical distinctions without selection or reporting bias. The abstract outlines scope but provides no explicit search protocol, databases, inclusion/exclusion criteria, or quality assessment details, leaving these load-bearing elements unverified and risking incomplete coverage or forced groupings that undermine the gap analysis.

    Authors: We agree that the abstract does not explicitly detail the search protocol, databases, inclusion/exclusion criteria, or quality assessment, which limits immediate verifiability for readers. The full manuscript contains a dedicated Methodology section (Section 3) that outlines a structured search across IEEE Xplore, ACM Digital Library, ScienceDirect, SpringerLink, and arXiv using predefined keyword strings (detailed in Appendix A), with inclusion criteria limited to empirical studies (2020 onward) evaluating LLM prompting for code summarization and exclusion of non-empirical or non-English works. Quality assessment used a modified Kitchenham checklist with reported inter-rater reliability. The four-category taxonomy was derived from the primary technique reported in each primary study, with dual-author independent coding and consensus resolution to mitigate bias. However, we acknowledge the referee's point that these elements could be presented more explicitly and with a PRISMA diagram to strengthen the synthesis. We will revise the abstract to include a concise methodology summary and expand Section 3 with additional justification for the categorization scheme and search coverage. revision: yes

Circularity Check

0 steps flagged

No circularity: SLR aggregates external studies without self-referential derivations

full rationale

This systematic literature review consolidates evidence from external papers on prompting strategies for code summarization. It contains no mathematical derivations, fitted parameters, predictions, or uniqueness theorems that reduce to the paper's own inputs by construction. Claims about fragmentation and promising approaches (few-shot, CoT, RAG, zero-shot) are synthesized from cited literature rather than defined or forced by the review's own categorization or search process. No self-citation chains or ansatzes are load-bearing for the central synthesis. The paper is self-contained against external benchmarks as a review.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters or invented entities. The review rests on the domain assumption that established systematic literature review procedures can reliably map a fragmented research area.

axioms (1)
  • domain assumption Standard systematic literature review methodology (search strategy, inclusion criteria, quality assessment) is sufficient to consolidate evidence without major bias
    Invoked implicitly by the decision to perform an SLR and to draw conclusions about effectiveness and gaps.

pith-pipeline@v0.9.0 · 5508 in / 1256 out tokens · 91559 ms · 2026-05-10T11:30:51.383645+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

117 extracted references · 39 canonical work pages · 10 internal anchors

  1. [1]

    Large language models for software engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology, 33(8):1–79, 2024

    Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. Large language models for software engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology, 33(8):1–79, 2024

  2. [2]

    Competition-level code generation with alphacode.Science, 378(6624):1092–1097, 2022

    Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode.Science, 378(6624):1092–1097, 2022

  3. [3]

    An analysis of the automatic bug fixing performance of chatgpt

    Dominik Sobania, Martin Briesch, Carol Hanna, and Justyna Petke. An analysis of the automatic bug fixing performance of chatgpt. In2023 IEEE/ACM International Workshop on Automated Program Repair (APR), pages 23–30. IEEE, 2023

  4. [4]

    Program translation via code distillation.arXiv preprint arXiv:2310.11476, 2023

    Yufan Huang, Mengnan Qi, Yongqiang Yao, Maoquan Wang, Bin Gu, Colin Clement, and Neel Sundaresan. Program translation via code distillation.arXiv preprint arXiv:2310.11476, 2023

  5. [5]

    A systematic literature review on the use of deep learning in software engineering research.ACM Transactions on Software Engineering and Methodology (TOSEM), 31(2):1–58, 2022

    Cody Watson, Nathan Cooper, David Nader Palacio, Kevin Moran, and Denys Poshyvanyk. A systematic literature review on the use of deep learning in software engineering research.ACM Transactions on Software Engineering and Methodology (TOSEM), 31(2):1–58, 2022

  6. [6]

    Automatic generation of natural language summaries for java classes

    Laura Moreno, Jairo Aponte, Giriprasad Sridhara, Andrian Marcus, Lori Pollock, and K Vijay-Shanker. Automatic generation of natural language summaries for java classes. In2013 21st International conference on program comprehension (ICPC), pages 23–32. IEEE, 2013

  7. [7]

    Automatic code summarization: A systematic literature review.arXiv preprint arXiv:1909.04352, 2019

    Yuxiang Zhu and Minxue Pan. Automatic code summarization: A systematic literature review.arXiv preprint arXiv:1909.04352, 2019

  8. [8]

    A convolutional attention network for extreme summarization of source code

    Miltiadis Allamanis, Hao Peng, and Charles Sutton. A convolutional attention network for extreme summarization of source code. InInternational conference on machine learning, pages 2091–2100. PMLR, 2016

  9. [9]

    Deep learning-based code reviews: A paradigm shift or a double-edged sword?arXiv preprint arXiv:2411.11401, 2024

    Rosalia Tufano, Alberto Martin-Lopez, Ahmad Tayeb, Sonia Haiduc, Gabriele Bavota, et al. Deep learning-based code reviews: A paradigm shift or a double-edged sword?arXiv preprint arXiv:2411.11401, 2024

  10. [10]

    Toward deep learning software repositories

    Martin White, Christopher Vendome, Mario Linares-Vásquez, and Denys Poshyvanyk. Toward deep learning software repositories. In2015 IEEE/ACM 12th Working Conference on Mining Software Repositories, pages 334–345. IEEE, 2015

  11. [11]

    Automatically documenting program changes

    Raymond PL Buse and Westley R Weimer. Automatically documenting program changes. InProceedings of the 25th IEEE/ACM international conference on automated software engineering, pages 33–42, 2010

  12. [12]

    Simulated analysis and hardware implementation of voiceband circular microphone array

    M Sami Zitouni, M Luai Hammadih, Abdulla AlShehhi, Saif AlKindi, Nazar Ali, and Luis Weruaga. Simulated analysis and hardware implementation of voiceband circular microphone array. In2013 IEEE 20th International Conference on Electronics, Circuits, and Systems (ICECS), pages 508–511. IEEE, 2013

  13. [13]

    Commit message matters: Investigating impact and evolution of commit message quality

    Jiawei Li and Iftekhar Ahmed. Commit message matters: Investigating impact and evolution of commit message quality. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 806–817. IEEE, 2023

  14. [14]

    Source code summarization in the era of large language models.arXiv preprint arXiv:2407.07959, 2024

    Weisong Sun, Yun Miao, Yuekang Li, Hongyu Zhang, Chunrong Fang, Yi Liu, Gelei Deng, Yang Liu, and Zhenyu Chen. Source code summarization in the era of large language models.arXiv preprint arXiv:2407.07959, 2024

  15. [15]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  16. [16]

    Gemma: Open Models Based on Gemini Research and Technology

    Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295, 2024

  17. [17]

    From llm to nmt: Advancing low-resource machine translation with claude.arXiv preprint arXiv:2404.13813, 2024

    Maxim Enis and Mark Hopkins. From llm to nmt: Advancing low-resource machine translation with claude.arXiv preprint arXiv:2404.13813, 2024

  18. [18]

    Albert Webson and Ellie Pavlick. Do prompt-based models really understand the meaning of their prompts? InProceedings of the 2022 conference of the north american chapter of the association for computational linguistics: Human language technologies, pages 2300–2344, 2022

  19. [19]

    Sheng Lu, Hendrik Schuff, and Iryna Gurevych. How are prompts different in terms of sensitivity? InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 5833–5856, 2024

  20. [20]

    Large language models are human-level prompt engineers

    Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large language models are human-level prompt engineers. InThe eleventh international conference on learning representations, 2022

  21. [21]

    Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

  22. [22]

    Code Llama: Open Foundation Models for Code

    Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950, 2023

  23. [23]

    Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. Language models are few-shot learners.Advances in Neural Information Processing Systems, pages 1877–1901, 2020

  24. [24]

    Barr, and Collin McMillan

    Md Rizwan Parvez, Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. Retrieval augmented code generation and summarization.arXiv preprint arXiv:2108.11601, 2021

  25. [25]

    Key challenges in prompt engineering

    Vladimir Geroimenko. Key challenges in prompt engineering. InThe Essential Guide to Prompt Engineering: Key Principles, Techniques, Challenges, and Security Risks, pages 85–102. Springer, 2025

  26. [26]

    Unleashing the potential of prompt engineering for large language models.Patterns, 2025

    Banghao Chen, Zhaofeng Zhang, Nicolas Langrené, and Shengxin Zhu. Unleashing the potential of prompt engineering for large language models.Patterns, 2025

  27. [27]

    Reassessing automatic evaluation metrics for code summarization tasks

    Devjeet Roy, Sarah Fakhoury, and Venera Arnaoudova. Reassessing automatic evaluation metrics for code summarization tasks. InProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1105–1116, 2021

  28. [28]

    Semantic similarity metrics for evaluating source code summarization

    Sakib Haque, Zachary Eberhart, Aakash Bansal, and Collin McMillan. Semantic similarity metrics for evaluating source code summarization. InProceedings of the 30th IEEE/ACM International Conference on Program Comprehension, pages 36–47, 2022

  29. [29]

    Evaluating code summarization techniques: A new metric and an empirical characterization

    Antonio Mastropaolo, Matteo Ciniselli, Massimiliano Di Penta, and Gabriele Bavota. Evaluating code summarization techniques: A new metric and an empirical characterization. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering, pages 1–13, 2024

  30. [30]

    CodeBERT: A Pre-Trained Model for Programming and Natural Languages

    Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. Codebert: A pre-trained model for programming and natural languages.arXiv preprint arXiv:2002.08155, 2020

  31. [31]

    arXiv preprint arXiv:2009.08366 , year=

    Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, et al. Graphcodebert: Pre-training code representations with data flow.arXiv preprint arXiv:2009.08366, 2020

  32. [32]

    Joty and Steven C

    Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation.arXiv preprint arXiv:2109.00859, 2021

  33. [33]

    Unified pre- training for program understanding and generation

    Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. Unified pre-training for program understanding and generation.arXiv preprint arXiv:2103.06333, 2021. Manuscript submitted to ACM 40 Farjanaet al

  34. [34]

    DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

    Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al. Deepseek-coder: When the large language model meets programming–the rise of code intelligence.arXiv preprint arXiv:2401.14196, 2024

  35. [35]

    A systematic evaluation of large language models of code

    Frank F Xu, Uri Alon, Graham Neubig, and Vincent Josua Hellendoorn. A systematic evaluation of large language models of code. InProceedings of the 6th ACM SIGPLAN international symposium on machine programming, pages 1–10, 2022

  36. [36]

    A review of automatic source code summarization.Empirical Software Engineering, 29(6):162, 2024

    Xuejun Zhang, Xia Hou, Xiuming Qiao, and Wenfeng Song. A review of automatic source code summarization.Empirical Software Engineering, 29(6):162, 2024

  37. [37]

    A survey of large language models for code intelligence

    Zhu Zhang, Yue Wang, Daya Guo, Duyu Tang, Nan Duan, and Ming Zhou. A survey of large language models for code intelligence. 2022

  38. [38]

    Summarizing source code using a neural attention model

    Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. Summarizing source code using a neural attention model. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), pages 2073–2083, 2016

  39. [39]

    Improved code summarization via a graph neural network

    Alexander LeClair, Sakib Haque, Lingfei Wu, and Collin McMillan. Improved code summarization via a graph neural network. InProceedings of the 28th international conference on program comprehension, pages 184–195, 2020

  40. [40]

    Transformer-based model for source code summarization

    Wasi Uddin Ahmad, Swarnendu Chakraborty, Baishakhi Ray, and Kai-Wei Chang. Transformer-based model for source code summarization. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4998–5007, 2020

  41. [41]

    Retrieval-augmented generation for code summarization via hybrid gnn.arXiv preprint arXiv:2006.05405, 2020

    Shangqing Liu, Yu Chen, Xiaofei Xie, Jingkai Siow, and Yang Liu. Retrieval-augmented generation for code summarization via hybrid gnn.arXiv preprint arXiv:2006.05405, 2020

  42. [42]

    Enhanced prompting framework for code summarization with large language models.Proceedings of the ACM on Software Engineering, 2(ISSTA):1630–1653, 2025

    Minying Fang, Xing Yuan, Yuying Li, Haojie Li, Chunrong Fang, and Junwei Du. Enhanced prompting framework for code summarization with large language models.Proceedings of the ACM on Software Engineering, 2(ISSTA):1630–1653, 2025

  43. [43]

    Codex: A comprehensive knowledge graph completion benchmark.arXiv preprint arXiv:2009.07810, 2020

    Tara Safavi and Danai Koutra. Codex: A comprehensive knowledge graph completion benchmark.arXiv preprint arXiv:2009.07810, 2020

  44. [44]

    StarCoder: may the source be with you!

    Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. Starcoder: may the source be with you!arXiv preprint arXiv:2305.06161, 2023

  45. [45]

    Jieke Shi, Zhou Yang, and David Lo. Efficient and green large language models for software engineering: Literature review, vision, and the road ahead.ACM Transactions on Software Engineering and Methodology, 34(5):1–22, 2025

  46. [46]

    Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

  47. [47]

    Plbart: Pre-training language model for bi-directional attentive representation of code

    Wasi Uddin Ahmad, Swarnendu Chakraborty, Baishakhi Ray, and Kai-Wei Chang. Plbart: Pre-training language model for bi-directional attentive representation of code. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL), 2021

  48. [48]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. InJournal of Machine Learning Research, volume 21, pages 1–67, 2020

  49. [49]

    Deepcom: Deep comment generation for source code.Proceedings of the 26th IEEE/ACM International Conference on Program Comprehension (ICPC), pages 200–210, 2018

    Xiaopeng Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. Deepcom: Deep comment generation for source code.Proceedings of the 26th IEEE/ACM International Conference on Program Comprehension (ICPC), pages 200–210, 2018

  50. [50]

    A comprehensive study of code summarization with neural attention.Proceedings of the 28th IEEE/ACM International Conference on Program Comprehension (ICPC), pages 53–64, 2020

    Alexander LeClair and Collin McMillan. A comprehensive study of code summarization with neural attention.Proceedings of the 28th IEEE/ACM International Conference on Program Comprehension (ICPC), pages 53–64, 2020

  51. [51]

    Codexglue: A benchmark dataset and open challenge for code intelligence

    Shuai Lu, Duyu Tang, Nan Duan, Zhangyin Feng, et al. Codexglue: A benchmark dataset and open challenge for code intelligence. InAdvances in Neural Information Processing Systems, 2021

  52. [52]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

  53. [53]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

  54. [54]

    Finetuned language models are zero-shot learners

    Jason Wei et al. Finetuned language models are zero-shot learners. InInternational Conference on Learning Representations, 2021

  55. [55]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeff Wu, Xu Jiang, and et al. Training language models to follow instructions with human feedback.arXiv preprint arXiv:2203.02155, 2022

  56. [56]

    Self-Instruct: Aligning Language Models with Self-Generated Instructions

    Yizhong Wang, Yeganeh Kordi, et al. Self-instruct: Aligning language model with self generated instructions.arXiv preprint arXiv:2212.10560, 2022

  57. [57]

    Prompt engineering in large language models

    Ggaliwango Marvin, Nakayiza Hellen, Daudi Jjingo, and Joyce Nakatumba-Nabende. Prompt engineering in large language models. InInternational conference on data intelligence and cognitive informatics, pages 387–402. Springer, 2023

  58. [58]

    Source code summarization & comment generation with nlp: A new index proposal

    M Alp Eren Kilic and M Fatih Adak. Source code summarization & comment generation with nlp: A new index proposal. In2024 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA), pages 1–6. IEEE, 2024

  59. [59]

    Do advanced language models eliminate the need for prompt engineering in software engineering?arXiv preprint arXiv:2411.02093, 2024

    Guoqing Wang, Zeyu Sun, Zhihao Gong, Sixiang Ye, Yizhou Chen, Yifan Zhao, Qingyuan Liang, and Dan Hao. Do advanced language models eliminate the need for prompt engineering in software engineering?arXiv preprint arXiv:2411.02093, 2024

  60. [60]

    Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi D. Q. Bui, Junnan Li, and Steven C. H. Hoi. Codet5+: Open code large language models for code understanding and generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1069–1088, Singapore, 2023. Association for Computational Linguistics

  61. [61]

    A prompt learning framework for source code summarization.arXiv preprint arXiv:2312.16066, 2023

    Weisong Sun, Chunrong Fang, Yudu You, Yuchen Chen, Yi Liu, Chong Wang, Jian Zhang, Quanjun Zhang, Hanwei Qian, Wei Zhao, et al. A prompt learning framework for source code summarization.arXiv preprint arXiv:2312.16066, 2023

  62. [62]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

  63. [63]

    arXiv preprint arXiv:2407.12994 , year=

    Shubham Vatsal and Harsh Dubey. A survey of prompt engineering methods in large language models for different nlp tasks.arXiv preprint arXiv:2407.12994, 2024

  64. [64]

    Raxcs: Towards cross-language code summarization with contrastive pre-training and retrieval augmentation

    Kaiyuan Yang, Junfeng Wang, and Zihua Song. Raxcs: Towards cross-language code summarization with contrastive pre-training and retrieval augmentation. Information and Software Technology, 183:107741, 2025

  65. [65]

    Retrieval-based neural source code summarization

    Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, and Xudong Liu. Retrieval-based neural source code summarization. InProceedings of the ACM/IEEE 42nd International Conference on Software Engineering, pages 1385–1397, 2020

  66. [66]

    On the evaluation of neural code summarization

    Ensheng Shi, Yanlin Wang, Lun Du, Junjie Chen, Shi Han, Hongyu Zhang, Dongmei Zhang, and Hongbin Sun. On the evaluation of neural code summarization. InProceedings of the 44th international conference on software engineering, pages 1597–1608, 2022

  67. [67]

    Summary generation for source code: A systematic literature review.Information and Software Technology, 158:107169, 2023

    Bin Liu, Ming Wang, Shihai Huang, Zhe Jiang, and Yijun Yu. Summary generation for source code: A systematic literature review.Information and Software Technology, 158:107169, 2023

  68. [68]

    Automatic chain of thought prompting in large language models.arXiv preprint arXiv:2210.03493, 2022

    Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in large language models.arXiv preprint arXiv:2210.03493, 2022. Manuscript submitted to ACM Prompt-Driven Code Summarization: A Systematic Literature Review 41

  69. [69]

    Neural architecture for source code summarization

    Alexander LeClair and Collin McMillan. Neural architecture for source code summarization. In2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pages 590–601. IEEE, 2019

  70. [70]

    Commenting higher-level code unit: Full code, reduced code, or hierarchical code summarization.arXiv preprint arXiv:2503.10737, 2025

    Weisong Sun, Yiran Zhang, Jie Zhu, Zhihui Wang, Chunrong Fang, Yonglong Zhang, Yebo Feng, Jiangping Huang, Xingya Wang, Zhi Jin, et al. Commenting higher-level code unit: Full code, reduced code, or hierarchical code summarization.arXiv preprint arXiv:2503.10737, 2025

  71. [71]

    Achieving high-level software component summarization via hierarchical chain-of-thought prompting and static code analysis

    Satrio Adi Rukmono, Lina Ochoa, and Michel RV Chaudron. Achieving high-level software component summarization via hierarchical chain-of-thought prompting and static code analysis. In2023 IEEE International Conference on Data and Software Engineering (ICoDSE), pages 7–12. IEEE, 2023

  72. [72]

    Promptexp: Multi-granularity prompt explanation of large language models.arXiv preprint arXiv:2410.13073, 2024

    Ximing Dong, Shaowei Wang, Dayi Lin, Gopi Krishnan Rajbahadur, Boquan Zhou, Shichao Liu, and Ahmed E Hassan. Promptexp: Multi-granularity prompt explanation of large language models.arXiv preprint arXiv:2410.13073, 2024

  73. [73]

    Docagent: A multi-agent system for automated code documentation generation.arXiv preprint arXiv:2504.08725, 2025

    Dayu Yang, Antoine Simoulin, Xin Qian, Xiaoyi Liu, Yuwei Cao, Zhaopu Teng, and Grey Yang. Docagent: A multi-agent system for automated code documentation generation.arXiv preprint arXiv:2504.08725, 2025

  74. [74]

    Retrieve and refine: exemplar-based neural comment generation

    Bolin Wei, Yongmin Li, Ge Li, Xin Xia, and Zhi Jin. Retrieve and refine: exemplar-based neural comment generation. InProceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, pages 349–360, 2020

  75. [75]

    A survey of automatic generation of source code comments: Algorithms and techniques.IEEE access, 7:111411–111428, 2019

    Xiaotao Song, Hailong Sun, Xu Wang, and Jiafei Yan. A survey of automatic generation of source code comments: Algorithms and techniques.IEEE access, 7:111411–111428, 2019

  76. [76]

    Unlocking the potential of the prompt engineering paradigm in software engineering: A systematic literature review.AI, 6(9):206, 2025

    Irdina Wanda Syahputri, Eko K Budiardjo, and Panca O Hadi Putra. Unlocking the potential of the prompt engineering paradigm in software engineering: A systematic literature review.AI, 6(9):206, 2025

  77. [77]

    Guidelines for performing systematic literature reviews in software engineering

    Barbara Kitchenham and Stuart Charters. Guidelines for performing systematic literature reviews in software engineering. Technical Report EBSE-2007-01, EBSE Technical Report, Keele University and University of Durham, 2007

  78. [78]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

  79. [79]

    Rouge: A package for automatic evaluation of summaries

    Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004

  80. [80]

    Meteor: An automatic metric for mt evaluation with improved correlation with human judgments

    Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005

Showing first 80 references.