Leveraging Design-Aware Context in Large Language Models for Code Comment Generation

Anamitra Mukhopadhyay; Aritra Mitra; Partha Pratim Chakrabarti; Partha Pratim Das; Paul D Clough; Srijoni Majumdar

arxiv: 2510.22338 · v3 · submitted 2025-10-25 · 💻 cs.SE

Leveraging Design-Aware Context in Large Language Models for Code Comment Generation

Aritra Mitra , Srijoni Majumdar , Anamitra Mukhopadhyay , Partha Pratim Das , Paul D Clough , Partha Pratim Chakrabarti This is my paper

Pith reviewed 2026-05-18 04:08 UTC · model grok-4.3

classification 💻 cs.SE

keywords code comment generationlarge language modelsdesign documentssoftware documentationnovice codebasesLLM promptingsoftware maintenance

0 comments

The pith

Design documents can be used as context for large language models to generate more useful code comments than code alone allows.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that design documents supply extra information that helps large language models produce clearer and more relevant comments for code. This approach targets the common problem of missing or low-quality comments in code written by novices who lack established standards. A sympathetic reader would care because such improved comments could shorten the time and effort needed to understand and maintain those codebases later on. The study tests the practical feasibility of supplying design documents to the models during comment generation.

Core claim

The authors argue that design documents contain purpose and structure details not directly visible in the source code, and that providing these documents as context allows large language models to generate comments that better support future maintenance and understanding, especially in novice-developed projects where commenting standards are absent.

What carries the argument

Design-aware context, meaning the inclusion of design documents in the input prompt supplied to large language models for the specific task of generating code comments.

If this is right

Generated comments align more closely with the original design intent rather than just describing surface-level code behavior.
Maintenance time decreases for codebases that previously had inadequate or missing comments.
Large language models become a practical tool for filling documentation gaps in amateur or student-written software.
Design documents gain a new role as direct inputs to automated documentation processes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be extended by also supplying related artifacts such as requirements or test plans to further enrich the generated comments.
Integration into development environments might prompt users to attach design documents when requesting comment suggestions.
Teams could adopt lightweight design-document templates specifically to support automated comment improvement.

Load-bearing premise

Design documents are routinely available, hold information that is both relevant and not already obvious from the code, and current models can reliably extract and apply that information to improve comment quality.

What would settle it

A direct comparison of comment quality ratings or developer comprehension times for the same code, once with design-document context and once without, that shows no measurable gain would disprove the central claim.

Figures

Figures reproduced from arXiv: 2510.22338 by Anamitra Mukhopadhyay, Aritra Mitra, Partha Pratim Chakrabarti, Partha Pratim Das, Paul D Clough, Srijoni Majumdar.

**Figure 1.** Figure 1: Brighter points are with the design documents in context, and dimmer points are without them. For every LLM, the [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Decrease in completeness with increasing file size, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

Comments are very useful to the flow of code development. With the increasing commonality of code, novice coders have been creating a significant amount of codebases. Due to lack of commenting standards, their comments are often useless, and increase the time taken to further maintain codes. This study intends to find the usefulness of large language models (LLMs) in these cases to generate potentially better comments. This study focuses on the feasibility of design documents as a context for the LLMs to generate more useful comments, as design documents are often used by maintainers to understand code when comments do not suffice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a preliminary feasibility sketch for using design documents as extra context in LLMs for code comment generation, but it supplies no methods, baselines, or results to test the claim.

read the letter

The main point is that the authors want to feed design documents into LLMs so the models can produce more useful comments for code written by novices who skip proper documentation. They note that maintainers already turn to design docs when comments are missing, and they frame the work as a check on whether this extra context helps LLMs do better than code alone. That is the entire contribution so far: a stated intention to run the test, not the test itself or any findings from it. The idea is a direct extension of existing LLM code tasks rather than a new technique. It does pick out a practical pain point in student and small-team codebases where comments are often absent or unhelpful, and it correctly observes that design documents carry higher-level intent that raw code does not. That observation is fair and grounded in how real maintenance works. The gaps are straightforward and central. The description gives no account of the code and document pairs they would use, no prompting template, no code-only baseline, and no way to measure whether comments actually improve in usefulness or accuracy. Without those pieces, any claim that design documents are the reason for better output cannot be checked. The assumption that LLMs will reliably pull relevant, non-redundant information from the documents also sits unexamined. This kind of short note might interest people already working on LLM tooling for software documentation and maintainability. A reader hunting for reproducible experiments or a clear advance in method will find little to use. The work is not yet at the stage where a serious referee would get much out of it; the authors would need to add the evaluation design and at least preliminary outcomes before it makes sense to send for review.

Referee Report

1 major / 0 minor

Summary. The manuscript claims that design documents can be leveraged as context for LLMs to generate more useful code comments than those produced from code alone, addressing the lack of commenting standards in novice codebases. It positions the work as a feasibility study for this design-aware approach.

Significance. If the central claim holds with rigorous evidence, the result would be significant for software engineering practice: it could provide a low-cost way to improve documentation quality in educational, open-source, and novice-maintained codebases by exploiting design artifacts that are often already available to maintainers.

major comments (1)

[Abstract] Abstract: the manuscript states only the intention to study feasibility and supplies no dataset construction details, prompting template, code-only baseline condition, evaluation protocol (human ratings, automated metrics, or statistical test), or results. This absence is load-bearing for the central claim that design documents produce measurably better comments attributable to the design information rather than prompt length or generic LLM behavior.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the abstract should be expanded to better substantiate the feasibility study's methodology and results, and we will revise it accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the manuscript states only the intention to study feasibility and supplies no dataset construction details, prompting template, code-only baseline condition, evaluation protocol (human ratings, automated metrics, or statistical test), or results. This absence is load-bearing for the central claim that design documents produce measurably better comments attributable to the design information rather than prompt length or generic LLM behavior.

Authors: We acknowledge that the abstract is intentionally concise and high-level, which has led to the omission of these details. The full manuscript describes the dataset of novice-created codebases paired with available design documents, the prompting templates (with controls for length and structure between the design-aware and code-only conditions), the evaluation protocol combining expert human ratings on usefulness and clarity with automated metrics and statistical tests, and preliminary results indicating benefits from the design context. We will revise the abstract to include a brief summary of the dataset, baseline, prompting approach, evaluation methods, and key findings to strengthen support for the central claim and address potential confounds such as prompt length. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical feasibility study with no derivations, fitted parameters, or self-referential claims.

full rationale

The paper is a proposal to empirically test whether design-document context improves LLM-generated code comments. The abstract and described content contain no equations, no fitted quantities, no predictions that reduce to inputs by construction, and no load-bearing self-citations or uniqueness theorems. The central claim is framed as an intended experiment rather than a result derived from prior outputs of the same work. No step in the described chain equates a claimed output to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no equations, parameters, or formal assumptions. The implicit premise is that design documents exist and are usable as LLM context, but this is not formalized.

pith-pipeline@v0.9.0 · 5645 in / 1012 out tokens · 25602 ms · 2026-05-18T04:08:18.456261+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We use Retrieval-Augmented Generation (RAG) to generate comments from the code where design documents are used as a source for retrieval... We evaluated four generation setups: (i) Few-shot prompting... (iii) Few-shot prompting with RAG on the design document
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We have seen a 35% decrease in bug-fixing time for LLM generated comments when the design document is used.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 1 internal anchor

[1]

Aggarwal, K., Singh, Y., and Chhabra, J. (2002). An integrated measure of soft- ware maintainability. InAnnual Reliability and Maintainability Symposium. 2002 Proceedings (Cat. No.02CH37318), pages 235–241

work page 2002
[2]

and Devanbu, P

Ahmed, T. and Devanbu, P. (2022). Few-shot training LLMs for project-specific code-summarization

work page 2022
[3]

Aimer, A. (1998). Introduction to Software Documentation

work page 1998
[4]

J., Huang, Y., and Rajan, H

Biswas, S., Islam, M. J., Huang, Y., and Rajan, H. (2019). Boa Meets Python: A Boa Dataset of Data Science Software in Python Language. In2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR), pages 577–581

work page 2019
[5]

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, ...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[6]

Cai, R., Liang, Z., Xu, B., Li, Z., Hao, Y., and Chen, Y. (2020). TAG : Type Auxiliary Guiding for Code Comment Generation

work page 2020
[7]

P., and Chakrabarti, A

Chatterjee, N., Majumdar, S., Das, P. P., and Chakrabarti, A. (2023). ParallelC- Assist: Productivity Accelerator Suite Based on Dynamic Instrumentation.IEEE Access, 11:73599–73612

work page 2023
[8]

P., and Chakrabarti, A

Chatterjee, N., Majumdar, S., Das, P. P., and Chakrabarti, A. (2025). Tool assisted agile approach for legacy application migration.International Journal of System Assurance Engineering and Management, 16(9):3002–3017

work page 2025
[9]

Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H. P., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such, F. P., Cummings, D., Plappert, M., Ch...

work page 2021
[10]

P., Mytkowicz, T., Wang, B., Gao, J., and Duan, N

Cui, H., Wang, C., Huang, J., Inala, J. P., Mytkowicz, T., Wang, B., Gao, J., and Duan, N. (2022). CodeExp: Explanatory Code Document Generation

work page 2022
[11]

A., Christie, A

Dart, S. A., Christie, A. M., and Brown, A. W. (1993). A case study in software maintenance. Technical Report CMU/SEI-93-TR-8, Software Engineering Institute, Carnegie Mellon University, Pittsburgh, PA

work page 1993
[12]

de Souza, S. C. B., Anquetil, N., and de Oliveira, K. M. (2005). A study of the documentation essential to software maintenance. InProceedings of the 23rd Annual International Conference on Design of Communication: Documenting & Designing for Pervasive Information, SIGDOC ’05, page 68–75, New York, NY, USA. Association for Computing Machinery

work page 2005
[13]

and Lapata, M

Dong, L. and Lapata, M. (2016). Language to logical form with neural attention. In Erk, K. and Smith, N. A., editors,Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 33–43, Berlin, Germany. Association for Computational Linguistics

work page 2016
[14]

X., Narayanan, A

Fan, A. X., Narayanan, A. B. L., Hassany, M., and Ke, J. (2024). Evaluating the Quality of Code Comments Generated by Large Language Models for Novice Programmers

work page 2024
[15]

Figl, K., Kirchner, M., Baltes, S., and Felderer, M. (2025). The influence of code comments on the perceived helpfulness of stack overflow posts

work page 2025
[16]

Fluri, B., Wursch, M., and Gall, H. C. (2007). Do Code and Comments Co-Evolve? On the Relation between Source Code and Comment Changes. In14th Working Conference on Reverse Engineering (WCRE 2007), pages 70–79

work page 2007
[17]

Fried, D., Aghajanyan, A., Lin, J., Wang, S., Wallace, E., Shi, F., Zhong, R., tau Yih, W., Zettlemoyer, L., and Lewis, M. (2023). InCoder: A Generative Model for Code Infilling and Synthesis

work page 2023
[18]

Fu, J., Ng, S.-K., Jiang, Z., and Liu, P. (2023). GPTScore: Evaluate as You Desire

work page 2023
[19]

Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, M., and Wang, H. (2024). Retrieval-Augmented Generation for Large Language Models: A Survey

work page 2024
[20]

Hu, X., Li, G., Xia, X., Lo, D., and Jin, Z. (2020). Deep code comment gen- eration with hybrid lexical and syntactical information.Empirical Softw. Engg., 25(3):2179–2217

work page 2020
[21]

Husain, H., Wu, H.-H., Gazit, T., Allamanis, M., and Brockschmidt, M. (2020). CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

work page 2020
[22]

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., tau Yih, W., Rocktäschel, T., Riedel, S., and Kiela, D. (2021). Retrieval- Augmented Generation for Knowledge-Intensive NLP Tasks

work page 2021
[23]

Lin, C.-Y. (2004). ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics

work page 2004
[24]

Engineering investment analysis: Metrics guide

LinearB (2024). Engineering investment analysis: Metrics guide. https://linearb. io/metrics-guide/. Analysis of 3,000+ teams

work page 2024
[25]

S., Ye, L., Fabbri, A

Liu, Y., Shi, K., He, K. S., Ye, L., Fabbri, A. R., Liu, P., Radev, D., and Cohan, A. (2024). On Learning to Summarize with Large Language Models as References

work page 2024
[26]

K., Fu, S., and Liu, S

Lu, S., Guo, D., Ren, S., Huang, J., Svyatkovskiy, A., Blanco, A., Clement, C., Drain, D., Jiang, D., Tang, D., Li, G., Zhou, L., Shou, L., Zhou, L., Tufano, M., Gong, M., Zhou, M., Duan, N., Sundaresan, N., Deng, S. K., Fu, S., and Liu, S. (2021). CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

work page 2021
[27]

P., Clough, P

Majumdar, S., Bansal, A., Das, P. P., Clough, P. D., Datta, K., and Ghosh, S. K. (2022). Automated evaluation of comments to aid software maintenance.Journal of Software: Evolution and Process, 34(7):e2463

work page 2022
[28]

P., and Chakrabarti, P

Majumdar, S., Deshpande, A., Das, P. P., and Chakrabarti, P. P. (2025). Com- prehending c codes with llms: Effective comment generation through retrieval and reasoning.Pattern Recognition Letters

work page 2025
[29]

Hello GPT-4o

OpenAI (2024). Hello GPT-4o. Large Language Model. https://openai.com/index/ hello-gpt-4o/

work page 2024
[30]

Introducing OpenAI o3 and o4-mini

OpenAI (2025). Introducing OpenAI o3 and o4-mini. Large Language Model. https://openai.com/index/introducing-o3-and-o4-mini/

work page 2025
[31]

Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). BLEU: a method for automatic evaluation of machine translation. InProceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, page 311–318, USA. Association for Computational Linguistics

work page 2002
[32]

D., and Majumder, P

Paul, S., Majumdar, S., Bandyopadhyay, A., Dave, B., Chattopadhyay, S., Das, P., Clough, P. D., and Majumder, P. (2023). Efficiency of large language models to scale up ground truth: Overview of the irse track at forum for information retrieval

work page 2023
[33]

InProceedings of the 15th Annual Meeting of the Forum for Information Retrieval Evaluation, pages 16–18

work page
[34]

Pearson, K. (1900). X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling .The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 50(302):157–175

work page 1900
[35]

Rani, P., Birrer, M., Panichella, S., Ghafari, M., and Nierstrasz, O. (2021). What do developers discuss about code comments? In2021 IEEE 21st International Working Conference on Source Code Analysis and Manipulation (SCAM), pages 153–164

work page 2021
[36]

Rani, P., Blasi, A., Stulova, N., Panichella, S., Gorla, A., and Nierstrasz, O. (2023). A decade of code comment quality assessment: A systematic literature review.J. Syst. Softw., 195(C)

work page 2023
[37]

Research, D. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. https://github.com/deepseek-ai/DeepSeek-R1/blob/ main/DeepSeek_R1.pdf

work page 2025
[38]

The 2021 state of software code report

Rollbar (2021). The 2021 state of software code report. https: //rollbar.com/blog/announcing-the-2021-state-of-software-code-report/. Info- graphic: https://rollbar.com/wp-content/uploads/2022/06/rollbar-infographic-2021- the-state-of-software-code.pdf

work page 2021
[39]

Salewski, L., Alaniz, S., Rio-Torto, I., Schulz, E., and Akata, Z. (2023). In-Context Impersonation Reveals Large Language Models’ Strengths and Biases

work page 2023
[40]

Shahbazi, R., Sharma, R., and Fard, F. H. (2021). API2Com: On the Improvement of Automatically Generated Code Comments Using API Documentations

work page 2021
[41]

Shmerlin, Y., Hadar, I., Kliger, D., and Makabee, H. (2015). To document or not to document? an exploratory study on developers’ motivation to document code. In Persson, A. and Stirna, J., editors,Advanced Information Systems Engineering Workshops, pages 100–106, Cham. Springer International Publishing

work page 2015
[42]

Sridhara, G., Hill, E., Muppaneni, D., Pollock, L., and Vijay-Shanker, K. (2010). To- wards automatically generating summary comments for Java methods. InProceedings of the 25th IEEE/ACM International Conference on Automated Software Engineering, ASE ’10, page 43–52, New York, NY, USA. Association for Computing Machinery

work page 2010
[43]

Sun, W., Fang, C., Miao, Y., You, Y., Yuan, M., Chen, Y., Zhang, Q., Guo, A., Chen, X., Liu, Y., and Chen, Z. (2023). Abstract syntax tree for programming language understanding and representation: How far are we?

work page 2023
[44]

Sun, W., Zhang, Y., Zhu, J., Wang, Z., Fang, C., Zhang, Y., Feng, Y., Huang, J., Wang, X., Jin, Z., et al. (2025). Commenting Higher-level Code Unit: Full Code, Reduced Code, or Hierarchical Code Summarization.arXiv preprint arXiv:2503.10737

work page arXiv 2025
[45]

Team, M. A. (2025). Codestral 25.01. Large Language Model. https://mistral.ai/ news/codestral-2501

work page 2025
[46]

Tenny, T. (1988). Program readability: procedures versus comments.IEEE Transactions on Software Engineering, 14(9):1271–1279

work page 1988
[47]

S., Simon, E

Venkatkrishna, V., Nagabushanam, D. S., Simon, E. I.-O., and Vidoni, M. (2023). DocGen: Generating Detailed Parameter Docstrings in Python

work page 2023
[48]

E., and Li, S

Xia, X., Bao, L., Lo, D., Xing, Z., Hassan, A. E., and Li, S. (2018). Measuring Program Comprehension: A Large-Scale Field Study with Professionals.IEEE Transactions on Software Engineering, 44(10):951–976

work page 2018
[49]

Xu, Z., Peng, K., Ding, L., Tao, D., and Lu, X. (2024). Take Care of Your Prompt Bias! Investigating and Mitigating Prompt Bias in Factual Knowledge Extraction

work page 2024
[50]

Yang, G., Chen, X., Cao, J., Xu, S., Cui, Z., Yu, C., and Liu, K. (2021). ComFormer: Code Comment Generation via Transformer and Fusion Method-based Hybrid Code Representation

work page 2021
[51]

Yin, P., Deng, B., Chen, E., Vasilescu, B., and Neubig, G. (2018). Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow

work page 2018
[52]

Q., and Artzi, Y

Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., and Artzi, Y. (2020). BERTScore: Evaluating Text Generation with BERT

work page 2020
[53]

Zhong, V., Xiong, C., and Socher, R. (2017). Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning

work page 2017

[1] [1]

Aggarwal, K., Singh, Y., and Chhabra, J. (2002). An integrated measure of soft- ware maintainability. InAnnual Reliability and Maintainability Symposium. 2002 Proceedings (Cat. No.02CH37318), pages 235–241

work page 2002

[2] [2]

and Devanbu, P

Ahmed, T. and Devanbu, P. (2022). Few-shot training LLMs for project-specific code-summarization

work page 2022

[3] [3]

Aimer, A. (1998). Introduction to Software Documentation

work page 1998

[4] [4]

J., Huang, Y., and Rajan, H

Biswas, S., Islam, M. J., Huang, Y., and Rajan, H. (2019). Boa Meets Python: A Boa Dataset of Data Science Software in Python Language. In2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR), pages 577–581

work page 2019

[5] [5]

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, ...

work page internal anchor Pith review Pith/arXiv arXiv 2020

[6] [6]

Cai, R., Liang, Z., Xu, B., Li, Z., Hao, Y., and Chen, Y. (2020). TAG : Type Auxiliary Guiding for Code Comment Generation

work page 2020

[7] [7]

P., and Chakrabarti, A

Chatterjee, N., Majumdar, S., Das, P. P., and Chakrabarti, A. (2023). ParallelC- Assist: Productivity Accelerator Suite Based on Dynamic Instrumentation.IEEE Access, 11:73599–73612

work page 2023

[8] [8]

P., and Chakrabarti, A

Chatterjee, N., Majumdar, S., Das, P. P., and Chakrabarti, A. (2025). Tool assisted agile approach for legacy application migration.International Journal of System Assurance Engineering and Management, 16(9):3002–3017

work page 2025

[9] [9]

Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H. P., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such, F. P., Cummings, D., Plappert, M., Ch...

work page 2021

[10] [10]

P., Mytkowicz, T., Wang, B., Gao, J., and Duan, N

Cui, H., Wang, C., Huang, J., Inala, J. P., Mytkowicz, T., Wang, B., Gao, J., and Duan, N. (2022). CodeExp: Explanatory Code Document Generation

work page 2022

[11] [11]

A., Christie, A

Dart, S. A., Christie, A. M., and Brown, A. W. (1993). A case study in software maintenance. Technical Report CMU/SEI-93-TR-8, Software Engineering Institute, Carnegie Mellon University, Pittsburgh, PA

work page 1993

[12] [12]

de Souza, S. C. B., Anquetil, N., and de Oliveira, K. M. (2005). A study of the documentation essential to software maintenance. InProceedings of the 23rd Annual International Conference on Design of Communication: Documenting & Designing for Pervasive Information, SIGDOC ’05, page 68–75, New York, NY, USA. Association for Computing Machinery

work page 2005

[13] [13]

and Lapata, M

Dong, L. and Lapata, M. (2016). Language to logical form with neural attention. In Erk, K. and Smith, N. A., editors,Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 33–43, Berlin, Germany. Association for Computational Linguistics

work page 2016

[14] [14]

X., Narayanan, A

Fan, A. X., Narayanan, A. B. L., Hassany, M., and Ke, J. (2024). Evaluating the Quality of Code Comments Generated by Large Language Models for Novice Programmers

work page 2024

[15] [15]

Figl, K., Kirchner, M., Baltes, S., and Felderer, M. (2025). The influence of code comments on the perceived helpfulness of stack overflow posts

work page 2025

[16] [16]

Fluri, B., Wursch, M., and Gall, H. C. (2007). Do Code and Comments Co-Evolve? On the Relation between Source Code and Comment Changes. In14th Working Conference on Reverse Engineering (WCRE 2007), pages 70–79

work page 2007

[17] [17]

Fried, D., Aghajanyan, A., Lin, J., Wang, S., Wallace, E., Shi, F., Zhong, R., tau Yih, W., Zettlemoyer, L., and Lewis, M. (2023). InCoder: A Generative Model for Code Infilling and Synthesis

work page 2023

[18] [18]

Fu, J., Ng, S.-K., Jiang, Z., and Liu, P. (2023). GPTScore: Evaluate as You Desire

work page 2023

[19] [19]

Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, M., and Wang, H. (2024). Retrieval-Augmented Generation for Large Language Models: A Survey

work page 2024

[20] [20]

Hu, X., Li, G., Xia, X., Lo, D., and Jin, Z. (2020). Deep code comment gen- eration with hybrid lexical and syntactical information.Empirical Softw. Engg., 25(3):2179–2217

work page 2020

[21] [21]

Husain, H., Wu, H.-H., Gazit, T., Allamanis, M., and Brockschmidt, M. (2020). CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

work page 2020

[22] [22]

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., tau Yih, W., Rocktäschel, T., Riedel, S., and Kiela, D. (2021). Retrieval- Augmented Generation for Knowledge-Intensive NLP Tasks

work page 2021

[23] [23]

Lin, C.-Y. (2004). ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics

work page 2004

[24] [24]

Engineering investment analysis: Metrics guide

LinearB (2024). Engineering investment analysis: Metrics guide. https://linearb. io/metrics-guide/. Analysis of 3,000+ teams

work page 2024

[25] [25]

S., Ye, L., Fabbri, A

Liu, Y., Shi, K., He, K. S., Ye, L., Fabbri, A. R., Liu, P., Radev, D., and Cohan, A. (2024). On Learning to Summarize with Large Language Models as References

work page 2024

[26] [26]

K., Fu, S., and Liu, S

Lu, S., Guo, D., Ren, S., Huang, J., Svyatkovskiy, A., Blanco, A., Clement, C., Drain, D., Jiang, D., Tang, D., Li, G., Zhou, L., Shou, L., Zhou, L., Tufano, M., Gong, M., Zhou, M., Duan, N., Sundaresan, N., Deng, S. K., Fu, S., and Liu, S. (2021). CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

work page 2021

[27] [27]

P., Clough, P

Majumdar, S., Bansal, A., Das, P. P., Clough, P. D., Datta, K., and Ghosh, S. K. (2022). Automated evaluation of comments to aid software maintenance.Journal of Software: Evolution and Process, 34(7):e2463

work page 2022

[28] [28]

P., and Chakrabarti, P

Majumdar, S., Deshpande, A., Das, P. P., and Chakrabarti, P. P. (2025). Com- prehending c codes with llms: Effective comment generation through retrieval and reasoning.Pattern Recognition Letters

work page 2025

[29] [29]

Hello GPT-4o

OpenAI (2024). Hello GPT-4o. Large Language Model. https://openai.com/index/ hello-gpt-4o/

work page 2024

[30] [30]

Introducing OpenAI o3 and o4-mini

OpenAI (2025). Introducing OpenAI o3 and o4-mini. Large Language Model. https://openai.com/index/introducing-o3-and-o4-mini/

work page 2025

[31] [31]

Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). BLEU: a method for automatic evaluation of machine translation. InProceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, page 311–318, USA. Association for Computational Linguistics

work page 2002

[32] [32]

D., and Majumder, P

Paul, S., Majumdar, S., Bandyopadhyay, A., Dave, B., Chattopadhyay, S., Das, P., Clough, P. D., and Majumder, P. (2023). Efficiency of large language models to scale up ground truth: Overview of the irse track at forum for information retrieval

work page 2023

[33] [33]

InProceedings of the 15th Annual Meeting of the Forum for Information Retrieval Evaluation, pages 16–18

work page

[34] [34]

Pearson, K. (1900). X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling .The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 50(302):157–175

work page 1900

[35] [35]

Rani, P., Birrer, M., Panichella, S., Ghafari, M., and Nierstrasz, O. (2021). What do developers discuss about code comments? In2021 IEEE 21st International Working Conference on Source Code Analysis and Manipulation (SCAM), pages 153–164

work page 2021

[36] [36]

Rani, P., Blasi, A., Stulova, N., Panichella, S., Gorla, A., and Nierstrasz, O. (2023). A decade of code comment quality assessment: A systematic literature review.J. Syst. Softw., 195(C)

work page 2023

[37] [37]

Research, D. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. https://github.com/deepseek-ai/DeepSeek-R1/blob/ main/DeepSeek_R1.pdf

work page 2025

[38] [38]

The 2021 state of software code report

Rollbar (2021). The 2021 state of software code report. https: //rollbar.com/blog/announcing-the-2021-state-of-software-code-report/. Info- graphic: https://rollbar.com/wp-content/uploads/2022/06/rollbar-infographic-2021- the-state-of-software-code.pdf

work page 2021

[39] [39]

Salewski, L., Alaniz, S., Rio-Torto, I., Schulz, E., and Akata, Z. (2023). In-Context Impersonation Reveals Large Language Models’ Strengths and Biases

work page 2023

[40] [40]

Shahbazi, R., Sharma, R., and Fard, F. H. (2021). API2Com: On the Improvement of Automatically Generated Code Comments Using API Documentations

work page 2021

[41] [41]

Shmerlin, Y., Hadar, I., Kliger, D., and Makabee, H. (2015). To document or not to document? an exploratory study on developers’ motivation to document code. In Persson, A. and Stirna, J., editors,Advanced Information Systems Engineering Workshops, pages 100–106, Cham. Springer International Publishing

work page 2015

[42] [42]

Sridhara, G., Hill, E., Muppaneni, D., Pollock, L., and Vijay-Shanker, K. (2010). To- wards automatically generating summary comments for Java methods. InProceedings of the 25th IEEE/ACM International Conference on Automated Software Engineering, ASE ’10, page 43–52, New York, NY, USA. Association for Computing Machinery

work page 2010

[43] [43]

Sun, W., Fang, C., Miao, Y., You, Y., Yuan, M., Chen, Y., Zhang, Q., Guo, A., Chen, X., Liu, Y., and Chen, Z. (2023). Abstract syntax tree for programming language understanding and representation: How far are we?

work page 2023

[44] [44]

Sun, W., Zhang, Y., Zhu, J., Wang, Z., Fang, C., Zhang, Y., Feng, Y., Huang, J., Wang, X., Jin, Z., et al. (2025). Commenting Higher-level Code Unit: Full Code, Reduced Code, or Hierarchical Code Summarization.arXiv preprint arXiv:2503.10737

work page arXiv 2025

[45] [45]

Team, M. A. (2025). Codestral 25.01. Large Language Model. https://mistral.ai/ news/codestral-2501

work page 2025

[46] [46]

Tenny, T. (1988). Program readability: procedures versus comments.IEEE Transactions on Software Engineering, 14(9):1271–1279

work page 1988

[47] [47]

S., Simon, E

Venkatkrishna, V., Nagabushanam, D. S., Simon, E. I.-O., and Vidoni, M. (2023). DocGen: Generating Detailed Parameter Docstrings in Python

work page 2023

[48] [48]

E., and Li, S

Xia, X., Bao, L., Lo, D., Xing, Z., Hassan, A. E., and Li, S. (2018). Measuring Program Comprehension: A Large-Scale Field Study with Professionals.IEEE Transactions on Software Engineering, 44(10):951–976

work page 2018

[49] [49]

Xu, Z., Peng, K., Ding, L., Tao, D., and Lu, X. (2024). Take Care of Your Prompt Bias! Investigating and Mitigating Prompt Bias in Factual Knowledge Extraction

work page 2024

[50] [50]

Yang, G., Chen, X., Cao, J., Xu, S., Cui, Z., Yu, C., and Liu, K. (2021). ComFormer: Code Comment Generation via Transformer and Fusion Method-based Hybrid Code Representation

work page 2021

[51] [51]

Yin, P., Deng, B., Chen, E., Vasilescu, B., and Neubig, G. (2018). Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow

work page 2018

[52] [52]

Q., and Artzi, Y

Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., and Artzi, Y. (2020). BERTScore: Evaluating Text Generation with BERT

work page 2020

[53] [53]

Zhong, V., Xiong, C., and Socher, R. (2017). Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning

work page 2017