CodeWiki: Evaluating AI's Ability to Generate Holistic Documentation for Large-Scale Codebases

Anh Nguyen Hoang; Bach Le; Minh Le-Anh; Nghi D. Q. Bui

arxiv: 2510.24428 · v6 · submitted 2025-10-28 · 💻 cs.SE

CodeWiki: Evaluating AI's Ability to Generate Holistic Documentation for Large-Scale Codebases

Anh Nguyen Hoang , Minh Le-Anh , Bach Le , Nghi D. Q. Bui This is my paper

Pith reviewed 2026-05-18 03:07 UTC · model grok-4.3

classification 💻 cs.SE

keywords automated documentationrepository-level documentationmulti-agent processingcodebase analysissoftware maintenancemulti-modal synthesisbenchmarking

0 comments

The pith

CodeWiki generates holistic, architecture-aware documentation for large codebases by combining hierarchical decomposition, recursive multi-agent processing, and multi-modal synthesis, scoring 68.79% and exceeding baselines by 4.73%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CodeWiki, a framework for automatically generating comprehensive documentation for entire code repositories that accounts for interactions across files, modules, and the overall system. Existing automated approaches fall short because they do not capture the semantic dependencies and architectural structures that shape real software. CodeWiki addresses this through three techniques: breaking the codebase into hierarchical layers to retain context, delegating tasks among multiple AI agents recursively for scale, and blending textual explanations with visual diagrams and data flows. The authors also release CodeWikiBench, a benchmark with multi-dimensional rubrics judged by language models, to support consistent evaluation. Experiments across seven languages show CodeWiki reaching 68.79 percent quality, a 4.73 percent gain over the prior DeepWiki system and larger gains for scripting languages.

Core claim

CodeWiki is a unified framework for repository-level documentation generation that employs hierarchical decomposition to preserve architectural context across granularity levels, recursive multi-agent processing with dynamic task delegation to achieve scalability, and multi-modal synthesis to integrate textual descriptions with visual artifacts such as architecture diagrams and data-flow representations. Evaluated on the new CodeWikiBench benchmark, this produces a 68.79 percent quality score with proprietary models, outperforming the closed-source DeepWiki baseline by 4.73 percent, with notably stronger results on high-level scripting languages.

What carries the argument

The CodeWiki framework, which unites hierarchical decomposition to keep architectural context, recursive multi-agent processing for scalable task handling, and multi-modal synthesis of text plus visuals to model cross-file and system-level interactions.

If this is right

Long-term software maintenance improves when documentation captures cross-module dependencies rather than isolated functions.
Collaboration benefits from system-level views that reveal how components interact.
Generation scales to evolving repositories through dynamic agent delegation without manual intervention.
Standardized benchmarking via CodeWikiBench enables direct comparison of future documentation systems.
Gains are especially pronounced for high-level scripting languages, suggesting language-specific strengths.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Teams could reduce onboarding time for new developers by supplying instant architecture overviews instead of requiring manual exploration.
Embedding the framework in version-control workflows might keep documentation synchronized as code changes.
The same hierarchical and multi-agent pattern could extend to generating explanations for data pipelines or hardware designs.
Validating the automated scores against human developer feedback on usefulness would test whether the quality metric predicts practical value.

Load-bearing premise

The LLM-based assessment protocols and multi-dimensional rubrics in CodeWikiBench can accurately measure holistic documentation quality without introducing bias or overlooking key architectural aspects.

What would settle it

A side-by-side human expert evaluation of documentation generated by CodeWiki versus baselines on the same codebases, checking whether the 4.73 percent quality advantage holds under direct review or real maintenance-task performance.

read the original abstract

Given a large and evolving codebase, the ability to automatically generate holistic, architecture-aware documentation that captures not only individual functions but also cross-file, cross-module, and system-level interactions remains an open challenge. Comprehensive documentation is essential for long-term software maintenance and collaboration, yet current automated approaches still fail to model the rich semantic dependencies and architectural structures that define real-world software systems. We present \textbf{CodeWiki}, a unified framework for automated repository-level documentation across seven programming languages. CodeWiki introduces three key innovations: (i) hierarchical decomposition that preserves architectural context across multiple levels of granularity, (ii) recursive multi-agent processing with dynamic task delegation for scalable generation, and (iii) multi-modal synthesis that integrates textual descriptions with visual artifacts such as architecture diagrams and data-flow representations. To enable rigorous evaluation, we introduce \textbf{CodeWikiBench}, a comprehensive benchmark featuring multi-dimensional rubrics and LLM-based assessment protocols. Experimental results show that CodeWiki achieves a 68.79\% quality score with proprietary models, outperforming the closed-source DeepWiki baseline (64.06\%) by 4.73\%, with particularly strong improvements on high-level scripting languages (+10.47\%). We open-source CodeWiki to foster future research and community adoption.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CodeWiki combines hierarchical decomposition with recursive multi-agent generation and multi-modal outputs to tackle repo-scale documentation, delivering a modest 4.73% edge over DeepWiki on its new benchmark, but the gains rest on unvalidated LLM judges.

read the letter

The main point on this paper is that CodeWiki assembles hierarchical decomposition, recursive multi-agent task delegation, and multi-modal synthesis into one framework for producing architecture-aware docs across seven languages, and it reports a 68.79% quality score that beats the DeepWiki baseline by 4.73% with bigger lifts on scripting languages. They also release CodeWikiBench and open-source the code, which helps others test the ideas directly. That combination is new enough as an integrated package even if the pieces draw from prior multi-agent and code-AI work. The practical focus on cross-module and system-level interactions is useful for real maintenance tasks where function-level docs fall short. The benchmark tries to measure more dimensions than simple accuracy, which is a step forward from narrower evaluations. On the downside, the headline numbers come entirely from LLM-based scoring protocols and rubrics with no reported human correlation, inter-rater checks, or ablation against actual developer tasks. A 4.73% difference is small to begin with, so without evidence that the judge tracks genuine quality or avoids favoring the proposed method, it is difficult to treat the improvement as solid. The paper does not appear to include error bars or sensitivity analysis on the rubric choices either. This work is aimed at researchers and tool builders working on AI for software engineering and large-scale code comprehension. Readers who need concrete baselines or an open framework to extend would find value in the architecture and the released benchmark. It deserves a serious referee because the problem is concrete, the setup is reproducible, and the evaluation gap is fixable rather than fatal. I would send it out for review so the authors can add human validation or stronger controls before the claims are taken as settled.

Referee Report

1 major / 1 minor

Summary. The paper presents CodeWiki, a unified framework for automated repository-level documentation across seven programming languages. It introduces three innovations: hierarchical decomposition to preserve architectural context, recursive multi-agent processing with dynamic task delegation, and multi-modal synthesis integrating text with visual artifacts like architecture diagrams. To support evaluation, the authors propose CodeWikiBench featuring multi-dimensional rubrics and LLM-based assessment protocols. Experimental results claim CodeWiki achieves a 68.79% quality score with proprietary models, outperforming the DeepWiki baseline (64.06%) by 4.73%, with stronger gains (+10.47%) on high-level scripting languages. The work is open-sourced.

Significance. If the central empirical claims hold under validated evaluation, the framework and benchmark could meaningfully advance automated documentation for large, evolving codebases by better capturing cross-module and system-level interactions. The open-sourcing and multi-language coverage are positive for community adoption and reproducibility. However, the significance is tempered by the unvalidated nature of the LLM-as-judge protocol, which is load-bearing for the reported improvements.

major comments (1)

[Abstract / Experimental Results] Abstract and Experimental Results section: The headline 68.79% vs. 64.06% quality scores and the +10.47% gain on scripting languages are produced entirely by the LLM-based assessment protocols and multi-dimensional rubrics of CodeWikiBench. No human correlation study, inter-annotator agreement figures, or ablation against actual maintenance tasks are reported, leaving the central claim that CodeWiki produces superior holistic documentation dependent on an unverified proxy that may embed judge-model biases.

minor comments (1)

[Abstract] The abstract states concrete percentage improvements without accompanying error bars or statistical significance tests; adding these would strengthen the presentation of the 4.73% gain.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback on our manuscript. The concern regarding validation of the LLM-as-judge protocol in CodeWikiBench is well-taken, and we address it directly below while outlining targeted revisions.

read point-by-point responses

Referee: [Abstract / Experimental Results] Abstract and Experimental Results section: The headline 68.79% vs. 64.06% quality scores and the +10.47% gain on scripting languages are produced entirely by the LLM-based assessment protocols and multi-dimensional rubrics of CodeWikiBench. No human correlation study, inter-annotator agreement figures, or ablation against actual maintenance tasks are reported, leaving the central claim that CodeWiki produces superior holistic documentation dependent on an unverified proxy that may embed judge-model biases.

Authors: We appreciate the referee highlighting that our primary results rest on the LLM-based evaluation in CodeWikiBench. The multi-dimensional rubrics were explicitly constructed to assess documentation along axes of completeness, accuracy, architectural coherence, and cross-module coverage, with the intent of reducing reliance on any single subjective judgment. Nevertheless, we acknowledge that the current manuscript does not include a human correlation study, inter-annotator agreement statistics, or downstream ablation on maintenance tasks. This is a genuine limitation of the reported evaluation. In the revised manuscript we will (i) add a dedicated paragraph in the Experimental Results section discussing potential judge-model biases and the rubric design choices made to promote consistency, (ii) report score variance across two additional judge models to demonstrate robustness, and (iii) explicitly state the scope of CodeWikiBench as an intrinsic quality benchmark rather than a proxy for end-to-end maintenance outcomes. These changes will make the evidential basis of our claims more transparent without overstating the current validation. revision: partial

Circularity Check

0 steps flagged

No significant circularity: empirical comparison to external baseline on newly introduced benchmark

full rationale

The paper introduces CodeWiki as a framework with hierarchical decomposition, recursive multi-agent processing, and multi-modal synthesis, then defines CodeWikiBench with multi-dimensional rubrics and LLM-based assessment to evaluate it. The central result (68.79% vs. 64.06% on DeepWiki) is obtained by applying the same benchmark protocol to both the proposed system and an independent closed-source baseline. No equations, fitted parameters, or self-citations are shown that would make the reported superiority equivalent to the inputs by construction. The evaluation protocol is external to the method itself and the comparison provides an independent check, satisfying the criteria for a self-contained empirical claim.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The framework builds on existing LLM capabilities and standard software engineering practices without introducing new mathematical axioms, free parameters fitted to data, or invented physical entities; the main additions are engineering choices in the multi-agent workflow.

pith-pipeline@v0.9.0 · 5764 in / 1162 out tokens · 23243 ms · 2026-05-18T03:07:19.109447+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

hierarchical decomposition that preserves architectural context across multiple levels of granularity, recursive multi-agent processing with dynamic task delegation
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

multi-dimensional rubrics and LLM-based assessment protocols

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Documentation-Guided Agentic Codebase Migration from C to Rust
cs.SE 2026-05 unverdicted novelty 7.0

RustPrint uses documentation as a migration blueprint for agentic C-to-Rust translation, achieving full compilation and higher feature preservation than baselines on eight real-world repositories from 11K to 84K LoC.
Documentation-Guided Agentic Codebase Migration from C to Rust
cs.SE 2026-05 unverdicted novelty 7.0

RustPrint is a documentation-guided agentic system that migrates entire C repositories to Rust by using architecture docs as blueprints, achieving full compilability and 93-95% feature/test preservation on eight 11K-8...
Remember Your Trace: Memory-Guided Long-Horizon Agentic Framework for Consistent and Hierarchical Repository-Level Code Documentation
cs.SE 2026-05 unverdicted novelty 7.0

MemDocAgent generates consistent hierarchical repository-level code documentation by combining dependency-aware traversal with memory-guided agent interactions that accumulate work traces.
RepoDoc: A Knowledge Graph-Based Framework to Automatic Documentation Generation and Incremental Updates
cs.SE 2026-04 unverdicted novelty 7.0

RepoDoc uses a repository knowledge graph with module clustering and semantic impact propagation to generate more complete documentation 3x faster with 85% fewer tokens and handle incremental updates 73% faster than p...
SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios
cs.SE 2025-12 unverdicted novelty 7.0

SWE-EVO shows GPT-5.4 with OpenHands reaching only 25% success on complex multi-file evolution tasks versus 72.8% on SWE-Bench Verified, and introduces Fix Rate as a partial-progress metric.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · cited by 4 Pith papers · 4 internal anchors

[1]

Few-shot training llms for project-specific code-summarization

Toufique Ahmed and Premkumar Devanbu. Few-shot training llms for project-specific code-summarization. In Proceedings of the 37th IEEE/ACMInternationalConference on Automated SoftwareEngineering, ASE ’22, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9781450394758. doi: 10.1145/3551349.3559555. URLhttps://doi.org/10.1145/3551349.3559555

work page doi:10.1145/3551349.3559555 2023
[2]

Program synthesis with large language models, 2021

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models, 2021

work page 2021
[3]

Codetf: One-stop transformer library for state-of-the-art code llm.arXiv preprint arXiv:2306.00029, 2023

Nghi DQ Bui, Hung Le, Yue Wang, Junnan Li, Akhilesh Deepak Gotmare, and Steven CH Hoi. Codetf: One-stop transformer library for state-of-the-art code llm.arXiv preprint arXiv:2306.00029, 2023

work page arXiv 2023
[4]

MultiPL-E:

Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, Arjun Guha, Michael Greenberg, and Abhinav Jangda. Multipl-e: A scalable and polyglot approach to benchmarking neural code generation.IEEE Transactions on Software Engineering, 49(7):3675–3691,...

work page doi:10.1109/tse.2023.3267446 2023
[5]

An empirical analysis of the impact of software development problem factors on software maintainability.J

Jie-Cherng Chen and Sun-Jen Huang. An empirical analysis of the impact of software development problem factors on software maintainability.J. Syst. Softw., 82(6):981–992, June 2009. ISSN 0164-1212. doi: 10.1016/j.jss. 2008.12.036. URLhttps://doi.org/10.1016/j.jss.2008.12.036

work page doi:10.1016/j.jss 2009
[6]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page 2021
[7]

Anti-saturation coordination control of permanent magnet synchronous wind power system.IEEE Access, 11:33428–33441, 2023

Yunseok Choi, Cheolwon Na, Hyojun Kim, and Jee-Hyong Lee. Readsum: Retrieval-augmented adaptive transformer for source code summarization.IEEE Access, 11:51155–51165, 2023. doi: 10.1109/ACCESS.2023. 3271992

work page doi:10.1109/access.2023 2023
[8]

On the effectiveness of llm-as-a-judge for code generation and summarization, 2025

Giuseppe Crupi, Rosalia Tufano, Alejandro Velasco, Antonio Mastropaolo, Denys Poshyvanyk, and Gabriele Bavota. On the effectiveness of llm-as-a-judge for code generation and summarization, 2025. URL https: //arxiv.org/abs/2507.16587

work page arXiv 2025
[9]

de Souza, Nicolas Anquetil, and Káthia M

Sergio Cozzetti B. de Souza, Nicolas Anquetil, and Káthia M. de Oliveira. A study of the documentation essential to software maintenance. InProceedings of the 23rd Annual International Conference on Design of Communication: Documenting & Designing for PervasiveInformation, SIGDOC ’05, page 68–75, New York, NY, USA, 2005. Association for Computing Machiner...

work page doi:10.1145/1085313.1085331 2005
[10]

Deepwiki, 2025

DeepWiki. Deepwiki, 2025. URLhttps://deepwiki.com/

work page 2025
[11]

Out of the bleu: How should we assess quality of the code generation models?J

Mikhail Evtikhiev, Egor Bogomolov, Yaroslav Sokolov, and Timofey Bryksin. Out of the bleu: How should we assess quality of the code generation models?J. Syst. Softw., 203(C), September 2023. ISSN 0164-1212. doi: 10.1016/j.jss.2023.111741. URLhttps://doi.org/10.1016/j.jss.2023.111741

work page doi:10.1016/j.jss.2023.111741 2023
[12]

CodeBERT: A pre-trained model for programming and natural languages

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. CodeBERT: A pre-trained model for programming and natural languages. In Trevor Cohn, Yulan He, and Yang Liu, editors,Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1536–1547, Online, November 2...

work page doi:10.18653/v1/2020.findings-emnlp.139 2020
[13]

Graph- codebert: Pre-training code representations with data flow

Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Nan Duan, Ming Zhou, and Daxin Jiang. Graph- codebert: Pre-training code representations with data flow. InICLR, 2021

work page 2021
[14]

Analyzing the performance of large language models on code summa- rization

Rajarshi Haldar and Julia Hockenmaier. Analyzing the performance of large language models on code summa- rization. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen 13 Xue, editors, Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-...

work page 2024
[15]

Codesearchnet challenge: Evaluating the state of semantic code search

Husain Hamel, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. Codesearchnet challenge: Evaluating the state of semantic code search. InProceedings of the 2019 Symposium on Foundations of Software Engineering (FSE), pages 974–985. ACM, 2019

work page 2019
[16]

From code to courtroom: Llms as the new software judges, 2025

Junda He, Jieke Shi, Terry Yue Zhuo, Christoph Treude, Jiamou Sun, Zhenchang Xing, Xiaoning Du, and David Lo. From code to courtroom: Llms as the new software judges, 2025. URLhttps://arxiv.org/abs/2503.02246

work page arXiv 2025
[17]

Llm-based multi-agent systems for software engineering: Literature review, vision, and the road ahead.ACM Trans.Softw

Junda He, Christoph Treude, and David Lo. Llm-based multi-agent systems for software engineering: Literature review, vision, and the road ahead.ACM Trans.Softw. Eng. Methodol., 34(5), May 2025. ISSN 1049-331X. doi: 10.1145/3712003. URLhttps://doi.org/10.1145/3712003

work page doi:10.1145/3712003 2025
[18]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

Xiaohan Hong, Jiaxi Zhang, Wenzhong Tang, Yizhou Jiang, Quan Liu, and Yidong Yang. Metagpt: Meta programming for multi-agent collaborative framework.arXiv preprint arXiv:2308.00352, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Large language models for software engineering: A systematic literature review, 2023

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. Large language models for software engineering: A systematic literature review, 2023

work page 2023
[20]

Automatic code documentation generation using gpt-3

Junaed Younus Khan and Gias Uddin. Automatic code documentation generation using gpt-3. InProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, ASE ’22, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9781450394758. doi: 10.1145/3551349.3559548. URL https://doi.org/10.1145/3551349.3559548

work page doi:10.1145/3551349.3559548 2023
[21]

A neural model for generating natural language summaries of program subroutines

Alexander LeClair, Siyuan Jiang, and Collin McMillan. A neural model for generating natural language summaries of program subroutines. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pages 795–806. IEEE, 2019

work page 2019
[22]

Do pretrained language models indeed understand software engineering tasks? IEEE Transactions on Software Engineering, 49(10):4639–4655, 2023

Yao Li, Tao Zhang, Xiapu Luo, Haipeng Cai, Sen Fang, and Dawei Yuan. Do pretrained language models indeed understand software engineering tasks? IEEE Transactions on Software Engineering, 49(10):4639–4655, 2023. doi: 10.1109/TSE.2023.3308952

work page doi:10.1109/tse.2023.3308952 2023
[23]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004

work page 2004
[24]

In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023

Jiahao Liu, Jun Zeng, Xiang Wang, and Zhenkai Liang. Learning graph-based code representations for source-level functional similarity detection. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 345–357, 2023. doi: 10.1109/ICSE48619.2023.00040

work page doi:10.1109/icse48619.2023.00040 2023
[25]

G-eval: Nlg evaluation using gpt-4 with better human alignment.arXiv preprint arXiv:2311.08788, 2023

Yang Liu, Yao Fu, Yujie Xie, Xinyi Chen, Bo Pang, Chenyan Qian, Teng Ma, and Dragomir Radev. G-eval: Nlg evaluation using gpt-4 with better human alignment.arXiv preprint arXiv:2311.08788, 2023

work page arXiv 2023
[26]

ProConSuL: Project context for code summarization with LLMs

Vadim Lomshakov, Andrey Podivilov, Sergey Savin, Oleg Baryshnikov, Alena Lisevych, and Sergey Nikolenko. ProConSuL: Project context for code summarization with LLMs. In Franck Dernoncourt, Daniel Preoţiuc-Pietro, and Anastasia Shimorina, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 8...

work page doi:10.18653/v1/2024.emnlp-industry.65 2024
[27]

RepoAgent: An LLM-powered open-source framework for repository-level code documentation generation

Qinyu Luo, Yining Ye, Shihao Liang, Zhong Zhang, Yujia Qin, Yaxi Lu, Yesai Wu, Xin Cong, Yankai Lin, Yingli Zhang, Xiaoyin Che, Zhiyuan Liu, and Maosong Sun. RepoAgent: An LLM-powered open-source framework for repository-level code documentation generation. In Delia Irazu Hernandez Farias, Tom Hope, and Manling Li, editors,Proceedings of the 2024 Conferen...

work page doi:10.18653/v1/2024.emnlp-demo.46 2024
[28]

Knowledgegraphbasedrepository-levelcodegeneration

Vladimir Makharev and Vladimir Ivanov. Code summarization beyond function level. In2025 IEEE 1st Conference on Large Language Models for Code (LLM4Code), pages 153–160, 05 2025. doi: 10.1109/LLM4Code66737.2025. 00024

work page doi:10.1109/llm4code66737.2025 2025
[29]

Evaluating code summarization techniques: A new metric and an empirical characterization

Ernesto Mastropaolo, Georgios Gousios, Gabriele Bavota, Rocco Oliveto, and Barbara Russo. Evaluating code summarization techniques: A new metric and an empirical characterization. InProceedings of the 46th International Conference on Software Engineering (ICSE), 2024. 14

work page 2024
[30]

McBurney, Siyuan Jiang, Marouane Kessentini, Nicholas A

Paul W. McBurney, Siyuan Jiang, Marouane Kessentini, Nicholas A. Kraft, Ameer Armaly, Mohamed Wiem Mkaouer, and Collin McMillan. Towards prioritizing documentation effort.IEEE Trans. Softw. Eng., 44(9): 897–913, September 2018. ISSN 0098-5589. doi: 10.1109/TSE.2017.2716950. URLhttps://doi.org/10.1109/ TSE.2017.2716950

work page doi:10.1109/tse.2017.2716950 2018
[31]

Robillard

Mathieu Nassif and Martin P. Robillard. Non-linear software documentation with interactive code examples. ACM Trans. Softw. Eng. Methodol., 34(2), January 2025. ISSN 1049-331X. doi: 10.1145/3702976. URL https://doi.org/10.1145/3702976

work page doi:10.1145/3702976 2025
[32]

Codemmlu: A multi-task benchmark for assessing code understanding & reasoning capabilities of codellms

Dung Manh Nguyen, Thang Chau Phan, Nam Le Hai, Tien-Thong Doan, Nam V Nguyen, Quang Pham, and Nghi DQ Bui. Codemmlu: A multi-task benchmark for assessing code understanding & reasoning capabilities of codellms. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025
[33]

Agilecoder: Dynamic collaborative agents for software development based on agile methodology

Minh Huynh Nguyen, Thang Phan Chau, Phong X Nguyen, and Nghi DQ Bui. Agilecoder: Dynamic collaborative agents for software development based on agile methodology. In2025 IEEE/ACMSecond InternationalConference on AI Foundation Models and Software Engineering (Forge), pages 156–167. IEEE, 2025

work page 2025
[34]

Deep learning meets software engineering: A survey on pre-trained models of source code, 2022

Changan Niu, Chuanyi Li, Bin Luo, and Vincent Ng. Deep learning meets software engineering: A survey on pre-trained models of source code, 2022

work page 2022
[35]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

work page 2002
[36]

arXiv preprint arXiv:2409.16299

Huy Nhat Phan, Tien N Nguyen, Phong X Nguyen, and Nghi DQ Bui. Hyperagent: Generalist software engineering agents to solve coding tasks at scale.arXiv preprint arXiv:2409.16299, 2024

work page arXiv 2024
[37]

Documint: Docstring generation for python using small language models.arXiv preprint arXiv:2405.10243), 2024

Bibek Poudel, Adam Cook, Sekou Traore, and Shelah Ameli. Documint: Docstring generation for python using small language models.arXiv preprint arXiv:2405.10243), 2024

work page arXiv 2024
[38]

ChatDev: Communicative Agents for Software Development

Yuzhang Qian, Zian Zhang, Liang Pan, Peng Wang, Shouyi Liu, Wayne Xin Zhao, and Ji-Rong Wen. Chatdev: Revolutionizing software development with ai-collaborative agents.arXiv preprint arXiv:2307.07924, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

A review on source code documentation.ACM Trans

Sawan Rai, Ramesh Chandra Belwal, and Atul Gupta. A review on source code documentation.ACM Trans. Intell. Syst. Technol., 13(5), June 2022. ISSN 2157-6904. doi: 10.1145/3519312. URLhttps://doi.org/10.1145/ 3519312

work page doi:10.1145/3519312 2022
[40]

Boosting coverage-based fault localization via graph-based representation learning,

Devjeet Roy, Sarah Fakhoury, and Venera Arnaoudova. Reassessing automatic evaluation metrics for code sum- marization tasks. InProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundationsof Software Engineering, ESEC/FSE 2021, page 1105–1116, New York, NY, USA, 2021. Association for Computing Machi...

work page doi:10.1145/3468264.3468588 2021
[41]

Reassessing automatic evaluation metrics for code summarization tasks

Rahul Roy, Saikat Chakraborty, Baishakhi Ray, and Miryung Kim. Reassessing automatic evaluation metrics for code summarization tasks. InProceedings of the 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), pages 1344–1356, 2021

work page 2021
[42]

Prompt engineering or fine tuning: An empirical assessment of large language models in automated software engineering tasks, 2023

Jiho Shin, Clark Tang, Tahmineh Mohati, Maleknaz Nayebi, Song Wang, and Hadi Hemmati. Prompt engineering or fine tuning: An empirical assessment of large language models in automated software engineering tasks, 2023

work page 2023
[43]

2025 stack overflow developer survey: Ai, 2025

Stack Overflow. 2025 stack overflow developer survey: Ai, 2025. URLhttps://survey.stackoverflow.co/2025/ ai. Based on 49,000+ developer responses

work page 2025
[44]

PaperBench: Evaluating AI's Ability to Replicate AI Research

Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, and Tejal Patwardhan. Paperbench: Evaluating ai’s ability to replicate ai research, 2025. URLhttps://arxiv.org/abs/2504.01848

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Automatic code summarization via chatgpt: How far are we?arXiv preprint arXiv:2305.12865, 2023

Weisong Sun, Chunrong Fang, Yudu You, Yun Miao, Yi Liu, Yuekang Li, Gelei Deng, Shenghan Huang, Yuchen Chen, Quanjun Zhang, et al. Automatic code summarization via chatgpt: How far are we?arXiv preprint arXiv:2305.12865, 2023

work page arXiv 2023
[46]

Beyond accuracy: assessing software documentation quality

Christoph Treude, Justin Middleton, and Thushari Atapattu. Beyond accuracy: assessing software documentation quality. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2020, page 1509–1512, New York, NY, USA, 2020. Association for Computing Machinery....

work page doi:10.1145/3368089.3417045 2020
[47]

On the validity of pre-trained transformers for natural language processing in the software engineering domain.IEEE Transactionson Software Engineering, 49 (4):1487–1507, 2023

Julian von der Mosel, Alexander Trautsch, and Steffen Herbold. On the validity of pre-trained transformers for natural language processing in the software engineering domain.IEEE Transactionson Software Engineering, 49 (4):1487–1507, 2023. doi: 10.1109/TSE.2022.3178469

work page doi:10.1109/tse.2022.3178469 2023
[48]

Can llms replace human evaluators? an empirical study of llm-as-a-judge in software engineering.Proc

Ruiqi Wang, Jiyu Guo, Cuiyun Gao, Guodong Fan, Chun Yong Chong, and Xin Xia. Can llms replace human evaluators? an empirical study of llm-as-a-judge in software engineering.Proc. ACM Softw. Eng., 2(ISSTA), June

work page
[49]

URLhttps://doi.org/10.1145/3728963

doi: 10.1145/3728963. URLhttps://doi.org/10.1145/3728963

work page doi:10.1145/3728963
[50]

Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation

Yue Wang, Shuo Ren, Daya Lu, Duyu Tang, Nan Duan, Ming Zhou, and Daxin Jiang. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. InEMNLP, 2021

work page 2021
[51]

Codet5+: Open code large language models for code understanding and generation

Yue Wang, Hung Le, Akhilesh Gotmare, Nghi Bui, Junnan Li, and Steven Hoi. Codet5+: Open code large language models for code understanding and generation. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 1069–1088, 2023

work page 2023
[52]

The Rise and Potential of Large Language Model Based Agents: A Survey

Ziniu Wu, Cheng Liu, Jindong Zhang, Xinyun Li, Yewen Wang, Jimmy Xin, Lianmin Zhang, Eric Xing, Yuxin Lu, and Percy Liang. Autogen: Enabling next-generation multi-agent communication with language models.arXiv preprint arXiv:2309.07864, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

Codescope: An execution-based multilingual multitask multidimensional benchmark for evaluating llms on code understanding and generation, 2023

Weixiang Yan, Haitian Liu, Yunkun Wang, Yunzhe Li, Qian Chen, Wen Wang, Tingyu Lin, Weishan Zhao, Li Zhu, Shuiguang Deng, and Hari Sundaram. Codescope: An execution-based multilingual multitask multidimensional benchmark for evaluating llms on code understanding and generation, 2023

work page 2023
[54]

DocAgent: A multi-agent system for automated code documentation generation

Dayu Yang, Antoine Simoulin, Xin Qian, Xiaoyi Liu, Yuwei Cao, Zhaopu Teng, and Grey Yang. DocAgent: A multi-agent system for automated code documentation generation. In Pushkar Mishra, Smaranda Muresan, and Tao Yu, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 460–4...

work page doi:10.18653/v1/2025.acl-demo.44 2025
[55]

Available: https://arxiv.org/abs/2312.15223

Quanjun Zhang, Chunrong Fang, Yang Xie, Yaxin Zhang, Yun Yang, Weisong Sun, Shengcheng Yu, and Zhenyu Chen. A survey on large language models for software engineering.ArXiv, abs/2312.15223, 2023. URL https://api.semanticscholar.org/CorpusID:266551742

work page arXiv 2023
[56]

Mapcoder: Map-reduce-style code generation with multi-agent collaboration.arXiv preprint arXiv:2307.15808, 2023

Xiaoqing Zhang, Zhirui Wang, Lichao Yang, Wei Zhang, and Yong Zhang. Mapcoder: Map-reduce-style code generation with multi-agent collaboration.arXiv preprint arXiv:2307.15808, 2023

work page arXiv 2023
[57]

Judging llm-as-a-judge with mt-bench and chatbot arena, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Wu Zhanghao, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023

work page 2023
[58]

Towards an understanding of large language models in software engineering tasks, 2023

Zibin Zheng, Kaiwen Ning, Jiachi Chen, Yanlin Wang, Wenqing Chen, Lianghong Guo, and Weicheng Wang. Towards an understanding of large language models in software engineering tasks, 2023

work page 2023
[59]

Cost, benefits and quality of software development documentation.J

Junji Zhi, Vahid Garousi-Yusifoğlu, Bo Sun, Golara Garousi, Shawn Shahnewaz, and Guenther Ruhe. Cost, benefits and quality of software development documentation.J. Syst. Softw., 99(C):175–198, January 2015. ISSN 0164-1212. doi: 10.1016/j.jss.2014.09.042. URLhttps://doi.org/10.1016/j.jss.2014.09.042. 16 Appendix Appendices A Example Generated Documentation...

work page doi:10.1016/j.jss.2014.09.042 2015

[1] [1]

Few-shot training llms for project-specific code-summarization

Toufique Ahmed and Premkumar Devanbu. Few-shot training llms for project-specific code-summarization. In Proceedings of the 37th IEEE/ACMInternationalConference on Automated SoftwareEngineering, ASE ’22, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9781450394758. doi: 10.1145/3551349.3559555. URLhttps://doi.org/10.1145/3551349.3559555

work page doi:10.1145/3551349.3559555 2023

[2] [2]

Program synthesis with large language models, 2021

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models, 2021

work page 2021

[3] [3]

Codetf: One-stop transformer library for state-of-the-art code llm.arXiv preprint arXiv:2306.00029, 2023

Nghi DQ Bui, Hung Le, Yue Wang, Junnan Li, Akhilesh Deepak Gotmare, and Steven CH Hoi. Codetf: One-stop transformer library for state-of-the-art code llm.arXiv preprint arXiv:2306.00029, 2023

work page arXiv 2023

[4] [4]

MultiPL-E:

Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, Arjun Guha, Michael Greenberg, and Abhinav Jangda. Multipl-e: A scalable and polyglot approach to benchmarking neural code generation.IEEE Transactions on Software Engineering, 49(7):3675–3691,...

work page doi:10.1109/tse.2023.3267446 2023

[5] [5]

An empirical analysis of the impact of software development problem factors on software maintainability.J

Jie-Cherng Chen and Sun-Jen Huang. An empirical analysis of the impact of software development problem factors on software maintainability.J. Syst. Softw., 82(6):981–992, June 2009. ISSN 0164-1212. doi: 10.1016/j.jss. 2008.12.036. URLhttps://doi.org/10.1016/j.jss.2008.12.036

work page doi:10.1016/j.jss 2009

[6] [6]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page 2021

[7] [7]

Anti-saturation coordination control of permanent magnet synchronous wind power system.IEEE Access, 11:33428–33441, 2023

Yunseok Choi, Cheolwon Na, Hyojun Kim, and Jee-Hyong Lee. Readsum: Retrieval-augmented adaptive transformer for source code summarization.IEEE Access, 11:51155–51165, 2023. doi: 10.1109/ACCESS.2023. 3271992

work page doi:10.1109/access.2023 2023

[8] [8]

On the effectiveness of llm-as-a-judge for code generation and summarization, 2025

Giuseppe Crupi, Rosalia Tufano, Alejandro Velasco, Antonio Mastropaolo, Denys Poshyvanyk, and Gabriele Bavota. On the effectiveness of llm-as-a-judge for code generation and summarization, 2025. URL https: //arxiv.org/abs/2507.16587

work page arXiv 2025

[9] [9]

de Souza, Nicolas Anquetil, and Káthia M

Sergio Cozzetti B. de Souza, Nicolas Anquetil, and Káthia M. de Oliveira. A study of the documentation essential to software maintenance. InProceedings of the 23rd Annual International Conference on Design of Communication: Documenting & Designing for PervasiveInformation, SIGDOC ’05, page 68–75, New York, NY, USA, 2005. Association for Computing Machiner...

work page doi:10.1145/1085313.1085331 2005

[10] [10]

Deepwiki, 2025

DeepWiki. Deepwiki, 2025. URLhttps://deepwiki.com/

work page 2025

[11] [11]

Out of the bleu: How should we assess quality of the code generation models?J

Mikhail Evtikhiev, Egor Bogomolov, Yaroslav Sokolov, and Timofey Bryksin. Out of the bleu: How should we assess quality of the code generation models?J. Syst. Softw., 203(C), September 2023. ISSN 0164-1212. doi: 10.1016/j.jss.2023.111741. URLhttps://doi.org/10.1016/j.jss.2023.111741

work page doi:10.1016/j.jss.2023.111741 2023

[12] [12]

CodeBERT: A pre-trained model for programming and natural languages

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. CodeBERT: A pre-trained model for programming and natural languages. In Trevor Cohn, Yulan He, and Yang Liu, editors,Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1536–1547, Online, November 2...

work page doi:10.18653/v1/2020.findings-emnlp.139 2020

[13] [13]

Graph- codebert: Pre-training code representations with data flow

Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Nan Duan, Ming Zhou, and Daxin Jiang. Graph- codebert: Pre-training code representations with data flow. InICLR, 2021

work page 2021

[14] [14]

Analyzing the performance of large language models on code summa- rization

Rajarshi Haldar and Julia Hockenmaier. Analyzing the performance of large language models on code summa- rization. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen 13 Xue, editors, Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-...

work page 2024

[15] [15]

Codesearchnet challenge: Evaluating the state of semantic code search

Husain Hamel, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. Codesearchnet challenge: Evaluating the state of semantic code search. InProceedings of the 2019 Symposium on Foundations of Software Engineering (FSE), pages 974–985. ACM, 2019

work page 2019

[16] [16]

From code to courtroom: Llms as the new software judges, 2025

Junda He, Jieke Shi, Terry Yue Zhuo, Christoph Treude, Jiamou Sun, Zhenchang Xing, Xiaoning Du, and David Lo. From code to courtroom: Llms as the new software judges, 2025. URLhttps://arxiv.org/abs/2503.02246

work page arXiv 2025

[17] [17]

Llm-based multi-agent systems for software engineering: Literature review, vision, and the road ahead.ACM Trans.Softw

Junda He, Christoph Treude, and David Lo. Llm-based multi-agent systems for software engineering: Literature review, vision, and the road ahead.ACM Trans.Softw. Eng. Methodol., 34(5), May 2025. ISSN 1049-331X. doi: 10.1145/3712003. URLhttps://doi.org/10.1145/3712003

work page doi:10.1145/3712003 2025

[18] [18]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

Xiaohan Hong, Jiaxi Zhang, Wenzhong Tang, Yizhou Jiang, Quan Liu, and Yidong Yang. Metagpt: Meta programming for multi-agent collaborative framework.arXiv preprint arXiv:2308.00352, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

Large language models for software engineering: A systematic literature review, 2023

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. Large language models for software engineering: A systematic literature review, 2023

work page 2023

[20] [20]

Automatic code documentation generation using gpt-3

Junaed Younus Khan and Gias Uddin. Automatic code documentation generation using gpt-3. InProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, ASE ’22, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9781450394758. doi: 10.1145/3551349.3559548. URL https://doi.org/10.1145/3551349.3559548

work page doi:10.1145/3551349.3559548 2023

[21] [21]

A neural model for generating natural language summaries of program subroutines

Alexander LeClair, Siyuan Jiang, and Collin McMillan. A neural model for generating natural language summaries of program subroutines. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pages 795–806. IEEE, 2019

work page 2019

[22] [22]

Do pretrained language models indeed understand software engineering tasks? IEEE Transactions on Software Engineering, 49(10):4639–4655, 2023

Yao Li, Tao Zhang, Xiapu Luo, Haipeng Cai, Sen Fang, and Dawei Yuan. Do pretrained language models indeed understand software engineering tasks? IEEE Transactions on Software Engineering, 49(10):4639–4655, 2023. doi: 10.1109/TSE.2023.3308952

work page doi:10.1109/tse.2023.3308952 2023

[23] [23]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004

work page 2004

[24] [24]

In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023

Jiahao Liu, Jun Zeng, Xiang Wang, and Zhenkai Liang. Learning graph-based code representations for source-level functional similarity detection. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 345–357, 2023. doi: 10.1109/ICSE48619.2023.00040

work page doi:10.1109/icse48619.2023.00040 2023

[25] [25]

G-eval: Nlg evaluation using gpt-4 with better human alignment.arXiv preprint arXiv:2311.08788, 2023

Yang Liu, Yao Fu, Yujie Xie, Xinyi Chen, Bo Pang, Chenyan Qian, Teng Ma, and Dragomir Radev. G-eval: Nlg evaluation using gpt-4 with better human alignment.arXiv preprint arXiv:2311.08788, 2023

work page arXiv 2023

[26] [26]

ProConSuL: Project context for code summarization with LLMs

Vadim Lomshakov, Andrey Podivilov, Sergey Savin, Oleg Baryshnikov, Alena Lisevych, and Sergey Nikolenko. ProConSuL: Project context for code summarization with LLMs. In Franck Dernoncourt, Daniel Preoţiuc-Pietro, and Anastasia Shimorina, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 8...

work page doi:10.18653/v1/2024.emnlp-industry.65 2024

[27] [27]

RepoAgent: An LLM-powered open-source framework for repository-level code documentation generation

Qinyu Luo, Yining Ye, Shihao Liang, Zhong Zhang, Yujia Qin, Yaxi Lu, Yesai Wu, Xin Cong, Yankai Lin, Yingli Zhang, Xiaoyin Che, Zhiyuan Liu, and Maosong Sun. RepoAgent: An LLM-powered open-source framework for repository-level code documentation generation. In Delia Irazu Hernandez Farias, Tom Hope, and Manling Li, editors,Proceedings of the 2024 Conferen...

work page doi:10.18653/v1/2024.emnlp-demo.46 2024

[28] [28]

Knowledgegraphbasedrepository-levelcodegeneration

Vladimir Makharev and Vladimir Ivanov. Code summarization beyond function level. In2025 IEEE 1st Conference on Large Language Models for Code (LLM4Code), pages 153–160, 05 2025. doi: 10.1109/LLM4Code66737.2025. 00024

work page doi:10.1109/llm4code66737.2025 2025

[29] [29]

Evaluating code summarization techniques: A new metric and an empirical characterization

Ernesto Mastropaolo, Georgios Gousios, Gabriele Bavota, Rocco Oliveto, and Barbara Russo. Evaluating code summarization techniques: A new metric and an empirical characterization. InProceedings of the 46th International Conference on Software Engineering (ICSE), 2024. 14

work page 2024

[30] [30]

McBurney, Siyuan Jiang, Marouane Kessentini, Nicholas A

Paul W. McBurney, Siyuan Jiang, Marouane Kessentini, Nicholas A. Kraft, Ameer Armaly, Mohamed Wiem Mkaouer, and Collin McMillan. Towards prioritizing documentation effort.IEEE Trans. Softw. Eng., 44(9): 897–913, September 2018. ISSN 0098-5589. doi: 10.1109/TSE.2017.2716950. URLhttps://doi.org/10.1109/ TSE.2017.2716950

work page doi:10.1109/tse.2017.2716950 2018

[31] [31]

Robillard

Mathieu Nassif and Martin P. Robillard. Non-linear software documentation with interactive code examples. ACM Trans. Softw. Eng. Methodol., 34(2), January 2025. ISSN 1049-331X. doi: 10.1145/3702976. URL https://doi.org/10.1145/3702976

work page doi:10.1145/3702976 2025

[32] [32]

Codemmlu: A multi-task benchmark for assessing code understanding & reasoning capabilities of codellms

Dung Manh Nguyen, Thang Chau Phan, Nam Le Hai, Tien-Thong Doan, Nam V Nguyen, Quang Pham, and Nghi DQ Bui. Codemmlu: A multi-task benchmark for assessing code understanding & reasoning capabilities of codellms. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025

[33] [33]

Agilecoder: Dynamic collaborative agents for software development based on agile methodology

Minh Huynh Nguyen, Thang Phan Chau, Phong X Nguyen, and Nghi DQ Bui. Agilecoder: Dynamic collaborative agents for software development based on agile methodology. In2025 IEEE/ACMSecond InternationalConference on AI Foundation Models and Software Engineering (Forge), pages 156–167. IEEE, 2025

work page 2025

[34] [34]

Deep learning meets software engineering: A survey on pre-trained models of source code, 2022

Changan Niu, Chuanyi Li, Bin Luo, and Vincent Ng. Deep learning meets software engineering: A survey on pre-trained models of source code, 2022

work page 2022

[35] [35]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

work page 2002

[36] [36]

arXiv preprint arXiv:2409.16299

Huy Nhat Phan, Tien N Nguyen, Phong X Nguyen, and Nghi DQ Bui. Hyperagent: Generalist software engineering agents to solve coding tasks at scale.arXiv preprint arXiv:2409.16299, 2024

work page arXiv 2024

[37] [37]

Documint: Docstring generation for python using small language models.arXiv preprint arXiv:2405.10243), 2024

Bibek Poudel, Adam Cook, Sekou Traore, and Shelah Ameli. Documint: Docstring generation for python using small language models.arXiv preprint arXiv:2405.10243), 2024

work page arXiv 2024

[38] [38]

ChatDev: Communicative Agents for Software Development

Yuzhang Qian, Zian Zhang, Liang Pan, Peng Wang, Shouyi Liu, Wayne Xin Zhao, and Ji-Rong Wen. Chatdev: Revolutionizing software development with ai-collaborative agents.arXiv preprint arXiv:2307.07924, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[39] [39]

A review on source code documentation.ACM Trans

Sawan Rai, Ramesh Chandra Belwal, and Atul Gupta. A review on source code documentation.ACM Trans. Intell. Syst. Technol., 13(5), June 2022. ISSN 2157-6904. doi: 10.1145/3519312. URLhttps://doi.org/10.1145/ 3519312

work page doi:10.1145/3519312 2022

[40] [40]

Boosting coverage-based fault localization via graph-based representation learning,

Devjeet Roy, Sarah Fakhoury, and Venera Arnaoudova. Reassessing automatic evaluation metrics for code sum- marization tasks. InProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundationsof Software Engineering, ESEC/FSE 2021, page 1105–1116, New York, NY, USA, 2021. Association for Computing Machi...

work page doi:10.1145/3468264.3468588 2021

[41] [41]

Reassessing automatic evaluation metrics for code summarization tasks

Rahul Roy, Saikat Chakraborty, Baishakhi Ray, and Miryung Kim. Reassessing automatic evaluation metrics for code summarization tasks. InProceedings of the 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), pages 1344–1356, 2021

work page 2021

[42] [42]

Prompt engineering or fine tuning: An empirical assessment of large language models in automated software engineering tasks, 2023

Jiho Shin, Clark Tang, Tahmineh Mohati, Maleknaz Nayebi, Song Wang, and Hadi Hemmati. Prompt engineering or fine tuning: An empirical assessment of large language models in automated software engineering tasks, 2023

work page 2023

[43] [43]

2025 stack overflow developer survey: Ai, 2025

Stack Overflow. 2025 stack overflow developer survey: Ai, 2025. URLhttps://survey.stackoverflow.co/2025/ ai. Based on 49,000+ developer responses

work page 2025

[44] [44]

PaperBench: Evaluating AI's Ability to Replicate AI Research

Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, and Tejal Patwardhan. Paperbench: Evaluating ai’s ability to replicate ai research, 2025. URLhttps://arxiv.org/abs/2504.01848

work page internal anchor Pith review Pith/arXiv arXiv 2025

[45] [45]

Automatic code summarization via chatgpt: How far are we?arXiv preprint arXiv:2305.12865, 2023

Weisong Sun, Chunrong Fang, Yudu You, Yun Miao, Yi Liu, Yuekang Li, Gelei Deng, Shenghan Huang, Yuchen Chen, Quanjun Zhang, et al. Automatic code summarization via chatgpt: How far are we?arXiv preprint arXiv:2305.12865, 2023

work page arXiv 2023

[46] [46]

Beyond accuracy: assessing software documentation quality

Christoph Treude, Justin Middleton, and Thushari Atapattu. Beyond accuracy: assessing software documentation quality. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2020, page 1509–1512, New York, NY, USA, 2020. Association for Computing Machinery....

work page doi:10.1145/3368089.3417045 2020

[47] [47]

On the validity of pre-trained transformers for natural language processing in the software engineering domain.IEEE Transactionson Software Engineering, 49 (4):1487–1507, 2023

Julian von der Mosel, Alexander Trautsch, and Steffen Herbold. On the validity of pre-trained transformers for natural language processing in the software engineering domain.IEEE Transactionson Software Engineering, 49 (4):1487–1507, 2023. doi: 10.1109/TSE.2022.3178469

work page doi:10.1109/tse.2022.3178469 2023

[48] [48]

Can llms replace human evaluators? an empirical study of llm-as-a-judge in software engineering.Proc

Ruiqi Wang, Jiyu Guo, Cuiyun Gao, Guodong Fan, Chun Yong Chong, and Xin Xia. Can llms replace human evaluators? an empirical study of llm-as-a-judge in software engineering.Proc. ACM Softw. Eng., 2(ISSTA), June

work page

[49] [49]

URLhttps://doi.org/10.1145/3728963

doi: 10.1145/3728963. URLhttps://doi.org/10.1145/3728963

work page doi:10.1145/3728963

[50] [50]

Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation

Yue Wang, Shuo Ren, Daya Lu, Duyu Tang, Nan Duan, Ming Zhou, and Daxin Jiang. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. InEMNLP, 2021

work page 2021

[51] [51]

Codet5+: Open code large language models for code understanding and generation

Yue Wang, Hung Le, Akhilesh Gotmare, Nghi Bui, Junnan Li, and Steven Hoi. Codet5+: Open code large language models for code understanding and generation. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 1069–1088, 2023

work page 2023

[52] [52]

The Rise and Potential of Large Language Model Based Agents: A Survey

Ziniu Wu, Cheng Liu, Jindong Zhang, Xinyun Li, Yewen Wang, Jimmy Xin, Lianmin Zhang, Eric Xing, Yuxin Lu, and Percy Liang. Autogen: Enabling next-generation multi-agent communication with language models.arXiv preprint arXiv:2309.07864, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[53] [53]

Codescope: An execution-based multilingual multitask multidimensional benchmark for evaluating llms on code understanding and generation, 2023

Weixiang Yan, Haitian Liu, Yunkun Wang, Yunzhe Li, Qian Chen, Wen Wang, Tingyu Lin, Weishan Zhao, Li Zhu, Shuiguang Deng, and Hari Sundaram. Codescope: An execution-based multilingual multitask multidimensional benchmark for evaluating llms on code understanding and generation, 2023

work page 2023

[54] [54]

DocAgent: A multi-agent system for automated code documentation generation

Dayu Yang, Antoine Simoulin, Xin Qian, Xiaoyi Liu, Yuwei Cao, Zhaopu Teng, and Grey Yang. DocAgent: A multi-agent system for automated code documentation generation. In Pushkar Mishra, Smaranda Muresan, and Tao Yu, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 460–4...

work page doi:10.18653/v1/2025.acl-demo.44 2025

[55] [55]

Available: https://arxiv.org/abs/2312.15223

Quanjun Zhang, Chunrong Fang, Yang Xie, Yaxin Zhang, Yun Yang, Weisong Sun, Shengcheng Yu, and Zhenyu Chen. A survey on large language models for software engineering.ArXiv, abs/2312.15223, 2023. URL https://api.semanticscholar.org/CorpusID:266551742

work page arXiv 2023

[56] [56]

Mapcoder: Map-reduce-style code generation with multi-agent collaboration.arXiv preprint arXiv:2307.15808, 2023

Xiaoqing Zhang, Zhirui Wang, Lichao Yang, Wei Zhang, and Yong Zhang. Mapcoder: Map-reduce-style code generation with multi-agent collaboration.arXiv preprint arXiv:2307.15808, 2023

work page arXiv 2023

[57] [57]

Judging llm-as-a-judge with mt-bench and chatbot arena, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Wu Zhanghao, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023

work page 2023

[58] [58]

Towards an understanding of large language models in software engineering tasks, 2023

Zibin Zheng, Kaiwen Ning, Jiachi Chen, Yanlin Wang, Wenqing Chen, Lianghong Guo, and Weicheng Wang. Towards an understanding of large language models in software engineering tasks, 2023

work page 2023

[59] [59]

Cost, benefits and quality of software development documentation.J

Junji Zhi, Vahid Garousi-Yusifoğlu, Bo Sun, Golara Garousi, Shawn Shahnewaz, and Guenther Ruhe. Cost, benefits and quality of software development documentation.J. Syst. Softw., 99(C):175–198, January 2015. ISSN 0164-1212. doi: 10.1016/j.jss.2014.09.042. URLhttps://doi.org/10.1016/j.jss.2014.09.042. 16 Appendix Appendices A Example Generated Documentation...

work page doi:10.1016/j.jss.2014.09.042 2015