pith. sign in

arxiv: 2510.24428 · v6 · submitted 2025-10-28 · 💻 cs.SE

CodeWiki: Evaluating AI's Ability to Generate Holistic Documentation for Large-Scale Codebases

Pith reviewed 2026-05-18 03:07 UTC · model grok-4.3

classification 💻 cs.SE
keywords automated documentationrepository-level documentationmulti-agent processingcodebase analysissoftware maintenancemulti-modal synthesisbenchmarking
0
0 comments X

The pith

CodeWiki generates holistic, architecture-aware documentation for large codebases by combining hierarchical decomposition, recursive multi-agent processing, and multi-modal synthesis, scoring 68.79% and exceeding baselines by 4.73%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CodeWiki, a framework for automatically generating comprehensive documentation for entire code repositories that accounts for interactions across files, modules, and the overall system. Existing automated approaches fall short because they do not capture the semantic dependencies and architectural structures that shape real software. CodeWiki addresses this through three techniques: breaking the codebase into hierarchical layers to retain context, delegating tasks among multiple AI agents recursively for scale, and blending textual explanations with visual diagrams and data flows. The authors also release CodeWikiBench, a benchmark with multi-dimensional rubrics judged by language models, to support consistent evaluation. Experiments across seven languages show CodeWiki reaching 68.79 percent quality, a 4.73 percent gain over the prior DeepWiki system and larger gains for scripting languages.

Core claim

CodeWiki is a unified framework for repository-level documentation generation that employs hierarchical decomposition to preserve architectural context across granularity levels, recursive multi-agent processing with dynamic task delegation to achieve scalability, and multi-modal synthesis to integrate textual descriptions with visual artifacts such as architecture diagrams and data-flow representations. Evaluated on the new CodeWikiBench benchmark, this produces a 68.79 percent quality score with proprietary models, outperforming the closed-source DeepWiki baseline by 4.73 percent, with notably stronger results on high-level scripting languages.

What carries the argument

The CodeWiki framework, which unites hierarchical decomposition to keep architectural context, recursive multi-agent processing for scalable task handling, and multi-modal synthesis of text plus visuals to model cross-file and system-level interactions.

If this is right

  • Long-term software maintenance improves when documentation captures cross-module dependencies rather than isolated functions.
  • Collaboration benefits from system-level views that reveal how components interact.
  • Generation scales to evolving repositories through dynamic agent delegation without manual intervention.
  • Standardized benchmarking via CodeWikiBench enables direct comparison of future documentation systems.
  • Gains are especially pronounced for high-level scripting languages, suggesting language-specific strengths.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Teams could reduce onboarding time for new developers by supplying instant architecture overviews instead of requiring manual exploration.
  • Embedding the framework in version-control workflows might keep documentation synchronized as code changes.
  • The same hierarchical and multi-agent pattern could extend to generating explanations for data pipelines or hardware designs.
  • Validating the automated scores against human developer feedback on usefulness would test whether the quality metric predicts practical value.

Load-bearing premise

The LLM-based assessment protocols and multi-dimensional rubrics in CodeWikiBench can accurately measure holistic documentation quality without introducing bias or overlooking key architectural aspects.

What would settle it

A side-by-side human expert evaluation of documentation generated by CodeWiki versus baselines on the same codebases, checking whether the 4.73 percent quality advantage holds under direct review or real maintenance-task performance.

read the original abstract

Given a large and evolving codebase, the ability to automatically generate holistic, architecture-aware documentation that captures not only individual functions but also cross-file, cross-module, and system-level interactions remains an open challenge. Comprehensive documentation is essential for long-term software maintenance and collaboration, yet current automated approaches still fail to model the rich semantic dependencies and architectural structures that define real-world software systems. We present \textbf{CodeWiki}, a unified framework for automated repository-level documentation across seven programming languages. CodeWiki introduces three key innovations: (i) hierarchical decomposition that preserves architectural context across multiple levels of granularity, (ii) recursive multi-agent processing with dynamic task delegation for scalable generation, and (iii) multi-modal synthesis that integrates textual descriptions with visual artifacts such as architecture diagrams and data-flow representations. To enable rigorous evaluation, we introduce \textbf{CodeWikiBench}, a comprehensive benchmark featuring multi-dimensional rubrics and LLM-based assessment protocols. Experimental results show that CodeWiki achieves a 68.79\% quality score with proprietary models, outperforming the closed-source DeepWiki baseline (64.06\%) by 4.73\%, with particularly strong improvements on high-level scripting languages (+10.47\%). We open-source CodeWiki to foster future research and community adoption.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper presents CodeWiki, a unified framework for automated repository-level documentation across seven programming languages. It introduces three innovations: hierarchical decomposition to preserve architectural context, recursive multi-agent processing with dynamic task delegation, and multi-modal synthesis integrating text with visual artifacts like architecture diagrams. To support evaluation, the authors propose CodeWikiBench featuring multi-dimensional rubrics and LLM-based assessment protocols. Experimental results claim CodeWiki achieves a 68.79% quality score with proprietary models, outperforming the DeepWiki baseline (64.06%) by 4.73%, with stronger gains (+10.47%) on high-level scripting languages. The work is open-sourced.

Significance. If the central empirical claims hold under validated evaluation, the framework and benchmark could meaningfully advance automated documentation for large, evolving codebases by better capturing cross-module and system-level interactions. The open-sourcing and multi-language coverage are positive for community adoption and reproducibility. However, the significance is tempered by the unvalidated nature of the LLM-as-judge protocol, which is load-bearing for the reported improvements.

major comments (1)
  1. [Abstract / Experimental Results] Abstract and Experimental Results section: The headline 68.79% vs. 64.06% quality scores and the +10.47% gain on scripting languages are produced entirely by the LLM-based assessment protocols and multi-dimensional rubrics of CodeWikiBench. No human correlation study, inter-annotator agreement figures, or ablation against actual maintenance tasks are reported, leaving the central claim that CodeWiki produces superior holistic documentation dependent on an unverified proxy that may embed judge-model biases.
minor comments (1)
  1. [Abstract] The abstract states concrete percentage improvements without accompanying error bars or statistical significance tests; adding these would strengthen the presentation of the 4.73% gain.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback on our manuscript. The concern regarding validation of the LLM-as-judge protocol in CodeWikiBench is well-taken, and we address it directly below while outlining targeted revisions.

read point-by-point responses
  1. Referee: [Abstract / Experimental Results] Abstract and Experimental Results section: The headline 68.79% vs. 64.06% quality scores and the +10.47% gain on scripting languages are produced entirely by the LLM-based assessment protocols and multi-dimensional rubrics of CodeWikiBench. No human correlation study, inter-annotator agreement figures, or ablation against actual maintenance tasks are reported, leaving the central claim that CodeWiki produces superior holistic documentation dependent on an unverified proxy that may embed judge-model biases.

    Authors: We appreciate the referee highlighting that our primary results rest on the LLM-based evaluation in CodeWikiBench. The multi-dimensional rubrics were explicitly constructed to assess documentation along axes of completeness, accuracy, architectural coherence, and cross-module coverage, with the intent of reducing reliance on any single subjective judgment. Nevertheless, we acknowledge that the current manuscript does not include a human correlation study, inter-annotator agreement statistics, or downstream ablation on maintenance tasks. This is a genuine limitation of the reported evaluation. In the revised manuscript we will (i) add a dedicated paragraph in the Experimental Results section discussing potential judge-model biases and the rubric design choices made to promote consistency, (ii) report score variance across two additional judge models to demonstrate robustness, and (iii) explicitly state the scope of CodeWikiBench as an intrinsic quality benchmark rather than a proxy for end-to-end maintenance outcomes. These changes will make the evidential basis of our claims more transparent without overstating the current validation. revision: partial

Circularity Check

0 steps flagged

No significant circularity: empirical comparison to external baseline on newly introduced benchmark

full rationale

The paper introduces CodeWiki as a framework with hierarchical decomposition, recursive multi-agent processing, and multi-modal synthesis, then defines CodeWikiBench with multi-dimensional rubrics and LLM-based assessment to evaluate it. The central result (68.79% vs. 64.06% on DeepWiki) is obtained by applying the same benchmark protocol to both the proposed system and an independent closed-source baseline. No equations, fitted parameters, or self-citations are shown that would make the reported superiority equivalent to the inputs by construction. The evaluation protocol is external to the method itself and the comparison provides an independent check, satisfying the criteria for a self-contained empirical claim.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The framework builds on existing LLM capabilities and standard software engineering practices without introducing new mathematical axioms, free parameters fitted to data, or invented physical entities; the main additions are engineering choices in the multi-agent workflow.

pith-pipeline@v0.9.0 · 5764 in / 1162 out tokens · 23243 ms · 2026-05-18T03:07:19.109447+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Documentation-Guided Agentic Codebase Migration from C to Rust

    cs.SE 2026-05 unverdicted novelty 7.0

    RustPrint uses documentation as a migration blueprint for agentic C-to-Rust translation, achieving full compilation and higher feature preservation than baselines on eight real-world repositories from 11K to 84K LoC.

  2. Documentation-Guided Agentic Codebase Migration from C to Rust

    cs.SE 2026-05 unverdicted novelty 7.0

    RustPrint is a documentation-guided agentic system that migrates entire C repositories to Rust by using architecture docs as blueprints, achieving full compilability and 93-95% feature/test preservation on eight 11K-8...

  3. Remember Your Trace: Memory-Guided Long-Horizon Agentic Framework for Consistent and Hierarchical Repository-Level Code Documentation

    cs.SE 2026-05 unverdicted novelty 7.0

    MemDocAgent generates consistent hierarchical repository-level code documentation by combining dependency-aware traversal with memory-guided agent interactions that accumulate work traces.

  4. RepoDoc: A Knowledge Graph-Based Framework to Automatic Documentation Generation and Incremental Updates

    cs.SE 2026-04 unverdicted novelty 7.0

    RepoDoc uses a repository knowledge graph with module clustering and semantic impact propagation to generate more complete documentation 3x faster with 85% fewer tokens and handle incremental updates 73% faster than p...

  5. SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

    cs.SE 2025-12 unverdicted novelty 7.0

    SWE-EVO shows GPT-5.4 with OpenHands reaching only 25% success on complex multi-file evolution tasks versus 72.8% on SWE-Bench Verified, and introduces Fix Rate as a partial-progress metric.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · cited by 4 Pith papers · 4 internal anchors

  1. [1]

    Few-shot training llms for project-specific code-summarization

    Toufique Ahmed and Premkumar Devanbu. Few-shot training llms for project-specific code-summarization. In Proceedings of the 37th IEEE/ACMInternationalConference on Automated SoftwareEngineering, ASE ’22, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9781450394758. doi: 10.1145/3551349.3559555. URLhttps://doi.org/10.1145/3551349.3559555

  2. [2]

    Program synthesis with large language models, 2021

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models, 2021

  3. [3]

    Codetf: One-stop transformer library for state-of-the-art code llm.arXiv preprint arXiv:2306.00029, 2023

    Nghi DQ Bui, Hung Le, Yue Wang, Junnan Li, Akhilesh Deepak Gotmare, and Steven CH Hoi. Codetf: One-stop transformer library for state-of-the-art code llm.arXiv preprint arXiv:2306.00029, 2023

  4. [4]

    MultiPL-E:

    Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, Arjun Guha, Michael Greenberg, and Abhinav Jangda. Multipl-e: A scalable and polyglot approach to benchmarking neural code generation.IEEE Transactions on Software Engineering, 49(7):3675–3691,...

  5. [5]

    An empirical analysis of the impact of software development problem factors on software maintainability.J

    Jie-Cherng Chen and Sun-Jen Huang. An empirical analysis of the impact of software development problem factors on software maintainability.J. Syst. Softw., 82(6):981–992, June 2009. ISSN 0164-1212. doi: 10.1016/j.jss. 2008.12.036. URLhttps://doi.org/10.1016/j.jss.2008.12.036

  6. [6]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  7. [7]

    Anti-saturation coordination control of permanent magnet synchronous wind power system.IEEE Access, 11:33428–33441, 2023

    Yunseok Choi, Cheolwon Na, Hyojun Kim, and Jee-Hyong Lee. Readsum: Retrieval-augmented adaptive transformer for source code summarization.IEEE Access, 11:51155–51165, 2023. doi: 10.1109/ACCESS.2023. 3271992

  8. [8]

    On the effectiveness of llm-as-a-judge for code generation and summarization, 2025

    Giuseppe Crupi, Rosalia Tufano, Alejandro Velasco, Antonio Mastropaolo, Denys Poshyvanyk, and Gabriele Bavota. On the effectiveness of llm-as-a-judge for code generation and summarization, 2025. URL https: //arxiv.org/abs/2507.16587

  9. [9]

    de Souza, Nicolas Anquetil, and Káthia M

    Sergio Cozzetti B. de Souza, Nicolas Anquetil, and Káthia M. de Oliveira. A study of the documentation essential to software maintenance. InProceedings of the 23rd Annual International Conference on Design of Communication: Documenting & Designing for PervasiveInformation, SIGDOC ’05, page 68–75, New York, NY, USA, 2005. Association for Computing Machiner...

  10. [10]

    Deepwiki, 2025

    DeepWiki. Deepwiki, 2025. URLhttps://deepwiki.com/

  11. [11]

    Out of the bleu: How should we assess quality of the code generation models?J

    Mikhail Evtikhiev, Egor Bogomolov, Yaroslav Sokolov, and Timofey Bryksin. Out of the bleu: How should we assess quality of the code generation models?J. Syst. Softw., 203(C), September 2023. ISSN 0164-1212. doi: 10.1016/j.jss.2023.111741. URLhttps://doi.org/10.1016/j.jss.2023.111741

  12. [12]

    CodeBERT: A pre-trained model for programming and natural languages

    Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. CodeBERT: A pre-trained model for programming and natural languages. In Trevor Cohn, Yulan He, and Yang Liu, editors,Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1536–1547, Online, November 2...

  13. [13]

    Graph- codebert: Pre-training code representations with data flow

    Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Nan Duan, Ming Zhou, and Daxin Jiang. Graph- codebert: Pre-training code representations with data flow. InICLR, 2021

  14. [14]

    Analyzing the performance of large language models on code summa- rization

    Rajarshi Haldar and Julia Hockenmaier. Analyzing the performance of large language models on code summa- rization. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen 13 Xue, editors, Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-...

  15. [15]

    Codesearchnet challenge: Evaluating the state of semantic code search

    Husain Hamel, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. Codesearchnet challenge: Evaluating the state of semantic code search. InProceedings of the 2019 Symposium on Foundations of Software Engineering (FSE), pages 974–985. ACM, 2019

  16. [16]

    From code to courtroom: Llms as the new software judges, 2025

    Junda He, Jieke Shi, Terry Yue Zhuo, Christoph Treude, Jiamou Sun, Zhenchang Xing, Xiaoning Du, and David Lo. From code to courtroom: Llms as the new software judges, 2025. URLhttps://arxiv.org/abs/2503.02246

  17. [17]

    Llm-based multi-agent systems for software engineering: Literature review, vision, and the road ahead.ACM Trans.Softw

    Junda He, Christoph Treude, and David Lo. Llm-based multi-agent systems for software engineering: Literature review, vision, and the road ahead.ACM Trans.Softw. Eng. Methodol., 34(5), May 2025. ISSN 1049-331X. doi: 10.1145/3712003. URLhttps://doi.org/10.1145/3712003

  18. [18]

    MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

    Xiaohan Hong, Jiaxi Zhang, Wenzhong Tang, Yizhou Jiang, Quan Liu, and Yidong Yang. Metagpt: Meta programming for multi-agent collaborative framework.arXiv preprint arXiv:2308.00352, 2023

  19. [19]

    Large language models for software engineering: A systematic literature review, 2023

    Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. Large language models for software engineering: A systematic literature review, 2023

  20. [20]

    Automatic code documentation generation using gpt-3

    Junaed Younus Khan and Gias Uddin. Automatic code documentation generation using gpt-3. InProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, ASE ’22, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9781450394758. doi: 10.1145/3551349.3559548. URL https://doi.org/10.1145/3551349.3559548

  21. [21]

    A neural model for generating natural language summaries of program subroutines

    Alexander LeClair, Siyuan Jiang, and Collin McMillan. A neural model for generating natural language summaries of program subroutines. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pages 795–806. IEEE, 2019

  22. [22]

    Do pretrained language models indeed understand software engineering tasks? IEEE Transactions on Software Engineering, 49(10):4639–4655, 2023

    Yao Li, Tao Zhang, Xiapu Luo, Haipeng Cai, Sen Fang, and Dawei Yuan. Do pretrained language models indeed understand software engineering tasks? IEEE Transactions on Software Engineering, 49(10):4639–4655, 2023. doi: 10.1109/TSE.2023.3308952

  23. [23]

    Rouge: A package for automatic evaluation of summaries

    Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004

  24. [24]

    In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023

    Jiahao Liu, Jun Zeng, Xiang Wang, and Zhenkai Liang. Learning graph-based code representations for source-level functional similarity detection. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 345–357, 2023. doi: 10.1109/ICSE48619.2023.00040

  25. [25]

    G-eval: Nlg evaluation using gpt-4 with better human alignment.arXiv preprint arXiv:2311.08788, 2023

    Yang Liu, Yao Fu, Yujie Xie, Xinyi Chen, Bo Pang, Chenyan Qian, Teng Ma, and Dragomir Radev. G-eval: Nlg evaluation using gpt-4 with better human alignment.arXiv preprint arXiv:2311.08788, 2023

  26. [26]

    ProConSuL: Project context for code summarization with LLMs

    Vadim Lomshakov, Andrey Podivilov, Sergey Savin, Oleg Baryshnikov, Alena Lisevych, and Sergey Nikolenko. ProConSuL: Project context for code summarization with LLMs. In Franck Dernoncourt, Daniel Preoţiuc-Pietro, and Anastasia Shimorina, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 8...

  27. [27]

    RepoAgent: An LLM-powered open-source framework for repository-level code documentation generation

    Qinyu Luo, Yining Ye, Shihao Liang, Zhong Zhang, Yujia Qin, Yaxi Lu, Yesai Wu, Xin Cong, Yankai Lin, Yingli Zhang, Xiaoyin Che, Zhiyuan Liu, and Maosong Sun. RepoAgent: An LLM-powered open-source framework for repository-level code documentation generation. In Delia Irazu Hernandez Farias, Tom Hope, and Manling Li, editors,Proceedings of the 2024 Conferen...

  28. [28]

    Knowledgegraphbasedrepository-levelcodegeneration

    Vladimir Makharev and Vladimir Ivanov. Code summarization beyond function level. In2025 IEEE 1st Conference on Large Language Models for Code (LLM4Code), pages 153–160, 05 2025. doi: 10.1109/LLM4Code66737.2025. 00024

  29. [29]

    Evaluating code summarization techniques: A new metric and an empirical characterization

    Ernesto Mastropaolo, Georgios Gousios, Gabriele Bavota, Rocco Oliveto, and Barbara Russo. Evaluating code summarization techniques: A new metric and an empirical characterization. InProceedings of the 46th International Conference on Software Engineering (ICSE), 2024. 14

  30. [30]

    McBurney, Siyuan Jiang, Marouane Kessentini, Nicholas A

    Paul W. McBurney, Siyuan Jiang, Marouane Kessentini, Nicholas A. Kraft, Ameer Armaly, Mohamed Wiem Mkaouer, and Collin McMillan. Towards prioritizing documentation effort.IEEE Trans. Softw. Eng., 44(9): 897–913, September 2018. ISSN 0098-5589. doi: 10.1109/TSE.2017.2716950. URLhttps://doi.org/10.1109/ TSE.2017.2716950

  31. [31]

    Robillard

    Mathieu Nassif and Martin P. Robillard. Non-linear software documentation with interactive code examples. ACM Trans. Softw. Eng. Methodol., 34(2), January 2025. ISSN 1049-331X. doi: 10.1145/3702976. URL https://doi.org/10.1145/3702976

  32. [32]

    Codemmlu: A multi-task benchmark for assessing code understanding & reasoning capabilities of codellms

    Dung Manh Nguyen, Thang Chau Phan, Nam Le Hai, Tien-Thong Doan, Nam V Nguyen, Quang Pham, and Nghi DQ Bui. Codemmlu: A multi-task benchmark for assessing code understanding & reasoning capabilities of codellms. In The Thirteenth International Conference on Learning Representations, 2025

  33. [33]

    Agilecoder: Dynamic collaborative agents for software development based on agile methodology

    Minh Huynh Nguyen, Thang Phan Chau, Phong X Nguyen, and Nghi DQ Bui. Agilecoder: Dynamic collaborative agents for software development based on agile methodology. In2025 IEEE/ACMSecond InternationalConference on AI Foundation Models and Software Engineering (Forge), pages 156–167. IEEE, 2025

  34. [34]

    Deep learning meets software engineering: A survey on pre-trained models of source code, 2022

    Changan Niu, Chuanyi Li, Bin Luo, and Vincent Ng. Deep learning meets software engineering: A survey on pre-trained models of source code, 2022

  35. [35]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

  36. [36]

    arXiv preprint arXiv:2409.16299

    Huy Nhat Phan, Tien N Nguyen, Phong X Nguyen, and Nghi DQ Bui. Hyperagent: Generalist software engineering agents to solve coding tasks at scale.arXiv preprint arXiv:2409.16299, 2024

  37. [37]

    Documint: Docstring generation for python using small language models.arXiv preprint arXiv:2405.10243), 2024

    Bibek Poudel, Adam Cook, Sekou Traore, and Shelah Ameli. Documint: Docstring generation for python using small language models.arXiv preprint arXiv:2405.10243), 2024

  38. [38]

    ChatDev: Communicative Agents for Software Development

    Yuzhang Qian, Zian Zhang, Liang Pan, Peng Wang, Shouyi Liu, Wayne Xin Zhao, and Ji-Rong Wen. Chatdev: Revolutionizing software development with ai-collaborative agents.arXiv preprint arXiv:2307.07924, 2023

  39. [39]

    A review on source code documentation.ACM Trans

    Sawan Rai, Ramesh Chandra Belwal, and Atul Gupta. A review on source code documentation.ACM Trans. Intell. Syst. Technol., 13(5), June 2022. ISSN 2157-6904. doi: 10.1145/3519312. URLhttps://doi.org/10.1145/ 3519312

  40. [40]

    Boosting coverage-based fault localization via graph-based representation learning,

    Devjeet Roy, Sarah Fakhoury, and Venera Arnaoudova. Reassessing automatic evaluation metrics for code sum- marization tasks. InProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundationsof Software Engineering, ESEC/FSE 2021, page 1105–1116, New York, NY, USA, 2021. Association for Computing Machi...

  41. [41]

    Reassessing automatic evaluation metrics for code summarization tasks

    Rahul Roy, Saikat Chakraborty, Baishakhi Ray, and Miryung Kim. Reassessing automatic evaluation metrics for code summarization tasks. InProceedings of the 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), pages 1344–1356, 2021

  42. [42]

    Prompt engineering or fine tuning: An empirical assessment of large language models in automated software engineering tasks, 2023

    Jiho Shin, Clark Tang, Tahmineh Mohati, Maleknaz Nayebi, Song Wang, and Hadi Hemmati. Prompt engineering or fine tuning: An empirical assessment of large language models in automated software engineering tasks, 2023

  43. [43]

    2025 stack overflow developer survey: Ai, 2025

    Stack Overflow. 2025 stack overflow developer survey: Ai, 2025. URLhttps://survey.stackoverflow.co/2025/ ai. Based on 49,000+ developer responses

  44. [44]

    PaperBench: Evaluating AI's Ability to Replicate AI Research

    Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, and Tejal Patwardhan. Paperbench: Evaluating ai’s ability to replicate ai research, 2025. URLhttps://arxiv.org/abs/2504.01848

  45. [45]

    Automatic code summarization via chatgpt: How far are we?arXiv preprint arXiv:2305.12865, 2023

    Weisong Sun, Chunrong Fang, Yudu You, Yun Miao, Yi Liu, Yuekang Li, Gelei Deng, Shenghan Huang, Yuchen Chen, Quanjun Zhang, et al. Automatic code summarization via chatgpt: How far are we?arXiv preprint arXiv:2305.12865, 2023

  46. [46]

    Beyond accuracy: assessing software documentation quality

    Christoph Treude, Justin Middleton, and Thushari Atapattu. Beyond accuracy: assessing software documentation quality. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2020, page 1509–1512, New York, NY, USA, 2020. Association for Computing Machinery....

  47. [47]

    On the validity of pre-trained transformers for natural language processing in the software engineering domain.IEEE Transactionson Software Engineering, 49 (4):1487–1507, 2023

    Julian von der Mosel, Alexander Trautsch, and Steffen Herbold. On the validity of pre-trained transformers for natural language processing in the software engineering domain.IEEE Transactionson Software Engineering, 49 (4):1487–1507, 2023. doi: 10.1109/TSE.2022.3178469

  48. [48]

    Can llms replace human evaluators? an empirical study of llm-as-a-judge in software engineering.Proc

    Ruiqi Wang, Jiyu Guo, Cuiyun Gao, Guodong Fan, Chun Yong Chong, and Xin Xia. Can llms replace human evaluators? an empirical study of llm-as-a-judge in software engineering.Proc. ACM Softw. Eng., 2(ISSTA), June

  49. [49]

    URLhttps://doi.org/10.1145/3728963

    doi: 10.1145/3728963. URLhttps://doi.org/10.1145/3728963

  50. [50]

    Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation

    Yue Wang, Shuo Ren, Daya Lu, Duyu Tang, Nan Duan, Ming Zhou, and Daxin Jiang. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. InEMNLP, 2021

  51. [51]

    Codet5+: Open code large language models for code understanding and generation

    Yue Wang, Hung Le, Akhilesh Gotmare, Nghi Bui, Junnan Li, and Steven Hoi. Codet5+: Open code large language models for code understanding and generation. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 1069–1088, 2023

  52. [52]

    The Rise and Potential of Large Language Model Based Agents: A Survey

    Ziniu Wu, Cheng Liu, Jindong Zhang, Xinyun Li, Yewen Wang, Jimmy Xin, Lianmin Zhang, Eric Xing, Yuxin Lu, and Percy Liang. Autogen: Enabling next-generation multi-agent communication with language models.arXiv preprint arXiv:2309.07864, 2023

  53. [53]

    Codescope: An execution-based multilingual multitask multidimensional benchmark for evaluating llms on code understanding and generation, 2023

    Weixiang Yan, Haitian Liu, Yunkun Wang, Yunzhe Li, Qian Chen, Wen Wang, Tingyu Lin, Weishan Zhao, Li Zhu, Shuiguang Deng, and Hari Sundaram. Codescope: An execution-based multilingual multitask multidimensional benchmark for evaluating llms on code understanding and generation, 2023

  54. [54]

    DocAgent: A multi-agent system for automated code documentation generation

    Dayu Yang, Antoine Simoulin, Xin Qian, Xiaoyi Liu, Yuwei Cao, Zhaopu Teng, and Grey Yang. DocAgent: A multi-agent system for automated code documentation generation. In Pushkar Mishra, Smaranda Muresan, and Tao Yu, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 460–4...

  55. [55]

    Available: https://arxiv.org/abs/2312.15223

    Quanjun Zhang, Chunrong Fang, Yang Xie, Yaxin Zhang, Yun Yang, Weisong Sun, Shengcheng Yu, and Zhenyu Chen. A survey on large language models for software engineering.ArXiv, abs/2312.15223, 2023. URL https://api.semanticscholar.org/CorpusID:266551742

  56. [56]

    Mapcoder: Map-reduce-style code generation with multi-agent collaboration.arXiv preprint arXiv:2307.15808, 2023

    Xiaoqing Zhang, Zhirui Wang, Lichao Yang, Wei Zhang, and Yong Zhang. Mapcoder: Map-reduce-style code generation with multi-agent collaboration.arXiv preprint arXiv:2307.15808, 2023

  57. [57]

    Judging llm-as-a-judge with mt-bench and chatbot arena, 2023

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Wu Zhanghao, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023

  58. [58]

    Towards an understanding of large language models in software engineering tasks, 2023

    Zibin Zheng, Kaiwen Ning, Jiachi Chen, Yanlin Wang, Wenqing Chen, Lianghong Guo, and Weicheng Wang. Towards an understanding of large language models in software engineering tasks, 2023

  59. [59]

    Cost, benefits and quality of software development documentation.J

    Junji Zhi, Vahid Garousi-Yusifoğlu, Bo Sun, Golara Garousi, Shawn Shahnewaz, and Guenther Ruhe. Cost, benefits and quality of software development documentation.J. Syst. Softw., 99(C):175–198, January 2015. ISSN 0164-1212. doi: 10.1016/j.jss.2014.09.042. URLhttps://doi.org/10.1016/j.jss.2014.09.042. 16 Appendix Appendices A Example Generated Documentation...