CodeWiki: Evaluating AI's Ability to Generate Holistic Documentation for Large-Scale Codebases
Pith reviewed 2026-05-18 03:07 UTC · model grok-4.3
The pith
CodeWiki generates holistic, architecture-aware documentation for large codebases by combining hierarchical decomposition, recursive multi-agent processing, and multi-modal synthesis, scoring 68.79% and exceeding baselines by 4.73%.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CodeWiki is a unified framework for repository-level documentation generation that employs hierarchical decomposition to preserve architectural context across granularity levels, recursive multi-agent processing with dynamic task delegation to achieve scalability, and multi-modal synthesis to integrate textual descriptions with visual artifacts such as architecture diagrams and data-flow representations. Evaluated on the new CodeWikiBench benchmark, this produces a 68.79 percent quality score with proprietary models, outperforming the closed-source DeepWiki baseline by 4.73 percent, with notably stronger results on high-level scripting languages.
What carries the argument
The CodeWiki framework, which unites hierarchical decomposition to keep architectural context, recursive multi-agent processing for scalable task handling, and multi-modal synthesis of text plus visuals to model cross-file and system-level interactions.
If this is right
- Long-term software maintenance improves when documentation captures cross-module dependencies rather than isolated functions.
- Collaboration benefits from system-level views that reveal how components interact.
- Generation scales to evolving repositories through dynamic agent delegation without manual intervention.
- Standardized benchmarking via CodeWikiBench enables direct comparison of future documentation systems.
- Gains are especially pronounced for high-level scripting languages, suggesting language-specific strengths.
Where Pith is reading between the lines
- Teams could reduce onboarding time for new developers by supplying instant architecture overviews instead of requiring manual exploration.
- Embedding the framework in version-control workflows might keep documentation synchronized as code changes.
- The same hierarchical and multi-agent pattern could extend to generating explanations for data pipelines or hardware designs.
- Validating the automated scores against human developer feedback on usefulness would test whether the quality metric predicts practical value.
Load-bearing premise
The LLM-based assessment protocols and multi-dimensional rubrics in CodeWikiBench can accurately measure holistic documentation quality without introducing bias or overlooking key architectural aspects.
What would settle it
A side-by-side human expert evaluation of documentation generated by CodeWiki versus baselines on the same codebases, checking whether the 4.73 percent quality advantage holds under direct review or real maintenance-task performance.
read the original abstract
Given a large and evolving codebase, the ability to automatically generate holistic, architecture-aware documentation that captures not only individual functions but also cross-file, cross-module, and system-level interactions remains an open challenge. Comprehensive documentation is essential for long-term software maintenance and collaboration, yet current automated approaches still fail to model the rich semantic dependencies and architectural structures that define real-world software systems. We present \textbf{CodeWiki}, a unified framework for automated repository-level documentation across seven programming languages. CodeWiki introduces three key innovations: (i) hierarchical decomposition that preserves architectural context across multiple levels of granularity, (ii) recursive multi-agent processing with dynamic task delegation for scalable generation, and (iii) multi-modal synthesis that integrates textual descriptions with visual artifacts such as architecture diagrams and data-flow representations. To enable rigorous evaluation, we introduce \textbf{CodeWikiBench}, a comprehensive benchmark featuring multi-dimensional rubrics and LLM-based assessment protocols. Experimental results show that CodeWiki achieves a 68.79\% quality score with proprietary models, outperforming the closed-source DeepWiki baseline (64.06\%) by 4.73\%, with particularly strong improvements on high-level scripting languages (+10.47\%). We open-source CodeWiki to foster future research and community adoption.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents CodeWiki, a unified framework for automated repository-level documentation across seven programming languages. It introduces three innovations: hierarchical decomposition to preserve architectural context, recursive multi-agent processing with dynamic task delegation, and multi-modal synthesis integrating text with visual artifacts like architecture diagrams. To support evaluation, the authors propose CodeWikiBench featuring multi-dimensional rubrics and LLM-based assessment protocols. Experimental results claim CodeWiki achieves a 68.79% quality score with proprietary models, outperforming the DeepWiki baseline (64.06%) by 4.73%, with stronger gains (+10.47%) on high-level scripting languages. The work is open-sourced.
Significance. If the central empirical claims hold under validated evaluation, the framework and benchmark could meaningfully advance automated documentation for large, evolving codebases by better capturing cross-module and system-level interactions. The open-sourcing and multi-language coverage are positive for community adoption and reproducibility. However, the significance is tempered by the unvalidated nature of the LLM-as-judge protocol, which is load-bearing for the reported improvements.
major comments (1)
- [Abstract / Experimental Results] Abstract and Experimental Results section: The headline 68.79% vs. 64.06% quality scores and the +10.47% gain on scripting languages are produced entirely by the LLM-based assessment protocols and multi-dimensional rubrics of CodeWikiBench. No human correlation study, inter-annotator agreement figures, or ablation against actual maintenance tasks are reported, leaving the central claim that CodeWiki produces superior holistic documentation dependent on an unverified proxy that may embed judge-model biases.
minor comments (1)
- [Abstract] The abstract states concrete percentage improvements without accompanying error bars or statistical significance tests; adding these would strengthen the presentation of the 4.73% gain.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive feedback on our manuscript. The concern regarding validation of the LLM-as-judge protocol in CodeWikiBench is well-taken, and we address it directly below while outlining targeted revisions.
read point-by-point responses
-
Referee: [Abstract / Experimental Results] Abstract and Experimental Results section: The headline 68.79% vs. 64.06% quality scores and the +10.47% gain on scripting languages are produced entirely by the LLM-based assessment protocols and multi-dimensional rubrics of CodeWikiBench. No human correlation study, inter-annotator agreement figures, or ablation against actual maintenance tasks are reported, leaving the central claim that CodeWiki produces superior holistic documentation dependent on an unverified proxy that may embed judge-model biases.
Authors: We appreciate the referee highlighting that our primary results rest on the LLM-based evaluation in CodeWikiBench. The multi-dimensional rubrics were explicitly constructed to assess documentation along axes of completeness, accuracy, architectural coherence, and cross-module coverage, with the intent of reducing reliance on any single subjective judgment. Nevertheless, we acknowledge that the current manuscript does not include a human correlation study, inter-annotator agreement statistics, or downstream ablation on maintenance tasks. This is a genuine limitation of the reported evaluation. In the revised manuscript we will (i) add a dedicated paragraph in the Experimental Results section discussing potential judge-model biases and the rubric design choices made to promote consistency, (ii) report score variance across two additional judge models to demonstrate robustness, and (iii) explicitly state the scope of CodeWikiBench as an intrinsic quality benchmark rather than a proxy for end-to-end maintenance outcomes. These changes will make the evidential basis of our claims more transparent without overstating the current validation. revision: partial
Circularity Check
No significant circularity: empirical comparison to external baseline on newly introduced benchmark
full rationale
The paper introduces CodeWiki as a framework with hierarchical decomposition, recursive multi-agent processing, and multi-modal synthesis, then defines CodeWikiBench with multi-dimensional rubrics and LLM-based assessment to evaluate it. The central result (68.79% vs. 64.06% on DeepWiki) is obtained by applying the same benchmark protocol to both the proposed system and an independent closed-source baseline. No equations, fitted parameters, or self-citations are shown that would make the reported superiority equivalent to the inputs by construction. The evaluation protocol is external to the method itself and the comparison provides an independent check, satisfying the criteria for a self-contained empirical claim.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
hierarchical decomposition that preserves architectural context across multiple levels of granularity, recursive multi-agent processing with dynamic task delegation
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
multi-dimensional rubrics and LLM-based assessment protocols
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 5 Pith papers
-
Documentation-Guided Agentic Codebase Migration from C to Rust
RustPrint uses documentation as a migration blueprint for agentic C-to-Rust translation, achieving full compilation and higher feature preservation than baselines on eight real-world repositories from 11K to 84K LoC.
-
Documentation-Guided Agentic Codebase Migration from C to Rust
RustPrint is a documentation-guided agentic system that migrates entire C repositories to Rust by using architecture docs as blueprints, achieving full compilability and 93-95% feature/test preservation on eight 11K-8...
-
Remember Your Trace: Memory-Guided Long-Horizon Agentic Framework for Consistent and Hierarchical Repository-Level Code Documentation
MemDocAgent generates consistent hierarchical repository-level code documentation by combining dependency-aware traversal with memory-guided agent interactions that accumulate work traces.
-
RepoDoc: A Knowledge Graph-Based Framework to Automatic Documentation Generation and Incremental Updates
RepoDoc uses a repository knowledge graph with module clustering and semantic impact propagation to generate more complete documentation 3x faster with 85% fewer tokens and handle incremental updates 73% faster than p...
-
SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios
SWE-EVO shows GPT-5.4 with OpenHands reaching only 25% success on complex multi-file evolution tasks versus 72.8% on SWE-Bench Verified, and introduces Fix Rate as a partial-progress metric.
Reference graph
Works this paper leans on
-
[1]
Few-shot training llms for project-specific code-summarization
Toufique Ahmed and Premkumar Devanbu. Few-shot training llms for project-specific code-summarization. In Proceedings of the 37th IEEE/ACMInternationalConference on Automated SoftwareEngineering, ASE ’22, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9781450394758. doi: 10.1145/3551349.3559555. URLhttps://doi.org/10.1145/3551349.3559555
-
[2]
Program synthesis with large language models, 2021
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models, 2021
work page 2021
-
[3]
Nghi DQ Bui, Hung Le, Yue Wang, Junnan Li, Akhilesh Deepak Gotmare, and Steven CH Hoi. Codetf: One-stop transformer library for state-of-the-art code llm.arXiv preprint arXiv:2306.00029, 2023
-
[4]
Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, Arjun Guha, Michael Greenberg, and Abhinav Jangda. Multipl-e: A scalable and polyglot approach to benchmarking neural code generation.IEEE Transactions on Software Engineering, 49(7):3675–3691,...
-
[5]
Jie-Cherng Chen and Sun-Jen Huang. An empirical analysis of the impact of software development problem factors on software maintainability.J. Syst. Softw., 82(6):981–992, June 2009. ISSN 0164-1212. doi: 10.1016/j.jss. 2008.12.036. URLhttps://doi.org/10.1016/j.jss.2008.12.036
-
[6]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...
work page 2021
-
[7]
Yunseok Choi, Cheolwon Na, Hyojun Kim, and Jee-Hyong Lee. Readsum: Retrieval-augmented adaptive transformer for source code summarization.IEEE Access, 11:51155–51165, 2023. doi: 10.1109/ACCESS.2023. 3271992
-
[8]
On the effectiveness of llm-as-a-judge for code generation and summarization, 2025
Giuseppe Crupi, Rosalia Tufano, Alejandro Velasco, Antonio Mastropaolo, Denys Poshyvanyk, and Gabriele Bavota. On the effectiveness of llm-as-a-judge for code generation and summarization, 2025. URL https: //arxiv.org/abs/2507.16587
-
[9]
de Souza, Nicolas Anquetil, and Káthia M
Sergio Cozzetti B. de Souza, Nicolas Anquetil, and Káthia M. de Oliveira. A study of the documentation essential to software maintenance. InProceedings of the 23rd Annual International Conference on Design of Communication: Documenting & Designing for PervasiveInformation, SIGDOC ’05, page 68–75, New York, NY, USA, 2005. Association for Computing Machiner...
- [10]
-
[11]
Out of the bleu: How should we assess quality of the code generation models?J
Mikhail Evtikhiev, Egor Bogomolov, Yaroslav Sokolov, and Timofey Bryksin. Out of the bleu: How should we assess quality of the code generation models?J. Syst. Softw., 203(C), September 2023. ISSN 0164-1212. doi: 10.1016/j.jss.2023.111741. URLhttps://doi.org/10.1016/j.jss.2023.111741
-
[12]
CodeBERT: A pre-trained model for programming and natural languages
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. CodeBERT: A pre-trained model for programming and natural languages. In Trevor Cohn, Yulan He, and Yang Liu, editors,Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1536–1547, Online, November 2...
-
[13]
Graph- codebert: Pre-training code representations with data flow
Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Nan Duan, Ming Zhou, and Daxin Jiang. Graph- codebert: Pre-training code representations with data flow. InICLR, 2021
work page 2021
-
[14]
Analyzing the performance of large language models on code summa- rization
Rajarshi Haldar and Julia Hockenmaier. Analyzing the performance of large language models on code summa- rization. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen 13 Xue, editors, Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-...
work page 2024
-
[15]
Codesearchnet challenge: Evaluating the state of semantic code search
Husain Hamel, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. Codesearchnet challenge: Evaluating the state of semantic code search. InProceedings of the 2019 Symposium on Foundations of Software Engineering (FSE), pages 974–985. ACM, 2019
work page 2019
-
[16]
From code to courtroom: Llms as the new software judges, 2025
Junda He, Jieke Shi, Terry Yue Zhuo, Christoph Treude, Jiamou Sun, Zhenchang Xing, Xiaoning Du, and David Lo. From code to courtroom: Llms as the new software judges, 2025. URLhttps://arxiv.org/abs/2503.02246
-
[17]
Junda He, Christoph Treude, and David Lo. Llm-based multi-agent systems for software engineering: Literature review, vision, and the road ahead.ACM Trans.Softw. Eng. Methodol., 34(5), May 2025. ISSN 1049-331X. doi: 10.1145/3712003. URLhttps://doi.org/10.1145/3712003
-
[18]
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework
Xiaohan Hong, Jiaxi Zhang, Wenzhong Tang, Yizhou Jiang, Quan Liu, and Yidong Yang. Metagpt: Meta programming for multi-agent collaborative framework.arXiv preprint arXiv:2308.00352, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
Large language models for software engineering: A systematic literature review, 2023
Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. Large language models for software engineering: A systematic literature review, 2023
work page 2023
-
[20]
Automatic code documentation generation using gpt-3
Junaed Younus Khan and Gias Uddin. Automatic code documentation generation using gpt-3. InProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, ASE ’22, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9781450394758. doi: 10.1145/3551349.3559548. URL https://doi.org/10.1145/3551349.3559548
-
[21]
A neural model for generating natural language summaries of program subroutines
Alexander LeClair, Siyuan Jiang, and Collin McMillan. A neural model for generating natural language summaries of program subroutines. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pages 795–806. IEEE, 2019
work page 2019
-
[22]
Yao Li, Tao Zhang, Xiapu Luo, Haipeng Cai, Sen Fang, and Dawei Yuan. Do pretrained language models indeed understand software engineering tasks? IEEE Transactions on Software Engineering, 49(10):4639–4655, 2023. doi: 10.1109/TSE.2023.3308952
-
[23]
Rouge: A package for automatic evaluation of summaries
Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004
work page 2004
-
[24]
Jiahao Liu, Jun Zeng, Xiang Wang, and Zhenkai Liang. Learning graph-based code representations for source-level functional similarity detection. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 345–357, 2023. doi: 10.1109/ICSE48619.2023.00040
-
[25]
G-eval: Nlg evaluation using gpt-4 with better human alignment.arXiv preprint arXiv:2311.08788, 2023
Yang Liu, Yao Fu, Yujie Xie, Xinyi Chen, Bo Pang, Chenyan Qian, Teng Ma, and Dragomir Radev. G-eval: Nlg evaluation using gpt-4 with better human alignment.arXiv preprint arXiv:2311.08788, 2023
-
[26]
ProConSuL: Project context for code summarization with LLMs
Vadim Lomshakov, Andrey Podivilov, Sergey Savin, Oleg Baryshnikov, Alena Lisevych, and Sergey Nikolenko. ProConSuL: Project context for code summarization with LLMs. In Franck Dernoncourt, Daniel Preoţiuc-Pietro, and Anastasia Shimorina, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 8...
-
[27]
RepoAgent: An LLM-powered open-source framework for repository-level code documentation generation
Qinyu Luo, Yining Ye, Shihao Liang, Zhong Zhang, Yujia Qin, Yaxi Lu, Yesai Wu, Xin Cong, Yankai Lin, Yingli Zhang, Xiaoyin Che, Zhiyuan Liu, and Maosong Sun. RepoAgent: An LLM-powered open-source framework for repository-level code documentation generation. In Delia Irazu Hernandez Farias, Tom Hope, and Manling Li, editors,Proceedings of the 2024 Conferen...
-
[28]
Knowledgegraphbasedrepository-levelcodegeneration
Vladimir Makharev and Vladimir Ivanov. Code summarization beyond function level. In2025 IEEE 1st Conference on Large Language Models for Code (LLM4Code), pages 153–160, 05 2025. doi: 10.1109/LLM4Code66737.2025. 00024
-
[29]
Evaluating code summarization techniques: A new metric and an empirical characterization
Ernesto Mastropaolo, Georgios Gousios, Gabriele Bavota, Rocco Oliveto, and Barbara Russo. Evaluating code summarization techniques: A new metric and an empirical characterization. InProceedings of the 46th International Conference on Software Engineering (ICSE), 2024. 14
work page 2024
-
[30]
McBurney, Siyuan Jiang, Marouane Kessentini, Nicholas A
Paul W. McBurney, Siyuan Jiang, Marouane Kessentini, Nicholas A. Kraft, Ameer Armaly, Mohamed Wiem Mkaouer, and Collin McMillan. Towards prioritizing documentation effort.IEEE Trans. Softw. Eng., 44(9): 897–913, September 2018. ISSN 0098-5589. doi: 10.1109/TSE.2017.2716950. URLhttps://doi.org/10.1109/ TSE.2017.2716950
-
[31]
Mathieu Nassif and Martin P. Robillard. Non-linear software documentation with interactive code examples. ACM Trans. Softw. Eng. Methodol., 34(2), January 2025. ISSN 1049-331X. doi: 10.1145/3702976. URL https://doi.org/10.1145/3702976
-
[32]
Dung Manh Nguyen, Thang Chau Phan, Nam Le Hai, Tien-Thong Doan, Nam V Nguyen, Quang Pham, and Nghi DQ Bui. Codemmlu: A multi-task benchmark for assessing code understanding & reasoning capabilities of codellms. In The Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[33]
Agilecoder: Dynamic collaborative agents for software development based on agile methodology
Minh Huynh Nguyen, Thang Phan Chau, Phong X Nguyen, and Nghi DQ Bui. Agilecoder: Dynamic collaborative agents for software development based on agile methodology. In2025 IEEE/ACMSecond InternationalConference on AI Foundation Models and Software Engineering (Forge), pages 156–167. IEEE, 2025
work page 2025
-
[34]
Deep learning meets software engineering: A survey on pre-trained models of source code, 2022
Changan Niu, Chuanyi Li, Bin Luo, and Vincent Ng. Deep learning meets software engineering: A survey on pre-trained models of source code, 2022
work page 2022
-
[35]
Bleu: a method for automatic evaluation of machine translation
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002
work page 2002
-
[36]
arXiv preprint arXiv:2409.16299
Huy Nhat Phan, Tien N Nguyen, Phong X Nguyen, and Nghi DQ Bui. Hyperagent: Generalist software engineering agents to solve coding tasks at scale.arXiv preprint arXiv:2409.16299, 2024
-
[37]
Bibek Poudel, Adam Cook, Sekou Traore, and Shelah Ameli. Documint: Docstring generation for python using small language models.arXiv preprint arXiv:2405.10243), 2024
-
[38]
ChatDev: Communicative Agents for Software Development
Yuzhang Qian, Zian Zhang, Liang Pan, Peng Wang, Shouyi Liu, Wayne Xin Zhao, and Ji-Rong Wen. Chatdev: Revolutionizing software development with ai-collaborative agents.arXiv preprint arXiv:2307.07924, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[39]
A review on source code documentation.ACM Trans
Sawan Rai, Ramesh Chandra Belwal, and Atul Gupta. A review on source code documentation.ACM Trans. Intell. Syst. Technol., 13(5), June 2022. ISSN 2157-6904. doi: 10.1145/3519312. URLhttps://doi.org/10.1145/ 3519312
-
[40]
Boosting coverage-based fault localization via graph-based representation learning,
Devjeet Roy, Sarah Fakhoury, and Venera Arnaoudova. Reassessing automatic evaluation metrics for code sum- marization tasks. InProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundationsof Software Engineering, ESEC/FSE 2021, page 1105–1116, New York, NY, USA, 2021. Association for Computing Machi...
-
[41]
Reassessing automatic evaluation metrics for code summarization tasks
Rahul Roy, Saikat Chakraborty, Baishakhi Ray, and Miryung Kim. Reassessing automatic evaluation metrics for code summarization tasks. InProceedings of the 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), pages 1344–1356, 2021
work page 2021
-
[42]
Jiho Shin, Clark Tang, Tahmineh Mohati, Maleknaz Nayebi, Song Wang, and Hadi Hemmati. Prompt engineering or fine tuning: An empirical assessment of large language models in automated software engineering tasks, 2023
work page 2023
-
[43]
2025 stack overflow developer survey: Ai, 2025
Stack Overflow. 2025 stack overflow developer survey: Ai, 2025. URLhttps://survey.stackoverflow.co/2025/ ai. Based on 49,000+ developer responses
work page 2025
-
[44]
PaperBench: Evaluating AI's Ability to Replicate AI Research
Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, and Tejal Patwardhan. Paperbench: Evaluating ai’s ability to replicate ai research, 2025. URLhttps://arxiv.org/abs/2504.01848
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Automatic code summarization via chatgpt: How far are we?arXiv preprint arXiv:2305.12865, 2023
Weisong Sun, Chunrong Fang, Yudu You, Yun Miao, Yi Liu, Yuekang Li, Gelei Deng, Shenghan Huang, Yuchen Chen, Quanjun Zhang, et al. Automatic code summarization via chatgpt: How far are we?arXiv preprint arXiv:2305.12865, 2023
-
[46]
Beyond accuracy: assessing software documentation quality
Christoph Treude, Justin Middleton, and Thushari Atapattu. Beyond accuracy: assessing software documentation quality. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2020, page 1509–1512, New York, NY, USA, 2020. Association for Computing Machinery....
-
[47]
Julian von der Mosel, Alexander Trautsch, and Steffen Herbold. On the validity of pre-trained transformers for natural language processing in the software engineering domain.IEEE Transactionson Software Engineering, 49 (4):1487–1507, 2023. doi: 10.1109/TSE.2022.3178469
-
[48]
Can llms replace human evaluators? an empirical study of llm-as-a-judge in software engineering.Proc
Ruiqi Wang, Jiyu Guo, Cuiyun Gao, Guodong Fan, Chun Yong Chong, and Xin Xia. Can llms replace human evaluators? an empirical study of llm-as-a-judge in software engineering.Proc. ACM Softw. Eng., 2(ISSTA), June
-
[49]
URLhttps://doi.org/10.1145/3728963
doi: 10.1145/3728963. URLhttps://doi.org/10.1145/3728963
-
[50]
Yue Wang, Shuo Ren, Daya Lu, Duyu Tang, Nan Duan, Ming Zhou, and Daxin Jiang. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. InEMNLP, 2021
work page 2021
-
[51]
Codet5+: Open code large language models for code understanding and generation
Yue Wang, Hung Le, Akhilesh Gotmare, Nghi Bui, Junnan Li, and Steven Hoi. Codet5+: Open code large language models for code understanding and generation. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 1069–1088, 2023
work page 2023
-
[52]
The Rise and Potential of Large Language Model Based Agents: A Survey
Ziniu Wu, Cheng Liu, Jindong Zhang, Xinyun Li, Yewen Wang, Jimmy Xin, Lianmin Zhang, Eric Xing, Yuxin Lu, and Percy Liang. Autogen: Enabling next-generation multi-agent communication with language models.arXiv preprint arXiv:2309.07864, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[53]
Weixiang Yan, Haitian Liu, Yunkun Wang, Yunzhe Li, Qian Chen, Wen Wang, Tingyu Lin, Weishan Zhao, Li Zhu, Shuiguang Deng, and Hari Sundaram. Codescope: An execution-based multilingual multitask multidimensional benchmark for evaluating llms on code understanding and generation, 2023
work page 2023
-
[54]
DocAgent: A multi-agent system for automated code documentation generation
Dayu Yang, Antoine Simoulin, Xin Qian, Xiaoyi Liu, Yuwei Cao, Zhaopu Teng, and Grey Yang. DocAgent: A multi-agent system for automated code documentation generation. In Pushkar Mishra, Smaranda Muresan, and Tao Yu, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 460–4...
-
[55]
Available: https://arxiv.org/abs/2312.15223
Quanjun Zhang, Chunrong Fang, Yang Xie, Yaxin Zhang, Yun Yang, Weisong Sun, Shengcheng Yu, and Zhenyu Chen. A survey on large language models for software engineering.ArXiv, abs/2312.15223, 2023. URL https://api.semanticscholar.org/CorpusID:266551742
-
[56]
Xiaoqing Zhang, Zhirui Wang, Lichao Yang, Wei Zhang, and Yong Zhang. Mapcoder: Map-reduce-style code generation with multi-agent collaboration.arXiv preprint arXiv:2307.15808, 2023
-
[57]
Judging llm-as-a-judge with mt-bench and chatbot arena, 2023
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Wu Zhanghao, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023
work page 2023
-
[58]
Towards an understanding of large language models in software engineering tasks, 2023
Zibin Zheng, Kaiwen Ning, Jiachi Chen, Yanlin Wang, Wenqing Chen, Lianghong Guo, and Weicheng Wang. Towards an understanding of large language models in software engineering tasks, 2023
work page 2023
-
[59]
Cost, benefits and quality of software development documentation.J
Junji Zhi, Vahid Garousi-Yusifoğlu, Bo Sun, Golara Garousi, Shawn Shahnewaz, and Guenther Ruhe. Cost, benefits and quality of software development documentation.J. Syst. Softw., 99(C):175–198, January 2015. ISSN 0164-1212. doi: 10.1016/j.jss.2014.09.042. URLhttps://doi.org/10.1016/j.jss.2014.09.042. 16 Appendix Appendices A Example Generated Documentation...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.