The Illusion of Agentic Complexity in README.md Generation: Evaluating Single-Agent vs. Multi-Agent RAG Systems
Pith reviewed 2026-06-30 04:54 UTC · model grok-4.3
The pith
Single-agent RAG matches multi-agent lexical quality for READMEs while cutting token use by 86% and doubling speed.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A single-agent RAG pipeline reaches lexical quality comparable to a specialized multi-agent system for generating README files, while cutting token consumption by 86 percent and running at twice the speed. Manual taxonomy review shows the multi-agent system reaches 98 percent structural consistency and fixes formatting problems seen in single-agent output. Autonomous planning is the main bottleneck in the pipelines; adding lightweight developer-guided plans produces the highest overall documentation quality and exceeds every other tested setup, including the LARCH baseline.
What carries the argument
Head-to-head comparison of single-agent RAG, multi-agent RAG, and developer-guided planning variants, scored on lexical similarity, manual taxonomy for structure, token count, and runtime against LARCH and ground-truth READMEs.
If this is right
- Single-agent pipelines can substitute for multi-agent systems when lexical match is the main goal and resource limits matter.
- Multi-agent coordination raises structural consistency to 98 percent and removes common formatting errors.
- Light developer input on planning improves quality beyond fully autonomous single-agent or multi-agent runs.
- Autonomous planning remains the dominant performance constraint across all tested architectures.
Where Pith is reading between the lines
- The same efficiency-versus-structure trade-off could appear in related tasks such as code summarization or test-case generation.
- Teams that need only basic lexical coverage may prefer single-agent setups for daily use rather than full multi-agent orchestration.
- Testing the same pipelines on repositories from different domains or with larger codebases would show whether the 86 percent token saving and speed gain hold at scale.
Load-bearing premise
The chosen metrics of lexical similarity, structural consistency, token use, and speed, together with the LARCH baseline and selected repositories, give an unbiased picture of practical README quality for developers.
What would settle it
A replication study on a fresh collection of repositories in which the single-agent lexical scores fall clearly below the multi-agent scores or the developer-guided variant no longer ranks first would undermine the reported trade-off.
Figures
read the original abstract
Large Language Models (LLMs) are increasingly utilized to automate several software engineering tasks, including code completion, code summarization, testing, and the generation of repository-level documentation. While Multi-Agent Systems (MAS) are often adopted to support such tasks under the premise that task decomposition improves performance, the impact of architectural complexity on practical efficiency remains under-examined. This study empirically evaluates Retrieval-Augmented Generation (RAG) dependent architectures for the generation of README files for GitHub repositories. In this work, we conducted a systematic comparison between a Single-Agent pipeline, a specialized MAS, and a developer-guided planning (DevPlan) variant, benchmarked against LARCH -- a state-of-the-art baseline -- and the original ground truth. Results indicate a critical architectural trade-off: the Single-Agent pipeline achieves lexical quality comparable to MAS while reducing token consumption by 86% and operating at twice the speed. In contrast, manual taxonomy analysis demonstrates that MAS achieves high structural consistency (98%), resolving formatting issues observed in single-agent approaches. Autonomous planning is identified as the primary pipeline bottleneck; incorporating lightweight developer-guided plans produces the highest overall documentation quality, surpassing all the analyzed configurations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents an empirical study comparing Single-Agent, Multi-Agent System (MAS), and developer-guided planning (DevPlan) RAG pipelines for generating README.md files from GitHub repositories. It benchmarks these against the LARCH baseline and original ground truth, concluding that Single-Agent matches MAS in lexical quality while consuming 86% fewer tokens and running at twice the speed, MAS achieves 98% structural consistency, and DevPlan produces the highest overall quality by addressing the bottleneck of autonomous planning.
Significance. If the findings are robust, the work provides evidence against the default adoption of complex multi-agent architectures for documentation tasks in software engineering, demonstrating substantial efficiency gains from simpler designs and benefits from lightweight human guidance. The direct comparison to an external baseline and ground truth is a strength, as is the identification of planning as a key bottleneck.
major comments (2)
- [Abstract] The abstract states quantitative results (e.g., 86% token reduction, 98% consistency, 2x speed) but supplies no details on repository sample size, selection criteria, statistical tests, error bars, or controls for confounds such as repository size or domain. This absence makes it impossible to verify whether the data support the stated claims about architectural trade-offs.
- [Evaluation] The central claims of an architectural trade-off and DevPlan superiority rest on lexical similarity, manual taxonomy consistency, token count, and speed as proxies for README quality. These metrics do not directly assess factual accuracy of generated content, completeness for onboarding, or developer-perceived usefulness. Without additional validation or metrics addressing these aspects, the reported conclusions do not necessarily follow.
minor comments (2)
- [Results] Ensure all tables and figures report exact sample sizes, confidence intervals where applicable, and full configuration details for each pipeline variant.
- [Methodology] Clarify the exact procedure and inter-rater reliability for the manual taxonomy analysis used to compute the 98% structural consistency figure.
Simulated Author's Rebuttal
We thank the referee for their detailed review and constructive comments. We respond to each major comment below and have revised the manuscript where we agree changes are needed to improve clarity and address limitations.
read point-by-point responses
-
Referee: [Abstract] The abstract states quantitative results (e.g., 86% token reduction, 98% consistency, 2x speed) but supplies no details on repository sample size, selection criteria, statistical tests, error bars, or controls for confounds such as repository size or domain. This absence makes it impossible to verify whether the data support the stated claims about architectural trade-offs.
Authors: The abstract provides a concise summary of the results. Full details on the experimental setup, including the number of repositories evaluated, selection criteria, and controls for repository characteristics, are described in the Methodology and Evaluation sections. To address the referee's concern, we have revised the abstract to include the sample size and a note on the controls used, while maintaining brevity. revision: yes
-
Referee: [Evaluation] The central claims of an architectural trade-off and DevPlan superiority rest on lexical similarity, manual taxonomy consistency, token count, and speed as proxies for README quality. These metrics do not directly assess factual accuracy of generated content, completeness for onboarding, or developer-perceived usefulness. Without additional validation or metrics addressing these aspects, the reported conclusions do not necessarily follow.
Authors: We agree that the metrics employed are proxies and do not directly measure factual accuracy or perceived usefulness. Our study focuses on efficiency and structural aspects using standard lexical and consistency metrics, benchmarked against ground truth and LARCH. The conclusions regarding the trade-offs are supported within the scope of these metrics. We have added a paragraph in the Discussion section acknowledging this limitation and outlining plans for future human-centered evaluation. revision: partial
Circularity Check
Empirical benchmarking with direct measurements; no derivations or self-referential reductions
full rationale
The paper is a comparative empirical study that directly measures lexical similarity, structural consistency (via manual taxonomy), token consumption, and runtime speed for Single-Agent, MAS, and DevPlan pipelines against ground-truth READMEs and the external LARCH baseline. No equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations appear in the abstract or described methodology. All reported trade-offs follow from observed experimental outcomes rather than any definitional or self-referential construction. The metric-validity concern raised by the skeptic is a question of external validity, not circularity.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Replication package for The Illusion of Agentic Complexity in README.md Generation: Evaluating Single-Agent vs
Anonymous Authors. Replication package for The Illusion of Agentic Complexity in README.md Generation: Evaluating Single-Agent vs. Multi-Agent RAG. https://anonymous.4open.science/r/ML4SE-8268/ README.md, 2026
2026
-
[3]
Understanding the factors that impact the popularity of github repositories
Hudson Borges, Andre Hora, and Marco Tulio Valente. Understanding the factors that impact the popularity of github repositories. In2016 IEEE international conference on software maintenance and evolution (ICSME), pages 334–344. IEEE, 2016
2016
-
[4]
Studying memorization of large language models using answers to stack overflow questions.Transactions on Machine Learning Research, 2025
Laura Caspari, Alexander Trautsch, Michael Granitzer, and Steffen Herbold. Studying memorization of large language models using answers to stack overflow questions.Transactions on Machine Learning Research, 2025
2025
-
[5]
Readmeready: Free and customiz- able code documentation with llms-a fine-tuning approach.Journal of Open Source Software, 10(108):7489, 2025
Sayak Chakrabarty and Souradip Pal. Readmeready: Free and customiz- able code documentation with llms-a fine-tuning approach.Journal of Open Source Software, 10(108):7489, 2025
2025
-
[6]
https://doi.org/10.48550/arXiv.2502.14425, http://arxiv.org/abs/2502.14425, arXiv:2502.14425 [cs]
Yuxing Cheng, Yi Chang, and Yuan Wu. A survey on data contamination for large language models.arXiv preprint arXiv:2502.14425, 2025
-
[7]
Rmgenie: An llm-based agent framework for open source software readme generation
Xing Cui, Jingzheng Wu, Zhiyuan Li, Tianyue Luo, and Xiang Ling. Rmgenie: An llm-based agent framework for open source software readme generation. In2025 IEEE International Conference on Software Maintenance and Evolution (ICSME), pages 505–516. IEEE, 2025
2025
-
[8]
A comparative analysis of large language models for code documentation generation
Shubhang Shekhar Dvivedi, Vyshnav Vijay, Sai Leela Rahul Pujari, Shoumik Lodh, and Dhruv Kumar. A comparative analysis of large language models for code documentation generation. InProceedings of the 1st ACM international conference on AI-powered software, pages 65–73, 2024
2024
-
[9]
Haoyu Gao, Hong Yi Lin, Christoph Treude, Gregory Gay, and Man- sooreh Zahedi. Does my readme file need to be updated? exploring llm-based readme maintenance.arXiv preprint arXiv:2603.00489, 2026
-
[10]
Single-agent or multi-agent systems? why not both? arXiv preprint arXiv:2505.18286, 2025
Mingyan Gao, Yanzi Li, Banruo Liu, Yifan Yu, Phillip Wang, Ching-Yu Lin, and Fan Lai. Single-agent or multi-agent systems? why not both? arXiv preprint arXiv:2505.18286, 2025
-
[11]
CodeWiki: Evaluating AI's Ability to Generate Holistic Documentation for Large-Scale Codebases
Anh Nguyen Hoang, Minh Le-Anh, Bach Le, and Nghi DQ Bui. Codewiki: Evaluating ai’s ability to generate holistic documentation for large-scale codebases.arXiv preprint arXiv:2510.24428, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
Metagpt: Meta programming for a multi-agent collaborative framework
Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. Metagpt: Meta programming for a multi-agent collaborative framework. InThe twelfth international conference on learning representations, 2023
2023
-
[13]
Correlating automated and human evaluation of code documentation generation quality.ACM Transactions on Software Engineering and Methodology (TOSEM), 31(4):1–28, 2022
Xing Hu, Qiuyuan Chen, Haoye Wang, Xin Xia, David Lo, and Thomas Zimmermann. Correlating automated and human evaluation of code documentation generation quality.ACM Transactions on Software Engineering and Methodology (TOSEM), 31(4):1–28, 2022
2022
-
[14]
A survey on large language models for code generation.ACM Transactions on Software Engineering and Methodology, 35(2):1–72, 2026
Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. A survey on large language models for code generation.ACM Transactions on Software Engineering and Methodology, 35(2):1–72, 2026
2026
-
[15]
Automatic code documentation generation using gpt-3
Junaed Younus Khan and Gias Uddin. Automatic code documentation generation using gpt-3. InProceedings of the 37th IEEE/ACM Inter- national Conference on Automated Software Engineering, pages 1–6, 2022
2022
-
[16]
Larch: Large language model-based automatic readme creation with heuristics
Yuta Koreeda, Terufumi Morishita, Osamu Imaichi, and Yasuhiro So- gawa. Larch: Large language model-based automatic readme creation with heuristics. InProceedings of the 32nd ACM International Con- ference on Information and Knowledge Management, pages 5066–5070, 2023
2023
-
[17]
LangChain
LangChain. LangChain. https://www.langchain.com/, 2026. [Online; accessed May 6, 2026]
2026
-
[18]
Comparing code explanations created by students and large language models
Juho Leinonen, Paul Denny, Stephen MacNeil, Sami Sarsa, Seth Bern- stein, Joanne Kim, Andrew Tran, and Arto Hellas. Comparing code explanations created by students and large language models. InProceed- ings of the 2023 Conference on Innovation and Technology in Computer Science Education V . 1, pages 124–130, 2023
2023
-
[19]
How readme files are structured in open source java projects.Information and Software Technology, 148:106924, 2022
Yuyang Liu, Ehsan Noei, and Kelly Lyons. How readme files are structured in open source java projects.Information and Software Technology, 148:106924, 2022
2022
-
[20]
Qinyu Luo, Yining Ye, Shihao Liang, Zhong Zhang, Yujia Qin, Yaxi Lu, Yesai Wu, Xin Cong, Yankai Lin, Yingli Zhang, et al. Repoagent: An llm-powered open-source framework for repository-level code doc- umentation generation.arXiv preprint arXiv:2402.16667, 2024
-
[21]
Curating github for engineered software projects.Empirical Software Engineering, 22(6):3219–3253, 2017
Nuthan Munaiah, Steven Kroh, Craig Cabrey, and Meiyappan Nagappan. Curating github for engineered software projects.Empirical Software Engineering, 22(6):3219–3253, 2017
2017
-
[22]
Using an llm to help with code understanding
Daye Nam, Andrew Macvean, Vincent Hellendoorn, Bogdan Vasilescu, and Brad Myers. Using an llm to help with code understanding. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering, pages 1–13, 2024
2024
-
[23]
Teamwork makes the dream work: Llms-based agents for github readme
Duc SH Nguyen, Bach G Truong, Phuong T Nguyen, Juri Di Rocco, and Davide Di Ruscio. Teamwork makes the dream work: Llms-based agents for github readme. md summarization. InProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering, pages 621–625, 2025
2025
-
[24]
Repository-level code understanding by llms via hierarchical summarization: Improving code search and bug localiza- tion
Amirkia Rafiei Oskooei, Selcan Yukcu, Mehmet Cevheri Bozoglan, and Mehmet S Aktas. Repository-level code understanding by llms via hierarchical summarization: Improving code search and bug localiza- tion. InInternational Conference on Computational Science and Its Applications, pages 88–105. Springer, 2025
2025
-
[25]
Chatdev: Communicative agents for software development
Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, et al. Chatdev: Communicative agents for software development. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 15174–15186, 2024
2024
-
[26]
Automated and context- aware code documentation leveraging advanced llms
Swapnil Sharma Sarker and Tanzina Taher Ifty. Automated and context- aware code documentation leveraging advanced llms. InProceedings of the 18th International Natural Language Generation Conference, pages 486–498, 2025
2025
-
[27]
Judging the judges: A systematic study of position bias in llm-as-a-judge
Lin Shi, Chiyu Ma, Wenhua Liang, Xingjian Diao, Weicheng Ma, and Soroush V osoughi. Judging the judges: A systematic study of position bias in llm-as-a-judge. InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguis- tics, pages 2...
2025
-
[28]
Repository- level prompt generation for large language models of code
Disha Shrivastava, Hugo Larochelle, and Daniel Tarlow. Repository- level prompt generation for large language models of code. InInterna- tional Conference on Machine Learning, pages 31693–31715. PMLR, 2023
2023
-
[29]
Context-aware code summary generation.arXiv preprint arXiv:2408.09006, 2024
Chia-Yi Su, Aakash Bansal, Yu Huang, Toby Jia-Jun Li, and Collin McMillan. Context-aware code summary generation.arXiv preprint arXiv:2408.09006, 2024
-
[30]
Magis: Llm-based multi-agent framework for github issue resolution.Advances in Neural Information Processing Systems, 37:51963–51993, 2024
Wei Tao, Yucheng Zhou, Yanlin Wang, Wenqiang Zhang, Hongyu Zhang, and Yu Cheng. Magis: Llm-based multi-agent framework for github issue resolution.Advances in Neural Information Processing Systems, 37:51963–51993, 2024
2024
-
[31]
A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024
Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024
2024
-
[32]
Study the correlation between the readme file of github projects and their popularity.Journal of Systems and Software, 205:111806, 2023
Tianlei Wang, Shaowei Wang, and Tse-Hsun Peter Chen. Study the correlation between the readme file of github projects and their popularity.Journal of Systems and Software, 205:111806, 2023
2023
- [33]
-
[34]
Autogen: Enabling next-gen llm applications via multi-agent conversations
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. In First conference on language modeling, 2024
2024
-
[35]
Dayu Yang, Antoine Simoulin, Xin Qian, Xiaoyi Liu, Yuwei Cao, Zhaopu Teng, and Grey Yang. Docagent: A multi-agent system for auto- mated code documentation generation.arXiv preprint arXiv:2504.08725, 2025
-
[36]
Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhang- hao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023
2023
-
[37]
Yifeng Zhu, Xianlin Zhao, Xutian Li, Yanzhen Zou, Haizhuo Yuan, Yue Wang, and Bing Xie. Reposummary: Feature-oriented summarization and documentation generation for code repositories.arXiv preprint arXiv:2510.11039, 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.