Recognition: no theorem link
LLM-Based Multi-Agent Systems for Code Generation: A Multi-Vocal Literature Review
Pith reviewed 2026-05-15 19:49 UTC · model grok-4.3
The pith
A review of 114 studies from academia and industry classifies nine motivations, common models, six challenge categories, and six future directions for multi-agent LLM code generation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Through a multi-vocal literature review of 114 studies, the authors establish that motivations for adopting multi-agent LLM systems for code generation fall into nine categories, that the studies employ a recognizable set of models and evaluation benchmarks, that challenges and solutions group into six main categories with 26 subcategories, and that future research directions organize into six main categories with 18 subcategories. The synthesis draws from both academic and industrial sources to support further studies and real-world adoption.
What carries the argument
The multi-vocal literature review (MLR) method, which integrates peer-reviewed papers and grey literature to classify motivations, models, benchmarks, challenges, solutions, and future directions across the selected studies.
If this is right
- The nine motivation categories give practitioners a checklist for deciding whether a multi-agent architecture is appropriate for a given code-generation task.
- The mapped models and benchmarks supply a reference point for choosing LLM configurations and evaluation methods in new work.
- The six challenge categories with 26 subcategories identify the concrete obstacles that must be solved before reliable industrial use.
- The six categories of future directions with 18 subcategories can be used to prioritize research agendas.
- The overall synthesis supports the transition of multi-agent code generation from research prototypes to production settings.
Where Pith is reading between the lines
- If the synthesized challenges are addressed in priority order, the gap between research prototypes and deployable industrial tools may narrow faster than isolated studies suggest.
- Standardization of benchmarks across future papers could make the field more cumulative, building directly on the overview provided here.
- Closer tracking of grey literature in follow-up reviews may reveal whether industry practices are diverging from the academic patterns captured in this study.
- Testing whether the nine motivation categories remain stable as new papers appear would provide a direct measure of the review's lasting utility.
Load-bearing premise
The search and selection process captured a representative sample of both peer-reviewed and grey literature without significant bias, and the manual categorization into nine motivation categories, six challenge categories, and six future-direction categories is complete and reproducible.
What would settle it
A replication that retrieves a substantial set of additional studies missed by the original search and shows that these studies fall outside the nine motivation categories or the six challenge categories.
Figures
read the original abstract
Large Language Models (LLMs) have enabled multi-agent systems to perform autonomous code generation for complex tasks. Despite the recent growth in research and industrial applications in this area, there is little work on synthesizing evidence from both academic and industrial sources to capture the current state of research on LLM-based multi-agent systems for code generation. To this end, we conducted a Multi-Vocal Literature Review (MLR), combining insights from both academia and industry, including peer-reviewed studies and grey literature. The aim of this study is to systematically synthesize and analyze existing knowledge on LLM-based multi-agent systems for code generation. Specifically, the review examines the motivations for their use, employed benchmarks and models, key challenges, proposed solutions, and potential directions for future research. We selected and reviewed 114 studies, and the key findings are: 1) the identified reasons for adopting multi-agent systems for code generation were classified into nine categories; 2) the models and evaluation benchmarks utilized across the studies were systematically analyzed to provide a structured overview of commonly adopted LLM configurations and assessment practices; 3) the reported challenges and corresponding solutions were synthesized into six main categories and 26 subcategories; and 4) future research directions were identified and organized into six main categories and 18 subcategories. The results of this MLR will assist researchers and practitioners in pursuing further studies and supporting the real-world adoption of multi-agent systems in industrial settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper conducts a multi-vocal literature review (MLR) of 114 studies (peer-reviewed and grey literature) on LLM-based multi-agent systems for code generation. It classifies motivations for adoption into nine categories, provides a structured analysis of employed models and evaluation benchmarks, synthesizes reported challenges and solutions into six main categories with 26 subcategories, and organizes future research directions into six main categories with 18 subcategories.
Significance. If the underlying selection and categorization processes prove robust and reproducible, the review would offer a useful consolidated overview of motivations, common LLM configurations, challenges, and open questions in an emerging sub-area, helping researchers and practitioners navigate the literature and identify gaps for both academic and industrial work.
major comments (2)
- [Methods] Methods section: The account of the search strategy, databases queried, search strings, inclusion/exclusion criteria, and quality assessment is missing or only high-level. Without these details the claim of having systematically selected 114 representative studies cannot be verified and the risk of selection bias remains unaddressed.
- [Results] Results (categorization subsections): No coding protocol, codebook, double-coding procedure, or inter-rater agreement statistic (e.g., Cohen’s kappa or percentage agreement) is reported for mapping study content onto the nine motivation categories, six challenge categories, or six future-direction categories. This makes the taxonomy counts and distributions difficult to reproduce or audit.
minor comments (1)
- [Abstract] Abstract: The time window of the literature search and the final search date are not stated, which would help readers assess currency of the 114-study corpus.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments, which highlight important aspects of transparency and reproducibility in our multi-vocal literature review. We have prepared point-by-point responses below and will revise the manuscript to address the concerns where feasible.
read point-by-point responses
-
Referee: [Methods] Methods section: The account of the search strategy, databases queried, search strings, inclusion/exclusion criteria, and quality assessment is missing or only high-level. Without these details the claim of having systematically selected 114 representative studies cannot be verified and the risk of selection bias remains unaddressed.
Authors: We agree that the Methods section in the submitted manuscript is high-level and insufficient for full verification. In the revised version, we will expand this section to provide a complete account of the search strategy, including the specific academic databases queried (e.g., IEEE Xplore, ACM DL, Scopus), grey literature sources (e.g., arXiv, GitHub repositories, industry white papers), exact search strings, inclusion/exclusion criteria, and the quality assessment process used to select the 114 studies. This will directly address concerns about selection bias and enable reproducibility. revision: yes
-
Referee: [Results] Results (categorization subsections): No coding protocol, codebook, double-coding procedure, or inter-rater agreement statistic (e.g., Cohen’s kappa or percentage agreement) is reported for mapping study content onto the nine motivation categories, six challenge categories, or six future-direction categories. This makes the taxonomy counts and distributions difficult to reproduce or audit.
Authors: We acknowledge that the categorization process lacks sufficient methodological detail in the current draft. The nine motivation categories, six challenge categories, and six future-direction categories were derived via iterative thematic analysis involving team discussions to resolve discrepancies. However, we did not implement a formal double-coding protocol with independent raters or compute inter-rater agreement statistics. In the revision, we will add a dedicated subsection describing the coding protocol, codebook development, and assignment process, including illustrative examples of study mappings. We will also note the absence of quantitative agreement metrics as a limitation and discuss how this affects auditability. revision: partial
Circularity Check
No circularity: standard literature review synthesis
full rationale
This is a multi-vocal literature review that selects 114 external studies and synthesizes their reported motivations, models, challenges, solutions, and future directions into taxonomies. No derivations, equations, parameter fittings, or predictions are performed inside the paper; all content aggregates findings from the cited studies. The nine motivation categories, six challenge categories with 26 subcategories, and six future-direction categories are outputs of the review process applied to external sources rather than self-referential definitions or fitted inputs. No load-bearing step reduces by construction to the paper's own inputs, and the methods section describes a conventional search-and-selection protocol without internal circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Multi-vocal literature review methodology is appropriate and sufficient for synthesizing peer-reviewed and grey literature on LLM-based multi-agent code generation.
Reference graph
Works this paper leans on
-
[1]
L. Belzner, T. Gabor, M. Wirsing, Large language model assisted software engineering: prospects, challenges, and a case study, in: In- ternational Conference on Bridging the Gap between AI and Reality, Springer, 2023, pp. 355–374
work page 2023
- [2]
- [3]
-
[4]
S. Hong, X. Zheng, J. Chen, Y. Cheng, J. Wang, C. Zhang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, et al., Metagpt: Meta programmingformulti-agentcollaborativeframework,arXivpreprint arXiv:2308.00352 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
C. Qian, W. Liu, H. Liu, N. Chen, Y. Dang, J. Li, C. Yang, W. Chen, Y. Su, X. Cong, et al., Chatdev: Communicative agents for software development, in: Proceedings of the 62nd Annual Meeting of the AssociationforComputationalLinguistics(Volume1:LongPapers), 2024, pp. 15174–15186
work page 2024
-
[6]
D.Huang,J.M.Zhang,M.Luck,Q.Bu,Y.Qing,H.Cui,Agentcoder: Multi-agent-based code generation with iterative testing and optimi- sation, arXiv preprint arXiv:2312.13010 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
M. A. Islam, M. E. Ali, M. R. Parvez, Mapcoder: Multi-agent code generation for competitive problem solving, in: L. Ku, A. Martins, V. Srikumar (Eds.), Proceedings of the 62nd Annual Meeting of the AssociationforComputationalLinguistics(Volume1:LongPapers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, Association for Computational Linguistics, 2024, p...
-
[8]
J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, O. Press, Swe-agent: Agent-computer interfaces enable automated software engineering, Advances in Neural Information Processing Systems 37 (2024) 50528–50652
work page 2024
-
[9]
M. A. Islam, M. E. Ali, M. R. Parvez, Codesim: Multi-agent code generation and problem solving through simulation-driven planning and debugging, in: L. Chiruzzo, A. Ritter, L. Wang (Eds.), Findings of the Association for Computational Linguistics: NAACL 2025, Albuquerque,NewMexico,USA,April29-May4,2025,Association forComputationalLinguistics,2025,pp.5113–...
work page doi:10.18653/v 2025
-
[10]
J. He, C. Treude, D. Lo, Llm-based multi-agent systems for soft- ware engineering: Literature review, vision, and the road ahead, ACMTransactionsonSoftwareEngineeringandMethodology34(5) (2025) 1–30
work page 2025
-
[11]
M. Mohammadi, Y. Li, J. Lo, W. Yip, Evaluation and benchmarking of llm agents: A survey, in: Proceedings of the 31st ACM SIGKDD ConferenceonKnowledgeDiscoveryandDataMiningV.2,2025,pp. 6129–6139
work page 2025
-
[12]
Y. Wang, W. Zhong, Y. Huang, E. Shi, M. Yang, J. Chen, H. Li, Y. Ma, Q. Wang, Z. Zheng, Agents in software engineering: Survey, landscape,andvision,AutomatedSoftwareEngineering32(2)(2025) 70
work page 2025
-
[13]
Rasheed, Llm-based multi-agent systems for code generation: A multi-vocal literature review (Feb
Z. Rasheed, Llm-based multi-agent systems for code generation: A multi-vocal literature review (Feb. 2026).doi:10.5281/zenodo.18763 362. URLhttps://doi.org/10.5281/zenodo.18763362
-
[14]
V.Garousi,M.Felderer,M.V.Mäntylä,Guidelinesforincludinggrey literature and conducting multivocal literature reviews in software engineering, Information and software technology 106 (2019) 101– 121
work page 2019
-
[15]
B. Kitchenham, O. P. Brereton, D. Budgen, M. Turner, J. Bailey, S. Linkman, Systematic literature reviews in software engineering– a systematic literature review, Information and software technology 51 (1) (2009) 7–15
work page 2009
-
[16]
C. Schardt, M. B. Adams, T. Owens, S. Keitz, P. Fontelo, Utilization ofthepicoframeworktoimprovesearchingpubmedforclinicalques- tions,BMCmedicalinformaticsanddecisionmaking7(1)(2007)16
work page 2007
-
[17]
P. Brereton, B. A. Kitchenham, D. Budgen, M. Turner, M. Khalil, Lessonsfromapplyingthesystematicliteraturereviewprocesswithin the software engineering domain, Journal of systems and software 80 (4) (2007) 571–583
work page 2007
-
[18]
C.Wohlin,Guidelinesforsnowballinginsystematicliteraturestudies andareplicationinsoftwareengineering,in:Proceedingsofthe18th international conference on evaluation and assessment in software engineering, 2014, pp. 1–10
work page 2014
-
[19]
T. Dybå, T. Dingsøyr, Empirical studies of agile software devel- opment: A systematic review, Information and software technology 50 (9-10) (2008) 833–859
work page 2008
- [20]
- [21]
-
[22]
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Ka- plan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, et al., Evaluating large language models trained on code, arXiv preprint arXiv:2107.03374 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[23]
Program Synthesis with Large Language Models
J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E.Jiang,C.Cai,M.Terry,Q.Le,etal.,Programsynthesiswithlarge language models, arXiv preprint arXiv:2108.07732 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [24]
- [25]
-
[26]
B. Szalontai, B. Márton, B. Pintér, T. Gregorics, Investigating repro- ducibility challenges in llm bugfixing on the humanevalfix bench- mark, Software 4 (3) (2025) 17. Rasheed et al.:Preprint submitted to ElsevierPage 32 of 34 LLM-Based Multi-Agent Systems for Code Generation
work page 2025
-
[27]
J.Liu,C.S.Xia,Y.Wang,L.Zhang,Isyourcodegeneratedbychatgpt really correct? rigorous evaluation of large language models for code generation, arXiv preprint arXiv:2305.01210 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [28]
- [29]
- [30]
-
[31]
T.Helmuth,P.Kelly,Psb2:thesecondprogramsynthesisbenchmark suite, in: Proceedings of the Genetic and Evolutionary Computation Conference, 2021, pp. 785–794
work page 2021
-
[32]
Z. Wang, S. Zhou, D. Fried, G. Neubig, Execution-based evaluation for open-domain code generation, in: Findings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 1271–1290
work page 2023
- [33]
-
[34]
F. Cassano, J. Gouwar, D. Nguyen, S. Nguyen, L. Phipps-Costin, D.Pinckney,M.-H.Yee,Y.Zi,C.J.Anderson,M.Q.Feldman,etal., Multipl-e:Ascalableandextensibleapproachtobenchmarkingneural code generation, arXiv preprint arXiv:2208.08227 (2022)
-
[35]
C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, K. Narasimhan, Swe-bench: Can language models resolve real-world github issues?, arXiv preprint arXiv:2310.06770 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[36]
Kio, Swe-bench-secret: Automating ai agent evaluation for soft- ware engineering tasks (2025)
G. Kio, Swe-bench-secret: Automating ai agent evaluation for soft- ware engineering tasks (2025)
work page 2025
-
[37]
Y. Ding, Z. Wang, W. Ahmad, H. Ding, M. Tan, N. Jain, M. K. Ramanathan, R. Nallapati, P. Bhatia, D. Roth, et al., Crosscodeeval: A diverse and multilingual benchmark for cross-file code comple- tion, Advances in Neural Information Processing Systems 36 (2023) 46701–46723
work page 2023
-
[38]
Measuring Coding Challenge Competence With APPS
D.Hendrycks,S.Basart,S.Kadavath,M.Mazeika,A.Arora,E.Guo, C. Burns, S. Puranik, H. He, D. Song, et al., Measuring coding challenge competence with apps, arXiv preprint arXiv:2105.09938 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [39]
-
[40]
N.Jain,K.Han,A.Gu,W.-D.Li,F.Yan,T.Zhang,S.Wang,A.Solar- Lezama,K.Sen,I.Stoica,Livecodebench:Holisticandcontamination free evaluation of large language models for code, arXiv preprint arXiv:2403.07974 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
P. Yin, B. Deng, E. Chen, B. Vasilescu, G. Neubig, Learning to minealignedcodeandnaturallanguagepairsfromstackoverflow,in: Proceedings of the 15th international conference on mining software repositories, 2018, pp. 476–486
work page 2018
-
[42]
D.Rodriguez-Cardenas,D.N.Palacio,D.Khati,H.Burke,D.Poshy- vanyk,Benchmarkingcausalstudytointerpretlargelanguagemodels forsourcecode,in:2023IEEEInternationalConferenceonSoftware Maintenance and Evolution (ICSME), IEEE, 2023, pp. 329–334
work page 2023
- [43]
-
[44]
X. Liu, H. Yu, H. Zhang, Y. Xu, X. Lei, H. Lai, Y. Gu, H. Ding, K.Men,K.Yang,etal.,Agentbench:Evaluatingllmsasagents,arXiv preprint arXiv:2308.03688 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [45]
-
[46]
Z. Z. Wang, A. Asai, F. F. Xu, Y. Xie, G. Neubig, D. Fried, et al., Coderag-bench:Canretrievalaugmentcodegeneration?,in:Findings of the Association for Computational Linguistics: NAACL 2025, 2025, pp. 3199–3214
work page 2025
-
[47]
E. Debenedetti, J. Zhang, M. Balunovic, L. Beurer-Kellner, M. Fis- cher, F. Tramèr, Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents, Advances in Neural Information Processing Systems 37 (2024) 82895–82920
work page 2024
-
[48]
C. Guo, X. Liu, C. Xie, A. Zhou, Y. Zeng, Z. Lin, D. Song, B. Li, Redcode: Risky code execution and generation benchmark for code agents, Advances in Neural Information Processing Systems 37 (2024) 106190–106236
work page 2024
-
[49]
C. Tony, M. Mutas, N. E. D. Ferreyra, R. Scandariato, Llmseceval: A dataset of natural language prompts for security evaluations, in: 2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR), IEEE, 2023, pp. 588–592
work page 2023
-
[50]
M. Liu, N. Pinckney, B. Khailany, H. Ren, Verilogeval: Evaluat- ing large language models for verilog code generation, in: 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD), IEEE, 2023, pp. 1–8
work page 2023
-
[51]
N. Pinckney, C. Batten, M. Liu, H. Ren, B. Khailany, Revisiting verilogeval: A year of improvements in large-language models for hardwarecodegeneration,ACMTransactionsonDesignAutomation of Electronic Systems (2025)
work page 2025
-
[52]
H. Yu, B. Shen, D. Ran, J. Zhang, Q. Zhang, Y. Ma, G. Liang, Y. Li, Q. Wang, T. Xie, Codereval: A benchmark of pragmatic code generationwithgenerativepre-trainedmodels,in:Proceedingsofthe 46th IEEE/ACM International Conference on Software Engineering, 2024, pp. 1–12
work page 2024
- [53]
- [54]
-
[55]
Y. Talebirad, A. Nadiri, Multi-agent collaboration: Harnessing the power of intelligent llm agents, arXiv preprint arXiv:2306.03314 (2023)
- [56]
-
[57]
D.G.Paul,H.Zhu,I.Bayley,Benchmarksandmetricsforevaluations of code generation: A critical review, in: 2024 IEEE International Conference on Artificial Intelligence Testing (AITest), IEEE, 2024, pp. 87–94
work page 2024
- [58]
-
[59]
A. Kulkarni, M. Chakraborty, Blue sky: Reducing performance gap between commercial and open-source llms, in: Proceedings of the 2025SIAMInternationalConferenceonDataMining(SDM),SIAM, 2025, pp. 335–338
work page 2025
- [60]
-
[61]
Survey on Evaluation of LLM-based Agents
A. Yehudai, L. Eden, A. Li, G. Uziel, Y. Zhao, R. Bar-Haim, A. Co- han, M. Shmueli-Scheuer, Survey on evaluation of llm-based agents, arXiv preprint arXiv:2503.16416 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[62]
LLM-Powered AI Agent Systems and Their Applications in Industry
G. Liang, Q. Tong, Llm-powered ai agent systems and their applica- tions in industry, arXiv preprint arXiv:2505.16120 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [63]
- [64]
-
[65]
H. Elhashemy, Y. Lotfy, Y. Tang, Bridging the prototype-production gap: A multi-agent system for notebooks transformation, in: 2025 40th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW), IEEE, 2025, pp. 299–302. Rasheed et al.:Preprint submitted to ElsevierPage 33 of 34 LLM-Based Multi-Agent Systems for Code Generation
work page 2025
-
[66]
P. Runeson, M. Höst, Guidelines for conducting and reporting case study research in software engineering, Empirical software engineer- ing 14 (2009) 131–164
work page 2009
-
[67]
E. Blair, A reflexive exploration of two qualitative data coding tech- niques, Journal of Methods and Measurement in the Social Sciences 6 (1) (2015) 14–29
work page 2015
-
[68]
S. Keele, et al., Guidelines for performing systematic literature re- views in software engineering (2007)
work page 2007
-
[69]
F. Quin, D. Weyns, M. Galster, C. C. Silva, A/b testing: a systematic literature review, Journal of Systems and Software (2024) 112011
work page 2024
- [70]
-
[71]
I. Ozkaya, Application of large language models to software engi- neering tasks: Opportunities, risks, and implications, IEEE Software 40 (3) (2023) 4–8
work page 2023
-
[72]
A.Fan,B.Gokkaya,M.Harman,M.Lyubarskiy,S.Sengupta,S.Yoo, J.M.Zhang,Largelanguagemodelsforsoftwareengineering:Survey andopenproblems,in:2023IEEE/ACMInternationalConferenceon SoftwareEngineering:FutureofSoftwareEngineering(ICSE-FoSE), IEEE, 2023, pp. 31–53
work page 2023
-
[73]
X. Hou, Y. Zhao, Y. Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo, J. Grundy, H. Wang, Large language models for software engineer- ing: A systematic literature review, ACM Transactions on Software Engineering and Methodology (2023)
work page 2023
- [74]
-
[75]
R.A.Husein,H.Aburajouh,C.Catal,Largelanguagemodelsforcode completion: A systematic literature review, Computer Standards & Interfaces 92 (2025) 103917
work page 2025
-
[76]
J. Wang, Y. Huang, C. Chen, Z. Liu, S. Wang, Q. Wang, Software testing with large language models: Survey, landscape, and vision, IEEE Transactions on Software Engineering 50 (4) (2024) 911–936
work page 2024
- [77]
-
[78]
J. Shi, Z. Yang, D. Lo, Efficient and green large language models for software engineering: Literature review, vision, and the road ahead, ACMTransactionsonSoftwareEngineeringandMethodology34(5) (2025) 1–22
work page 2025
-
[79]
M. K. Görmez, M. Yılmaz, P. M. Clarke, Large language models for software engineering: A systematic mapping study, in: European Conference on Software Process Improvement, Springer, 2024, pp. 64–79
work page 2024
-
[80]
B. V. L. d. Albuquerque, A. F. S. d. Cunha, L. Souza, S. W. M. Siqueira, R. P. d. Santos, Generating and reviewing programming codes with large language models: A systematic mapping study, in: Proceedings of the 20th Brazilian Symposium on Information Systems, 2024, pp. 1–10
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.