pith. machine review for the scientific record. sign in

arxiv: 2604.16321 · v1 · submitted 2026-02-25 · 💻 cs.SE

Recognition: no theorem link

LLM-Based Multi-Agent Systems for Code Generation: A Multi-Vocal Literature Review

Authors on Pith no claims yet

Pith reviewed 2026-05-15 19:49 UTC · model grok-4.3

classification 💻 cs.SE
keywords multi-agent systemslarge language modelscode generationliterature reviewchallengesmotivationsbenchmarksfuture directions
0
0 comments X

The pith

A review of 114 studies from academia and industry classifies nine motivations, common models, six challenge categories, and six future directions for multi-agent LLM code generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper performs a multi-vocal literature review that combines peer-reviewed work and grey literature to synthesize the current state of LLM-based multi-agent systems for code generation. It examines 114 studies to group the reasons for adopting these systems into nine categories, map the models and benchmarks in use, and organize reported challenges with their solutions into six main categories containing 26 subcategories. The review also collects future research directions into six main categories with 18 subcategories. A sympathetic reader would care because the structured overview can help decide when and how to apply multi-agent setups in practice and point to concrete next steps for both research and industrial deployment.

Core claim

Through a multi-vocal literature review of 114 studies, the authors establish that motivations for adopting multi-agent LLM systems for code generation fall into nine categories, that the studies employ a recognizable set of models and evaluation benchmarks, that challenges and solutions group into six main categories with 26 subcategories, and that future research directions organize into six main categories with 18 subcategories. The synthesis draws from both academic and industrial sources to support further studies and real-world adoption.

What carries the argument

The multi-vocal literature review (MLR) method, which integrates peer-reviewed papers and grey literature to classify motivations, models, benchmarks, challenges, solutions, and future directions across the selected studies.

If this is right

  • The nine motivation categories give practitioners a checklist for deciding whether a multi-agent architecture is appropriate for a given code-generation task.
  • The mapped models and benchmarks supply a reference point for choosing LLM configurations and evaluation methods in new work.
  • The six challenge categories with 26 subcategories identify the concrete obstacles that must be solved before reliable industrial use.
  • The six categories of future directions with 18 subcategories can be used to prioritize research agendas.
  • The overall synthesis supports the transition of multi-agent code generation from research prototypes to production settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the synthesized challenges are addressed in priority order, the gap between research prototypes and deployable industrial tools may narrow faster than isolated studies suggest.
  • Standardization of benchmarks across future papers could make the field more cumulative, building directly on the overview provided here.
  • Closer tracking of grey literature in follow-up reviews may reveal whether industry practices are diverging from the academic patterns captured in this study.
  • Testing whether the nine motivation categories remain stable as new papers appear would provide a direct measure of the review's lasting utility.

Load-bearing premise

The search and selection process captured a representative sample of both peer-reviewed and grey literature without significant bias, and the manual categorization into nine motivation categories, six challenge categories, and six future-direction categories is complete and reproducible.

What would settle it

A replication that retrieves a substantial set of additional studies missed by the original search and shows that these studies fall outside the nine motivation categories or the six challenge categories.

Figures

Figures reproduced from arXiv: 2604.16321 by Kai-Kristian Kemella, Mika Saari, Muhammad Waseema, Pekka Abrahamsson, Zeeshan Rasheeda.

Figure 1
Figure 1. Figure 1: Visual representation of the research methodology implemented in this study for MLR [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Demographic distribution of peer-reviewed and grey literature studies [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
read the original abstract

Large Language Models (LLMs) have enabled multi-agent systems to perform autonomous code generation for complex tasks. Despite the recent growth in research and industrial applications in this area, there is little work on synthesizing evidence from both academic and industrial sources to capture the current state of research on LLM-based multi-agent systems for code generation. To this end, we conducted a Multi-Vocal Literature Review (MLR), combining insights from both academia and industry, including peer-reviewed studies and grey literature. The aim of this study is to systematically synthesize and analyze existing knowledge on LLM-based multi-agent systems for code generation. Specifically, the review examines the motivations for their use, employed benchmarks and models, key challenges, proposed solutions, and potential directions for future research. We selected and reviewed 114 studies, and the key findings are: 1) the identified reasons for adopting multi-agent systems for code generation were classified into nine categories; 2) the models and evaluation benchmarks utilized across the studies were systematically analyzed to provide a structured overview of commonly adopted LLM configurations and assessment practices; 3) the reported challenges and corresponding solutions were synthesized into six main categories and 26 subcategories; and 4) future research directions were identified and organized into six main categories and 18 subcategories. The results of this MLR will assist researchers and practitioners in pursuing further studies and supporting the real-world adoption of multi-agent systems in industrial settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper conducts a multi-vocal literature review (MLR) of 114 studies (peer-reviewed and grey literature) on LLM-based multi-agent systems for code generation. It classifies motivations for adoption into nine categories, provides a structured analysis of employed models and evaluation benchmarks, synthesizes reported challenges and solutions into six main categories with 26 subcategories, and organizes future research directions into six main categories with 18 subcategories.

Significance. If the underlying selection and categorization processes prove robust and reproducible, the review would offer a useful consolidated overview of motivations, common LLM configurations, challenges, and open questions in an emerging sub-area, helping researchers and practitioners navigate the literature and identify gaps for both academic and industrial work.

major comments (2)
  1. [Methods] Methods section: The account of the search strategy, databases queried, search strings, inclusion/exclusion criteria, and quality assessment is missing or only high-level. Without these details the claim of having systematically selected 114 representative studies cannot be verified and the risk of selection bias remains unaddressed.
  2. [Results] Results (categorization subsections): No coding protocol, codebook, double-coding procedure, or inter-rater agreement statistic (e.g., Cohen’s kappa or percentage agreement) is reported for mapping study content onto the nine motivation categories, six challenge categories, or six future-direction categories. This makes the taxonomy counts and distributions difficult to reproduce or audit.
minor comments (1)
  1. [Abstract] Abstract: The time window of the literature search and the final search date are not stated, which would help readers assess currency of the 114-study corpus.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which highlight important aspects of transparency and reproducibility in our multi-vocal literature review. We have prepared point-by-point responses below and will revise the manuscript to address the concerns where feasible.

read point-by-point responses
  1. Referee: [Methods] Methods section: The account of the search strategy, databases queried, search strings, inclusion/exclusion criteria, and quality assessment is missing or only high-level. Without these details the claim of having systematically selected 114 representative studies cannot be verified and the risk of selection bias remains unaddressed.

    Authors: We agree that the Methods section in the submitted manuscript is high-level and insufficient for full verification. In the revised version, we will expand this section to provide a complete account of the search strategy, including the specific academic databases queried (e.g., IEEE Xplore, ACM DL, Scopus), grey literature sources (e.g., arXiv, GitHub repositories, industry white papers), exact search strings, inclusion/exclusion criteria, and the quality assessment process used to select the 114 studies. This will directly address concerns about selection bias and enable reproducibility. revision: yes

  2. Referee: [Results] Results (categorization subsections): No coding protocol, codebook, double-coding procedure, or inter-rater agreement statistic (e.g., Cohen’s kappa or percentage agreement) is reported for mapping study content onto the nine motivation categories, six challenge categories, or six future-direction categories. This makes the taxonomy counts and distributions difficult to reproduce or audit.

    Authors: We acknowledge that the categorization process lacks sufficient methodological detail in the current draft. The nine motivation categories, six challenge categories, and six future-direction categories were derived via iterative thematic analysis involving team discussions to resolve discrepancies. However, we did not implement a formal double-coding protocol with independent raters or compute inter-rater agreement statistics. In the revision, we will add a dedicated subsection describing the coding protocol, codebook development, and assignment process, including illustrative examples of study mappings. We will also note the absence of quantitative agreement metrics as a limitation and discuss how this affects auditability. revision: partial

Circularity Check

0 steps flagged

No circularity: standard literature review synthesis

full rationale

This is a multi-vocal literature review that selects 114 external studies and synthesizes their reported motivations, models, challenges, solutions, and future directions into taxonomies. No derivations, equations, parameter fittings, or predictions are performed inside the paper; all content aggregates findings from the cited studies. The nine motivation categories, six challenge categories with 26 subcategories, and six future-direction categories are outputs of the review process applied to external sources rather than self-referential definitions or fitted inputs. No load-bearing step reduces by construction to the paper's own inputs, and the methods section describes a conventional search-and-selection protocol without internal circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The review rests on the standard assumption that a multi-vocal literature review methodology can reliably capture and categorize knowledge from both academic and grey literature sources without introducing new free parameters or entities.

axioms (1)
  • domain assumption Multi-vocal literature review methodology is appropriate and sufficient for synthesizing peer-reviewed and grey literature on LLM-based multi-agent code generation.
    Invoked when the abstract states the approach combines academic and industrial sources to produce the reported categories.

pith-pipeline@v0.9.0 · 5573 in / 1336 out tokens · 47525 ms · 2026-05-15T19:49:56.313359+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

86 extracted references · 86 canonical work pages · 12 internal anchors

  1. [1]

    Belzner, T

    L. Belzner, T. Gabor, M. Wirsing, Large language model assisted software engineering: prospects, challenges, and a case study, in: In- ternational Conference on Bridging the Gap between AI and Reality, Springer, 2023, pp. 355–374

  2. [2]

    J.Liu,K.Wang,Y.Chen,X.Peng,Z.Chen,L.Zhang,Y.Lou,Large language model-based agents for software engineering: A survey, arXiv preprint arXiv:2409.02977 (2024)

  3. [3]

    H. Jin, L. Huang, H. Cai, J. Yan, B. Li, H. Chen, From llms to llm- basedagentsforsoftwareengineering:Asurveyofcurrent,challenges and future, arXiv preprint arXiv:2408.02479 (2024)

  4. [4]

    S. Hong, X. Zheng, J. Chen, Y. Cheng, J. Wang, C. Zhang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, et al., Metagpt: Meta programmingformulti-agentcollaborativeframework,arXivpreprint arXiv:2308.00352 (2023)

  5. [5]

    C. Qian, W. Liu, H. Liu, N. Chen, Y. Dang, J. Li, C. Yang, W. Chen, Y. Su, X. Cong, et al., Chatdev: Communicative agents for software development, in: Proceedings of the 62nd Annual Meeting of the AssociationforComputationalLinguistics(Volume1:LongPapers), 2024, pp. 15174–15186

  6. [6]

    D.Huang,J.M.Zhang,M.Luck,Q.Bu,Y.Qing,H.Cui,Agentcoder: Multi-agent-based code generation with iterative testing and optimi- sation, arXiv preprint arXiv:2312.13010 (2023)

  7. [7]

    M. A. Islam, M. E. Ali, M. R. Parvez, Mapcoder: Multi-agent code generation for competitive problem solving, in: L. Ku, A. Martins, V. Srikumar (Eds.), Proceedings of the 62nd Annual Meeting of the AssociationforComputationalLinguistics(Volume1:LongPapers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, Association for Computational Linguistics, 2024, p...

  8. [8]

    J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, O. Press, Swe-agent: Agent-computer interfaces enable automated software engineering, Advances in Neural Information Processing Systems 37 (2024) 50528–50652

  9. [9]

    M. A. Islam, M. E. Ali, M. R. Parvez, Codesim: Multi-agent code generation and problem solving through simulation-driven planning and debugging, in: L. Chiruzzo, A. Ritter, L. Wang (Eds.), Findings of the Association for Computational Linguistics: NAACL 2025, Albuquerque,NewMexico,USA,April29-May4,2025,Association forComputationalLinguistics,2025,pp.5113–...

  10. [10]

    J. He, C. Treude, D. Lo, Llm-based multi-agent systems for soft- ware engineering: Literature review, vision, and the road ahead, ACMTransactionsonSoftwareEngineeringandMethodology34(5) (2025) 1–30

  11. [11]

    Mohammadi, Y

    M. Mohammadi, Y. Li, J. Lo, W. Yip, Evaluation and benchmarking of llm agents: A survey, in: Proceedings of the 31st ACM SIGKDD ConferenceonKnowledgeDiscoveryandDataMiningV.2,2025,pp. 6129–6139

  12. [12]

    Y. Wang, W. Zhong, Y. Huang, E. Shi, M. Yang, J. Chen, H. Li, Y. Ma, Q. Wang, Z. Zheng, Agents in software engineering: Survey, landscape,andvision,AutomatedSoftwareEngineering32(2)(2025) 70

  13. [13]

    Rasheed, Llm-based multi-agent systems for code generation: A multi-vocal literature review (Feb

    Z. Rasheed, Llm-based multi-agent systems for code generation: A multi-vocal literature review (Feb. 2026).doi:10.5281/zenodo.18763 362. URLhttps://doi.org/10.5281/zenodo.18763362

  14. [14]

    V.Garousi,M.Felderer,M.V.Mäntylä,Guidelinesforincludinggrey literature and conducting multivocal literature reviews in software engineering, Information and software technology 106 (2019) 101– 121

  15. [15]

    Kitchenham, O

    B. Kitchenham, O. P. Brereton, D. Budgen, M. Turner, J. Bailey, S. Linkman, Systematic literature reviews in software engineering– a systematic literature review, Information and software technology 51 (1) (2009) 7–15

  16. [16]

    Schardt, M

    C. Schardt, M. B. Adams, T. Owens, S. Keitz, P. Fontelo, Utilization ofthepicoframeworktoimprovesearchingpubmedforclinicalques- tions,BMCmedicalinformaticsanddecisionmaking7(1)(2007)16

  17. [17]

    Brereton, B

    P. Brereton, B. A. Kitchenham, D. Budgen, M. Turner, M. Khalil, Lessonsfromapplyingthesystematicliteraturereviewprocesswithin the software engineering domain, Journal of systems and software 80 (4) (2007) 571–583

  18. [18]

    C.Wohlin,Guidelinesforsnowballinginsystematicliteraturestudies andareplicationinsoftwareengineering,in:Proceedingsofthe18th international conference on evaluation and assessment in software engineering, 2014, pp. 1–10

  19. [19]

    T. Dybå, T. Dingsøyr, Empirical studies of agile software devel- opment: A systematic review, Information and software technology 50 (9-10) (2008) 833–859

  20. [20]

    Wohlin, M

    C. Wohlin, M. Höst, K. Henningsson, Empirical research methods in web and software engineering, in: Web engineering, Springer, 2006, pp. 409–430

  21. [21]

    Terry, N

    G. Terry, N. Hayfield, V. Clarke, V. Braun, et al., Thematic analysis, The SAGE handbook of qualitative research in psychology 2 (17-37) (2017) 25

  22. [22]

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Ka- plan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, et al., Evaluating large language models trained on code, arXiv preprint arXiv:2107.03374 (2021)

  23. [23]

    Program Synthesis with Large Language Models

    J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E.Jiang,C.Cai,M.Terry,Q.Le,etal.,Programsynthesiswithlarge language models, arXiv preprint arXiv:2108.07732 (2021)

  24. [24]

    Q. Peng, Y. Chai, X. Li, Humaneval-xl: A multilingual code gen- eration benchmark for cross-lingual natural language generalization, arXiv preprint arXiv:2402.16694 (2024)

  25. [25]

    F. Lin, D. J. Kim, et al., Soen-101: Code generation by emulating software process models using large language model agents, arXiv preprint arXiv:2403.15852 (2024)

  26. [26]

    Szalontai, B

    B. Szalontai, B. Márton, B. Pintér, T. Gregorics, Investigating repro- ducibility challenges in llm bugfixing on the humanevalfix bench- mark, Software 4 (3) (2025) 17. Rasheed et al.:Preprint submitted to ElsevierPage 32 of 34 LLM-Based Multi-Agent Systems for Code Generation

  27. [27]

    J.Liu,C.S.Xia,Y.Wang,L.Zhang,Isyourcodegeneratedbychatgpt really correct? rigorous evaluation of large language models for code generation, arXiv preprint arXiv:2305.01210 (2023)

  28. [28]

    M.A.M.Khan,M.S.Bari,X.L.Do,W.Wang,M.R.Parvez,S.Joty, xcodeeval: A large scale multilingual multitask benchmark for code understanding, generation, translation and retrieval, arXiv preprint arXiv:2303.03004 (2023)

  29. [29]

    Huynh, B

    N. Huynh, B. Lin, Large language models for code generation: A comprehensive survey of challenges, techniques, evaluation, and ap- plications, arXiv preprint arXiv:2503.01245 (2025)

  30. [30]

    Ahmad, S

    B.Athiwaratkun,S.K.Gouda,Z.Wang,X.Li,Y.Tian,M.Tan,W.U. Ahmad, S. Wang, Q. Sun, M. Shang, et al., Multi-lingual evaluation of code generation models, arXiv preprint arXiv:2210.14868 (2022)

  31. [31]

    T.Helmuth,P.Kelly,Psb2:thesecondprogramsynthesisbenchmark suite, in: Proceedings of the Genetic and Evolutionary Computation Conference, 2021, pp. 785–794

  32. [32]

    Z. Wang, S. Zhou, D. Fried, G. Neubig, Execution-based evaluation for open-domain code generation, in: Findings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 1271–1290

  33. [33]

    Agashe, S

    R. Agashe, S. Iyer, L. Zettlemoyer, Juice: A large scale distantly supervised dataset for open domain context-based code generation, arXiv preprint arXiv:1910.02216 (2019)

  34. [34]

    Cassano, J

    F. Cassano, J. Gouwar, D. Nguyen, S. Nguyen, L. Phipps-Costin, D.Pinckney,M.-H.Yee,Y.Zi,C.J.Anderson,M.Q.Feldman,etal., Multipl-e:Ascalableandextensibleapproachtobenchmarkingneural code generation, arXiv preprint arXiv:2208.08227 (2022)

  35. [35]

    C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, K. Narasimhan, Swe-bench: Can language models resolve real-world github issues?, arXiv preprint arXiv:2310.06770 (2023)

  36. [36]

    Kio, Swe-bench-secret: Automating ai agent evaluation for soft- ware engineering tasks (2025)

    G. Kio, Swe-bench-secret: Automating ai agent evaluation for soft- ware engineering tasks (2025)

  37. [37]

    Y. Ding, Z. Wang, W. Ahmad, H. Ding, M. Tan, N. Jain, M. K. Ramanathan, R. Nallapati, P. Bhatia, D. Roth, et al., Crosscodeeval: A diverse and multilingual benchmark for cross-file code comple- tion, Advances in Neural Information Processing Systems 36 (2023) 46701–46723

  38. [38]

    Measuring Coding Challenge Competence With APPS

    D.Hendrycks,S.Basart,S.Kadavath,M.Mazeika,A.Arora,E.Guo, C. Burns, S. Puranik, H. He, D. Song, et al., Measuring coding challenge competence with apps, arXiv preprint arXiv:2105.09938 (2021)

  39. [39]

    Z.Wang,S.Liu,Y.Sun,H.Li,K.Shen,Codecontests+:High-quality test case generation for competitive programming, arXiv preprint arXiv:2506.05817 (2025)

  40. [40]

    N.Jain,K.Han,A.Gu,W.-D.Li,F.Yan,T.Zhang,S.Wang,A.Solar- Lezama,K.Sen,I.Stoica,Livecodebench:Holisticandcontamination free evaluation of large language models for code, arXiv preprint arXiv:2403.07974 (2024)

  41. [41]

    P. Yin, B. Deng, E. Chen, B. Vasilescu, G. Neubig, Learning to minealignedcodeandnaturallanguagepairsfromstackoverflow,in: Proceedings of the 15th international conference on mining software repositories, 2018, pp. 476–486

  42. [42]

    D.Rodriguez-Cardenas,D.N.Palacio,D.Khati,H.Burke,D.Poshy- vanyk,Benchmarkingcausalstudytointerpretlargelanguagemodels forsourcecode,in:2023IEEEInternationalConferenceonSoftware Maintenance and Evolution (ICSME), IEEE, 2023, pp. 329–334

  43. [43]

    Huang, J

    Q. Huang, J. Vora, P. Liang, J. Leskovec, Mlagentbench: Evaluating languageagentsonmachinelearningexperimentation,arXivpreprint arXiv:2310.03302 (2023)

  44. [44]

    X. Liu, H. Yu, H. Zhang, Y. Xu, X. Lei, H. Lai, Y. Gu, H. Ding, K.Men,K.Yang,etal.,Agentbench:Evaluatingllmsasagents,arXiv preprint arXiv:2308.03688 (2023)

  45. [45]

    Zhang, J

    K. Zhang, J. Li, G. Li, X. Shi, Z. Jin, Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo- level coding challenges, arXiv preprint arXiv:2401.07339 (2024)

  46. [46]

    Z. Z. Wang, A. Asai, F. F. Xu, Y. Xie, G. Neubig, D. Fried, et al., Coderag-bench:Canretrievalaugmentcodegeneration?,in:Findings of the Association for Computational Linguistics: NAACL 2025, 2025, pp. 3199–3214

  47. [47]

    Debenedetti, J

    E. Debenedetti, J. Zhang, M. Balunovic, L. Beurer-Kellner, M. Fis- cher, F. Tramèr, Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents, Advances in Neural Information Processing Systems 37 (2024) 82895–82920

  48. [48]

    C. Guo, X. Liu, C. Xie, A. Zhou, Y. Zeng, Z. Lin, D. Song, B. Li, Redcode: Risky code execution and generation benchmark for code agents, Advances in Neural Information Processing Systems 37 (2024) 106190–106236

  49. [49]

    C. Tony, M. Mutas, N. E. D. Ferreyra, R. Scandariato, Llmseceval: A dataset of natural language prompts for security evaluations, in: 2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR), IEEE, 2023, pp. 588–592

  50. [50]

    M. Liu, N. Pinckney, B. Khailany, H. Ren, Verilogeval: Evaluat- ing large language models for verilog code generation, in: 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD), IEEE, 2023, pp. 1–8

  51. [51]

    Pinckney, C

    N. Pinckney, C. Batten, M. Liu, H. Ren, B. Khailany, Revisiting verilogeval: A year of improvements in large-language models for hardwarecodegeneration,ACMTransactionsonDesignAutomation of Electronic Systems (2025)

  52. [52]

    H. Yu, B. Shen, D. Ran, J. Zhang, Q. Zhang, Y. Ma, G. Liang, Y. Li, Q. Wang, T. Xie, Codereval: A benchmark of pragmatic code generationwithgenerativepre-trainedmodels,in:Proceedingsofthe 46th IEEE/ACM International Conference on Software Engineering, 2024, pp. 1–12

  53. [53]

    Huang, J

    Y. Huang, J. Luo, Y. Yu, Y. Zhang, F. Lei, Y. Wei, S. He, L. Huang, X. Liu, J. Zhao, et al., Da-code: Agent data science codegenerationbenchmarkforlargelanguagemodels,arXivpreprint arXiv:2410.07331 (2024)

  54. [54]

    R. Shu, N. Das, M. Yuan, M. Sunkara, Y. Zhang, Towards effective genai multi-agent collaboration: Design and evaluation for enterprise applications, arXiv preprint arXiv:2412.05449 (2024)

  55. [55]

    Talebirad, A

    Y. Talebirad, A. Nadiri, Multi-agent collaboration: Harnessing the power of intelligent llm agents, arXiv preprint arXiv:2306.03314 (2023)

  56. [56]

    Z.Yu,Y.Zhao,A.Cohan,X.-P.Zhang,Humanevalproandmbpppro: Evaluating large language models on self-invoking code generation, arXiv preprint arXiv:2412.21199 (2024)

  57. [57]

    D.G.Paul,H.Zhu,I.Bayley,Benchmarksandmetricsforevaluations of code generation: A critical review, in: 2024 IEEE International Conference on Artificial Intelligence Testing (AITest), IEEE, 2024, pp. 87–94

  58. [58]

    M. T. R. Laskar, X.-Y. Fu, C. Chen, S. B. Tn, Building real-world meeting summarization systems using large language models: A practical perspective, arXiv preprint arXiv:2310.19233 (2023)

  59. [59]

    Kulkarni, M

    A. Kulkarni, M. Chakraborty, Blue sky: Reducing performance gap between commercial and open-source llms, in: Proceedings of the 2025SIAMInternationalConferenceonDataMining(SDM),SIAM, 2025, pp. 335–338

  60. [60]

    C.Sypherd,V.Belle,Practicalconsiderationsforagenticllmsystems, arXiv preprint arXiv:2412.04093 (2024)

  61. [61]

    Survey on Evaluation of LLM-based Agents

    A. Yehudai, L. Eden, A. Li, G. Uziel, Y. Zhao, R. Bar-Haim, A. Co- han, M. Shmueli-Scheuer, Survey on evaluation of llm-based agents, arXiv preprint arXiv:2503.16416 (2025)

  62. [62]

    LLM-Powered AI Agent Systems and Their Applications in Industry

    G. Liang, Q. Tong, Llm-powered ai agent systems and their applica- tions in industry, arXiv preprint arXiv:2505.16120 (2025)

  63. [63]

    K. Wang, G. Zhang, Z. Zhou, J. Wu, M. Yu, S. Zhao, C. Yin, J. Fu, Y. Yan, H. Luo, et al., A comprehensive survey in llm (- agent)fullstacksafety:Data,traininganddeployment,arXivpreprint arXiv:2504.15585 (2025)

  64. [64]

    S.Han,Q.Zhang,W.Jin,Z.Xu,Llmmulti-agentsystems:Challenges and open problems, arXiv preprint arXiv:2402.03578 (2024)

  65. [65]

    Elhashemy, Y

    H. Elhashemy, Y. Lotfy, Y. Tang, Bridging the prototype-production gap: A multi-agent system for notebooks transformation, in: 2025 40th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW), IEEE, 2025, pp. 299–302. Rasheed et al.:Preprint submitted to ElsevierPage 33 of 34 LLM-Based Multi-Agent Systems for Code Generation

  66. [66]

    Runeson, M

    P. Runeson, M. Höst, Guidelines for conducting and reporting case study research in software engineering, Empirical software engineer- ing 14 (2009) 131–164

  67. [67]

    Blair, A reflexive exploration of two qualitative data coding tech- niques, Journal of Methods and Measurement in the Social Sciences 6 (1) (2015) 14–29

    E. Blair, A reflexive exploration of two qualitative data coding tech- niques, Journal of Methods and Measurement in the Social Sciences 6 (1) (2015) 14–29

  68. [68]

    Keele, et al., Guidelines for performing systematic literature re- views in software engineering (2007)

    S. Keele, et al., Guidelines for performing systematic literature re- views in software engineering (2007)

  69. [69]

    F. Quin, D. Weyns, M. Galster, C. C. Silva, A/b testing: a systematic literature review, Journal of Systems and Software (2024) 112011

  70. [70]

    Zheng, K

    Z. Zheng, K. Ning, Y. Wang, J. Zhang, D. Zheng, M. Ye, J. Chen, A survey of large language models for code: Evolution, benchmarking, and future trends, arXiv preprint arXiv:2311.10372 (2023)

  71. [71]

    Ozkaya, Application of large language models to software engi- neering tasks: Opportunities, risks, and implications, IEEE Software 40 (3) (2023) 4–8

    I. Ozkaya, Application of large language models to software engi- neering tasks: Opportunities, risks, and implications, IEEE Software 40 (3) (2023) 4–8

  72. [72]

    A.Fan,B.Gokkaya,M.Harman,M.Lyubarskiy,S.Sengupta,S.Yoo, J.M.Zhang,Largelanguagemodelsforsoftwareengineering:Survey andopenproblems,in:2023IEEE/ACMInternationalConferenceon SoftwareEngineering:FutureofSoftwareEngineering(ICSE-FoSE), IEEE, 2023, pp. 31–53

  73. [73]

    X. Hou, Y. Zhao, Y. Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo, J. Grundy, H. Wang, Large language models for software engineer- ing: A systematic literature review, ACM Transactions on Software Engineering and Methodology (2023)

  74. [74]

    Zhang, T

    Q. Zhang, T. Zhang, J. Zhai, C. Fang, B. Yu, W. Sun, Z. Chen, A critical review of large language model on software engineering: An example from chatgpt and automated program repair, arXiv preprint arXiv:2310.08879 (2023)

  75. [75]

    R.A.Husein,H.Aburajouh,C.Catal,Largelanguagemodelsforcode completion: A systematic literature review, Computer Standards & Interfaces 92 (2025) 103917

  76. [76]

    J. Wang, Y. Huang, C. Chen, Z. Liu, S. Wang, Q. Wang, Software testing with large language models: Survey, landscape, and vision, IEEE Transactions on Software Engineering 50 (4) (2024) 911–936

  77. [77]

    Zheng, K

    Z. Zheng, K. Ning, Q. Zhong, J. Chen, W. Chen, L. Guo, W. Wang, Y. Wang, Towards an understanding of large language models in software engineering tasks, Empirical Software Engineering 30 (2) (2025) 50

  78. [78]

    J. Shi, Z. Yang, D. Lo, Efficient and green large language models for software engineering: Literature review, vision, and the road ahead, ACMTransactionsonSoftwareEngineeringandMethodology34(5) (2025) 1–22

  79. [79]

    M. K. Görmez, M. Yılmaz, P. M. Clarke, Large language models for software engineering: A systematic mapping study, in: European Conference on Software Process Improvement, Springer, 2024, pp. 64–79

  80. [80]

    B. V. L. d. Albuquerque, A. F. S. d. Cunha, L. Souza, S. W. M. Siqueira, R. P. d. Santos, Generating and reviewing programming codes with large language models: A systematic mapping study, in: Proceedings of the 20th Brazilian Symposium on Information Systems, 2024, pp. 1–10

Showing first 80 references.