arxiv: 2604.16321 · v1 · submitted 2026-02-25 · 💻 cs.SE

Recognition: no theorem link

LLM-Based Multi-Agent Systems for Code Generation: A Multi-Vocal Literature Review

Zeeshan Rasheeda , Muhammad Waseema , Kai-Kristian Kemella , Mika Saari , Pekka Abrahamsson

Authors on Pith no claims yet

Pith reviewed 2026-05-15 19:49 UTC · model grok-4.3

classification 💻 cs.SE

keywords multi-agent systemslarge language modelscode generationliterature reviewchallengesmotivationsbenchmarksfuture directions

0 comments

The pith

A review of 114 studies from academia and industry classifies nine motivations, common models, six challenge categories, and six future directions for multi-agent LLM code generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper performs a multi-vocal literature review that combines peer-reviewed work and grey literature to synthesize the current state of LLM-based multi-agent systems for code generation. It examines 114 studies to group the reasons for adopting these systems into nine categories, map the models and benchmarks in use, and organize reported challenges with their solutions into six main categories containing 26 subcategories. The review also collects future research directions into six main categories with 18 subcategories. A sympathetic reader would care because the structured overview can help decide when and how to apply multi-agent setups in practice and point to concrete next steps for both research and industrial deployment.

Core claim

Through a multi-vocal literature review of 114 studies, the authors establish that motivations for adopting multi-agent LLM systems for code generation fall into nine categories, that the studies employ a recognizable set of models and evaluation benchmarks, that challenges and solutions group into six main categories with 26 subcategories, and that future research directions organize into six main categories with 18 subcategories. The synthesis draws from both academic and industrial sources to support further studies and real-world adoption.

What carries the argument

The multi-vocal literature review (MLR) method, which integrates peer-reviewed papers and grey literature to classify motivations, models, benchmarks, challenges, solutions, and future directions across the selected studies.

If this is right

The nine motivation categories give practitioners a checklist for deciding whether a multi-agent architecture is appropriate for a given code-generation task.
The mapped models and benchmarks supply a reference point for choosing LLM configurations and evaluation methods in new work.
The six challenge categories with 26 subcategories identify the concrete obstacles that must be solved before reliable industrial use.
The six categories of future directions with 18 subcategories can be used to prioritize research agendas.
The overall synthesis supports the transition of multi-agent code generation from research prototypes to production settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the synthesized challenges are addressed in priority order, the gap between research prototypes and deployable industrial tools may narrow faster than isolated studies suggest.
Standardization of benchmarks across future papers could make the field more cumulative, building directly on the overview provided here.
Closer tracking of grey literature in follow-up reviews may reveal whether industry practices are diverging from the academic patterns captured in this study.
Testing whether the nine motivation categories remain stable as new papers appear would provide a direct measure of the review's lasting utility.

Load-bearing premise

The search and selection process captured a representative sample of both peer-reviewed and grey literature without significant bias, and the manual categorization into nine motivation categories, six challenge categories, and six future-direction categories is complete and reproducible.

What would settle it

A replication that retrieves a substantial set of additional studies missed by the original search and shows that these studies fall outside the nine motivation categories or the six challenge categories.

Figures

Figures reproduced from arXiv: 2604.16321 by Kai-Kristian Kemella, Mika Saari, Muhammad Waseema, Pekka Abrahamsson, Zeeshan Rasheeda.

**Figure 2.** Figure 2: Demographic distribution of peer-reviewed and grey literature studies [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

Large Language Models (LLMs) have enabled multi-agent systems to perform autonomous code generation for complex tasks. Despite the recent growth in research and industrial applications in this area, there is little work on synthesizing evidence from both academic and industrial sources to capture the current state of research on LLM-based multi-agent systems for code generation. To this end, we conducted a Multi-Vocal Literature Review (MLR), combining insights from both academia and industry, including peer-reviewed studies and grey literature. The aim of this study is to systematically synthesize and analyze existing knowledge on LLM-based multi-agent systems for code generation. Specifically, the review examines the motivations for their use, employed benchmarks and models, key challenges, proposed solutions, and potential directions for future research. We selected and reviewed 114 studies, and the key findings are: 1) the identified reasons for adopting multi-agent systems for code generation were classified into nine categories; 2) the models and evaluation benchmarks utilized across the studies were systematically analyzed to provide a structured overview of commonly adopted LLM configurations and assessment practices; 3) the reported challenges and corresponding solutions were synthesized into six main categories and 26 subcategories; and 4) future research directions were identified and organized into six main categories and 18 subcategories. The results of this MLR will assist researchers and practitioners in pursuing further studies and supporting the real-world adoption of multi-agent systems in industrial settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A practical map of 114 studies on multi-agent LLM code generation, but the category breakdowns rest on thin methodological detail.

read the letter

This paper runs a multi-vocal review that pulls in both peer-reviewed work and grey literature on LLM-based multi-agent systems for code generation. It ends up with 114 studies broken down into nine motivation categories, common models and benchmarks, six challenge categories with 26 subcategories, and six future-direction categories with 18 subcategories. That structure is the main thing it offers readers who need a quick way to see what problems keep showing up and what people are trying next.

Referee Report

2 major / 1 minor

Summary. The paper conducts a multi-vocal literature review (MLR) of 114 studies (peer-reviewed and grey literature) on LLM-based multi-agent systems for code generation. It classifies motivations for adoption into nine categories, provides a structured analysis of employed models and evaluation benchmarks, synthesizes reported challenges and solutions into six main categories with 26 subcategories, and organizes future research directions into six main categories with 18 subcategories.

Significance. If the underlying selection and categorization processes prove robust and reproducible, the review would offer a useful consolidated overview of motivations, common LLM configurations, challenges, and open questions in an emerging sub-area, helping researchers and practitioners navigate the literature and identify gaps for both academic and industrial work.

major comments (2)

[Methods] Methods section: The account of the search strategy, databases queried, search strings, inclusion/exclusion criteria, and quality assessment is missing or only high-level. Without these details the claim of having systematically selected 114 representative studies cannot be verified and the risk of selection bias remains unaddressed.
[Results] Results (categorization subsections): No coding protocol, codebook, double-coding procedure, or inter-rater agreement statistic (e.g., Cohen’s kappa or percentage agreement) is reported for mapping study content onto the nine motivation categories, six challenge categories, or six future-direction categories. This makes the taxonomy counts and distributions difficult to reproduce or audit.

minor comments (1)

[Abstract] Abstract: The time window of the literature search and the final search date are not stated, which would help readers assess currency of the 114-study corpus.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which highlight important aspects of transparency and reproducibility in our multi-vocal literature review. We have prepared point-by-point responses below and will revise the manuscript to address the concerns where feasible.

read point-by-point responses

Referee: [Methods] Methods section: The account of the search strategy, databases queried, search strings, inclusion/exclusion criteria, and quality assessment is missing or only high-level. Without these details the claim of having systematically selected 114 representative studies cannot be verified and the risk of selection bias remains unaddressed.

Authors: We agree that the Methods section in the submitted manuscript is high-level and insufficient for full verification. In the revised version, we will expand this section to provide a complete account of the search strategy, including the specific academic databases queried (e.g., IEEE Xplore, ACM DL, Scopus), grey literature sources (e.g., arXiv, GitHub repositories, industry white papers), exact search strings, inclusion/exclusion criteria, and the quality assessment process used to select the 114 studies. This will directly address concerns about selection bias and enable reproducibility. revision: yes
Referee: [Results] Results (categorization subsections): No coding protocol, codebook, double-coding procedure, or inter-rater agreement statistic (e.g., Cohen’s kappa or percentage agreement) is reported for mapping study content onto the nine motivation categories, six challenge categories, or six future-direction categories. This makes the taxonomy counts and distributions difficult to reproduce or audit.

Authors: We acknowledge that the categorization process lacks sufficient methodological detail in the current draft. The nine motivation categories, six challenge categories, and six future-direction categories were derived via iterative thematic analysis involving team discussions to resolve discrepancies. However, we did not implement a formal double-coding protocol with independent raters or compute inter-rater agreement statistics. In the revision, we will add a dedicated subsection describing the coding protocol, codebook development, and assignment process, including illustrative examples of study mappings. We will also note the absence of quantitative agreement metrics as a limitation and discuss how this affects auditability. revision: partial

Circularity Check

0 steps flagged

No circularity: standard literature review synthesis

full rationale

This is a multi-vocal literature review that selects 114 external studies and synthesizes their reported motivations, models, challenges, solutions, and future directions into taxonomies. No derivations, equations, parameter fittings, or predictions are performed inside the paper; all content aggregates findings from the cited studies. The nine motivation categories, six challenge categories with 26 subcategories, and six future-direction categories are outputs of the review process applied to external sources rather than self-referential definitions or fitted inputs. No load-bearing step reduces by construction to the paper's own inputs, and the methods section describes a conventional search-and-selection protocol without internal circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The review rests on the standard assumption that a multi-vocal literature review methodology can reliably capture and categorize knowledge from both academic and grey literature sources without introducing new free parameters or entities.

axioms (1)

domain assumption Multi-vocal literature review methodology is appropriate and sufficient for synthesizing peer-reviewed and grey literature on LLM-based multi-agent code generation.
Invoked when the abstract states the approach combines academic and industrial sources to produce the reported categories.

pith-pipeline@v0.9.0 · 5573 in / 1336 out tokens · 47525 ms · 2026-05-15T19:49:56.313359+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

86 extracted references · 86 canonical work pages · 12 internal anchors

[1]

Belzner, T

L. Belzner, T. Gabor, M. Wirsing, Large language model assisted software engineering: prospects, challenges, and a case study, in: In- ternational Conference on Bridging the Gap between AI and Reality, Springer, 2023, pp. 355–374

work page 2023
[2]

J.Liu,K.Wang,Y.Chen,X.Peng,Z.Chen,L.Zhang,Y.Lou,Large language model-based agents for software engineering: A survey, arXiv preprint arXiv:2409.02977 (2024)

work page arXiv 2024
[3]

H. Jin, L. Huang, H. Cai, J. Yan, B. Li, H. Chen, From llms to llm- basedagentsforsoftwareengineering:Asurveyofcurrent,challenges and future, arXiv preprint arXiv:2408.02479 (2024)

work page arXiv 2024
[4]

S. Hong, X. Zheng, J. Chen, Y. Cheng, J. Wang, C. Zhang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, et al., Metagpt: Meta programmingformulti-agentcollaborativeframework,arXivpreprint arXiv:2308.00352 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

C. Qian, W. Liu, H. Liu, N. Chen, Y. Dang, J. Li, C. Yang, W. Chen, Y. Su, X. Cong, et al., Chatdev: Communicative agents for software development, in: Proceedings of the 62nd Annual Meeting of the AssociationforComputationalLinguistics(Volume1:LongPapers), 2024, pp. 15174–15186

work page 2024
[6]

D.Huang,J.M.Zhang,M.Luck,Q.Bu,Y.Qing,H.Cui,Agentcoder: Multi-agent-based code generation with iterative testing and optimi- sation, arXiv preprint arXiv:2312.13010 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

M. A. Islam, M. E. Ali, M. R. Parvez, Mapcoder: Multi-agent code generation for competitive problem solving, in: L. Ku, A. Martins, V. Srikumar (Eds.), Proceedings of the 62nd Annual Meeting of the AssociationforComputationalLinguistics(Volume1:LongPapers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, Association for Computational Linguistics, 2024, p...

work page doi:10.18653/v1/ 2024
[8]

J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, O. Press, Swe-agent: Agent-computer interfaces enable automated software engineering, Advances in Neural Information Processing Systems 37 (2024) 50528–50652

work page 2024
[9]

M. A. Islam, M. E. Ali, M. R. Parvez, Codesim: Multi-agent code generation and problem solving through simulation-driven planning and debugging, in: L. Chiruzzo, A. Ritter, L. Wang (Eds.), Findings of the Association for Computational Linguistics: NAACL 2025, Albuquerque,NewMexico,USA,April29-May4,2025,Association forComputationalLinguistics,2025,pp.5113–...

work page doi:10.18653/v 2025
[10]

J. He, C. Treude, D. Lo, Llm-based multi-agent systems for soft- ware engineering: Literature review, vision, and the road ahead, ACMTransactionsonSoftwareEngineeringandMethodology34(5) (2025) 1–30

work page 2025
[11]

Mohammadi, Y

M. Mohammadi, Y. Li, J. Lo, W. Yip, Evaluation and benchmarking of llm agents: A survey, in: Proceedings of the 31st ACM SIGKDD ConferenceonKnowledgeDiscoveryandDataMiningV.2,2025,pp. 6129–6139

work page 2025
[12]

Y. Wang, W. Zhong, Y. Huang, E. Shi, M. Yang, J. Chen, H. Li, Y. Ma, Q. Wang, Z. Zheng, Agents in software engineering: Survey, landscape,andvision,AutomatedSoftwareEngineering32(2)(2025) 70

work page 2025
[13]

Rasheed, Llm-based multi-agent systems for code generation: A multi-vocal literature review (Feb

Z. Rasheed, Llm-based multi-agent systems for code generation: A multi-vocal literature review (Feb. 2026).doi:10.5281/zenodo.18763 362. URLhttps://doi.org/10.5281/zenodo.18763362

work page doi:10.5281/zenodo.18763 2026
[14]

V.Garousi,M.Felderer,M.V.Mäntylä,Guidelinesforincludinggrey literature and conducting multivocal literature reviews in software engineering, Information and software technology 106 (2019) 101– 121

work page 2019
[15]

Kitchenham, O

B. Kitchenham, O. P. Brereton, D. Budgen, M. Turner, J. Bailey, S. Linkman, Systematic literature reviews in software engineering– a systematic literature review, Information and software technology 51 (1) (2009) 7–15

work page 2009
[16]

Schardt, M

C. Schardt, M. B. Adams, T. Owens, S. Keitz, P. Fontelo, Utilization ofthepicoframeworktoimprovesearchingpubmedforclinicalques- tions,BMCmedicalinformaticsanddecisionmaking7(1)(2007)16

work page 2007
[17]

Brereton, B

P. Brereton, B. A. Kitchenham, D. Budgen, M. Turner, M. Khalil, Lessonsfromapplyingthesystematicliteraturereviewprocesswithin the software engineering domain, Journal of systems and software 80 (4) (2007) 571–583

work page 2007
[18]

C.Wohlin,Guidelinesforsnowballinginsystematicliteraturestudies andareplicationinsoftwareengineering,in:Proceedingsofthe18th international conference on evaluation and assessment in software engineering, 2014, pp. 1–10

work page 2014
[19]

T. Dybå, T. Dingsøyr, Empirical studies of agile software devel- opment: A systematic review, Information and software technology 50 (9-10) (2008) 833–859

work page 2008
[20]

Wohlin, M

C. Wohlin, M. Höst, K. Henningsson, Empirical research methods in web and software engineering, in: Web engineering, Springer, 2006, pp. 409–430

work page 2006
[21]

Terry, N

G. Terry, N. Hayfield, V. Clarke, V. Braun, et al., Thematic analysis, The SAGE handbook of qualitative research in psychology 2 (17-37) (2017) 25

work page 2017
[22]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Ka- plan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, et al., Evaluating large language models trained on code, arXiv preprint arXiv:2107.03374 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[23]

Program Synthesis with Large Language Models

J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E.Jiang,C.Cai,M.Terry,Q.Le,etal.,Programsynthesiswithlarge language models, arXiv preprint arXiv:2108.07732 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[24]

Q. Peng, Y. Chai, X. Li, Humaneval-xl: A multilingual code gen- eration benchmark for cross-lingual natural language generalization, arXiv preprint arXiv:2402.16694 (2024)

work page arXiv 2024
[25]

F. Lin, D. J. Kim, et al., Soen-101: Code generation by emulating software process models using large language model agents, arXiv preprint arXiv:2403.15852 (2024)

work page arXiv 2024
[26]

Szalontai, B

B. Szalontai, B. Márton, B. Pintér, T. Gregorics, Investigating repro- ducibility challenges in llm bugfixing on the humanevalfix bench- mark, Software 4 (3) (2025) 17. Rasheed et al.:Preprint submitted to ElsevierPage 32 of 34 LLM-Based Multi-Agent Systems for Code Generation

work page 2025
[27]

J.Liu,C.S.Xia,Y.Wang,L.Zhang,Isyourcodegeneratedbychatgpt really correct? rigorous evaluation of large language models for code generation, arXiv preprint arXiv:2305.01210 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

M.A.M.Khan,M.S.Bari,X.L.Do,W.Wang,M.R.Parvez,S.Joty, xcodeeval: A large scale multilingual multitask benchmark for code understanding, generation, translation and retrieval, arXiv preprint arXiv:2303.03004 (2023)

work page arXiv 2023
[29]

Huynh, B

N. Huynh, B. Lin, Large language models for code generation: A comprehensive survey of challenges, techniques, evaluation, and ap- plications, arXiv preprint arXiv:2503.01245 (2025)

work page arXiv 2025
[30]

Ahmad, S

B.Athiwaratkun,S.K.Gouda,Z.Wang,X.Li,Y.Tian,M.Tan,W.U. Ahmad, S. Wang, Q. Sun, M. Shang, et al., Multi-lingual evaluation of code generation models, arXiv preprint arXiv:2210.14868 (2022)

work page arXiv 2022
[31]

T.Helmuth,P.Kelly,Psb2:thesecondprogramsynthesisbenchmark suite, in: Proceedings of the Genetic and Evolutionary Computation Conference, 2021, pp. 785–794

work page 2021
[32]

Z. Wang, S. Zhou, D. Fried, G. Neubig, Execution-based evaluation for open-domain code generation, in: Findings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 1271–1290

work page 2023
[33]

Agashe, S

R. Agashe, S. Iyer, L. Zettlemoyer, Juice: A large scale distantly supervised dataset for open domain context-based code generation, arXiv preprint arXiv:1910.02216 (2019)

work page arXiv 1910
[34]

Cassano, J

F. Cassano, J. Gouwar, D. Nguyen, S. Nguyen, L. Phipps-Costin, D.Pinckney,M.-H.Yee,Y.Zi,C.J.Anderson,M.Q.Feldman,etal., Multipl-e:Ascalableandextensibleapproachtobenchmarkingneural code generation, arXiv preprint arXiv:2208.08227 (2022)

work page arXiv 2022
[35]

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, K. Narasimhan, Swe-bench: Can language models resolve real-world github issues?, arXiv preprint arXiv:2310.06770 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Kio, Swe-bench-secret: Automating ai agent evaluation for soft- ware engineering tasks (2025)

G. Kio, Swe-bench-secret: Automating ai agent evaluation for soft- ware engineering tasks (2025)

work page 2025
[37]

Y. Ding, Z. Wang, W. Ahmad, H. Ding, M. Tan, N. Jain, M. K. Ramanathan, R. Nallapati, P. Bhatia, D. Roth, et al., Crosscodeeval: A diverse and multilingual benchmark for cross-file code comple- tion, Advances in Neural Information Processing Systems 36 (2023) 46701–46723

work page 2023
[38]

Measuring Coding Challenge Competence With APPS

D.Hendrycks,S.Basart,S.Kadavath,M.Mazeika,A.Arora,E.Guo, C. Burns, S. Puranik, H. He, D. Song, et al., Measuring coding challenge competence with apps, arXiv preprint arXiv:2105.09938 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[39]

Z.Wang,S.Liu,Y.Sun,H.Li,K.Shen,Codecontests+:High-quality test case generation for competitive programming, arXiv preprint arXiv:2506.05817 (2025)

work page arXiv 2025
[40]

N.Jain,K.Han,A.Gu,W.-D.Li,F.Yan,T.Zhang,S.Wang,A.Solar- Lezama,K.Sen,I.Stoica,Livecodebench:Holisticandcontamination free evaluation of large language models for code, arXiv preprint arXiv:2403.07974 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

P. Yin, B. Deng, E. Chen, B. Vasilescu, G. Neubig, Learning to minealignedcodeandnaturallanguagepairsfromstackoverflow,in: Proceedings of the 15th international conference on mining software repositories, 2018, pp. 476–486

work page 2018
[42]

D.Rodriguez-Cardenas,D.N.Palacio,D.Khati,H.Burke,D.Poshy- vanyk,Benchmarkingcausalstudytointerpretlargelanguagemodels forsourcecode,in:2023IEEEInternationalConferenceonSoftware Maintenance and Evolution (ICSME), IEEE, 2023, pp. 329–334

work page 2023
[43]

Huang, J

Q. Huang, J. Vora, P. Liang, J. Leskovec, Mlagentbench: Evaluating languageagentsonmachinelearningexperimentation,arXivpreprint arXiv:2310.03302 (2023)

work page arXiv 2023
[44]

X. Liu, H. Yu, H. Zhang, Y. Xu, X. Lei, H. Lai, Y. Gu, H. Ding, K.Men,K.Yang,etal.,Agentbench:Evaluatingllmsasagents,arXiv preprint arXiv:2308.03688 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Zhang, J

K. Zhang, J. Li, G. Li, X. Shi, Z. Jin, Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo- level coding challenges, arXiv preprint arXiv:2401.07339 (2024)

work page arXiv 2024
[46]

Z. Z. Wang, A. Asai, F. F. Xu, Y. Xie, G. Neubig, D. Fried, et al., Coderag-bench:Canretrievalaugmentcodegeneration?,in:Findings of the Association for Computational Linguistics: NAACL 2025, 2025, pp. 3199–3214

work page 2025
[47]

Debenedetti, J

E. Debenedetti, J. Zhang, M. Balunovic, L. Beurer-Kellner, M. Fis- cher, F. Tramèr, Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents, Advances in Neural Information Processing Systems 37 (2024) 82895–82920

work page 2024
[48]

C. Guo, X. Liu, C. Xie, A. Zhou, Y. Zeng, Z. Lin, D. Song, B. Li, Redcode: Risky code execution and generation benchmark for code agents, Advances in Neural Information Processing Systems 37 (2024) 106190–106236

work page 2024
[49]

C. Tony, M. Mutas, N. E. D. Ferreyra, R. Scandariato, Llmseceval: A dataset of natural language prompts for security evaluations, in: 2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR), IEEE, 2023, pp. 588–592

work page 2023
[50]

M. Liu, N. Pinckney, B. Khailany, H. Ren, Verilogeval: Evaluat- ing large language models for verilog code generation, in: 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD), IEEE, 2023, pp. 1–8

work page 2023
[51]

Pinckney, C

N. Pinckney, C. Batten, M. Liu, H. Ren, B. Khailany, Revisiting verilogeval: A year of improvements in large-language models for hardwarecodegeneration,ACMTransactionsonDesignAutomation of Electronic Systems (2025)

work page 2025
[52]

H. Yu, B. Shen, D. Ran, J. Zhang, Q. Zhang, Y. Ma, G. Liang, Y. Li, Q. Wang, T. Xie, Codereval: A benchmark of pragmatic code generationwithgenerativepre-trainedmodels,in:Proceedingsofthe 46th IEEE/ACM International Conference on Software Engineering, 2024, pp. 1–12

work page 2024
[53]

Huang, J

Y. Huang, J. Luo, Y. Yu, Y. Zhang, F. Lei, Y. Wei, S. He, L. Huang, X. Liu, J. Zhao, et al., Da-code: Agent data science codegenerationbenchmarkforlargelanguagemodels,arXivpreprint arXiv:2410.07331 (2024)

work page arXiv 2024
[54]

R. Shu, N. Das, M. Yuan, M. Sunkara, Y. Zhang, Towards effective genai multi-agent collaboration: Design and evaluation for enterprise applications, arXiv preprint arXiv:2412.05449 (2024)

work page arXiv 2024
[55]

Talebirad, A

Y. Talebirad, A. Nadiri, Multi-agent collaboration: Harnessing the power of intelligent llm agents, arXiv preprint arXiv:2306.03314 (2023)

work page arXiv 2023
[56]

Z.Yu,Y.Zhao,A.Cohan,X.-P.Zhang,Humanevalproandmbpppro: Evaluating large language models on self-invoking code generation, arXiv preprint arXiv:2412.21199 (2024)

work page arXiv 2024
[57]

D.G.Paul,H.Zhu,I.Bayley,Benchmarksandmetricsforevaluations of code generation: A critical review, in: 2024 IEEE International Conference on Artificial Intelligence Testing (AITest), IEEE, 2024, pp. 87–94

work page 2024
[58]

M. T. R. Laskar, X.-Y. Fu, C. Chen, S. B. Tn, Building real-world meeting summarization systems using large language models: A practical perspective, arXiv preprint arXiv:2310.19233 (2023)

work page arXiv 2023
[59]

Kulkarni, M

A. Kulkarni, M. Chakraborty, Blue sky: Reducing performance gap between commercial and open-source llms, in: Proceedings of the 2025SIAMInternationalConferenceonDataMining(SDM),SIAM, 2025, pp. 335–338

work page 2025
[60]

C.Sypherd,V.Belle,Practicalconsiderationsforagenticllmsystems, arXiv preprint arXiv:2412.04093 (2024)

work page arXiv 2024
[61]

Survey on Evaluation of LLM-based Agents

A. Yehudai, L. Eden, A. Li, G. Uziel, Y. Zhao, R. Bar-Haim, A. Co- han, M. Shmueli-Scheuer, Survey on evaluation of llm-based agents, arXiv preprint arXiv:2503.16416 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[62]

LLM-Powered AI Agent Systems and Their Applications in Industry

G. Liang, Q. Tong, Llm-powered ai agent systems and their applica- tions in industry, arXiv preprint arXiv:2505.16120 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[63]

K. Wang, G. Zhang, Z. Zhou, J. Wu, M. Yu, S. Zhao, C. Yin, J. Fu, Y. Yan, H. Luo, et al., A comprehensive survey in llm (- agent)fullstacksafety:Data,traininganddeployment,arXivpreprint arXiv:2504.15585 (2025)

work page arXiv 2025
[64]

S.Han,Q.Zhang,W.Jin,Z.Xu,Llmmulti-agentsystems:Challenges and open problems, arXiv preprint arXiv:2402.03578 (2024)

work page arXiv 2024
[65]

Elhashemy, Y

H. Elhashemy, Y. Lotfy, Y. Tang, Bridging the prototype-production gap: A multi-agent system for notebooks transformation, in: 2025 40th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW), IEEE, 2025, pp. 299–302. Rasheed et al.:Preprint submitted to ElsevierPage 33 of 34 LLM-Based Multi-Agent Systems for Code Generation

work page 2025
[66]

Runeson, M

P. Runeson, M. Höst, Guidelines for conducting and reporting case study research in software engineering, Empirical software engineer- ing 14 (2009) 131–164

work page 2009
[67]

Blair, A reflexive exploration of two qualitative data coding tech- niques, Journal of Methods and Measurement in the Social Sciences 6 (1) (2015) 14–29

E. Blair, A reflexive exploration of two qualitative data coding tech- niques, Journal of Methods and Measurement in the Social Sciences 6 (1) (2015) 14–29

work page 2015
[68]

Keele, et al., Guidelines for performing systematic literature re- views in software engineering (2007)

S. Keele, et al., Guidelines for performing systematic literature re- views in software engineering (2007)

work page 2007
[69]

F. Quin, D. Weyns, M. Galster, C. C. Silva, A/b testing: a systematic literature review, Journal of Systems and Software (2024) 112011

work page 2024
[70]

Zheng, K

Z. Zheng, K. Ning, Y. Wang, J. Zhang, D. Zheng, M. Ye, J. Chen, A survey of large language models for code: Evolution, benchmarking, and future trends, arXiv preprint arXiv:2311.10372 (2023)

work page arXiv 2023
[71]

Ozkaya, Application of large language models to software engi- neering tasks: Opportunities, risks, and implications, IEEE Software 40 (3) (2023) 4–8

I. Ozkaya, Application of large language models to software engi- neering tasks: Opportunities, risks, and implications, IEEE Software 40 (3) (2023) 4–8

work page 2023
[72]

A.Fan,B.Gokkaya,M.Harman,M.Lyubarskiy,S.Sengupta,S.Yoo, J.M.Zhang,Largelanguagemodelsforsoftwareengineering:Survey andopenproblems,in:2023IEEE/ACMInternationalConferenceon SoftwareEngineering:FutureofSoftwareEngineering(ICSE-FoSE), IEEE, 2023, pp. 31–53

work page 2023
[73]

X. Hou, Y. Zhao, Y. Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo, J. Grundy, H. Wang, Large language models for software engineer- ing: A systematic literature review, ACM Transactions on Software Engineering and Methodology (2023)

work page 2023
[74]

Zhang, T

Q. Zhang, T. Zhang, J. Zhai, C. Fang, B. Yu, W. Sun, Z. Chen, A critical review of large language model on software engineering: An example from chatgpt and automated program repair, arXiv preprint arXiv:2310.08879 (2023)

work page arXiv 2023
[75]

R.A.Husein,H.Aburajouh,C.Catal,Largelanguagemodelsforcode completion: A systematic literature review, Computer Standards & Interfaces 92 (2025) 103917

work page 2025
[76]

J. Wang, Y. Huang, C. Chen, Z. Liu, S. Wang, Q. Wang, Software testing with large language models: Survey, landscape, and vision, IEEE Transactions on Software Engineering 50 (4) (2024) 911–936

work page 2024
[77]

Zheng, K

Z. Zheng, K. Ning, Q. Zhong, J. Chen, W. Chen, L. Guo, W. Wang, Y. Wang, Towards an understanding of large language models in software engineering tasks, Empirical Software Engineering 30 (2) (2025) 50

work page 2025
[78]

J. Shi, Z. Yang, D. Lo, Efficient and green large language models for software engineering: Literature review, vision, and the road ahead, ACMTransactionsonSoftwareEngineeringandMethodology34(5) (2025) 1–22

work page 2025
[79]

M. K. Görmez, M. Yılmaz, P. M. Clarke, Large language models for software engineering: A systematic mapping study, in: European Conference on Software Process Improvement, Springer, 2024, pp. 64–79

work page 2024
[80]

B. V. L. d. Albuquerque, A. F. S. d. Cunha, L. Souza, S. W. M. Siqueira, R. P. d. Santos, Generating and reviewing programming codes with large language models: A systematic mapping study, in: Proceedings of the 20th Brazilian Symposium on Information Systems, 2024, pp. 1–10

work page 2024

Showing first 80 references.