CodePori: Large-Scale System for Autonomous Software Development Using Multi-Agent Technology
Pith reviewed 2026-05-24 03:56 UTC · model grok-4.3
The pith
LLM-based multi-agent systems can automate large-scale software development but only after addressing memory limits, hallucinations, and code smells.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CodePori shows that coordinated LLM agents can perform automated code generation for software tasks, yet participant feedback identifies persistent barriers including memory limitations, hallucinations, and code smells that prevent reliable large-scale use; successful deployment therefore demands both technical mitigations and a practitioner-centric design perspective.
What carries the argument
CodePori, a multi-agent architecture in which separate LLM agents handle planning, coding, review, and integration steps for end-to-end software development.
If this is right
- Fixing memory limits and hallucinations in multi-agent LLM systems would allow their use on larger, more complex software projects.
- Mitigating code smells in generated output would improve maintainability of autonomously produced codebases.
- Moving from benchmark pass/fail scores to practitioner evaluations reveals integration barriers that technical metrics alone miss.
- Designing such systems with practitioner input increases the chance that automation tools fit actual development workflows.
Where Pith is reading between the lines
- The same multi-agent coordination pattern could be tested on non-software tasks such as hardware design or data pipeline construction.
- Scaling the system to teams of more than a few agents might introduce coordination overhead not captured in the current participant study.
- If memory and hallucination fixes are found, CodePori-style systems could shorten development cycles in startups that lack large engineering staffs.
Load-bearing premise
Participant feedback from the evaluation accurately reflects the practical performance and limitations of the CodePori system in real-world autonomous software development tasks.
What would settle it
A controlled industry trial in which teams use CodePori on production projects for several weeks and report no notable memory issues, hallucinations, or code smells would falsify the claim that these challenges must be addressed for successful integration.
Figures
read the original abstract
Context: LLM-based multi-agent systems enable automation and decision support in software development, yet existing studies rely on benchmark datasets offering only binary pass-or-fail results, limiting insight into real-world applicability. Objective: This study empirically investigates the potential and limitations of LLM-based agents in autonomous software development tasks. Method: A two-phase approach was employed: developing a multi-agent system, CodePori, for automated code generation, and conducting participant-based evaluation to assess practical performance. Results: Participant feedback reveals key strengths, challenges, and areas for improvement in LLM-based multi-agent systems, highlighting aspects missed by standard code-generation benchmarks. Conclusions: While LLM-based multi-agent systems show potential for large-scale software development, successful integration requires addressing challenges such as memory limitations, hallucinations, and code smells, alongside a practitioner-centric perspective.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CodePori, a multi-agent LLM-based system for autonomous software development. It employs a two-phase method consisting of system development followed by participant-based evaluation, and reports that feedback highlights strengths alongside challenges such as memory limitations, hallucinations, and code smells that are not captured by standard binary benchmarks. The central claim is that LLM multi-agent systems have potential for large-scale development but require practitioner-centric improvements to address these issues.
Significance. If the participant evaluation protocol and results are rigorously documented and analyzed, the work could usefully extend beyond pass/fail benchmarks by surfacing practical limitations of current LLM agents. The absence of any quantitative metrics, participant counts, task descriptions, or statistical analysis in the provided description, however, leaves the empirical contribution difficult to evaluate and limits the strength of the conclusions.
major comments (1)
- Abstract and Results: The participant-based evaluation is described only at a high level with no information on the number of participants, the software development tasks used, any quantitative performance metrics collected, or the qualitative analysis procedure. Without these details the feedback-derived claims about strengths, challenges, and required improvements cannot be assessed for reliability or generalizability.
Simulated Author's Rebuttal
We thank the referee for their constructive comments. We agree that the participant evaluation requires substantially more detail to support the claims and will revise the manuscript to address this.
read point-by-point responses
-
Referee: Abstract and Results: The participant-based evaluation is described only at a high level with no information on the number of participants, the software development tasks used, any quantitative performance metrics collected, or the qualitative analysis procedure. Without these details the feedback-derived claims about strengths, challenges, and required improvements cannot be assessed for reliability or generalizability.
Authors: We agree that the current description is insufficient. In the revised manuscript we will expand the Method and Results sections to report the exact number of participants, the specific software development tasks assigned, any quantitative metrics collected during the evaluation, and the qualitative analysis procedure (including how themes were derived from feedback). These additions will allow readers to assess reliability and generalizability directly. revision: yes
Circularity Check
No significant circularity
full rationale
The paper describes an empirical investigation: development of the CodePori multi-agent system followed by participant-based evaluation whose conclusions rest on external feedback about strengths, challenges, and limitations. No derivation chain, equations, fitted parameters presented as predictions, or self-citation load-bearing premises appear in the abstract or described method. The central claims are grounded in practitioner input rather than reducing to self-referential definitions or inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Participant-based evaluation can surface practical limitations of LLM agents that binary benchmarks miss.
Forward citations
Cited by 4 Pith papers
-
Memory in the Age of AI Agents
The paper maps agent memory research via three forms (token-level, parametric, latent), three functions (factual, experiential, working), and dynamics of formation/evolution/retrieval, plus benchmarks and future directions.
-
Beyond Functional Correctness: Design Issues in AI IDE-Generated Large-Scale Projects
AI IDEs with structured guidance can produce functional large-scale code but frequently introduce design flaws such as duplication, complexity, and principle violations that risk long-term maintainability.
-
A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems
A comprehensive review of self-evolving AI agents that improve themselves over time, organized via a framework of inputs, agent system, environment, and optimizers, with domain-specific and safety discussions.
-
Large Language Model-Based Agents for Software Engineering: A Survey
A literature survey that collects and categorizes 124 papers on LLM-based agents for software engineering from SE and agent perspectives.
Reference graph
Works this paper leans on
-
[1]
C. Treude, Navigating complexity in software engineering: A prototype for comparing gpt-n solutions, in: 2023 IEEE/ACM 5th International Workshop on Bots in Software Engineering (BotSE), IEEE, 2023, pp. 1–5
work page 2023
-
[2]
A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, et al., Improving language under- standing by generative pre-training
-
[3]
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., Language models are unsupervised multitask learners, OpenAI blog 1 (8) (2019) 9
work page 2019
- [4]
- [5]
-
[6]
L. Belzner, T. Gabor, M. Wirsing, Large language model assisted software engineering: prospects, challenges, and a case study, in: International Conference on Bridging the Gap between AI and Reality, Springer, 2023, pp. 355–374
work page 2023
-
[7]
Learning to Represent Programs with Graphs
M. Allamanis, M. Brockschmidt, M. Khademi, Learning to represent programs with graphs, arXiv preprint arXiv:1711.00740
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, et al., Emergent abilities of large language models, arXiv preprint arXiv:2206.07682. 19
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
F. Urrutia, R. Araya, Who’s the best detective? large language models vs. traditional ma- chine learning in detecting incoherent fourth grade math answers, Journal of Educational Computing Research 61 (8) (2024) 187–218
work page 2024
-
[10]
X. Hu, H. K. Dam, Future of software engineering@ icse 2023, in: 2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE- FoSE), IEEE, 2023, pp. 1–3
work page 2023
- [11]
-
[12]
Y. Chae, T. Davidson, Large language models for text classification: From zero-shot learning to fine-tuning, Open Science Foundation
-
[13]
J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song, J. Aslanides, S. Hen- derson, R. Ring, S. Young, et al., Scaling language models: Methods, analysis & insights from training gopher, arXiv preprint arXiv:2112.11446
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Z. Rasheed, M. Waseem, K.-K. Kemell, W. Xiaofeng, A. N. Duc, K. Syst¨ a, P. Abra- hamsson, Autonomous agents in software development: A vision paper, arXiv preprint arXiv:2311.18440
-
[15]
X. Gu, H. Zhang, S. Kim, Deep code search, in: Proceedings of the 40th International Conference on Software Engineering, 2018, pp. 933–944
work page 2018
- [16]
-
[17]
Q. Gu, Llm-based code generation method for golang compiler testing, in: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2023, pp. 2201–2203
work page 2023
-
[18]
Y. Ishibashi, Y. Nishimura, Self-organized agents: A llm multi-agent framework toward ultra large-scale code generation and optimization, arXiv preprint arXiv:2404.02183
-
[19]
S. Hong, X. Zheng, J. Chen, Y. Cheng, J. Wang, C. Zhang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, et al., Metagpt: Meta programming for multi-agent collaborative framework, arXiv preprint arXiv:2308.00352
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, et al., Evaluating large language models trained on code, arXiv preprint arXiv:2107.03374
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
C. Qian, X. Cong, C. Yang, W. Chen, Y. Su, J. Xu, Z. Liu, M. Sun, Communicative agents for software development, arXiv preprint arXiv:2307.07924
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. Dal Lago, et al., Competition-level code generation with alphacode, Science 378 (6624) (2022) 1092–1097
work page 2022
-
[23]
InCoder: A Generative Model for Code Infilling and Synthesis
D. Fried, A. Aghajanyan, J. Lin, S. Wang, E. Wallace, F. Shi, R. Zhong, W.-t. Yih, L. Zettlemoyer, M. Lewis, Incoder: A generative model for code infilling and synthesis, arXiv preprint arXiv:2204.05999
work page internal anchor Pith review Pith/arXiv arXiv
- [24]
-
[25]
A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al., Palm: Scaling language modeling with pathways, Journal of Machine Learning Research 24 (240) (2023) 1–113
work page 2023
-
[26]
CodePori: Large Scale System for Autonomous Software Development by Using Multi-Agents
Z. Rasheed, Dataset of the Paper “CodePori: Large Scale System for Autonomous Software Development by Using Multi-Agents”, https://doi.org/10.5281/zenodo.13755415 (2024)
-
[27]
Z. Rasheed, M. A. Sami, P. Abrahamsson, Codepori, accessed: 2024-09-12 (2024). URL https://github.com/GPT-Laboratory/CodePori
work page 2024
-
[28]
R. Gozalo-Brizuela, E. C. Garrido-Merchan, Chatgpt is not all you need. a state of the art review of large generative ai models, arXiv preprint arXiv:2301.04655
-
[29]
D. Rothman, A. Gulli, Transformers for Natural Language Processing: Build, train, and fine-tune deep neural network architectures for NLP with Python, PyTorch, TensorFlow, BERT, and GPT-3, Packt Publishing Ltd, 2022
work page 2022
-
[30]
Y. Li, H. Wen, W. Wang, X. Li, Y. Yuan, G. Liu, J. Liu, W. Xu, X. Wang, Y. Sun, et al., Personal llm agents: Insights and survey about the capability, efficiency and security, arXiv preprint arXiv:2401.05459
work page internal anchor Pith review Pith/arXiv arXiv
- [31]
- [32]
- [33]
-
[34]
D. Baidoo-Anu, L. Owusu Ansah, Education in the era of generative artificial intelligence (ai): Understanding the potential benefits of chatgpt in promoting teaching and learning, Available at SSRN 4337484
-
[35]
Z. Rasheed, M. Waseem, A. Ahmad, K.-K. Kemell, W. Xiaofeng, A. N. Duc, P. Abrahams- son, Can large language models serve as data analysts? a multi-agent assisted approach for qualitative data analysis, arXiv preprint arXiv:2402.01386
- [36]
- [37]
- [38]
- [39]
-
[40]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in neural information processing systems 30. 21
-
[41]
W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, et al., A survey of large language models, arXiv preprint arXiv:2303.18223
work page internal anchor Pith review Pith/arXiv arXiv
-
[42]
Large Language Models Can Self-Improve
J. Huang, S. S. Gu, L. Hou, Y. Wu, X. Wang, H. Yu, J. Han, Large language models can self-improve, arXiv preprint arXiv:2210.11610
work page internal anchor Pith review Pith/arXiv arXiv
- [43]
-
[44]
Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang, et al., Codebert: A pre-trained model for programming and natural languages, arXiv preprint arXiv:2002.08155
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[45]
D. Guo, S. Lu, N. Duan, Y. Wang, M. Zhou, J. Yin, Unixcoder: Unified cross-modal pre-training for code representation, arXiv preprint arXiv:2203.03850
work page internal anchor Pith review Pith/arXiv arXiv
-
[46]
doi:10.48550/arXiv.2303.10130 , url =
T. Eloundou, S. Manning, P. Mishkin, D. Rock, Gpts are gpts: An early look at the labor market impact potential of large language models, arXiv preprint arXiv:2303.10130
-
[47]
Y. Feng, S. Vanam, M. Cherukupally, W. Zheng, M. Qiu, H. Chen, Investigating code generation performance of chat-gpt with crowdsourcing social data, in: Proceedings of the 47th IEEE Computer Software and Applications Conference, 2023, pp. 1–10
work page 2023
-
[48]
L. Floridi, M. Chiriatti, Gpt-3: Its nature, scope, limits, and consequences, Minds and Machines 30 (2020) 681–694
work page 2020
-
[49]
J. Thiergart, S. Huber, T. ¨Ubellacker, Understanding emails and drafting responses–an approach using gpt-3, arXiv preprint arXiv:2102.03062
-
[50]
H¨ ornemalm, Chatgpt as a software development tool: The future of development (2023)
A. H¨ ornemalm, Chatgpt as a software development tool: The future of development (2023)
work page 2023
- [51]
-
[52]
W. Ma, S. Liu, W. Wang, Q. Hu, Y. Liu, C. Zhang, L. Nie, Y. Liu, The scope of chatgpt in software engineering: A thorough investigation, arXiv preprint arXiv:2305.12138
work page internal anchor Pith review Pith/arXiv arXiv
-
[53]
N. Nascimento, P. Alencar, D. Cowan, Comparing software developers with chatgpt: An empirical investigation, arXiv preprint arXiv:2305.11837
-
[54]
Z. Rasheed, M. Waseem, K. Syst¨ a, P. Abrahamsson, Large language model evaluation via multi ai agents: Preliminary results, arXiv preprint arXiv:2404.01023
-
[55]
F. Quin, D. Weyns, M. Galster, C. C. Silva, A/b testing: a systematic literature review, Journal of Systems and Software (2024) 112011
work page 2024
- [56]
- [57]
- [58]
-
[59]
Y. Wang, W. Wang, S. Joty, S. C. Hoi, Codet5: Identifier-aware unified pre- trained encoder-decoder models for code understanding and generation, arXiv preprint arXiv:2109.00859
work page internal anchor Pith review Pith/arXiv arXiv
- [60]
-
[61]
B. Wang, A. Komatsuzaki, Gpt-j-6b: A 6 billion parameter autoregressive language model (2021)
work page 2021
-
[62]
L. Tunstall, L. Von Werra, T. Wolf, Natural language processing with transformers, ” O’Reilly Media, Inc. ”, 2022
work page 2022
-
[63]
F. F. Xu, U. Alon, G. Neubig, V. J. Hellendoorn, A systematic evaluation of large language models of code, in: Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, 2022, pp. 1–10
work page 2022
-
[64]
CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis
E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese, C. Xiong, Codegen: An open large language model for code with multi-turn program synthesis, arXiv preprint arXiv:2203.13474
work page internal anchor Pith review Pith/arXiv arXiv
-
[65]
B. Chen, F. Zhang, A. Nguyen, D. Zan, Z. Lin, J.-G. Lou, W. Chen, Codet: Code generation with generated tests, arXiv preprint arXiv:2207.10397
work page internal anchor Pith review Pith/arXiv arXiv
-
[66]
GPT-NeoX-20B: An Open-Source Autoregressive Language Model
S. Black, S. Biderman, E. Hallahan, Q. Anthony, L. Gao, L. Golding, H. He, C. Leahy, K. McDonell, J. Phang, et al., Gpt-neox-20b: An open-source autoregressive language model, arXiv preprint arXiv:2204.06745
work page internal anchor Pith review Pith/arXiv arXiv
- [67]
- [68]
-
[69]
J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang, et al., Qwen technical report, arXiv preprint arXiv:2309.16609
work page internal anchor Pith review Pith/arXiv arXiv
-
[70]
Contributors, Opencompass: A universal evaluation platform for foundation models, GitHub repository
O. Contributors, Opencompass: A universal evaluation platform for foundation models, GitHub repository
-
[71]
S. Golchin, M. Surdeanu, Time travel in llms: Tracing data contamination in large lan- guage models, arXiv preprint arXiv:2308.08493
-
[72]
M. Riddell, A. Ni, A. Cohan, Quantifying contamination in evaluating code generation capabilities of language models, arXiv preprint arXiv:2403.04811
-
[73]
M. Roberts, H. Thakur, C. Herlihy, C. White, S. Dooley, To the cutoff... and beyond? a longitudinal perspective on llm data contamination, in: The Twelfth International Con- ference on Learning Representations, 2023
work page 2023
-
[74]
P. Runeson, M. H¨ ost, Guidelines for conducting and reporting case study research in software engineering, Empirical software engineering 14 (2009) 131–164
work page 2009
-
[75]
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al., Gpt-4 technical report, arXiv preprint arXiv:2303.08774
work page internal anchor Pith review Pith/arXiv arXiv
-
[76]
D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y. Wu, Y. Li, et al., Deepseek-coder: When the large language model meets programming–the rise of code intelligence, arXiv preprint arXiv:2401.14196. 23
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.