pith. sign in

arxiv: 2402.01411 · v3 · pith:WYRWS3IHnew · submitted 2024-02-02 · 💻 cs.SE

CodePori: Large-Scale System for Autonomous Software Development Using Multi-Agent Technology

Pith reviewed 2026-05-24 03:56 UTC · model grok-4.3

classification 💻 cs.SE
keywords multi-agent systemslarge language modelsautonomous software developmentcode generationLLM agentssoftware engineeringempirical evaluationparticipant study
0
0 comments X

The pith

LLM-based multi-agent systems can automate large-scale software development but only after addressing memory limits, hallucinations, and code smells.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds CodePori, a multi-agent system that uses large language models to generate code and support autonomous software development tasks. It tests the system through participant evaluations rather than binary benchmarks to surface real-world performance issues. The central finding is that these systems hold promise for large projects yet fall short without fixes for specific problems like memory constraints and incorrect outputs. A practitioner-focused lens is required to move beyond technical demos toward usable tools.

Core claim

CodePori shows that coordinated LLM agents can perform automated code generation for software tasks, yet participant feedback identifies persistent barriers including memory limitations, hallucinations, and code smells that prevent reliable large-scale use; successful deployment therefore demands both technical mitigations and a practitioner-centric design perspective.

What carries the argument

CodePori, a multi-agent architecture in which separate LLM agents handle planning, coding, review, and integration steps for end-to-end software development.

If this is right

  • Fixing memory limits and hallucinations in multi-agent LLM systems would allow their use on larger, more complex software projects.
  • Mitigating code smells in generated output would improve maintainability of autonomously produced codebases.
  • Moving from benchmark pass/fail scores to practitioner evaluations reveals integration barriers that technical metrics alone miss.
  • Designing such systems with practitioner input increases the chance that automation tools fit actual development workflows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same multi-agent coordination pattern could be tested on non-software tasks such as hardware design or data pipeline construction.
  • Scaling the system to teams of more than a few agents might introduce coordination overhead not captured in the current participant study.
  • If memory and hallucination fixes are found, CodePori-style systems could shorten development cycles in startups that lack large engineering staffs.

Load-bearing premise

Participant feedback from the evaluation accurately reflects the practical performance and limitations of the CodePori system in real-world autonomous software development tasks.

What would settle it

A controlled industry trial in which teams use CodePori on production projects for several weeks and report no notable memory issues, hallucinations, or code smells would falsify the claim that these challenges must be addressed for successful integration.

Figures

Figures reproduced from arXiv: 2402.01411 by Aakash Ahmad, Jussi Rasku, Kai-Kristian Kemell, Malik Abdul Sami, Mika Saari, Muhammad Waseem, Pekka Abrahamsson, Zeeshan Rasheed.

Figure 1
Figure 1. Figure 1: Workflow diagram showcasing an AI-driven multi-agent system for automated code generation [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A numerically stable script for calculating an unbiased estimate of pass@k. [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Software developed by CodePori 3.1. CodePori: LLM-Based Multi-Agent System (RQ1) The implementation of CodePori, an LLM-based multi-agent system, represents a step forward in automating complex software development tasks. In this system, we assigned each agent to handle various aspects of software development, including code development, code review, code verification, and test engineering. This specializa… view at source ↗
Figure 4
Figure 4. Figure 4: HumanEval benchmark results uations were structured to provide a comparative analysis against existing systems including MetaGPT [19], ChatDev [21], AlphaCode [22], Incoder [23], CodeGeeX [24], Codex [20] and general domain LLMs such as PaLMCoder [25]. Our findings indicated that CodePori out￾performed the existing solutions in code accuracy and efficiency. The experimental results of HumanEval benchmarks … view at source ↗
read the original abstract

Context: LLM-based multi-agent systems enable automation and decision support in software development, yet existing studies rely on benchmark datasets offering only binary pass-or-fail results, limiting insight into real-world applicability. Objective: This study empirically investigates the potential and limitations of LLM-based agents in autonomous software development tasks. Method: A two-phase approach was employed: developing a multi-agent system, CodePori, for automated code generation, and conducting participant-based evaluation to assess practical performance. Results: Participant feedback reveals key strengths, challenges, and areas for improvement in LLM-based multi-agent systems, highlighting aspects missed by standard code-generation benchmarks. Conclusions: While LLM-based multi-agent systems show potential for large-scale software development, successful integration requires addressing challenges such as memory limitations, hallucinations, and code smells, alongside a practitioner-centric perspective.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces CodePori, a multi-agent LLM-based system for autonomous software development. It employs a two-phase method consisting of system development followed by participant-based evaluation, and reports that feedback highlights strengths alongside challenges such as memory limitations, hallucinations, and code smells that are not captured by standard binary benchmarks. The central claim is that LLM multi-agent systems have potential for large-scale development but require practitioner-centric improvements to address these issues.

Significance. If the participant evaluation protocol and results are rigorously documented and analyzed, the work could usefully extend beyond pass/fail benchmarks by surfacing practical limitations of current LLM agents. The absence of any quantitative metrics, participant counts, task descriptions, or statistical analysis in the provided description, however, leaves the empirical contribution difficult to evaluate and limits the strength of the conclusions.

major comments (1)
  1. Abstract and Results: The participant-based evaluation is described only at a high level with no information on the number of participants, the software development tasks used, any quantitative performance metrics collected, or the qualitative analysis procedure. Without these details the feedback-derived claims about strengths, challenges, and required improvements cannot be assessed for reliability or generalizability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive comments. We agree that the participant evaluation requires substantially more detail to support the claims and will revise the manuscript to address this.

read point-by-point responses
  1. Referee: Abstract and Results: The participant-based evaluation is described only at a high level with no information on the number of participants, the software development tasks used, any quantitative performance metrics collected, or the qualitative analysis procedure. Without these details the feedback-derived claims about strengths, challenges, and required improvements cannot be assessed for reliability or generalizability.

    Authors: We agree that the current description is insufficient. In the revised manuscript we will expand the Method and Results sections to report the exact number of participants, the specific software development tasks assigned, any quantitative metrics collected during the evaluation, and the qualitative analysis procedure (including how themes were derived from feedback). These additions will allow readers to assess reliability and generalizability directly. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical investigation: development of the CodePori multi-agent system followed by participant-based evaluation whose conclusions rest on external feedback about strengths, challenges, and limitations. No derivation chain, equations, fitted parameters presented as predictions, or self-citation load-bearing premises appear in the abstract or described method. The central claims are grounded in practitioner input rather than reducing to self-referential definitions or inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard empirical assumptions in software engineering research rather than new free parameters or invented entities.

axioms (1)
  • domain assumption Participant-based evaluation can surface practical limitations of LLM agents that binary benchmarks miss.
    This premise underpins the claim that the study provides insight into real-world applicability.

pith-pipeline@v0.9.0 · 5694 in / 1164 out tokens · 45319 ms · 2026-05-24T03:56:48.276039+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Memory in the Age of AI Agents

    cs.CL 2025-12 unverdicted novelty 6.0

    The paper maps agent memory research via three forms (token-level, parametric, latent), three functions (factual, experiential, working), and dynamics of formation/evolution/retrieval, plus benchmarks and future directions.

  2. Beyond Functional Correctness: Design Issues in AI IDE-Generated Large-Scale Projects

    cs.SE 2026-04 conditional novelty 5.0

    AI IDEs with structured guidance can produce functional large-scale code but frequently introduce design flaws such as duplication, complexity, and principle violations that risk long-term maintainability.

  3. A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems

    cs.AI 2025-08 unverdicted novelty 5.0

    A comprehensive review of self-evolving AI agents that improve themselves over time, organized via a framework of inputs, agent system, environment, and optimizers, with domain-specific and safety discussions.

  4. Large Language Model-Based Agents for Software Engineering: A Survey

    cs.SE 2024-09 unverdicted novelty 4.0

    A literature survey that collects and categorizes 124 papers on LLM-based agents for software engineering from SE and agent perspectives.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · cited by 4 Pith papers · 20 internal anchors

  1. [1]

    C. Treude, Navigating complexity in software engineering: A prototype for comparing gpt-n solutions, in: 2023 IEEE/ACM 5th International Workshop on Bots in Software Engineering (BotSE), IEEE, 2023, pp. 1–5

  2. [2]

    Radford, K

    A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, et al., Improving language under- standing by generative pre-training

  3. [3]

    Radford, J

    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., Language models are unsupervised multitask learners, OpenAI blog 1 (8) (2019) 9

  4. [4]

    Ouyang, J

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agar- wal, K. Slama, A. Ray, et al., Training language models to follow instructions with human feedback, Advances in Neural Information Processing Systems 35 (2022) 27730–27744

  5. [5]

    Brown, B

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in neural information processing systems 33 (2020) 1877–1901

  6. [6]

    Belzner, T

    L. Belzner, T. Gabor, M. Wirsing, Large language model assisted software engineering: prospects, challenges, and a case study, in: International Conference on Bridging the Gap between AI and Reality, Springer, 2023, pp. 355–374

  7. [7]

    Learning to Represent Programs with Graphs

    M. Allamanis, M. Brockschmidt, M. Khademi, Learning to represent programs with graphs, arXiv preprint arXiv:1711.00740

  8. [8]

    J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, et al., Emergent abilities of large language models, arXiv preprint arXiv:2206.07682. 19

  9. [9]

    Urrutia, R

    F. Urrutia, R. Araya, Who’s the best detective? large language models vs. traditional ma- chine learning in detecting incoherent fourth grade math answers, Journal of Educational Computing Research 61 (8) (2024) 187–218

  10. [10]

    X. Hu, H. K. Dam, Future of software engineering@ icse 2023, in: 2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE- FoSE), IEEE, 2023, pp. 1–3

  11. [11]

    X. Hou, Y. Zhao, Y. Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo, J. Grundy, H. Wang, Large language models for software engineering: A systematic literature review, arXiv preprint arXiv:2308.10620

  12. [12]

    Y. Chae, T. Davidson, Large language models for text classification: From zero-shot learning to fine-tuning, Open Science Foundation

  13. [13]

    J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song, J. Aslanides, S. Hen- derson, R. Ring, S. Young, et al., Scaling language models: Methods, analysis & insights from training gopher, arXiv preprint arXiv:2112.11446

  14. [14]

    Rasheed, M

    Z. Rasheed, M. Waseem, K.-K. Kemell, W. Xiaofeng, A. N. Duc, K. Syst¨ a, P. Abra- hamsson, Autonomous agents in software development: A vision paper, arXiv preprint arXiv:2311.18440

  15. [15]

    X. Gu, H. Zhang, S. Kim, Deep code search, in: Proceedings of the 40th International Conference on Software Engineering, 2018, pp. 933–944

  16. [16]

    F. Lin, D. J. Kim, et al., When llm-based code generation meets the software development process, arXiv preprint arXiv:2403.15852

  17. [17]

    Q. Gu, Llm-based code generation method for golang compiler testing, in: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2023, pp. 2201–2203

  18. [18]

    Self-organized agents: A llm multi-agent framework toward ultra large-scale code generation and optimization

    Y. Ishibashi, Y. Nishimura, Self-organized agents: A llm multi-agent framework toward ultra large-scale code generation and optimization, arXiv preprint arXiv:2404.02183

  19. [19]

    S. Hong, X. Zheng, J. Chen, Y. Cheng, J. Wang, C. Zhang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, et al., Metagpt: Meta programming for multi-agent collaborative framework, arXiv preprint arXiv:2308.00352

  20. [20]

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, et al., Evaluating large language models trained on code, arXiv preprint arXiv:2107.03374

  21. [21]

    C. Qian, X. Cong, C. Yang, W. Chen, Y. Su, J. Xu, Z. Liu, M. Sun, Communicative agents for software development, arXiv preprint arXiv:2307.07924

  22. [22]

    Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. Dal Lago, et al., Competition-level code generation with alphacode, Science 378 (6624) (2022) 1092–1097

  23. [23]

    InCoder: A Generative Model for Code Infilling and Synthesis

    D. Fried, A. Aghajanyan, J. Lin, S. Wang, E. Wallace, F. Shi, R. Zhong, W.-t. Yih, L. Zettlemoyer, M. Lewis, Incoder: A generative model for code infilling and synthesis, arXiv preprint arXiv:2204.05999

  24. [24]

    Zheng, X

    Q. Zheng, X. Xia, X. Zou, Y. Dong, S. Wang, Y. Xue, Z. Wang, L. Shen, A. Wang, Y. Li, et al., Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x, arXiv preprint arXiv:2303.17568. 20

  25. [25]

    Chowdhery, S

    A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al., Palm: Scaling language modeling with pathways, Journal of Machine Learning Research 24 (240) (2023) 1–113

  26. [26]

    CodePori: Large Scale System for Autonomous Software Development by Using Multi-Agents

    Z. Rasheed, Dataset of the Paper “CodePori: Large Scale System for Autonomous Software Development by Using Multi-Agents”, https://doi.org/10.5281/zenodo.13755415 (2024)

  27. [27]

    Rasheed, M

    Z. Rasheed, M. A. Sami, P. Abrahamsson, Codepori, accessed: 2024-09-12 (2024). URL https://github.com/GPT-Laboratory/CodePori

  28. [28]

    Gozalo-Brizuela, E

    R. Gozalo-Brizuela, E. C. Garrido-Merchan, Chatgpt is not all you need. a state of the art review of large generative ai models, arXiv preprint arXiv:2301.04655

  29. [29]

    Rothman, A

    D. Rothman, A. Gulli, Transformers for Natural Language Processing: Build, train, and fine-tune deep neural network architectures for NLP with Python, PyTorch, TensorFlow, BERT, and GPT-3, Packt Publishing Ltd, 2022

  30. [30]

    Y. Li, H. Wen, W. Wang, X. Li, Y. Yuan, G. Liu, J. Liu, W. Xu, X. Wang, Y. Sun, et al., Personal llm agents: Insights and survey about the capability, efficiency and security, arXiv preprint arXiv:2401.05459

  31. [31]

    S. Dou, H. Jia, S. Wu, H. Zheng, W. Zhou, M. Wu, M. Chai, J. Fan, C. Huang, Y. Tao, et al., What’s wrong with your code generated by large language models? an extensive study, arXiv preprint arXiv:2407.06153

  32. [32]

    Yadav, M

    A. Yadav, M. Singh, Boldly going where no benchmark has gone before: Exposing bias and shortcomings in code generation evaluation, arXiv preprint arXiv:2401.03855

  33. [33]

    J. Dai, J. Lu, Y. Feng, R. Ruan, M. Cheng, H. Tan, Z. Guo, Mhpp: Exploring the capa- bilities and limitations of language models beyond basic code generation, arXiv preprint arXiv:2405.11430

  34. [34]

    Baidoo-Anu, L

    D. Baidoo-Anu, L. Owusu Ansah, Education in the era of generative artificial intelligence (ai): Understanding the potential benefits of chatgpt in promoting teaching and learning, Available at SSRN 4337484

  35. [35]

    Rasheed, M

    Z. Rasheed, M. Waseem, A. Ahmad, K.-K. Kemell, W. Xiaofeng, A. N. Duc, P. Abrahams- son, Can large language models serve as data analysts? a multi-agent assisted approach for qualitative data analysis, arXiv preprint arXiv:2402.01386

  36. [36]

    Y. Cao, S. Li, Y. Liu, Z. Yan, Y. Dai, P. S. Yu, L. Sun, A comprehensive survey of ai- generated content (aigc): A history of generative ai from gan to chatgpt, arXiv preprint arXiv:2303.04226

  37. [37]

    Hacker, A

    P. Hacker, A. Engel, M. Mauer, Regulating chatgpt and other large generative ai models, in: Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Trans- parency, 2023, pp. 1112–1123

  38. [38]

    M. A. Sami, M. Waseem, Z. Rasheed, M. Saari, K. Syst¨ a, P. Abrahamsson, Experiment- ing with multi-agent software development: Towards a unified platform, arXiv preprint arXiv:2406.05381

  39. [39]

    Aydın, E

    ¨O. Aydın, E. Karaarslan, Is chatgpt leading generative ai? what is beyond expectations?, What is beyond expectations

  40. [40]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in neural information processing systems 30. 21

  41. [41]

    W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, et al., A survey of large language models, arXiv preprint arXiv:2303.18223

  42. [42]

    Large Language Models Can Self-Improve

    J. Huang, S. S. Gu, L. Hou, Y. Wu, X. Wang, H. Yu, J. Han, Large language models can self-improve, arXiv preprint arXiv:2210.11610

  43. [43]

    A. M. Sami, Z. Rasheed, K.-K. Kemell, M. Waseem, T. Kilamo, M. Saari, A. N. Duc, K. Syst¨ a, P. Abrahamsson, System for systematic literature review using multiple ai agents: Concept and an empirical evaluation, arXiv preprint arXiv:2403.08399

  44. [44]

    Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang, et al., Codebert: A pre-trained model for programming and natural languages, arXiv preprint arXiv:2002.08155

  45. [45]

    D. Guo, S. Lu, N. Duan, Y. Wang, M. Zhou, J. Yin, Unixcoder: Unified cross-modal pre-training for code representation, arXiv preprint arXiv:2203.03850

  46. [46]

    doi:10.48550/arXiv.2303.10130 , url =

    T. Eloundou, S. Manning, P. Mishkin, D. Rock, Gpts are gpts: An early look at the labor market impact potential of large language models, arXiv preprint arXiv:2303.10130

  47. [47]

    Y. Feng, S. Vanam, M. Cherukupally, W. Zheng, M. Qiu, H. Chen, Investigating code generation performance of chat-gpt with crowdsourcing social data, in: Proceedings of the 47th IEEE Computer Software and Applications Conference, 2023, pp. 1–10

  48. [48]

    Floridi, M

    L. Floridi, M. Chiriatti, Gpt-3: Its nature, scope, limits, and consequences, Minds and Machines 30 (2020) 681–694

  49. [49]

    Thiergart, S

    J. Thiergart, S. Huber, T. ¨Ubellacker, Understanding emails and drafting responses–an approach using gpt-3, arXiv preprint arXiv:2102.03062

  50. [50]

    H¨ ornemalm, Chatgpt as a software development tool: The future of development (2023)

    A. H¨ ornemalm, Chatgpt as a software development tool: The future of development (2023)

  51. [51]

    Tufano, D

    M. Tufano, D. Drain, A. Svyatkovskiy, S. K. Deng, N. Sundaresan, Unit test case gener- ation with transformers and focal context, arXiv preprint arXiv:2009.05617

  52. [52]

    W. Ma, S. Liu, W. Wang, Q. Hu, Y. Liu, C. Zhang, L. Nie, Y. Liu, The scope of chatgpt in software engineering: A thorough investigation, arXiv preprint arXiv:2305.12138

  53. [53]

    Nascimento, P

    N. Nascimento, P. Alencar, D. Cowan, Comparing software developers with chatgpt: An empirical investigation, arXiv preprint arXiv:2305.11837

  54. [54]

    Rasheed, M

    Z. Rasheed, M. Waseem, K. Syst¨ a, P. Abrahamsson, Large language model evaluation via multi ai agents: Preliminary results, arXiv preprint arXiv:2404.01023

  55. [55]

    F. Quin, D. Weyns, M. Galster, C. C. Silva, A/b testing: a systematic literature review, Journal of Systems and Software (2024) 112011

  56. [56]

    Zheng, K

    Z. Zheng, K. Ning, J. Chen, Y. Wang, W. Chen, L. Guo, W. Wang, Towards an understanding of large language models in software engineering tasks, arXiv preprint arXiv:2308.11396

  57. [57]

    Zheng, K

    Z. Zheng, K. Ning, Y. Wang, J. Zhang, D. Zheng, M. Ye, J. Chen, A survey of large language models for code: Evolution, benchmarking, and future trends, arXiv preprint arXiv:2311.10372

  58. [58]

    J. Shin, C. Tang, T. Mohati, M. Nayebi, S. Wang, H. Hemmati, Prompt engineering or fine tuning: An empirical assessment of large language models in automated software engineering tasks, arXiv preprint arXiv:2310.10508. 22

  59. [59]

    Y. Wang, W. Wang, S. Joty, S. C. Hoi, Codet5: Identifier-aware unified pre- trained encoder-decoder models for code understanding and generation, arXiv preprint arXiv:2109.00859

  60. [60]

    Black, L

    S. Black, L. Gao, P. Wang, C. Leahy, S. Biderman, Gpt-neo: Large scale autoregressive language modeling with mesh-tensorflow, If you use this software, please cite it using these metadata 58

  61. [61]

    B. Wang, A. Komatsuzaki, Gpt-j-6b: A 6 billion parameter autoregressive language model (2021)

  62. [62]

    Tunstall, L

    L. Tunstall, L. Von Werra, T. Wolf, Natural language processing with transformers, ” O’Reilly Media, Inc. ”, 2022

  63. [63]

    F. F. Xu, U. Alon, G. Neubig, V. J. Hellendoorn, A systematic evaluation of large language models of code, in: Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, 2022, pp. 1–10

  64. [64]

    CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

    E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese, C. Xiong, Codegen: An open large language model for code with multi-turn program synthesis, arXiv preprint arXiv:2203.13474

  65. [65]

    B. Chen, F. Zhang, A. Nguyen, D. Zan, Z. Lin, J.-G. Lou, W. Chen, Codet: Code generation with generated tests, arXiv preprint arXiv:2207.10397

  66. [66]

    GPT-NeoX-20B: An Open-Source Autoregressive Language Model

    S. Black, S. Biderman, E. Hallahan, Q. Anthony, L. Gao, L. Golding, H. He, C. Leahy, K. McDonell, J. Phang, et al., Gpt-neox-20b: An open-source autoregressive language model, arXiv preprint arXiv:2204.06745

  67. [67]

    Arora, A

    S. Arora, A. Narayan, M. F. Chen, L. Orr, N. Guha, K. Bhatia, I. Chami, C. Re, Ask me anything: A simple strategy for prompting language models, in: The Eleventh Interna- tional Conference on Learning Representations, 2022

  68. [68]

    C. Wang, Q. Dong, X. Wang, H. Wang, Z. Sui, Statistical dataset evaluation: Reliability, difficulty, and validity, arXiv preprint arXiv:2212.09272

  69. [69]

    J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang, et al., Qwen technical report, arXiv preprint arXiv:2309.16609

  70. [70]

    Contributors, Opencompass: A universal evaluation platform for foundation models, GitHub repository

    O. Contributors, Opencompass: A universal evaluation platform for foundation models, GitHub repository

  71. [71]

    Golchin, M

    S. Golchin, M. Surdeanu, Time travel in llms: Tracing data contamination in large lan- guage models, arXiv preprint arXiv:2308.08493

  72. [72]

    Riddell, A

    M. Riddell, A. Ni, A. Cohan, Quantifying contamination in evaluating code generation capabilities of language models, arXiv preprint arXiv:2403.04811

  73. [73]

    Roberts, H

    M. Roberts, H. Thakur, C. Herlihy, C. White, S. Dooley, To the cutoff... and beyond? a longitudinal perspective on llm data contamination, in: The Twelfth International Con- ference on Learning Representations, 2023

  74. [74]

    Runeson, M

    P. Runeson, M. H¨ ost, Guidelines for conducting and reporting case study research in software engineering, Empirical software engineering 14 (2009) 131–164

  75. [75]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al., Gpt-4 technical report, arXiv preprint arXiv:2303.08774

  76. [76]

    D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y. Wu, Y. Li, et al., Deepseek-coder: When the large language model meets programming–the rise of code intelligence, arXiv preprint arXiv:2401.14196. 23