CodePori: Large-Scale System for Autonomous Software Development Using Multi-Agent Technology

Aakash Ahmad; Jussi Rasku; Kai-Kristian Kemell; Malik Abdul Sami; Mika Saari; Muhammad Waseem; Pekka Abrahamsson; Zeeshan Rasheed

arxiv: 2402.01411 · v3 · pith:WYRWS3IHnew · submitted 2024-02-02 · 💻 cs.SE

CodePori: Large-Scale System for Autonomous Software Development Using Multi-Agent Technology

Zeeshan Rasheed , Muhammad Waseem , Kai-Kristian Kemell , Aakash Ahmad , Malik Abdul Sami , Mika Saari , Jussi Rasku , Pekka Abrahamsson This is my paper

Pith reviewed 2026-05-24 03:56 UTC · model grok-4.3

classification 💻 cs.SE

keywords multi-agent systemslarge language modelsautonomous software developmentcode generationLLM agentssoftware engineeringempirical evaluationparticipant study

0 comments

The pith

LLM-based multi-agent systems can automate large-scale software development but only after addressing memory limits, hallucinations, and code smells.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds CodePori, a multi-agent system that uses large language models to generate code and support autonomous software development tasks. It tests the system through participant evaluations rather than binary benchmarks to surface real-world performance issues. The central finding is that these systems hold promise for large projects yet fall short without fixes for specific problems like memory constraints and incorrect outputs. A practitioner-focused lens is required to move beyond technical demos toward usable tools.

Core claim

CodePori shows that coordinated LLM agents can perform automated code generation for software tasks, yet participant feedback identifies persistent barriers including memory limitations, hallucinations, and code smells that prevent reliable large-scale use; successful deployment therefore demands both technical mitigations and a practitioner-centric design perspective.

What carries the argument

CodePori, a multi-agent architecture in which separate LLM agents handle planning, coding, review, and integration steps for end-to-end software development.

If this is right

Fixing memory limits and hallucinations in multi-agent LLM systems would allow their use on larger, more complex software projects.
Mitigating code smells in generated output would improve maintainability of autonomously produced codebases.
Moving from benchmark pass/fail scores to practitioner evaluations reveals integration barriers that technical metrics alone miss.
Designing such systems with practitioner input increases the chance that automation tools fit actual development workflows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same multi-agent coordination pattern could be tested on non-software tasks such as hardware design or data pipeline construction.
Scaling the system to teams of more than a few agents might introduce coordination overhead not captured in the current participant study.
If memory and hallucination fixes are found, CodePori-style systems could shorten development cycles in startups that lack large engineering staffs.

Load-bearing premise

Participant feedback from the evaluation accurately reflects the practical performance and limitations of the CodePori system in real-world autonomous software development tasks.

What would settle it

A controlled industry trial in which teams use CodePori on production projects for several weeks and report no notable memory issues, hallucinations, or code smells would falsify the claim that these challenges must be addressed for successful integration.

Figures

Figures reproduced from arXiv: 2402.01411 by Aakash Ahmad, Jussi Rasku, Kai-Kristian Kemell, Malik Abdul Sami, Mika Saari, Muhammad Waseem, Pekka Abrahamsson, Zeeshan Rasheed.

**Figure 2.** Figure 2: A numerically stable script for calculating an unbiased estimate of pass@k. [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Software developed by CodePori 3.1. CodePori: LLM-Based Multi-Agent System (RQ1) The implementation of CodePori, an LLM-based multi-agent system, represents a step forward in automating complex software development tasks. In this system, we assigned each agent to handle various aspects of software development, including code development, code review, code verification, and test engineering. This specializa… view at source ↗

**Figure 4.** Figure 4: HumanEval benchmark results uations were structured to provide a comparative analysis against existing systems including MetaGPT [19], ChatDev [21], AlphaCode [22], Incoder [23], CodeGeeX [24], Codex [20] and general domain LLMs such as PaLMCoder [25]. Our findings indicated that CodePori outperformed the existing solutions in code accuracy and efficiency. The experimental results of HumanEval benchmarks … view at source ↗

read the original abstract

Context: LLM-based multi-agent systems enable automation and decision support in software development, yet existing studies rely on benchmark datasets offering only binary pass-or-fail results, limiting insight into real-world applicability. Objective: This study empirically investigates the potential and limitations of LLM-based agents in autonomous software development tasks. Method: A two-phase approach was employed: developing a multi-agent system, CodePori, for automated code generation, and conducting participant-based evaluation to assess practical performance. Results: Participant feedback reveals key strengths, challenges, and areas for improvement in LLM-based multi-agent systems, highlighting aspects missed by standard code-generation benchmarks. Conclusions: While LLM-based multi-agent systems show potential for large-scale software development, successful integration requires addressing challenges such as memory limitations, hallucinations, and code smells, alongside a practitioner-centric perspective.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CodePori is a multi-agent LLM system for code generation whose main addition is participant feedback on issues like memory limits and hallucinations, though the abstract leaves the evaluation details unspecified.

read the letter

The one thing to know about this paper is that it describes CodePori, a multi-agent system for autonomous software development using LLMs, and reports on participant feedback that highlights practical challenges such as memory limitations, hallucinations, and code smells. The paper does a good job of pointing out how existing benchmark datasets only give binary results and don't capture real-world applicability. By conducting a participant-based evaluation, it tries to address that gap and brings in a practitioner-centric view. What is new is the specific implementation of CodePori and the collection of feedback on its use in software development tasks. This adds a concrete example to the literature on multi-agent LLMs for code generation. The soft spots are in the evaluation section. The abstract mentions the two-phase approach but provides no details on participant numbers, the tasks assigned, quantitative metrics, or how the analysis was done. This leaves the central claims without visible supporting evidence, which is a problem for judging the soundness. The conclusions seem reasonable based on the described feedback, but without the data, it's hard to say if they are well-supported. This paper is for researchers in software engineering and AI who are interested in integrating multi-agent systems into development workflows. A reader looking for examples of systems and initial user studies would get some value from it. It deserves a serious referee because the topic is relevant and the approach has potential, even if the current description is high-level. The full paper likely has more on the system architecture and study protocol. I recommend sending it to peer review to allow proper assessment of the methods and results.

Referee Report

1 major / 0 minor

Summary. The paper introduces CodePori, a multi-agent LLM-based system for autonomous software development. It employs a two-phase method consisting of system development followed by participant-based evaluation, and reports that feedback highlights strengths alongside challenges such as memory limitations, hallucinations, and code smells that are not captured by standard binary benchmarks. The central claim is that LLM multi-agent systems have potential for large-scale development but require practitioner-centric improvements to address these issues.

Significance. If the participant evaluation protocol and results are rigorously documented and analyzed, the work could usefully extend beyond pass/fail benchmarks by surfacing practical limitations of current LLM agents. The absence of any quantitative metrics, participant counts, task descriptions, or statistical analysis in the provided description, however, leaves the empirical contribution difficult to evaluate and limits the strength of the conclusions.

major comments (1)

Abstract and Results: The participant-based evaluation is described only at a high level with no information on the number of participants, the software development tasks used, any quantitative performance metrics collected, or the qualitative analysis procedure. Without these details the feedback-derived claims about strengths, challenges, and required improvements cannot be assessed for reliability or generalizability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive comments. We agree that the participant evaluation requires substantially more detail to support the claims and will revise the manuscript to address this.

read point-by-point responses

Referee: Abstract and Results: The participant-based evaluation is described only at a high level with no information on the number of participants, the software development tasks used, any quantitative performance metrics collected, or the qualitative analysis procedure. Without these details the feedback-derived claims about strengths, challenges, and required improvements cannot be assessed for reliability or generalizability.

Authors: We agree that the current description is insufficient. In the revised manuscript we will expand the Method and Results sections to report the exact number of participants, the specific software development tasks assigned, any quantitative metrics collected during the evaluation, and the qualitative analysis procedure (including how themes were derived from feedback). These additions will allow readers to assess reliability and generalizability directly. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical investigation: development of the CodePori multi-agent system followed by participant-based evaluation whose conclusions rest on external feedback about strengths, challenges, and limitations. No derivation chain, equations, fitted parameters presented as predictions, or self-citation load-bearing premises appear in the abstract or described method. The central claims are grounded in practitioner input rather than reducing to self-referential definitions or inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard empirical assumptions in software engineering research rather than new free parameters or invented entities.

axioms (1)

domain assumption Participant-based evaluation can surface practical limitations of LLM agents that binary benchmarks miss.
This premise underpins the claim that the study provides insight into real-world applicability.

pith-pipeline@v0.9.0 · 5694 in / 1164 out tokens · 45319 ms · 2026-05-24T03:56:48.276039+00:00 · methodology

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Memory in the Age of AI Agents
cs.CL 2025-12 unverdicted novelty 6.0

The paper maps agent memory research via three forms (token-level, parametric, latent), three functions (factual, experiential, working), and dynamics of formation/evolution/retrieval, plus benchmarks and future directions.
Beyond Functional Correctness: Design Issues in AI IDE-Generated Large-Scale Projects
cs.SE 2026-04 conditional novelty 5.0

AI IDEs with structured guidance can produce functional large-scale code but frequently introduce design flaws such as duplication, complexity, and principle violations that risk long-term maintainability.
A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems
cs.AI 2025-08 unverdicted novelty 5.0

A comprehensive review of self-evolving AI agents that improve themselves over time, organized via a framework of inputs, agent system, environment, and optimizers, with domain-specific and safety discussions.
Large Language Model-Based Agents for Software Engineering: A Survey
cs.SE 2024-09 unverdicted novelty 4.0

A literature survey that collects and categorizes 124 papers on LLM-based agents for software engineering from SE and agent perspectives.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · cited by 4 Pith papers · 20 internal anchors

[1]

C. Treude, Navigating complexity in software engineering: A prototype for comparing gpt-n solutions, in: 2023 IEEE/ACM 5th International Workshop on Bots in Software Engineering (BotSE), IEEE, 2023, pp. 1–5

work page 2023
[2]

Radford, K

A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, et al., Improving language under- standing by generative pre-training

work page
[3]

Radford, J

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., Language models are unsupervised multitask learners, OpenAI blog 1 (8) (2019) 9

work page 2019
[4]

Ouyang, J

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agar- wal, K. Slama, A. Ray, et al., Training language models to follow instructions with human feedback, Advances in Neural Information Processing Systems 35 (2022) 27730–27744

work page 2022
[5]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in neural information processing systems 33 (2020) 1877–1901

work page 2020
[6]

Belzner, T

L. Belzner, T. Gabor, M. Wirsing, Large language model assisted software engineering: prospects, challenges, and a case study, in: International Conference on Bridging the Gap between AI and Reality, Springer, 2023, pp. 355–374

work page 2023
[7]

Learning to Represent Programs with Graphs

M. Allamanis, M. Brockschmidt, M. Khademi, Learning to represent programs with graphs, arXiv preprint arXiv:1711.00740

work page internal anchor Pith review Pith/arXiv arXiv
[8]

J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, et al., Emergent abilities of large language models, arXiv preprint arXiv:2206.07682. 19

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Urrutia, R

F. Urrutia, R. Araya, Who’s the best detective? large language models vs. traditional ma- chine learning in detecting incoherent fourth grade math answers, Journal of Educational Computing Research 61 (8) (2024) 187–218

work page 2024
[10]

X. Hu, H. K. Dam, Future of software engineering@ icse 2023, in: 2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE- FoSE), IEEE, 2023, pp. 1–3

work page 2023
[11]

X. Hou, Y. Zhao, Y. Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo, J. Grundy, H. Wang, Large language models for software engineering: A systematic literature review, arXiv preprint arXiv:2308.10620

work page arXiv
[12]

Y. Chae, T. Davidson, Large language models for text classification: From zero-shot learning to fine-tuning, Open Science Foundation

work page
[13]

J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song, J. Aslanides, S. Hen- derson, R. Ring, S. Young, et al., Scaling language models: Methods, analysis & insights from training gopher, arXiv preprint arXiv:2112.11446

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Rasheed, M

Z. Rasheed, M. Waseem, K.-K. Kemell, W. Xiaofeng, A. N. Duc, K. Syst¨ a, P. Abra- hamsson, Autonomous agents in software development: A vision paper, arXiv preprint arXiv:2311.18440

work page arXiv
[15]

X. Gu, H. Zhang, S. Kim, Deep code search, in: Proceedings of the 40th International Conference on Software Engineering, 2018, pp. 933–944

work page 2018
[16]

F. Lin, D. J. Kim, et al., When llm-based code generation meets the software development process, arXiv preprint arXiv:2403.15852

work page arXiv
[17]

Q. Gu, Llm-based code generation method for golang compiler testing, in: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2023, pp. 2201–2203

work page 2023
[18]

Self-organized agents: A llm multi-agent framework toward ultra large-scale code generation and optimization

Y. Ishibashi, Y. Nishimura, Self-organized agents: A llm multi-agent framework toward ultra large-scale code generation and optimization, arXiv preprint arXiv:2404.02183

work page arXiv
[19]

S. Hong, X. Zheng, J. Chen, Y. Cheng, J. Wang, C. Zhang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, et al., Metagpt: Meta programming for multi-agent collaborative framework, arXiv preprint arXiv:2308.00352

work page internal anchor Pith review Pith/arXiv arXiv
[20]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, et al., Evaluating large language models trained on code, arXiv preprint arXiv:2107.03374

work page internal anchor Pith review Pith/arXiv arXiv
[21]

C. Qian, X. Cong, C. Yang, W. Chen, Y. Su, J. Xu, Z. Liu, M. Sun, Communicative agents for software development, arXiv preprint arXiv:2307.07924

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. Dal Lago, et al., Competition-level code generation with alphacode, Science 378 (6624) (2022) 1092–1097

work page 2022
[23]

InCoder: A Generative Model for Code Infilling and Synthesis

D. Fried, A. Aghajanyan, J. Lin, S. Wang, E. Wallace, F. Shi, R. Zhong, W.-t. Yih, L. Zettlemoyer, M. Lewis, Incoder: A generative model for code infilling and synthesis, arXiv preprint arXiv:2204.05999

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Zheng, X

Q. Zheng, X. Xia, X. Zou, Y. Dong, S. Wang, Y. Xue, Z. Wang, L. Shen, A. Wang, Y. Li, et al., Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x, arXiv preprint arXiv:2303.17568. 20

work page arXiv
[25]

Chowdhery, S

A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al., Palm: Scaling language modeling with pathways, Journal of Machine Learning Research 24 (240) (2023) 1–113

work page 2023
[26]

CodePori: Large Scale System for Autonomous Software Development by Using Multi-Agents

Z. Rasheed, Dataset of the Paper “CodePori: Large Scale System for Autonomous Software Development by Using Multi-Agents”, https://doi.org/10.5281/zenodo.13755415 (2024)

work page doi:10.5281/zenodo.13755415 2024
[27]

Rasheed, M

Z. Rasheed, M. A. Sami, P. Abrahamsson, Codepori, accessed: 2024-09-12 (2024). URL https://github.com/GPT-Laboratory/CodePori

work page 2024
[28]

Gozalo-Brizuela, E

R. Gozalo-Brizuela, E. C. Garrido-Merchan, Chatgpt is not all you need. a state of the art review of large generative ai models, arXiv preprint arXiv:2301.04655

work page arXiv
[29]

Rothman, A

D. Rothman, A. Gulli, Transformers for Natural Language Processing: Build, train, and fine-tune deep neural network architectures for NLP with Python, PyTorch, TensorFlow, BERT, and GPT-3, Packt Publishing Ltd, 2022

work page 2022
[30]

Y. Li, H. Wen, W. Wang, X. Li, Y. Yuan, G. Liu, J. Liu, W. Xu, X. Wang, Y. Sun, et al., Personal llm agents: Insights and survey about the capability, efficiency and security, arXiv preprint arXiv:2401.05459

work page internal anchor Pith review Pith/arXiv arXiv
[31]

S. Dou, H. Jia, S. Wu, H. Zheng, W. Zhou, M. Wu, M. Chai, J. Fan, C. Huang, Y. Tao, et al., What’s wrong with your code generated by large language models? an extensive study, arXiv preprint arXiv:2407.06153

work page arXiv
[32]

Yadav, M

A. Yadav, M. Singh, Boldly going where no benchmark has gone before: Exposing bias and shortcomings in code generation evaluation, arXiv preprint arXiv:2401.03855

work page arXiv
[33]

J. Dai, J. Lu, Y. Feng, R. Ruan, M. Cheng, H. Tan, Z. Guo, Mhpp: Exploring the capa- bilities and limitations of language models beyond basic code generation, arXiv preprint arXiv:2405.11430

work page arXiv
[34]

Baidoo-Anu, L

D. Baidoo-Anu, L. Owusu Ansah, Education in the era of generative artificial intelligence (ai): Understanding the potential benefits of chatgpt in promoting teaching and learning, Available at SSRN 4337484

work page
[35]

Rasheed, M

Z. Rasheed, M. Waseem, A. Ahmad, K.-K. Kemell, W. Xiaofeng, A. N. Duc, P. Abrahams- son, Can large language models serve as data analysts? a multi-agent assisted approach for qualitative data analysis, arXiv preprint arXiv:2402.01386

work page arXiv
[36]

Y. Cao, S. Li, Y. Liu, Z. Yan, Y. Dai, P. S. Yu, L. Sun, A comprehensive survey of ai- generated content (aigc): A history of generative ai from gan to chatgpt, arXiv preprint arXiv:2303.04226

work page arXiv
[37]

Hacker, A

P. Hacker, A. Engel, M. Mauer, Regulating chatgpt and other large generative ai models, in: Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Trans- parency, 2023, pp. 1112–1123

work page 2023
[38]

M. A. Sami, M. Waseem, Z. Rasheed, M. Saari, K. Syst¨ a, P. Abrahamsson, Experiment- ing with multi-agent software development: Towards a unified platform, arXiv preprint arXiv:2406.05381

work page arXiv
[39]

Aydın, E

¨O. Aydın, E. Karaarslan, Is chatgpt leading generative ai? what is beyond expectations?, What is beyond expectations

work page
[40]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in neural information processing systems 30. 21

work page
[41]

W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, et al., A survey of large language models, arXiv preprint arXiv:2303.18223

work page internal anchor Pith review Pith/arXiv arXiv
[42]

Large Language Models Can Self-Improve

J. Huang, S. S. Gu, L. Hou, Y. Wu, X. Wang, H. Yu, J. Han, Large language models can self-improve, arXiv preprint arXiv:2210.11610

work page internal anchor Pith review Pith/arXiv arXiv
[43]

A. M. Sami, Z. Rasheed, K.-K. Kemell, M. Waseem, T. Kilamo, M. Saari, A. N. Duc, K. Syst¨ a, P. Abrahamsson, System for systematic literature review using multiple ai agents: Concept and an empirical evaluation, arXiv preprint arXiv:2403.08399

work page arXiv
[44]

Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang, et al., Codebert: A pre-trained model for programming and natural languages, arXiv preprint arXiv:2002.08155

work page internal anchor Pith review Pith/arXiv arXiv 2002
[45]

D. Guo, S. Lu, N. Duan, Y. Wang, M. Zhou, J. Yin, Unixcoder: Unified cross-modal pre-training for code representation, arXiv preprint arXiv:2203.03850

work page internal anchor Pith review Pith/arXiv arXiv
[46]

doi:10.48550/arXiv.2303.10130 , url =

T. Eloundou, S. Manning, P. Mishkin, D. Rock, Gpts are gpts: An early look at the labor market impact potential of large language models, arXiv preprint arXiv:2303.10130

work page arXiv
[47]

Y. Feng, S. Vanam, M. Cherukupally, W. Zheng, M. Qiu, H. Chen, Investigating code generation performance of chat-gpt with crowdsourcing social data, in: Proceedings of the 47th IEEE Computer Software and Applications Conference, 2023, pp. 1–10

work page 2023
[48]

Floridi, M

L. Floridi, M. Chiriatti, Gpt-3: Its nature, scope, limits, and consequences, Minds and Machines 30 (2020) 681–694

work page 2020
[49]

Thiergart, S

J. Thiergart, S. Huber, T. ¨Ubellacker, Understanding emails and drafting responses–an approach using gpt-3, arXiv preprint arXiv:2102.03062

work page arXiv
[50]

H¨ ornemalm, Chatgpt as a software development tool: The future of development (2023)

A. H¨ ornemalm, Chatgpt as a software development tool: The future of development (2023)

work page 2023
[51]

Tufano, D

M. Tufano, D. Drain, A. Svyatkovskiy, S. K. Deng, N. Sundaresan, Unit test case gener- ation with transformers and focal context, arXiv preprint arXiv:2009.05617

work page arXiv 2009
[52]

W. Ma, S. Liu, W. Wang, Q. Hu, Y. Liu, C. Zhang, L. Nie, Y. Liu, The scope of chatgpt in software engineering: A thorough investigation, arXiv preprint arXiv:2305.12138

work page internal anchor Pith review Pith/arXiv arXiv
[53]

Nascimento, P

N. Nascimento, P. Alencar, D. Cowan, Comparing software developers with chatgpt: An empirical investigation, arXiv preprint arXiv:2305.11837

work page arXiv
[54]

Rasheed, M

Z. Rasheed, M. Waseem, K. Syst¨ a, P. Abrahamsson, Large language model evaluation via multi ai agents: Preliminary results, arXiv preprint arXiv:2404.01023

work page arXiv
[55]

F. Quin, D. Weyns, M. Galster, C. C. Silva, A/b testing: a systematic literature review, Journal of Systems and Software (2024) 112011

work page 2024
[56]

Zheng, K

Z. Zheng, K. Ning, J. Chen, Y. Wang, W. Chen, L. Guo, W. Wang, Towards an understanding of large language models in software engineering tasks, arXiv preprint arXiv:2308.11396

work page arXiv
[57]

Zheng, K

Z. Zheng, K. Ning, Y. Wang, J. Zhang, D. Zheng, M. Ye, J. Chen, A survey of large language models for code: Evolution, benchmarking, and future trends, arXiv preprint arXiv:2311.10372

work page arXiv
[58]

J. Shin, C. Tang, T. Mohati, M. Nayebi, S. Wang, H. Hemmati, Prompt engineering or fine tuning: An empirical assessment of large language models in automated software engineering tasks, arXiv preprint arXiv:2310.10508. 22

work page arXiv
[59]

Y. Wang, W. Wang, S. Joty, S. C. Hoi, Codet5: Identifier-aware unified pre- trained encoder-decoder models for code understanding and generation, arXiv preprint arXiv:2109.00859

work page internal anchor Pith review Pith/arXiv arXiv
[60]

Black, L

S. Black, L. Gao, P. Wang, C. Leahy, S. Biderman, Gpt-neo: Large scale autoregressive language modeling with mesh-tensorflow, If you use this software, please cite it using these metadata 58

work page
[61]

B. Wang, A. Komatsuzaki, Gpt-j-6b: A 6 billion parameter autoregressive language model (2021)

work page 2021
[62]

Tunstall, L

L. Tunstall, L. Von Werra, T. Wolf, Natural language processing with transformers, ” O’Reilly Media, Inc. ”, 2022

work page 2022
[63]

F. F. Xu, U. Alon, G. Neubig, V. J. Hellendoorn, A systematic evaluation of large language models of code, in: Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, 2022, pp. 1–10

work page 2022
[64]

CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese, C. Xiong, Codegen: An open large language model for code with multi-turn program synthesis, arXiv preprint arXiv:2203.13474

work page internal anchor Pith review Pith/arXiv arXiv
[65]

B. Chen, F. Zhang, A. Nguyen, D. Zan, Z. Lin, J.-G. Lou, W. Chen, Codet: Code generation with generated tests, arXiv preprint arXiv:2207.10397

work page internal anchor Pith review Pith/arXiv arXiv
[66]

GPT-NeoX-20B: An Open-Source Autoregressive Language Model

S. Black, S. Biderman, E. Hallahan, Q. Anthony, L. Gao, L. Golding, H. He, C. Leahy, K. McDonell, J. Phang, et al., Gpt-neox-20b: An open-source autoregressive language model, arXiv preprint arXiv:2204.06745

work page internal anchor Pith review Pith/arXiv arXiv
[67]

Arora, A

S. Arora, A. Narayan, M. F. Chen, L. Orr, N. Guha, K. Bhatia, I. Chami, C. Re, Ask me anything: A simple strategy for prompting language models, in: The Eleventh Interna- tional Conference on Learning Representations, 2022

work page 2022
[68]

C. Wang, Q. Dong, X. Wang, H. Wang, Z. Sui, Statistical dataset evaluation: Reliability, difficulty, and validity, arXiv preprint arXiv:2212.09272

work page arXiv
[69]

J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang, et al., Qwen technical report, arXiv preprint arXiv:2309.16609

work page internal anchor Pith review Pith/arXiv arXiv
[70]

Contributors, Opencompass: A universal evaluation platform for foundation models, GitHub repository

O. Contributors, Opencompass: A universal evaluation platform for foundation models, GitHub repository

work page
[71]

Golchin, M

S. Golchin, M. Surdeanu, Time travel in llms: Tracing data contamination in large lan- guage models, arXiv preprint arXiv:2308.08493

work page arXiv
[72]

Riddell, A

M. Riddell, A. Ni, A. Cohan, Quantifying contamination in evaluating code generation capabilities of language models, arXiv preprint arXiv:2403.04811

work page arXiv
[73]

Roberts, H

M. Roberts, H. Thakur, C. Herlihy, C. White, S. Dooley, To the cutoff... and beyond? a longitudinal perspective on llm data contamination, in: The Twelfth International Con- ference on Learning Representations, 2023

work page 2023
[74]

Runeson, M

P. Runeson, M. H¨ ost, Guidelines for conducting and reporting case study research in software engineering, Empirical software engineering 14 (2009) 131–164

work page 2009
[75]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al., Gpt-4 technical report, arXiv preprint arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv
[76]

D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y. Wu, Y. Li, et al., Deepseek-coder: When the large language model meets programming–the rise of code intelligence, arXiv preprint arXiv:2401.14196. 23

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

C. Treude, Navigating complexity in software engineering: A prototype for comparing gpt-n solutions, in: 2023 IEEE/ACM 5th International Workshop on Bots in Software Engineering (BotSE), IEEE, 2023, pp. 1–5

work page 2023

[2] [2]

Radford, K

A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, et al., Improving language under- standing by generative pre-training

work page

[3] [3]

Radford, J

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., Language models are unsupervised multitask learners, OpenAI blog 1 (8) (2019) 9

work page 2019

[4] [4]

Ouyang, J

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agar- wal, K. Slama, A. Ray, et al., Training language models to follow instructions with human feedback, Advances in Neural Information Processing Systems 35 (2022) 27730–27744

work page 2022

[5] [5]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in neural information processing systems 33 (2020) 1877–1901

work page 2020

[6] [6]

Belzner, T

L. Belzner, T. Gabor, M. Wirsing, Large language model assisted software engineering: prospects, challenges, and a case study, in: International Conference on Bridging the Gap between AI and Reality, Springer, 2023, pp. 355–374

work page 2023

[7] [7]

Learning to Represent Programs with Graphs

M. Allamanis, M. Brockschmidt, M. Khademi, Learning to represent programs with graphs, arXiv preprint arXiv:1711.00740

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, et al., Emergent abilities of large language models, arXiv preprint arXiv:2206.07682. 19

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Urrutia, R

F. Urrutia, R. Araya, Who’s the best detective? large language models vs. traditional ma- chine learning in detecting incoherent fourth grade math answers, Journal of Educational Computing Research 61 (8) (2024) 187–218

work page 2024

[10] [10]

X. Hu, H. K. Dam, Future of software engineering@ icse 2023, in: 2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE- FoSE), IEEE, 2023, pp. 1–3

work page 2023

[11] [11]

X. Hou, Y. Zhao, Y. Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo, J. Grundy, H. Wang, Large language models for software engineering: A systematic literature review, arXiv preprint arXiv:2308.10620

work page arXiv

[12] [12]

Y. Chae, T. Davidson, Large language models for text classification: From zero-shot learning to fine-tuning, Open Science Foundation

work page

[13] [13]

J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song, J. Aslanides, S. Hen- derson, R. Ring, S. Young, et al., Scaling language models: Methods, analysis & insights from training gopher, arXiv preprint arXiv:2112.11446

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Rasheed, M

Z. Rasheed, M. Waseem, K.-K. Kemell, W. Xiaofeng, A. N. Duc, K. Syst¨ a, P. Abra- hamsson, Autonomous agents in software development: A vision paper, arXiv preprint arXiv:2311.18440

work page arXiv

[15] [15]

X. Gu, H. Zhang, S. Kim, Deep code search, in: Proceedings of the 40th International Conference on Software Engineering, 2018, pp. 933–944

work page 2018

[16] [16]

F. Lin, D. J. Kim, et al., When llm-based code generation meets the software development process, arXiv preprint arXiv:2403.15852

work page arXiv

[17] [17]

Q. Gu, Llm-based code generation method for golang compiler testing, in: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2023, pp. 2201–2203

work page 2023

[18] [18]

Self-organized agents: A llm multi-agent framework toward ultra large-scale code generation and optimization

Y. Ishibashi, Y. Nishimura, Self-organized agents: A llm multi-agent framework toward ultra large-scale code generation and optimization, arXiv preprint arXiv:2404.02183

work page arXiv

[19] [19]

S. Hong, X. Zheng, J. Chen, Y. Cheng, J. Wang, C. Zhang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, et al., Metagpt: Meta programming for multi-agent collaborative framework, arXiv preprint arXiv:2308.00352

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, et al., Evaluating large language models trained on code, arXiv preprint arXiv:2107.03374

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

C. Qian, X. Cong, C. Yang, W. Chen, Y. Su, J. Xu, Z. Liu, M. Sun, Communicative agents for software development, arXiv preprint arXiv:2307.07924

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. Dal Lago, et al., Competition-level code generation with alphacode, Science 378 (6624) (2022) 1092–1097

work page 2022

[23] [23]

InCoder: A Generative Model for Code Infilling and Synthesis

D. Fried, A. Aghajanyan, J. Lin, S. Wang, E. Wallace, F. Shi, R. Zhong, W.-t. Yih, L. Zettlemoyer, M. Lewis, Incoder: A generative model for code infilling and synthesis, arXiv preprint arXiv:2204.05999

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

Zheng, X

Q. Zheng, X. Xia, X. Zou, Y. Dong, S. Wang, Y. Xue, Z. Wang, L. Shen, A. Wang, Y. Li, et al., Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x, arXiv preprint arXiv:2303.17568. 20

work page arXiv

[25] [25]

Chowdhery, S

A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al., Palm: Scaling language modeling with pathways, Journal of Machine Learning Research 24 (240) (2023) 1–113

work page 2023

[26] [26]

CodePori: Large Scale System for Autonomous Software Development by Using Multi-Agents

Z. Rasheed, Dataset of the Paper “CodePori: Large Scale System for Autonomous Software Development by Using Multi-Agents”, https://doi.org/10.5281/zenodo.13755415 (2024)

work page doi:10.5281/zenodo.13755415 2024

[27] [27]

Rasheed, M

Z. Rasheed, M. A. Sami, P. Abrahamsson, Codepori, accessed: 2024-09-12 (2024). URL https://github.com/GPT-Laboratory/CodePori

work page 2024

[28] [28]

Gozalo-Brizuela, E

R. Gozalo-Brizuela, E. C. Garrido-Merchan, Chatgpt is not all you need. a state of the art review of large generative ai models, arXiv preprint arXiv:2301.04655

work page arXiv

[29] [29]

Rothman, A

D. Rothman, A. Gulli, Transformers for Natural Language Processing: Build, train, and fine-tune deep neural network architectures for NLP with Python, PyTorch, TensorFlow, BERT, and GPT-3, Packt Publishing Ltd, 2022

work page 2022

[30] [30]

Y. Li, H. Wen, W. Wang, X. Li, Y. Yuan, G. Liu, J. Liu, W. Xu, X. Wang, Y. Sun, et al., Personal llm agents: Insights and survey about the capability, efficiency and security, arXiv preprint arXiv:2401.05459

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

S. Dou, H. Jia, S. Wu, H. Zheng, W. Zhou, M. Wu, M. Chai, J. Fan, C. Huang, Y. Tao, et al., What’s wrong with your code generated by large language models? an extensive study, arXiv preprint arXiv:2407.06153

work page arXiv

[32] [32]

Yadav, M

A. Yadav, M. Singh, Boldly going where no benchmark has gone before: Exposing bias and shortcomings in code generation evaluation, arXiv preprint arXiv:2401.03855

work page arXiv

[33] [33]

J. Dai, J. Lu, Y. Feng, R. Ruan, M. Cheng, H. Tan, Z. Guo, Mhpp: Exploring the capa- bilities and limitations of language models beyond basic code generation, arXiv preprint arXiv:2405.11430

work page arXiv

[34] [34]

Baidoo-Anu, L

D. Baidoo-Anu, L. Owusu Ansah, Education in the era of generative artificial intelligence (ai): Understanding the potential benefits of chatgpt in promoting teaching and learning, Available at SSRN 4337484

work page

[35] [35]

Rasheed, M

Z. Rasheed, M. Waseem, A. Ahmad, K.-K. Kemell, W. Xiaofeng, A. N. Duc, P. Abrahams- son, Can large language models serve as data analysts? a multi-agent assisted approach for qualitative data analysis, arXiv preprint arXiv:2402.01386

work page arXiv

[36] [36]

Y. Cao, S. Li, Y. Liu, Z. Yan, Y. Dai, P. S. Yu, L. Sun, A comprehensive survey of ai- generated content (aigc): A history of generative ai from gan to chatgpt, arXiv preprint arXiv:2303.04226

work page arXiv

[37] [37]

Hacker, A

P. Hacker, A. Engel, M. Mauer, Regulating chatgpt and other large generative ai models, in: Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Trans- parency, 2023, pp. 1112–1123

work page 2023

[38] [38]

M. A. Sami, M. Waseem, Z. Rasheed, M. Saari, K. Syst¨ a, P. Abrahamsson, Experiment- ing with multi-agent software development: Towards a unified platform, arXiv preprint arXiv:2406.05381

work page arXiv

[39] [39]

Aydın, E

¨O. Aydın, E. Karaarslan, Is chatgpt leading generative ai? what is beyond expectations?, What is beyond expectations

work page

[40] [40]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in neural information processing systems 30. 21

work page

[41] [41]

W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, et al., A survey of large language models, arXiv preprint arXiv:2303.18223

work page internal anchor Pith review Pith/arXiv arXiv

[42] [42]

Large Language Models Can Self-Improve

J. Huang, S. S. Gu, L. Hou, Y. Wu, X. Wang, H. Yu, J. Han, Large language models can self-improve, arXiv preprint arXiv:2210.11610

work page internal anchor Pith review Pith/arXiv arXiv

[43] [43]

A. M. Sami, Z. Rasheed, K.-K. Kemell, M. Waseem, T. Kilamo, M. Saari, A. N. Duc, K. Syst¨ a, P. Abrahamsson, System for systematic literature review using multiple ai agents: Concept and an empirical evaluation, arXiv preprint arXiv:2403.08399

work page arXiv

[44] [44]

Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang, et al., Codebert: A pre-trained model for programming and natural languages, arXiv preprint arXiv:2002.08155

work page internal anchor Pith review Pith/arXiv arXiv 2002

[45] [45]

D. Guo, S. Lu, N. Duan, Y. Wang, M. Zhou, J. Yin, Unixcoder: Unified cross-modal pre-training for code representation, arXiv preprint arXiv:2203.03850

work page internal anchor Pith review Pith/arXiv arXiv

[46] [46]

doi:10.48550/arXiv.2303.10130 , url =

T. Eloundou, S. Manning, P. Mishkin, D. Rock, Gpts are gpts: An early look at the labor market impact potential of large language models, arXiv preprint arXiv:2303.10130

work page arXiv

[47] [47]

Y. Feng, S. Vanam, M. Cherukupally, W. Zheng, M. Qiu, H. Chen, Investigating code generation performance of chat-gpt with crowdsourcing social data, in: Proceedings of the 47th IEEE Computer Software and Applications Conference, 2023, pp. 1–10

work page 2023

[48] [48]

Floridi, M

L. Floridi, M. Chiriatti, Gpt-3: Its nature, scope, limits, and consequences, Minds and Machines 30 (2020) 681–694

work page 2020

[49] [49]

Thiergart, S

J. Thiergart, S. Huber, T. ¨Ubellacker, Understanding emails and drafting responses–an approach using gpt-3, arXiv preprint arXiv:2102.03062

work page arXiv

[50] [50]

H¨ ornemalm, Chatgpt as a software development tool: The future of development (2023)

A. H¨ ornemalm, Chatgpt as a software development tool: The future of development (2023)

work page 2023

[51] [51]

Tufano, D

M. Tufano, D. Drain, A. Svyatkovskiy, S. K. Deng, N. Sundaresan, Unit test case gener- ation with transformers and focal context, arXiv preprint arXiv:2009.05617

work page arXiv 2009

[52] [52]

W. Ma, S. Liu, W. Wang, Q. Hu, Y. Liu, C. Zhang, L. Nie, Y. Liu, The scope of chatgpt in software engineering: A thorough investigation, arXiv preprint arXiv:2305.12138

work page internal anchor Pith review Pith/arXiv arXiv

[53] [53]

Nascimento, P

N. Nascimento, P. Alencar, D. Cowan, Comparing software developers with chatgpt: An empirical investigation, arXiv preprint arXiv:2305.11837

work page arXiv

[54] [54]

Rasheed, M

Z. Rasheed, M. Waseem, K. Syst¨ a, P. Abrahamsson, Large language model evaluation via multi ai agents: Preliminary results, arXiv preprint arXiv:2404.01023

work page arXiv

[55] [55]

F. Quin, D. Weyns, M. Galster, C. C. Silva, A/b testing: a systematic literature review, Journal of Systems and Software (2024) 112011

work page 2024

[56] [56]

Zheng, K

Z. Zheng, K. Ning, J. Chen, Y. Wang, W. Chen, L. Guo, W. Wang, Towards an understanding of large language models in software engineering tasks, arXiv preprint arXiv:2308.11396

work page arXiv

[57] [57]

Zheng, K

Z. Zheng, K. Ning, Y. Wang, J. Zhang, D. Zheng, M. Ye, J. Chen, A survey of large language models for code: Evolution, benchmarking, and future trends, arXiv preprint arXiv:2311.10372

work page arXiv

[58] [58]

J. Shin, C. Tang, T. Mohati, M. Nayebi, S. Wang, H. Hemmati, Prompt engineering or fine tuning: An empirical assessment of large language models in automated software engineering tasks, arXiv preprint arXiv:2310.10508. 22

work page arXiv

[59] [59]

Y. Wang, W. Wang, S. Joty, S. C. Hoi, Codet5: Identifier-aware unified pre- trained encoder-decoder models for code understanding and generation, arXiv preprint arXiv:2109.00859

work page internal anchor Pith review Pith/arXiv arXiv

[60] [60]

Black, L

S. Black, L. Gao, P. Wang, C. Leahy, S. Biderman, Gpt-neo: Large scale autoregressive language modeling with mesh-tensorflow, If you use this software, please cite it using these metadata 58

work page

[61] [61]

B. Wang, A. Komatsuzaki, Gpt-j-6b: A 6 billion parameter autoregressive language model (2021)

work page 2021

[62] [62]

Tunstall, L

L. Tunstall, L. Von Werra, T. Wolf, Natural language processing with transformers, ” O’Reilly Media, Inc. ”, 2022

work page 2022

[63] [63]

F. F. Xu, U. Alon, G. Neubig, V. J. Hellendoorn, A systematic evaluation of large language models of code, in: Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, 2022, pp. 1–10

work page 2022

[64] [64]

CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese, C. Xiong, Codegen: An open large language model for code with multi-turn program synthesis, arXiv preprint arXiv:2203.13474

work page internal anchor Pith review Pith/arXiv arXiv

[65] [65]

B. Chen, F. Zhang, A. Nguyen, D. Zan, Z. Lin, J.-G. Lou, W. Chen, Codet: Code generation with generated tests, arXiv preprint arXiv:2207.10397

work page internal anchor Pith review Pith/arXiv arXiv

[66] [66]

GPT-NeoX-20B: An Open-Source Autoregressive Language Model

S. Black, S. Biderman, E. Hallahan, Q. Anthony, L. Gao, L. Golding, H. He, C. Leahy, K. McDonell, J. Phang, et al., Gpt-neox-20b: An open-source autoregressive language model, arXiv preprint arXiv:2204.06745

work page internal anchor Pith review Pith/arXiv arXiv

[67] [67]

Arora, A

S. Arora, A. Narayan, M. F. Chen, L. Orr, N. Guha, K. Bhatia, I. Chami, C. Re, Ask me anything: A simple strategy for prompting language models, in: The Eleventh Interna- tional Conference on Learning Representations, 2022

work page 2022

[68] [68]

C. Wang, Q. Dong, X. Wang, H. Wang, Z. Sui, Statistical dataset evaluation: Reliability, difficulty, and validity, arXiv preprint arXiv:2212.09272

work page arXiv

[69] [69]

J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang, et al., Qwen technical report, arXiv preprint arXiv:2309.16609

work page internal anchor Pith review Pith/arXiv arXiv

[70] [70]

Contributors, Opencompass: A universal evaluation platform for foundation models, GitHub repository

O. Contributors, Opencompass: A universal evaluation platform for foundation models, GitHub repository

work page

[71] [71]

Golchin, M

S. Golchin, M. Surdeanu, Time travel in llms: Tracing data contamination in large lan- guage models, arXiv preprint arXiv:2308.08493

work page arXiv

[72] [72]

Riddell, A

M. Riddell, A. Ni, A. Cohan, Quantifying contamination in evaluating code generation capabilities of language models, arXiv preprint arXiv:2403.04811

work page arXiv

[73] [73]

Roberts, H

M. Roberts, H. Thakur, C. Herlihy, C. White, S. Dooley, To the cutoff... and beyond? a longitudinal perspective on llm data contamination, in: The Twelfth International Con- ference on Learning Representations, 2023

work page 2023

[74] [74]

Runeson, M

P. Runeson, M. H¨ ost, Guidelines for conducting and reporting case study research in software engineering, Empirical software engineering 14 (2009) 131–164

work page 2009

[75] [75]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al., Gpt-4 technical report, arXiv preprint arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv

[76] [76]

D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y. Wu, Y. Li, et al., Deepseek-coder: When the large language model meets programming–the rise of code intelligence, arXiv preprint arXiv:2401.14196. 23

work page internal anchor Pith review Pith/arXiv arXiv