Guidelines for Empirical Studies in Software Engineering involving Large Language Models
Pith reviewed 2026-05-25 08:15 UTC · model grok-4.3
The pith
A taxonomy of seven study types and eight guidelines address reproducibility threats in empirical software engineering research that uses large language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors present a taxonomy of seven study types that organizes how LLMs are used in SE research together with eight guidelines for designing and reporting such studies. The guidelines require researchers to declare LLM usage and role, report model versions and configurations, document tool architecture, disclose prompts and interaction logs, validate outputs with humans, include an open LLM baseline, use suitable baselines and metrics, and articulate limitations and mitigations. Requirements are distinguished from recommendations and contextualized by study type.
What carries the argument
The taxonomy of seven study types combined with eight guidelines that distinguish must from should requirements and map to study types via an applicability matrix.
If this is right
- Authors following the guidelines will explicitly declare the role of any LLM and report exact model versions plus customizations.
- Prompts, their iterative development process, and full interaction logs will be disclosed for every study that uses an LLM.
- Every such study will include validation of LLM outputs by humans and at least one open LLM as a baseline comparison.
- Limitations arising from non-determinism or model changes will be stated together with explicit mitigation steps.
- Reviewers will have a checklist that ties each guideline to the relevant study type to assess reporting completeness.
Where Pith is reading between the lines
- The same taxonomy and guideline structure could be adapted to empirical studies outside software engineering that rely on LLMs for data generation or analysis.
- If the guidelines are widely adopted, future meta-analyses could measure whether adherence correlates with higher replication success rates across published SE papers.
- The living online resource could serve as a testbed for community-driven updates when new LLM capabilities or failure modes emerge.
Load-bearing premise
The identified threats to reproducibility are the main problems and can be sufficiently mitigated by following the proposed guidelines without separate validation of the guidelines' actual impact on reproducibility rates.
What would settle it
A controlled comparison in which multiple independent teams re-execute the same set of LLM-based SE studies, once using only the new guidelines and once without them, and measure whether reproducibility rates differ measurably.
read the original abstract
Large Language Models (LLMs) are widely used in software engineering (SE) research and practice, yet their non-determinism, opaque training data, and rapidly evolving models threaten the reproducibility and replicability of empirical studies. We address this challenge through a collaborative effort of 22 researchers, presenting a taxonomy of seven study types that organizes how LLMs are used in SE research, together with eight guidelines for designing and reporting such studies. Each guideline distinguishes requirements (must) from recommended practices (should) and is contextualized by the study types it applies to. Our guidelines recommend that researchers: (1) declare LLM usage and role; (2) report model versions, configurations, and customizations; (3) document the tool architecture beyond the model; (4) disclose prompts, their development, and interaction logs; (5) validate LLM outputs with humans; (6) include an open LLM as a baseline; (7) use suitable baselines, benchmarks, and metrics; and (8) articulate limitations and mitigations. We complement the guidelines with an applicability matrix mapping guidelines to study types and a reporting checklist for authors and reviewers. We maintain the study types and guidelines online as a living resource for the community to use and shape (llm-guidelines$.$org).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to address reproducibility challenges in empirical software engineering studies involving LLMs—arising from non-determinism, opaque training data, and rapid model evolution—by presenting a taxonomy of seven study types and eight guidelines developed via collaborative input from 22 researchers. The guidelines distinguish 'must' from 'should' requirements, are mapped to study types via an applicability matrix, and are accompanied by a reporting checklist; they cover declaring LLM usage and role, reporting versions/configurations, documenting tool architecture, disclosing prompts and logs, human validation of outputs, including open-LLM baselines, using suitable baselines/benchmarks/metrics, and articulating limitations.
Significance. If the taxonomy and guidelines see adoption, the work could help standardize practices and improve reproducibility in LLM-based SE research. The collaborative expert synthesis and provision of a living online resource (llm-guidelines.org) are clear strengths that position the contribution as a community-oriented reference.
major comments (1)
- [Abstract] Abstract: the assertion that the taxonomy and guidelines 'address this challenge' of reproducibility threats rests solely on the collaborative synthesis by 22 researchers and logical mapping of practices to study types; no pilot study, retrospective application to existing papers, or measurement of improved replicability is reported, leaving the claim that these specific guidelines are necessary and sufficient untested.
Simulated Author's Rebuttal
We thank the referee for their thoughtful review and constructive comment. We address the point below and indicate where we will revise the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the assertion that the taxonomy and guidelines 'address this challenge' of reproducibility threats rests solely on the collaborative synthesis by 22 researchers and logical mapping of practices to study types; no pilot study, retrospective application to existing papers, or measurement of improved replicability is reported, leaving the claim that these specific guidelines are necessary and sufficient untested.
Authors: We agree that the manuscript does not contain empirical validation (e.g., a pilot study or retrospective application) demonstrating that the proposed guidelines measurably improve reproducibility. The contribution is a community-synthesized taxonomy and set of guidelines derived from expert consensus rather than a controlled evaluation of their effectiveness. The abstract's phrasing that the work 'address[es] this challenge' can be read as overstating the evidential basis. We will revise the abstract (and the corresponding sentence in the introduction) to state that the taxonomy and guidelines 'aim to mitigate' the identified reproducibility threats, making the scope of the claim explicit. No other changes to the core content are required. revision: yes
Circularity Check
No circularity: taxonomy and guidelines synthesized from expert consensus
full rationale
The paper presents a taxonomy of seven study types and eight guidelines derived from a collaborative effort of 22 researchers. No mathematical derivations, equations, fitted parameters, or predictions appear in the abstract or described content. The central claim rests on expert synthesis and logical mapping of practices to study types, with an applicability matrix and checklist; this is self-contained expert consensus rather than any reduction to self-citation chains, self-definitional constructs, or fitted inputs renamed as predictions. No load-bearing self-citations or ansatzes are invoked. This matches the default expectation of non-circularity for guideline papers.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Non-determinism, opaque training data, and rapidly evolving models of LLMs threaten the reproducibility and replicability of empirical studies in software engineering.
Forward citations
Cited by 7 Pith papers
-
LLM-Assisted Empirical Software Engineering: Systematic Literature Review and Research Agenda
A systematic review of 50 studies identifies 69 LLM-assisted tasks in empirical software engineering, concentrated in data processing and analysis with gaps in human-centered integration and reproducibility reporting.
-
Taking a Pulse on How Generative AI is Reshaping the Software Engineering Research Landscape
A survey of 457 SE researchers finds widespread GenAI use concentrated in writing and ideation, with productivity gains but persistent concerns over accuracy, bias, and the need for clearer governance rules.
-
Why Are Agentic Pull Requests Merged or Rejected? An Empirical Study
Analysis of 9,799 human-reviewed agentic PRs shows only 35.7% of rejections reflect clear agent failures, with 31.2% due to workflow constraints and 33.1% lacking clear rationale, plus notable interaction differences ...
-
Agentic Business Process Management: A Research Manifesto
Agentic Business Process Management reframes BPM around autonomous agents that must exhibit framed autonomy, explainability, conversational actionability, and self-modification to keep their actions aligned with organ...
-
Does AI Code Review Lead to Code Changes? A Case Study of GitHub Actions
Large-scale study of GitHub AI code review actions finds concise comments with code snippets, manual triggers, and hunk-level tools are more likely to produce code changes.
-
Rethinking Artifact Evaluation for Software Engineering in the Age of Generative AI
Artifact evaluation should become a first-class part of peer review in software engineering because generative AI weakens writing quality as a signal of scientific substance.
-
Investigating Notable Metadata Practices in PyPI Libraries: An Empirical Study about Repository and Donation Platform URLs
PyPI metadata gaps arise mainly from oversight, skepticism, and platform preferences, as shown by surveys of 1,776 responses analyzed with a robust LLaMA-based topic model.
Reference graph
Works this paper leans on
-
[1]
Devanbu, Christoph Treude, and Michael Pradel
Toufique Ahmed, Premkumar T. Devanbu, Christoph Treude, and Michael Pradel. 2025. Can LLMs Replace Manual Annotation of Software Engineering Artifacts?. In22nd IEEE/ACM International Conference on Mining Software Repos- itories, MSR@ICSE 2025, Ottawa, ON, Canada, April 28-29, 2025. IEEE, 526–538. https://doi.org/10.1109/MSR66628.2025.00086
-
[2]
Sanchit Ahuja, Varun Gumma, and Sunayana Sitaram. 2024. Contamination Report for Multilingual Benchmarks.CoRRabs/2410.16186 (2024). https: //doi.org/10.48550/ARXIV.2410.16186 arXiv:2410.16186
-
[3]
Miltiadis Allamanis. 2019. The adverse effects of code duplication in machine learning models of code. InProceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, Onward! 2019, Hidehiko Masuhara and Tomas Petricek (Eds.). ACM, 143–153. https://doi.org/10.1145/3359591.3359735
-
[4]
Dharun Anandayuvaraj, Matthew Campbell, Arav Tewari, and James C Davis
-
[5]
InPro- ceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering
FAIL: Analyzing Software Failures from the News Using LLMs. InPro- ceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 506–518
-
[6]
Lisa P. Argyle, Ethan C. Busby, Nancy Fulda, Joshua Gubler, Christopher Michael Rytting, and David Wingate. 2022. Out of One, Many: Using Language Models to Simulate Human Samples.CoRRabs/2209.06899 (2022). https://doi.org/10. 48550/ARXIV.2209.06899 arXiv:2209.06899
-
[7]
American Psychological Association. 2018. APA Dictionary of Psychology: subject. https://dictionary.apa.org/subject. Accessed 2025-08-15
work page 2018
-
[8]
Association for Computing Machinery. 2023. ACM Policy on Authorship. https: //www.acm.org/publications/policies/new-acm-policy-on-authorship. Ac- cessed 2025-01-13
work page 2023
-
[9]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V. Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models. CoRRabs/2108.07732 (2021). arXiv:2108.07732 https://arxiv.org/abs/2108.07732
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[10]
Maider Azanza, Juanan Pereira, Arantza Irastorza, and Aritz Galdos. 2024. Can LLMs Facilitate Onboarding Software Developers? An Ongoing Industrial Case Study. In36th International Conference on Software Engineering Education and Training, CSEE&T 2024. IEEE, 1–6. https://doi.org/10.1109/CSEET62301.2024. 10662989
-
[11]
Simone Balloccu, Patrícia Schmidtová, Mateusz Lango, and Ondrej Dusek. 2024. Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed- Source LLMs. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2024 - Volume 1: Long Papers, St. Julian’s, Malta, March 17-22, 2024, Yv...
work page 2024
- [12]
-
[13]
Muneera Bano, Rashina Hoda, Didar Zowghi, and Christoph Treude. 2024. Large language models for qualitative research in software engineering: exploring opportunities and challenges.Autom. Softw. Eng.31, 1 (2024), 8. https://doi. org/10.1007/S10515-023-00407-8
- [14]
-
[15]
Cauã Ferreira Barros, Bruna Borges Azevedo, Valdemar Vicente Graciano Neto, Mohamad Kassab, Marcos Kalinowski, Hugo Alexandre D. do Nascimento, and Michelle C. G. S. P. Bandeira. 2024. Large Language Model for Qualitative Research – A Systematic Mapping Study.arXiv(2024). arXiv:2411.14473 [cs.CL] https://arxiv.org/abs/2411.14473
-
[16]
Anna Bavaresco, Raffaella Bernardi, Leonardo Bertolazzi, Desmond Elliott, Raquel Fernández, Albert Gatt, Esam Ghaleb, Mario Giulianelli, Michael Hanna, Alexander Koller, André F. T. Martins, Philipp Mondorf, Vera Neplen- broek, Sandro Pezzelle, Barbara Plank, David Schlangen, Alessandro Suglia, Aditya K. Surikuchi, Ece Takmaz, and Alberto Testoni. 2024. L...
-
[17]
Courtni Byun, Piper Vasicek, and Kevin D. Seppi. 2023. Dispensing with Humans in Human-Computer Interaction Research. InExtended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems, CHI EA 2023, Albrecht Schmidt, Kaisa Väänänen, Tesh Goyal, Per Ola Kristensson, and Anicia Peters (Eds.). ACM, 413:1–413:26. https://doi.org/10.1145/354...
-
[18]
Yuchen Cai, Aashish Yadavally, Abhishek Mishra, Genesis Montejo, and Tien N. Nguyen. 2024. Programming Assistant for Exception Handling with Code- BERT. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14-20, 2024. ACM, 94:1–94:13. https://doi.org/10.1145/3597503.3639188
- [19]
-
[20]
Satish Chandra. 2025. Benchmarks for AI in Software Engineering (BLOG@CACM). https://cacm.acm.org/blogcacm/benchmarks-for-ai-in- software-engineering/. Accessed 2025-08-24
work page 2025
-
[21]
Sherman Chann. 2023. Non-determinism in GPT-4 is caused by Sparse MoE. https://152334h.github.io/blog/non-determinism-in-gpt-4/. Accessed 2025-01- 13
work page 2023
- [22]
-
[23]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[24]
Jacob Cohen. 1960. A Coefficient of Agreement for Nominal Scales.Educational and Psychological Measurement20, 1 (April 1960), 37–46. https://doi.org/10. 1177/001316446002000104
work page 1960
-
[25]
Tristan Coignion, Clément Quinton, and Romain Rouvoy. 2024. A Perfor- mance Study of LLM-Generated Code on Leetcode. InProceedings of the 28th International Conference on Evaluation and Assessment in Software En- gineering, EASE 2024, Salerno, Italy, June 18-21, 2024. ACM, 79–89. https: //doi.org/10.1145/3661167.3661221
-
[26]
Together Computer. 2023. RedPajama: an Open Dataset for Training Large Language Models. https://github.com/togethercomputer/RedPajama-Data
work page 2023
-
[27]
Rachel Crowell. 2023. Why AI’s diversity crisis matters, and how to tackle it. Nature Career Feature(2023). https://doi.org/10.1038/d41586-023-01689-4
-
[28]
Matheus de Morais Leça, Lucas Valença, Reydne Santos, and Ronnie de Souza Santos. 2024. Applications and Implications of Large Language Models in Qualitative Analysis: A New Frontier for Empirical Software Engineering. arXiv(2024). arXiv:2412.06564 [cs.SE] https://arxiv.org/abs/2412.06564
-
[29]
Stefano De Paoli. 2024. Performing an inductive thematic analysis of semi- structured interviews with a large language model: An exploration and provo- cation on the limits of the approach.Social Science Computer Review42, 4 (2024), 997–1019
work page 2024
-
[30]
Alex de Vries. 2023. The growing energy footprint of artificial intelligence. Joule7, 10 (2023), 2191–2194
work page 2023
-
[31]
Rudra Dhar, Karthik Vaidhyanathan, and Vasudeva Varma. 2024. Can LLMs Gen- erate Architectural Design Decisions? - An Exploratory Empirical Study. In21st IEEE International Conference on Software Architecture, ICSA 2024, Hyderabad, India, June 4-8, 2024. IEEE, 79–89. https://doi.org/10.1109/ICSA59870.2024.00016
-
[32]
Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2023. ClassEval: A Baltes et al. Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Gener- ation.CoRRabs/2308.01861 (2023). https://doi.org/10.48550/ARXIV.2308.01861 arXiv:2308.01861
-
[33]
Ty Dunn. 2023. It’s time to collect data on how you build software. https://blog. continue.dev/its-time-to-collect-data-on-how-you-build-software/. Accessed 2025-08-15
work page 2023
-
[34]
Aryaz Eghbali and Michael Pradel. 2022. CrystalBLEU: Precisely and Efficiently Measuring the Similarity of Code. In37th IEEE/ACM International Conference on Automated Software Engineering, ASE 2022. ACM, 28:1–28:12. https://doi. org/10.1145/3551349.3556903
-
[35]
Abdelkarim El-Hajjami and Camille Salinesi. 2025. How Good Are Synthetic Requirements? Evaluating LLM-Generated Datasets for AI4RE. CoRRabs/2506.21138 (2025). https://doi.org/10.48550/ARXIV.2506.21138 arXiv:2506.21138
-
[36]
Aparna Elangovan, Jongwoo Ko, Lei Xu, Mahsa Elyasi, Ling Liu, Sravan Bo- dapati, and Dan Roth. 2024. Beyond correlation: The impact of human uncer- tainty in measuring the effectiveness of automatic evaluation and LLM-as-a- judge.CoRRabs/2410.03775 (2024). https://doi.org/10.48550/ARXIV.2410.03775 arXiv:2410.03775
-
[37]
Zhenxiao Fu, Fan Chen, Shan Zhou, Haitong Li, and Lei Jiang. 2024. LLMCO2: Advancing Accurate Carbon Footprint Prediction for LLM Infer- ences.CoRRabs/2410.02950 (2024). https://doi.org/10.48550/ARXIV.2410.02950 arXiv:2410.02950
-
[38]
Isabel O. Gallegos, Ryan A. Rossi, Joe Barrow, Md. Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, and Nesreen Ahmed. 2024. Bias and Fairness in Large Language Models: A Survey.Computational Linguis- tics50 (2024), 1097–1179. Issue 3. https://doi.org/10.1162/coli_a_00524
-
[39]
Isabel O. Gallegos, Ryan A. Rossi, Joe Barrow, Md. Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, and Nesreen K. Ahmed. 2023. Bias and Fairness in Large Language Models: A Survey.CoRRabs/2309.00770 (2023). https://doi.org/10.48550/ARXIV.2309.00770 arXiv:2309.00770
-
[40]
Jack Gallifant, Majid Afshar, Saleem Ameen, Yindalon Aphinyanaphongs, Shan Chen, Giovanni Cacciamani, Dina Demner-Fushman, Dmitriy Dligach, Roxana Daneshjou, Chrystinne Fernandes, Lasse Hyldig Hansen, Adam Landman, Lisa Lehmann, Liam G. McCoy, Timothy Miller, Amy Moreno, Nikolaj Munch, David Restrepo, Guergana Savova, Renato Umeton, Judy Wawira Gichoya, G...
-
[41]
Marco Aurélio Gerosa, Bianca Trinkenreich, Igor Steinmacher, and Anita Sarma
-
[42]
Can AI serve as a substitute for human subjects in software engineering research?Autom. Softw. Eng.31, 1 (2024), 13. https://doi.org/10.1007/S10515- 023-00409-6
-
[43]
Elizabeth Gibney. 2024. Not all ‘open source’ AI models are actually open. Nature News(2024). https://doi.org/10.1038/d41586-024-02012-5
-
[44]
Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. 2023. ChatGPT Outperforms Crowd-Workers for Text-Annotation Tasks.CoRRabs/2303.15056 (2023). https: //doi.org/10.48550/ARXIV.2303.15056 arXiv:2303.15056
- [45]
-
[46]
2024.acmsigsoft/open-science-policies: v1.0.0
Daniel Graziotin. 2024.acmsigsoft/open-science-policies: v1.0.0. https://doi.org/ 10.5281/zenodo.10796477
-
[47]
Odd Erik Gundersen, Odd Cappelen, Martin Mølnå, and Nicklas Grimstad Nilsen
-
[48]
https://doi.org/10.48550/ARXIV.2412.17859 arXiv:2412.17859
The Unreasonable Effectiveness of Open Science in AI: A Replication Study.CoRRabs/2412.17859 (2024). https://doi.org/10.48550/ARXIV.2412.17859 arXiv:2412.17859
-
[49]
Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and Wen- feng Liang. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming - The Rise of Code Intelligence.CoRRabs/2401.14196 (2024). https://doi.org/10.48550/ARXIV.2401.14196 arXiv:2401.14196
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2401.14196 2024
-
[50]
Jacqueline Harding, William D’Alessandro, N. G. Laskowski, and Robert Long
-
[51]
AI language models cannot replace human research participants.AI Soc. 39, 5 (2024), 2603–2605. https://doi.org/10.1007/S00146-023-01725-X
-
[52]
Zeyu He, Chieh-Yang Huang, Chien-Kuang Cornelia Ding, Shaurya Rohatgi, and Ting-Hao Kenneth Huang. 2024. If in a Crowdsourced Data Annotation Pipeline, a GPT-4. InProceedings of the CHI Conference on Human Factors in Computing Systems, CHI 2024, Florian ’Floyd’ Mueller, Penny Kyburz, Julie R. Williamson, Corina Sas, Max L. Wilson, Phoebe O. Toups Dugas, a...
-
[53]
Catherine M Hicks, Carol S Lee, and Kristen Foster-Marks. 2025. The New Developer: AI Skill Threat, Identity Change & Developer Thriving in the Transition to AI-Assisted Software Development.PsyArXiv(March 2025). https://doi.org/10.31234/osf.io/2gej5_v2
-
[54]
Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large Language Models for Software Engineering: A Systematic Literature Review.ACM Trans. Softw. Eng. Methodol.33, 8, Article 220 (Dec. 2024), 79 pages. https://doi.org/10.1145/ 3695988
work page 2024
- [55]
-
[56]
Fan Huang, Haewoon Kwak, and Jisun An. 2023. Is ChatGPT better than Human Annotators? Potential and Limitations of ChatGPT in Explaining Implicit Hate Speech. InCompanion Proceedings of the ACM Web Conference 2023, WWW 2023, Ying Ding, Jie Tang, Juan F. Sequeda, Lora Aroyo, Carlos Castillo, and Geert-Jan Houben (Eds.). ACM, 294–297. https://doi.org/10.114...
-
[57]
Jiangping Huang, Bochen Yi, Weisong Sun, Bangrui Wan, Yang Xu, Yebo Feng, Wenguang Ye, and Qinjun Qin. 2024. Enhancing Review Classification Via LLM- Based Data Annotation and Multi-Perspective Feature Representation Learning. SSRN Electronic Journal(2024), 1–15. https://doi.org/10.2139/ssrn.5002351
-
[58]
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, An Yang, Rui Men, Fei Huang, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. 2024. Qwen2.5- Coder Technical Report.CoRRabs/2409.12186 (2024). https://doi.org/10.48550/ ARXIV.2409.12186 arXiv:2409.12186
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[59]
Jasmin Jahic and Ashkan Sami. 2024. State of Practice: LLMs in Software Engineering and Software Architecture. In21st IEEE International Conference on Software Architecture, ICSA 2024 - Companion, Hyderabad, India, June 4-8,
work page 2024
-
[60]
https://doi.org/10.1109/ICSA-C63560.2024.00059
IEEE, 311–318. https://doi.org/10.1109/ICSA-C63560.2024.00059
-
[62]
Kevin Jesse, Toufique Ahmed, Premkumar T. Devanbu, and Emily Morgan
-
[63]
In20th IEEE/ACM International Conference on Mining Software Repositories, MSR 2023
Large Language Models and Simple, Stupid Bugs. In20th IEEE/ACM International Conference on Mining Software Repositories, MSR 2023. IEEE, 563–
work page 2023
-
[64]
https://doi.org/10.1109/MSR59073.2023.00082
-
[65]
Peng Jiang, Christian Sonne, Wangliang Li, Fengqi You, and Siming You. 2024. Preventing the Immense Increase in the Life-Cycle Energy and Carbon Foot- prints of LLM-Powered Intelligent Chatbots.Engineering40 (2024), 202–210. https://doi.org/10.1016/j.eng.2024.04.002
-
[66]
Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world Github Issues?. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenRe- view.net. https://openreview.net/forum?id=VTF8yNQM66
work page 2024
-
[67]
Sungmin Kang, Juyeon Yoon, and Shin Yoo. 2023. Large Language Models are Few-shot Testers: Exploring LLM-based General Bug Reproduction. In45th IEEE/ACM International Conference on Software Engineering, ICSE 2023. IEEE, 2312–2323. https://doi.org/10.1109/ICSE48619.2023.00194
-
[68]
Anjan Karmakar, Miltiadis Allamanis, and Romain Robbes. 2023. JEMMA: An extensible Java dataset for ML4Code applications.Empir. Softw. Eng.28, 2 (2023),
work page 2023
-
[69]
https://doi.org/10.1007/S10664-022-10275-7
-
[70]
Ranim Khojah, Mazen Mohamad, Philipp Leitner, and Francisco Gomes de Oliveira Neto. 2024. Beyond Code Generation: An Observational Study of ChatGPT Usage in Software Engineering Practice.Proc. ACM Softw. Eng.1, FSE (2024), 1819–1840. https://doi.org/10.1145/3660788
-
[71]
Qusai Khraisha, Sophie Put, Johanna Kappenberg, Azza Warraitch, and Kristin Hadfield. 2024. Can large language models replace humans in system- atic reviews? Evaluating GPT-4’s efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages.Research Synthesis Methods15, 4 (2024), 616–626. https://doi.org/10.1002/jrsm...
-
[72]
Aobo Kong, Shiwan Zhao, Hao Chen, Qicheng Li, Yong Qin, Ruiqi Sun, and Xin Zhou. 2023. Better Zero-Shot Reasoning with Role-Play Prompt- ing.CoRRabs/2308.07702 (2023). https://doi.org/10.48550/ARXIV.2308.07702 arXiv:2308.07702
-
[73]
Sumith Kulal, Panupong Pasupat, Kartik Chandra, Mina Lee, Oded Padon, Alex Aiken, and Percy Liang. 2019. SPoC: Search-based Pseudocode to Code.CoRR abs/1906.04908 (2019). arXiv:1906.04908 http://arxiv.org/abs/1906.04908
work page internal anchor Pith review Pith/arXiv arXiv 2019
- [74]
-
[75]
Kleanthi Lakiotaki, Nikolaos Vorniotakis, Michail Tsagris, Georgios Geor- gakopoulos, and Ioannis Tsamardinos. 2018. BioDataome: a collection of uniformly preprocessed and automatically annotated datasets for data-driven biology.Database J. Biol. Databases Curation2018 (2018), bay011. https: //doi.org/10.1093/DATABASE/BAY011
-
[76]
Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. 2022. Deduplicating Training Data Makes Language Models Better. InProceedings of the 60th Annual Meeting Guidelines for Empirical Studies in Software Engineering involving Large Language Models of the Association for Computational Lingui...
-
[77]
David Li, Kartik Gupta, Mousumi Bhaduri, Paul Sathiadoss, Sahir Bhatnagar, and Jaron Chong. 2024. Comparing GPT-3.5 and GPT-4 Accuracy and Drift in Radiology Diagnosis Please Cases.Radiology310, 1 (2024), e232411. https: //doi.org/10.1148/radiol.232411 arXiv:https://doi.org/10.1148/radiol.232411
-
[78]
Jia Li, Ge Li, Xuanming Zhang, Yunfei Zhao, Yihong Dong, Zhi Jin, Binhua Li, Fei Huang, and Yongbin Li. 2024. EvoCodeBench: An Evolv- ing Code Generation Benchmark with Domain-Specific Evaluations. In Advances in Neural Information Processing Systems 38: Annual Confer- ence on Neural Information Processing Systems 2024, NeurIPS 2024, Amir Globersons, Lest...
work page 2024
-
[79]
Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Ko- cetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier De- haene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.06161 2023
-
[80]
Jenny T Liang, Carmen Badea, Christian Bird, Robert DeLine, Denae Ford, Nicole Forsgren, and Thomas Zimmermann. 2024. Can gpt-4 replicate empirical software engineering research?Proceedings of the ACM on Software Engineering 1, FSE (2024), 1330–1353
work page 2024
-
[81]
Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. InAdvances in Neural Information Pro- cessing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, Alice Oh, Tristan Naumann, Amir Gl...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.