pith. sign in

arxiv: 2508.15503 · v6 · pith:UVBEQKV3new · submitted 2025-08-21 · 💻 cs.SE

Guidelines for Empirical Studies in Software Engineering involving Large Language Models

Pith reviewed 2026-05-25 08:15 UTC · model grok-4.3

classification 💻 cs.SE
keywords large language modelssoftware engineeringempirical studiesreproducibilityguidelinestaxonomyreporting checklist
0
0 comments X

The pith

A taxonomy of seven study types and eight guidelines address reproducibility threats in empirical software engineering research that uses large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a collaborative framework to counter threats from non-determinism, opaque training data, and model evolution that undermine reproducibility in LLM-based SE studies. It introduces a taxonomy classifying seven distinct ways LLMs appear in research designs and pairs it with eight guidelines that separate mandatory requirements from recommended practices. Each guideline is mapped to the study types it applies to, supported by an applicability matrix and a reporting checklist. The work maintains these resources online as a living community document. A sympathetic reader would care because the guidelines target concrete reporting failures that currently make many LLM studies hard to replicate or review.

Core claim

The authors present a taxonomy of seven study types that organizes how LLMs are used in SE research together with eight guidelines for designing and reporting such studies. The guidelines require researchers to declare LLM usage and role, report model versions and configurations, document tool architecture, disclose prompts and interaction logs, validate outputs with humans, include an open LLM baseline, use suitable baselines and metrics, and articulate limitations and mitigations. Requirements are distinguished from recommendations and contextualized by study type.

What carries the argument

The taxonomy of seven study types combined with eight guidelines that distinguish must from should requirements and map to study types via an applicability matrix.

If this is right

  • Authors following the guidelines will explicitly declare the role of any LLM and report exact model versions plus customizations.
  • Prompts, their iterative development process, and full interaction logs will be disclosed for every study that uses an LLM.
  • Every such study will include validation of LLM outputs by humans and at least one open LLM as a baseline comparison.
  • Limitations arising from non-determinism or model changes will be stated together with explicit mitigation steps.
  • Reviewers will have a checklist that ties each guideline to the relevant study type to assess reporting completeness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same taxonomy and guideline structure could be adapted to empirical studies outside software engineering that rely on LLMs for data generation or analysis.
  • If the guidelines are widely adopted, future meta-analyses could measure whether adherence correlates with higher replication success rates across published SE papers.
  • The living online resource could serve as a testbed for community-driven updates when new LLM capabilities or failure modes emerge.

Load-bearing premise

The identified threats to reproducibility are the main problems and can be sufficiently mitigated by following the proposed guidelines without separate validation of the guidelines' actual impact on reproducibility rates.

What would settle it

A controlled comparison in which multiple independent teams re-execute the same set of LLM-based SE studies, once using only the new guidelines and once without them, and measure whether reproducibility rates differ measurably.

read the original abstract

Large Language Models (LLMs) are widely used in software engineering (SE) research and practice, yet their non-determinism, opaque training data, and rapidly evolving models threaten the reproducibility and replicability of empirical studies. We address this challenge through a collaborative effort of 22 researchers, presenting a taxonomy of seven study types that organizes how LLMs are used in SE research, together with eight guidelines for designing and reporting such studies. Each guideline distinguishes requirements (must) from recommended practices (should) and is contextualized by the study types it applies to. Our guidelines recommend that researchers: (1) declare LLM usage and role; (2) report model versions, configurations, and customizations; (3) document the tool architecture beyond the model; (4) disclose prompts, their development, and interaction logs; (5) validate LLM outputs with humans; (6) include an open LLM as a baseline; (7) use suitable baselines, benchmarks, and metrics; and (8) articulate limitations and mitigations. We complement the guidelines with an applicability matrix mapping guidelines to study types and a reporting checklist for authors and reviewers. We maintain the study types and guidelines online as a living resource for the community to use and shape (llm-guidelines$.$org).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper claims to address reproducibility challenges in empirical software engineering studies involving LLMs—arising from non-determinism, opaque training data, and rapid model evolution—by presenting a taxonomy of seven study types and eight guidelines developed via collaborative input from 22 researchers. The guidelines distinguish 'must' from 'should' requirements, are mapped to study types via an applicability matrix, and are accompanied by a reporting checklist; they cover declaring LLM usage and role, reporting versions/configurations, documenting tool architecture, disclosing prompts and logs, human validation of outputs, including open-LLM baselines, using suitable baselines/benchmarks/metrics, and articulating limitations.

Significance. If the taxonomy and guidelines see adoption, the work could help standardize practices and improve reproducibility in LLM-based SE research. The collaborative expert synthesis and provision of a living online resource (llm-guidelines.org) are clear strengths that position the contribution as a community-oriented reference.

major comments (1)
  1. [Abstract] Abstract: the assertion that the taxonomy and guidelines 'address this challenge' of reproducibility threats rests solely on the collaborative synthesis by 22 researchers and logical mapping of practices to study types; no pilot study, retrospective application to existing papers, or measurement of improved replicability is reported, leaving the claim that these specific guidelines are necessary and sufficient untested.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive comment. We address the point below and indicate where we will revise the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion that the taxonomy and guidelines 'address this challenge' of reproducibility threats rests solely on the collaborative synthesis by 22 researchers and logical mapping of practices to study types; no pilot study, retrospective application to existing papers, or measurement of improved replicability is reported, leaving the claim that these specific guidelines are necessary and sufficient untested.

    Authors: We agree that the manuscript does not contain empirical validation (e.g., a pilot study or retrospective application) demonstrating that the proposed guidelines measurably improve reproducibility. The contribution is a community-synthesized taxonomy and set of guidelines derived from expert consensus rather than a controlled evaluation of their effectiveness. The abstract's phrasing that the work 'address[es] this challenge' can be read as overstating the evidential basis. We will revise the abstract (and the corresponding sentence in the introduction) to state that the taxonomy and guidelines 'aim to mitigate' the identified reproducibility threats, making the scope of the claim explicit. No other changes to the core content are required. revision: yes

Circularity Check

0 steps flagged

No circularity: taxonomy and guidelines synthesized from expert consensus

full rationale

The paper presents a taxonomy of seven study types and eight guidelines derived from a collaborative effort of 22 researchers. No mathematical derivations, equations, fitted parameters, or predictions appear in the abstract or described content. The central claim rests on expert synthesis and logical mapping of practices to study types, with an applicability matrix and checklist; this is self-contained expert consensus rather than any reduction to self-citation chains, self-definitional constructs, or fitted inputs renamed as predictions. No load-bearing self-citations or ansatzes are invoked. This matches the default expectation of non-circularity for guideline papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on expert consensus for the taxonomy and guidelines rather than new empirical evidence or formal derivations.

axioms (1)
  • domain assumption Non-determinism, opaque training data, and rapidly evolving models of LLMs threaten the reproducibility and replicability of empirical studies in software engineering.
    This is the challenge stated as the motivation for the work.

pith-pipeline@v0.9.0 · 5845 in / 1307 out tokens · 52207 ms · 2026-05-25T08:15:54.225210+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LLM-Assisted Empirical Software Engineering: Systematic Literature Review and Research Agenda

    cs.SE 2026-04 unverdicted novelty 7.0

    A systematic review of 50 studies identifies 69 LLM-assisted tasks in empirical software engineering, concentrated in data processing and analysis with gaps in human-centered integration and reproducibility reporting.

  2. Taking a Pulse on How Generative AI is Reshaping the Software Engineering Research Landscape

    cs.SE 2026-04 accept novelty 7.0

    A survey of 457 SE researchers finds widespread GenAI use concentrated in writing and ideation, with productivity gains but persistent concerns over accuracy, bias, and the need for clearer governance rules.

  3. Why Are Agentic Pull Requests Merged or Rejected? An Empirical Study

    cs.SE 2026-05 unverdicted novelty 6.0

    Analysis of 9,799 human-reviewed agentic PRs shows only 35.7% of rejections reflect clear agent failures, with 31.2% due to workflow constraints and 33.1% lacking clear rationale, plus notable interaction differences ...

  4. Agentic Business Process Management: A Research Manifesto

    cs.AI 2026-03 unverdicted novelty 6.0

    Agentic Business Process Management reframes BPM around autonomous agents that must exhibit framed autonomy, explainability, conversational actionability, and self-modification to keep their actions aligned with organ...

  5. Does AI Code Review Lead to Code Changes? A Case Study of GitHub Actions

    cs.SE 2025-08 unverdicted novelty 6.0

    Large-scale study of GitHub AI code review actions finds concise comments with code snippets, manual triggers, and hunk-level tools are more likely to produce code changes.

  6. Rethinking Artifact Evaluation for Software Engineering in the Age of Generative AI

    cs.SE 2026-01 accept novelty 5.0

    Artifact evaluation should become a first-class part of peer review in software engineering because generative AI weakens writing quality as a signal of scientific substance.

  7. Investigating Notable Metadata Practices in PyPI Libraries: An Empirical Study about Repository and Donation Platform URLs

    cs.SE 2026-01 unverdicted novelty 5.0

    PyPI metadata gaps arise mainly from oversight, skepticism, and platform preferences, as shown by surveys of 1,776 responses analyzed with a robust LLaMA-based topic model.

Reference graph

Works this paper leans on

159 extracted references · 159 canonical work pages · cited by 7 Pith papers · 11 internal anchors

  1. [1]

    Devanbu, Christoph Treude, and Michael Pradel

    Toufique Ahmed, Premkumar T. Devanbu, Christoph Treude, and Michael Pradel. 2025. Can LLMs Replace Manual Annotation of Software Engineering Artifacts?. In22nd IEEE/ACM International Conference on Mining Software Repos- itories, MSR@ICSE 2025, Ottawa, ON, Canada, April 28-29, 2025. IEEE, 526–538. https://doi.org/10.1109/MSR66628.2025.00086

  2. [2]

    Sanchit Ahuja, Varun Gumma, and Sunayana Sitaram. 2024. Contamination Report for Multilingual Benchmarks.CoRRabs/2410.16186 (2024). https: //doi.org/10.48550/ARXIV.2410.16186 arXiv:2410.16186

  3. [3]

    Miltiadis Allamanis. 2019. The adverse effects of code duplication in machine learning models of code. InProceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, Onward! 2019, Hidehiko Masuhara and Tomas Petricek (Eds.). ACM, 143–153. https://doi.org/10.1145/3359591.3359735

  4. [4]

    Dharun Anandayuvaraj, Matthew Campbell, Arav Tewari, and James C Davis

  5. [5]

    InPro- ceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering

    FAIL: Analyzing Software Failures from the News Using LLMs. InPro- ceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 506–518

  6. [6]

    Argyle, Ethan C

    Lisa P. Argyle, Ethan C. Busby, Nancy Fulda, Joshua Gubler, Christopher Michael Rytting, and David Wingate. 2022. Out of One, Many: Using Language Models to Simulate Human Samples.CoRRabs/2209.06899 (2022). https://doi.org/10. 48550/ARXIV.2209.06899 arXiv:2209.06899

  7. [7]

    American Psychological Association. 2018. APA Dictionary of Psychology: subject. https://dictionary.apa.org/subject. Accessed 2025-08-15

  8. [8]

    Association for Computing Machinery. 2023. ACM Policy on Authorship. https: //www.acm.org/publications/policies/new-acm-policy-on-authorship. Ac- cessed 2025-01-13

  9. [9]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V. Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models. CoRRabs/2108.07732 (2021). arXiv:2108.07732 https://arxiv.org/abs/2108.07732

  10. [10]

    Maider Azanza, Juanan Pereira, Arantza Irastorza, and Aritz Galdos. 2024. Can LLMs Facilitate Onboarding Software Developers? An Ongoing Industrial Case Study. In36th International Conference on Software Engineering Education and Training, CSEE&T 2024. IEEE, 1–6. https://doi.org/10.1109/CSEET62301.2024. 10662989

  11. [11]

    Simone Balloccu, Patrícia Schmidtová, Mateusz Lango, and Ondrej Dusek. 2024. Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed- Source LLMs. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2024 - Volume 1: Long Papers, St. Julian’s, Malta, March 17-22, 2024, Yv...

  12. [12]

    Muneera Bano, Hashini Gunatilake, and Rashina Hoda. 2025. What Does a Software Engineer Look Like? Exploring Societal Stereotypes in LLMs.arXiv (2025). arXiv:2501.03569 [cs.SE] https://arxiv.org/abs/2501.03569

  13. [13]

    Muneera Bano, Rashina Hoda, Didar Zowghi, and Christoph Treude. 2024. Large language models for qualitative research in software engineering: exploring opportunities and challenges.Autom. Softw. Eng.31, 1 (2024), 8. https://doi. org/10.1007/S10515-023-00407-8

  14. [14]

    Muneera Bano, Didar Zowghi, and Jon Whittle. 2023. Exploring Qualitative Research Using LLMs.arXiv(2023). arXiv:2306.13298 [cs.SE] https://arxiv.org/ abs/2306.13298

  15. [15]

    do Nascimento, and Michelle C

    Cauã Ferreira Barros, Bruna Borges Azevedo, Valdemar Vicente Graciano Neto, Mohamad Kassab, Marcos Kalinowski, Hugo Alexandre D. do Nascimento, and Michelle C. G. S. P. Bandeira. 2024. Large Language Model for Qualitative Research – A Systematic Mapping Study.arXiv(2024). arXiv:2411.14473 [cs.CL] https://arxiv.org/abs/2411.14473

  16. [16]

    Anna Bavaresco, Raffaella Bernardi, Leonardo Bertolazzi, Desmond Elliott, Raquel Fernández, Albert Gatt, Esam Ghaleb, Mario Giulianelli, Michael Hanna, Alexander Koller, André F. T. Martins, Philipp Mondorf, Vera Neplen- broek, Sandro Pezzelle, Barbara Plank, David Schlangen, Alessandro Suglia, Aditya K. Surikuchi, Ece Takmaz, and Alberto Testoni. 2024. L...

  17. [17]

    Courtni Byun, Piper Vasicek, and Kevin D. Seppi. 2023. Dispensing with Humans in Human-Computer Interaction Research. InExtended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems, CHI EA 2023, Albrecht Schmidt, Kaisa Väänänen, Tesh Goyal, Per Ola Kristensson, and Anicia Peters (Eds.). ACM, 413:1–413:26. https://doi.org/10.1145/354...

  18. [18]

    Yuchen Cai, Aashish Yadavally, Abhishek Mishra, Genesis Montejo, and Tien N. Nguyen. 2024. Programming Assistant for Exception Handling with Code- BERT. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14-20, 2024. ACM, 94:1–94:13. https://doi.org/10.1145/3597503.3639188

  19. [19]

    Jialun Cao, Yuk-Kit Chan, Zixuan Ling, Wenxuan Wang, Shuqing Li, Mingwei Liu, Chaozheng Wang, Boxi Yu, Pinjia He, Shuai Wang, et al. 2025. How Should I Build A Benchmark?arXiv preprint arXiv:2501.10711(2025)

  20. [20]

    Satish Chandra. 2025. Benchmarks for AI in Software Engineering (BLOG@CACM). https://cacm.acm.org/blogcacm/benchmarks-for-ai-in- software-engineering/. Accessed 2025-08-24

  21. [21]

    Sherman Chann. 2023. Non-determinism in GPT-4 is caused by Sparse MoE. https://152334h.github.io/blog/non-determinism-in-gpt-4/. Accessed 2025-01- 13

  22. [22]

    Lingjiao Chen, Matei Zaharia, and James Zou. 2023. How is ChatGPT’s behavior changing over time?CoRRabs/2307.09009 (2023). https://doi.org/10.48550/ ARXIV.2307.09009 arXiv:2307.09009

  23. [23]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  24. [24]

    Jacob Cohen. 1960. A Coefficient of Agreement for Nominal Scales.Educational and Psychological Measurement20, 1 (April 1960), 37–46. https://doi.org/10. 1177/001316446002000104

  25. [25]

    Tristan Coignion, Clément Quinton, and Romain Rouvoy. 2024. A Perfor- mance Study of LLM-Generated Code on Leetcode. InProceedings of the 28th International Conference on Evaluation and Assessment in Software En- gineering, EASE 2024, Salerno, Italy, June 18-21, 2024. ACM, 79–89. https: //doi.org/10.1145/3661167.3661221

  26. [26]

    Together Computer. 2023. RedPajama: an Open Dataset for Training Large Language Models. https://github.com/togethercomputer/RedPajama-Data

  27. [27]

    Rachel Crowell. 2023. Why AI’s diversity crisis matters, and how to tackle it. Nature Career Feature(2023). https://doi.org/10.1038/d41586-023-01689-4

  28. [28]

    Matheus de Morais Leça, Lucas Valença, Reydne Santos, and Ronnie de Souza Santos. 2024. Applications and Implications of Large Language Models in Qualitative Analysis: A New Frontier for Empirical Software Engineering. arXiv(2024). arXiv:2412.06564 [cs.SE] https://arxiv.org/abs/2412.06564

  29. [29]

    Stefano De Paoli. 2024. Performing an inductive thematic analysis of semi- structured interviews with a large language model: An exploration and provo- cation on the limits of the approach.Social Science Computer Review42, 4 (2024), 997–1019

  30. [30]

    Alex de Vries. 2023. The growing energy footprint of artificial intelligence. Joule7, 10 (2023), 2191–2194

  31. [31]

    Rudra Dhar, Karthik Vaidhyanathan, and Vasudeva Varma. 2024. Can LLMs Gen- erate Architectural Design Decisions? - An Exploratory Empirical Study. In21st IEEE International Conference on Software Architecture, ICSA 2024, Hyderabad, India, June 4-8, 2024. IEEE, 79–89. https://doi.org/10.1109/ICSA59870.2024.00016

  32. [32]

    Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2023. ClassEval: A Baltes et al. Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Gener- ation.CoRRabs/2308.01861 (2023). https://doi.org/10.48550/ARXIV.2308.01861 arXiv:2308.01861

  33. [33]

    Ty Dunn. 2023. It’s time to collect data on how you build software. https://blog. continue.dev/its-time-to-collect-data-on-how-you-build-software/. Accessed 2025-08-15

  34. [34]

    Aryaz Eghbali and Michael Pradel. 2022. CrystalBLEU: Precisely and Efficiently Measuring the Similarity of Code. In37th IEEE/ACM International Conference on Automated Software Engineering, ASE 2022. ACM, 28:1–28:12. https://doi. org/10.1145/3551349.3556903

  35. [35]

    Abdelkarim El-Hajjami and Camille Salinesi. 2025. How Good Are Synthetic Requirements? Evaluating LLM-Generated Datasets for AI4RE. CoRRabs/2506.21138 (2025). https://doi.org/10.48550/ARXIV.2506.21138 arXiv:2506.21138

  36. [36]

    Aparna Elangovan, Jongwoo Ko, Lei Xu, Mahsa Elyasi, Ling Liu, Sravan Bo- dapati, and Dan Roth. 2024. Beyond correlation: The impact of human uncer- tainty in measuring the effectiveness of automatic evaluation and LLM-as-a- judge.CoRRabs/2410.03775 (2024). https://doi.org/10.48550/ARXIV.2410.03775 arXiv:2410.03775

  37. [37]

    Zhenxiao Fu, Fan Chen, Shan Zhou, Haitong Li, and Lei Jiang. 2024. LLMCO2: Advancing Accurate Carbon Footprint Prediction for LLM Infer- ences.CoRRabs/2410.02950 (2024). https://doi.org/10.48550/ARXIV.2410.02950 arXiv:2410.02950

  38. [38]

    Gallegos, Ryan A

    Isabel O. Gallegos, Ryan A. Rossi, Joe Barrow, Md. Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, and Nesreen Ahmed. 2024. Bias and Fairness in Large Language Models: A Survey.Computational Linguis- tics50 (2024), 1097–1179. Issue 3. https://doi.org/10.1162/coli_a_00524

  39. [39]

    Gallegos, Ryan A

    Isabel O. Gallegos, Ryan A. Rossi, Joe Barrow, Md. Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, and Nesreen K. Ahmed. 2023. Bias and Fairness in Large Language Models: A Survey.CoRRabs/2309.00770 (2023). https://doi.org/10.48550/ARXIV.2309.00770 arXiv:2309.00770

  40. [40]

    McCoy, Timothy Miller, Amy Moreno, Nikolaj Munch, David Restrepo, Guergana Savova, Renato Umeton, Judy Wawira Gichoya, Gary S

    Jack Gallifant, Majid Afshar, Saleem Ameen, Yindalon Aphinyanaphongs, Shan Chen, Giovanni Cacciamani, Dina Demner-Fushman, Dmitriy Dligach, Roxana Daneshjou, Chrystinne Fernandes, Lasse Hyldig Hansen, Adam Landman, Lisa Lehmann, Liam G. McCoy, Timothy Miller, Amy Moreno, Nikolaj Munch, David Restrepo, Guergana Savova, Renato Umeton, Judy Wawira Gichoya, G...

  41. [41]

    Marco Aurélio Gerosa, Bianca Trinkenreich, Igor Steinmacher, and Anita Sarma

  42. [42]

    Can AI serve as a substitute for human subjects in software engineering research?Autom. Softw. Eng.31, 1 (2024), 13. https://doi.org/10.1007/S10515- 023-00409-6

  43. [43]

    Elizabeth Gibney. 2024. Not all ‘open source’ AI models are actually open. Nature News(2024). https://doi.org/10.1038/d41586-024-02012-5

  44. [44]

    Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. 2023. ChatGPT Outperforms Crowd-Workers for Text-Annotation Tasks.CoRRabs/2303.15056 (2023). https: //doi.org/10.48550/ARXIV.2303.15056 arXiv:2303.15056

  45. [45]

    José Gonçalves, Miguel Silva, Bernardo Cabral, Tiago Dias, Eva Maia, Isabel Praça, Ricardo Severino, and Luís Lino Ferreira. 2025. Evaluating LLaMA 3.2 for Software Vulnerability Detection.arXiv preprint arXiv:2503.07770(2025)

  46. [46]

    2024.acmsigsoft/open-science-policies: v1.0.0

    Daniel Graziotin. 2024.acmsigsoft/open-science-policies: v1.0.0. https://doi.org/ 10.5281/zenodo.10796477

  47. [47]

    Odd Erik Gundersen, Odd Cappelen, Martin Mølnå, and Nicklas Grimstad Nilsen

  48. [48]

    https://doi.org/10.48550/ARXIV.2412.17859 arXiv:2412.17859

    The Unreasonable Effectiveness of Open Science in AI: A Replication Study.CoRRabs/2412.17859 (2024). https://doi.org/10.48550/ARXIV.2412.17859 arXiv:2412.17859

  49. [49]

    Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and Wen- feng Liang. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming - The Rise of Code Intelligence.CoRRabs/2401.14196 (2024). https://doi.org/10.48550/ARXIV.2401.14196 arXiv:2401.14196

  50. [50]

    Jacqueline Harding, William D’Alessandro, N. G. Laskowski, and Robert Long

  51. [51]

    39, 5 (2024), 2603–2605

    AI language models cannot replace human research participants.AI Soc. 39, 5 (2024), 2603–2605. https://doi.org/10.1007/S00146-023-01725-X

  52. [52]

    Zeyu He, Chieh-Yang Huang, Chien-Kuang Cornelia Ding, Shaurya Rohatgi, and Ting-Hao Kenneth Huang. 2024. If in a Crowdsourced Data Annotation Pipeline, a GPT-4. InProceedings of the CHI Conference on Human Factors in Computing Systems, CHI 2024, Florian ’Floyd’ Mueller, Penny Kyburz, Julie R. Williamson, Corina Sas, Max L. Wilson, Phoebe O. Toups Dugas, a...

  53. [53]

    Catherine M Hicks, Carol S Lee, and Kristen Foster-Marks. 2025. The New Developer: AI Skill Threat, Identity Change & Developer Thriving in the Transition to AI-Assisted Software Development.PsyArXiv(March 2025). https://doi.org/10.31234/osf.io/2gej5_v2

  54. [54]

    Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large Language Models for Software Engineering: A Systematic Literature Review.ACM Trans. Softw. Eng. Methodol.33, 8, Article 220 (Dec. 2024), 79 pages. https://doi.org/10.1145/ 3695988

  55. [55]

    Xing Hu, Feifei Niu, Junkai Chen, Xin Zhou, Junwei Zhang, Junda He, Xin Xia, and David Lo. 2025. Assessing and Advancing Benchmarks for Evaluating Large Language Models in Software Engineering Tasks.arXiv(2025). https: //arxiv.org/abs/2505.08903

  56. [56]

    Fan Huang, Haewoon Kwak, and Jisun An. 2023. Is ChatGPT better than Human Annotators? Potential and Limitations of ChatGPT in Explaining Implicit Hate Speech. InCompanion Proceedings of the ACM Web Conference 2023, WWW 2023, Ying Ding, Jie Tang, Juan F. Sequeda, Lora Aroyo, Carlos Castillo, and Geert-Jan Houben (Eds.). ACM, 294–297. https://doi.org/10.114...

  57. [57]

    Jiangping Huang, Bochen Yi, Weisong Sun, Bangrui Wan, Yang Xu, Yebo Feng, Wenguang Ye, and Qinjun Qin. 2024. Enhancing Review Classification Via LLM- Based Data Annotation and Multi-Perspective Feature Representation Learning. SSRN Electronic Journal(2024), 1–15. https://doi.org/10.2139/ssrn.5002351

  58. [58]

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, An Yang, Rui Men, Fei Huang, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. 2024. Qwen2.5- Coder Technical Report.CoRRabs/2409.12186 (2024). https://doi.org/10.48550/ ARXIV.2409.12186 arXiv:2409.12186

  59. [59]

    Jasmin Jahic and Ashkan Sami. 2024. State of Practice: LLMs in Software Engineering and Software Architecture. In21st IEEE International Conference on Software Architecture, ICSA 2024 - Companion, Hyderabad, India, June 4-8,

  60. [60]

    https://doi.org/10.1109/ICSA-C63560.2024.00059

    IEEE, 311–318. https://doi.org/10.1109/ICSA-C63560.2024.00059

  61. [62]

    Devanbu, and Emily Morgan

    Kevin Jesse, Toufique Ahmed, Premkumar T. Devanbu, and Emily Morgan

  62. [63]

    In20th IEEE/ACM International Conference on Mining Software Repositories, MSR 2023

    Large Language Models and Simple, Stupid Bugs. In20th IEEE/ACM International Conference on Mining Software Repositories, MSR 2023. IEEE, 563–

  63. [64]

    https://doi.org/10.1109/MSR59073.2023.00082

  64. [65]

    Peng Jiang, Christian Sonne, Wangliang Li, Fengqi You, and Siming You. 2024. Preventing the Immense Increase in the Life-Cycle Energy and Carbon Foot- prints of LLM-Powered Intelligent Chatbots.Engineering40 (2024), 202–210. https://doi.org/10.1016/j.eng.2024.04.002

  65. [66]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world Github Issues?. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenRe- view.net. https://openreview.net/forum?id=VTF8yNQM66

  66. [67]

    Sungmin Kang, Juyeon Yoon, and Shin Yoo. 2023. Large Language Models are Few-shot Testers: Exploring LLM-based General Bug Reproduction. In45th IEEE/ACM International Conference on Software Engineering, ICSE 2023. IEEE, 2312–2323. https://doi.org/10.1109/ICSE48619.2023.00194

  67. [68]

    Anjan Karmakar, Miltiadis Allamanis, and Romain Robbes. 2023. JEMMA: An extensible Java dataset for ML4Code applications.Empir. Softw. Eng.28, 2 (2023),

  68. [69]

    https://doi.org/10.1007/S10664-022-10275-7

  69. [70]

    Ranim Khojah, Mazen Mohamad, Philipp Leitner, and Francisco Gomes de Oliveira Neto. 2024. Beyond Code Generation: An Observational Study of ChatGPT Usage in Software Engineering Practice.Proc. ACM Softw. Eng.1, FSE (2024), 1819–1840. https://doi.org/10.1145/3660788

  70. [71]

    Qusai Khraisha, Sophie Put, Johanna Kappenberg, Azza Warraitch, and Kristin Hadfield. 2024. Can large language models replace humans in system- atic reviews? Evaluating GPT-4’s efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages.Research Synthesis Methods15, 4 (2024), 616–626. https://doi.org/10.1002/jrsm...

  71. [72]

    Aobo Kong, Shiwan Zhao, Hao Chen, Qicheng Li, Yong Qin, Ruiqi Sun, and Xin Zhou. 2023. Better Zero-Shot Reasoning with Role-Play Prompt- ing.CoRRabs/2308.07702 (2023). https://doi.org/10.48550/ARXIV.2308.07702 arXiv:2308.07702

  72. [73]

    Sumith Kulal, Panupong Pasupat, Kartik Chandra, Mina Lee, Oded Padon, Alex Aiken, and Percy Liang. 2019. SPoC: Search-based Pseudocode to Code.CoRR abs/1906.04908 (2019). arXiv:1906.04908 http://arxiv.org/abs/1906.04908

  73. [74]

    Marie-Anne Lachaux, Baptiste Rozière, Lowik Chanussot, and Guillaume Lam- ple. 2020. Unsupervised Translation of Programming Languages.CoRR abs/2006.03511 (2020). arXiv:2006.03511 https://arxiv.org/abs/2006.03511

  74. [75]

    Kleanthi Lakiotaki, Nikolaos Vorniotakis, Michail Tsagris, Georgios Geor- gakopoulos, and Ioannis Tsamardinos. 2018. BioDataome: a collection of uniformly preprocessed and automatically annotated datasets for data-driven biology.Database J. Biol. Databases Curation2018 (2018), bay011. https: //doi.org/10.1093/DATABASE/BAY011

  75. [76]

    Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. 2022. Deduplicating Training Data Makes Language Models Better. InProceedings of the 60th Annual Meeting Guidelines for Empirical Studies in Software Engineering involving Large Language Models of the Association for Computational Lingui...

  76. [77]

    David Li, Kartik Gupta, Mousumi Bhaduri, Paul Sathiadoss, Sahir Bhatnagar, and Jaron Chong. 2024. Comparing GPT-3.5 and GPT-4 Accuracy and Drift in Radiology Diagnosis Please Cases.Radiology310, 1 (2024), e232411. https: //doi.org/10.1148/radiol.232411 arXiv:https://doi.org/10.1148/radiol.232411

  77. [78]

    Jia Li, Ge Li, Xuanming Zhang, Yunfei Zhao, Yihong Dong, Zhi Jin, Binhua Li, Fei Huang, and Yongbin Li. 2024. EvoCodeBench: An Evolv- ing Code Generation Benchmark with Domain-Specific Evaluations. In Advances in Neural Information Processing Systems 38: Annual Confer- ence on Neural Information Processing Systems 2024, NeurIPS 2024, Amir Globersons, Lest...

  78. [79]

    Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Ko- cetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier De- haene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, ...

  79. [80]

    Jenny T Liang, Carmen Badea, Christian Bird, Robert DeLine, Denae Ford, Nicole Forsgren, and Thomas Zimmermann. 2024. Can gpt-4 replicate empirical software engineering research?Proceedings of the ACM on Software Engineering 1, FSE (2024), 1330–1353

  80. [81]

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. InAdvances in Neural Information Pro- cessing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, Alice Oh, Tristan Naumann, Amir Gl...

Showing first 80 references.