Guidelines for Empirical Studies in Software Engineering involving Large Language Models

Sebastian Baltes , Florian Angermeir , Chetan Arora , Marvin Mu\~noz Bar\'on , Chunyang Chen , Lukas B\"ohme , Fabio Calefato , Neil Ernst

show 14 more authors

Davide Falessi Brian Fitzgerald Davide Fucci Junda He Christoph Treude Marcos Kalinowski Stefano Lambiase Daniel Russo Mircea Lungu Cristina Martinez Montes Lutz Prechelt Paul Ralph Rijnard van Tonder Stefan Wagner

Authors on Pith no claims yet

classification 💻 cs.SE

keywords guidelinestypesmodelsstudiesempiricalengineeringlanguagelarge

0 comments

read the original abstract

Large Language Models (LLMs) are widely used in software engineering (SE) research and practice, yet their non-determinism, opaque training data, and rapidly evolving models threaten the reproducibility and replicability of empirical studies. We address this challenge through a collaborative effort of 22 researchers, presenting a taxonomy of seven study types that organizes how LLMs are used in SE research, together with eight guidelines for designing and reporting such studies. Each guideline distinguishes requirements (must) from recommended practices (should) and is contextualized by the study types it applies to. Our guidelines recommend that researchers: (1) declare LLM usage and role; (2) report model versions, configurations, and customizations; (3) document the tool architecture beyond the model; (4) disclose prompts, their development, and interaction logs; (5) validate LLM outputs with humans; (6) include an open LLM as a baseline; (7) use suitable baselines, benchmarks, and metrics; and (8) articulate limitations and mitigations. We complement the guidelines with an applicability matrix mapping guidelines to study types and a reporting checklist for authors and reviewers. We maintain the study types and guidelines online as a living resource for the community to use and shape (llm-guidelines$.$org).

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LLM-Assisted Empirical Software Engineering: Systematic Literature Review and Research Agenda
cs.SE 2026-04 unverdicted novelty 7.0

A systematic review of 50 studies identifies 69 LLM-assisted tasks in empirical software engineering, concentrated in data processing and analysis with gaps in human-centered integration and reproducibility reporting.
Taking a Pulse on How Generative AI is Reshaping the Software Engineering Research Landscape
cs.SE 2026-04 accept novelty 7.0

A survey of 457 SE researchers finds widespread GenAI use concentrated in writing and ideation, with productivity gains but persistent concerns over accuracy, bias, and the need for clearer governance rules.
Agentic Business Process Management: A Research Manifesto
cs.AI 2026-03 unverdicted novelty 6.0

Agentic Business Process Management reframes BPM around autonomous agents that must exhibit framed autonomy, explainability, conversational actionability, and self-modification to keep their actions aligned with organ...