pith. machine review for the scientific record. sign in

arxiv: 2508.15503 · v5 · submitted 2025-08-21 · 💻 cs.SE

Recognition: unknown

Guidelines for Empirical Studies in Software Engineering involving Large Language Models

Authors on Pith no claims yet
classification 💻 cs.SE
keywords guidelinestypesmodelsstudiesempiricalengineeringlanguagelarge
0
0 comments X
read the original abstract

Large Language Models (LLMs) are widely used in software engineering (SE) research and practice, yet their non-determinism, opaque training data, and rapidly evolving models threaten the reproducibility and replicability of empirical studies. We address this challenge through a collaborative effort of 22 researchers, presenting a taxonomy of seven study types that organizes how LLMs are used in SE research, together with eight guidelines for designing and reporting such studies. Each guideline distinguishes requirements (must) from recommended practices (should) and is contextualized by the study types it applies to. Our guidelines recommend that researchers: (1) declare LLM usage and role; (2) report model versions, configurations, and customizations; (3) document the tool architecture beyond the model; (4) disclose prompts, their development, and interaction logs; (5) validate LLM outputs with humans; (6) include an open LLM as a baseline; (7) use suitable baselines, benchmarks, and metrics; and (8) articulate limitations and mitigations. We complement the guidelines with an applicability matrix mapping guidelines to study types and a reporting checklist for authors and reviewers. We maintain the study types and guidelines online as a living resource for the community to use and shape (llm-guidelines$.$org).

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LLM-Assisted Empirical Software Engineering: Systematic Literature Review and Research Agenda

    cs.SE 2026-04 unverdicted novelty 7.0

    A systematic review of 50 studies identifies 69 LLM-assisted tasks in empirical software engineering, concentrated in data processing and analysis with gaps in human-centered integration and reproducibility reporting.

  2. Taking a Pulse on How Generative AI is Reshaping the Software Engineering Research Landscape

    cs.SE 2026-04 accept novelty 7.0

    A survey of 457 SE researchers finds widespread GenAI use concentrated in writing and ideation, with productivity gains but persistent concerns over accuracy, bias, and the need for clearer governance rules.

  3. Agentic Business Process Management: A Research Manifesto

    cs.AI 2026-03 unverdicted novelty 6.0

    Agentic Business Process Management reframes BPM around autonomous agents that must exhibit framed autonomy, explainability, conversational actionability, and self-modification to keep their actions aligned with organ...