Recognition: no theorem link
Rethinking Artifact Evaluation for Software Engineering in the Age of Generative AI
Pith reviewed 2026-05-16 11:02 UTC · model grok-4.3
The pith
Artifact evaluation should be treated as a first-class component of peer review in software engineering research.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Peer review in software engineering operates under tight time constraints while generative AI reduces the human effort needed for polished narratives. Reviewer attention therefore drifts toward aspects that are now easier to improve, rather than toward verifying that methods are correctly implemented, analyses are sound, and claims are supported by evidence. In software engineering this substance is frequently embodied in artifacts including code, data, evidence and analysis samples, and experimental infrastructure. The paper therefore argues that artifact evaluation should be treated as a first-class component of peer review decisions.
What carries the argument
Artifact evaluation as the mechanism that verifies implementation correctness, analytical soundness, and evidential support for research claims.
If this is right
- Peer review decisions would weigh artifact assessments more heavily than narrative polish.
- Reviewers would allocate greater time and attention to examining code, data, and experimental setups.
- Authors would need to prepare more complete, accessible, and verifiable artifacts for submission.
- Conferences and journals would adjust guidelines and review workflows to support artifact checking.
Where Pith is reading between the lines
- Venues could introduce dedicated artifact review tracks or incentives to make the shift practical.
- Tooling for automated partial checks on artifacts might emerge to reduce the remaining human burden.
- The same logic could prompt reevaluation of review practices in other empirical fields facing similar AI pressures.
Load-bearing premise
Generative AI has substantially reduced the effort required to produce high-quality narratives while artifact evaluation remains effort-intensive and dependent on human expertise.
What would settle it
A controlled experiment in which reviewers assess the same set of AI-assisted papers first from narratives alone and then with full artifact access, measuring whether their rigor judgments change significantly.
read the original abstract
Peer review in software engineering research operates under tight time constraints, while generative AI has substantially reduced the human effort required to produce polished research narratives. Reviewer attention is often spent on aspects of submissions such as writing quality or literature positioning that have become relatively less effort-intensive to address, rather than on evaluating the scientific substance of a paper. At the same time, assessing whether methods are implemented correctly, analyses are sound, and claims are supported by evidence remains effort-intensive and dependent on human expertise. In software engineering research, this substance is frequently embodied in artifacts, including code, data, evidence and analysis samples, and experimental infrastructure. In this position paper, we argue that artifact evaluation should be treated as a first-class component of peer review. We frame peer review as an attention allocation problem, examine how generative AI weakens narrative quality as a signal of rigor, and argue that artifact evaluation should play a more prominent role in peer review decisions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This position paper argues that generative AI has substantially reduced the human effort required to produce polished research narratives in software engineering, weakening narrative quality as a signal of rigor, while evaluation of artifacts (code, data, analyses, experimental infrastructure) remains effort-intensive and dependent on human expertise. It frames peer review as an attention-allocation problem and recommends treating artifact evaluation as a first-class component of peer review decisions.
Significance. If the argument holds, elevating artifact evaluation could meaningfully redirect limited reviewer attention toward verifiable scientific substance rather than AI-polished writing and positioning, potentially improving reproducibility and rigor in SE research. The paper offers a timely normative framing grounded in observed AI capabilities and standard SE practices, with no internal contradictions or unstated premises required for the recommendation.
minor comments (2)
- [Abstract] The abstract and introduction could more explicitly outline concrete mechanisms (e.g., mandatory artifact review checklists or dedicated review phases) for implementing the first-class status of artifact evaluation.
- [Discussion] A brief discussion of potential reviewer workload trade-offs or training requirements for artifact-focused review would strengthen the practical implications section.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of our position paper and their recommendation to accept. Their summary accurately reflects our core argument that generative AI has reduced the effort needed for polished narratives, thereby weakening writing quality as a reliable signal of scientific rigor, while artifact evaluation remains effort-intensive and central to verifying claims in software engineering research. We appreciate the recognition of the paper's timeliness and lack of internal contradictions.
Circularity Check
No significant circularity identified
full rationale
The manuscript is a position paper advancing a normative recommendation to treat artifact evaluation as first-class in peer review. It frames the issue as an attention-allocation problem arising from generative AI's differential impact on narrative polish versus the enduring human-expertise demands of verifying implementations and evidence. No equations, derivations, fitted parameters, or self-citation chains exist; the argument rests on direct observation of current AI capabilities and SE practices without any step that reduces by construction to its own inputs or prior self-referential claims.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Generative AI has substantially reduced the human effort required to produce polished research narratives.
- domain assumption Assessing methods, analyses, and evidence support remains effort-intensive and dependent on human expertise.
Reference graph
Works this paper leans on
-
[1]
Sebastian Baltes, Florian Angermeir, Chetan Arora, Marvin Muñoz Barón, Chun- yang Chen, Lukas Böhme, Fabio Calefato, Neil Ernst, Davide Falessi, Brian Fitzger- ald, Davide Fucci, Marcos Kalinowski, Stefano Lambiase, Daniel Russo, Mircea Lungu, Lutz Prechelt, Paul Ralph, Rijnard van Tonder, Christoph Treude, and Stefan Wagner. 2025. Guidelines for Empirica...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Ben Hermann, Stefan Winter, and Janet Siegmund. 2020. Community expectations for research artifacts and evaluation processes. InProceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 469–480
work page 2020
-
[3]
2024.Qualitative Research with Socio-Technical Grounded Theory
Rashina Hoda. 2024.Qualitative Research with Socio-Technical Grounded Theory. Springer. doi:10.1007/978-3-031-60533-8
-
[4]
Shriram Krishnamurthi. 2013. Artifact evaluation for software conferences.ACM SIGSOFT Software Engineering Notes38, 3 (2013), 7–10
work page 2013
-
[5]
Mugeng Liu, Xiaolong Huang, Wei He, Yibing Xie, Jie M Zhang, Xiang Jing, Zhenpeng Chen, and Yun Ma. 2024. Research artifacts in software engineering publications: Status and trends.Journal of Systems and Software213 (2024), 112032
work page 2024
- [6]
-
[7]
Paul Ralph, Nauman bin Ali, Sebastian Baltes, Domenico Bianculli, Jessica Diaz, Yvonne Dittrich, Neil Ernst, Michael Felderer, Robert Feldt, Antonio Filieri, Breno Bernard Nicolau de França, Carlo Alberto Furia, Greg Gay, Nicolas Gold, Daniel Graziotin, Pinjia He, Rashina Hoda, Natalia Juristo, Barbara Kitchen- ham, Valentina Lenarduzzi, Jorge Martínez, J...
-
[8]
2012.Case study research in software engineering: Guidelines and examples
Per Runeson, Martin Host, Austen Rainer, and Bjorn Regnell. 2012.Case study research in software engineering: Guidelines and examples. John Wiley & Sons
work page 2012
-
[9]
Margaret-Anne Storey, Rashina Hoda, Alessandra Maciel Paz Milani, and Maria Teresa Baldassarre. 2025. Guiding Principles for Using Mixed Methods Research in Software Engineering.Empirical Software Engineering(2025)
work page 2025
-
[10]
Christopher S Timperley, Lauren Herckis, Claire Le Goues, and Michael Hilton
-
[11]
Understanding and improving artifact sharing in software engineering research.Empirical Software Engineering(2021), 1–41. Issue 1
work page 2021
-
[12]
Stefan Winter, Christopher S Timperley, Ben Hermann, Jürgen Cito, Jonathan Bell, Michael Hilton, and Dirk Beyer. 2022. A retrospective study of one decade of artifact evaluations. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 145–156
work page 2022
-
[13]
2012.Experimentation in software engineering
Claes Wohlin, Per Runeson, Martin Höst, Magnus C Ohlsson, Björn Regnell, Anders Wesslén, et al. 2012.Experimentation in software engineering. Vol. 236. Springer
work page 2012
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.