arxiv: 2604.16306 · v1 · submitted 2026-01-26 · 💻 cs.SE

Recognition: no theorem link

Rethinking Artifact Evaluation for Software Engineering in the Age of Generative AI

Christoph Treude , Christopher M. Poskitt , Rashina Hoda

Authors on Pith no claims yet

Pith reviewed 2026-05-16 11:02 UTC · model grok-4.3

classification 💻 cs.SE

keywords artifact evaluationpeer reviewgenerative AIsoftware engineeringresearch rigorreproducibilityattention allocation

0 comments

The pith

Artifact evaluation should be treated as a first-class component of peer review in software engineering research.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Generative AI now lets researchers produce polished narratives with far less effort, so reviewer time spent on writing quality or literature positioning has become a weaker signal of actual rigor. The paper frames peer review as an attention allocation problem and argues that the remaining effort-intensive part—checking whether methods are implemented correctly, analyses are sound, and claims are backed by evidence—lives in artifacts such as code, data, and experimental infrastructure. Elevating artifact evaluation would redirect attention to the scientific substance that still requires human expertise. A sympathetic reader would care because this reallocation could make published software engineering research more reliable and reproducible.

Core claim

Peer review in software engineering operates under tight time constraints while generative AI reduces the human effort needed for polished narratives. Reviewer attention therefore drifts toward aspects that are now easier to improve, rather than toward verifying that methods are correctly implemented, analyses are sound, and claims are supported by evidence. In software engineering this substance is frequently embodied in artifacts including code, data, evidence and analysis samples, and experimental infrastructure. The paper therefore argues that artifact evaluation should be treated as a first-class component of peer review decisions.

What carries the argument

Artifact evaluation as the mechanism that verifies implementation correctness, analytical soundness, and evidential support for research claims.

If this is right

Peer review decisions would weigh artifact assessments more heavily than narrative polish.
Reviewers would allocate greater time and attention to examining code, data, and experimental setups.
Authors would need to prepare more complete, accessible, and verifiable artifacts for submission.
Conferences and journals would adjust guidelines and review workflows to support artifact checking.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Venues could introduce dedicated artifact review tracks or incentives to make the shift practical.
Tooling for automated partial checks on artifacts might emerge to reduce the remaining human burden.
The same logic could prompt reevaluation of review practices in other empirical fields facing similar AI pressures.

Load-bearing premise

Generative AI has substantially reduced the effort required to produce high-quality narratives while artifact evaluation remains effort-intensive and dependent on human expertise.

What would settle it

A controlled experiment in which reviewers assess the same set of AI-assisted papers first from narratives alone and then with full artifact access, measuring whether their rigor judgments change significantly.

read the original abstract

Peer review in software engineering research operates under tight time constraints, while generative AI has substantially reduced the human effort required to produce polished research narratives. Reviewer attention is often spent on aspects of submissions such as writing quality or literature positioning that have become relatively less effort-intensive to address, rather than on evaluating the scientific substance of a paper. At the same time, assessing whether methods are implemented correctly, analyses are sound, and claims are supported by evidence remains effort-intensive and dependent on human expertise. In software engineering research, this substance is frequently embodied in artifacts, including code, data, evidence and analysis samples, and experimental infrastructure. In this position paper, we argue that artifact evaluation should be treated as a first-class component of peer review. We frame peer review as an attention allocation problem, examine how generative AI weakens narrative quality as a signal of rigor, and argue that artifact evaluation should play a more prominent role in peer review decisions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This position paper makes a logical case for weighting artifacts more heavily in SE peer review because AI lowers the cost of good writing, but it stays normative without data.

read the letter

The main thing to know is that the authors want artifact evaluation treated as first-class in software engineering peer review. They argue that generative AI has made polished narratives cheaper to produce, so reviewer time spent on writing quality and positioning is now less informative, while checking code, data, and experiments still requires real human effort and expertise. The framing as an attention-allocation problem is straightforward and follows from how SE papers often embed their claims in artifacts.

Referee Report

0 major / 2 minor

Summary. This position paper argues that generative AI has substantially reduced the human effort required to produce polished research narratives in software engineering, weakening narrative quality as a signal of rigor, while evaluation of artifacts (code, data, analyses, experimental infrastructure) remains effort-intensive and dependent on human expertise. It frames peer review as an attention-allocation problem and recommends treating artifact evaluation as a first-class component of peer review decisions.

Significance. If the argument holds, elevating artifact evaluation could meaningfully redirect limited reviewer attention toward verifiable scientific substance rather than AI-polished writing and positioning, potentially improving reproducibility and rigor in SE research. The paper offers a timely normative framing grounded in observed AI capabilities and standard SE practices, with no internal contradictions or unstated premises required for the recommendation.

minor comments (2)

[Abstract] The abstract and introduction could more explicitly outline concrete mechanisms (e.g., mandatory artifact review checklists or dedicated review phases) for implementing the first-class status of artifact evaluation.
[Discussion] A brief discussion of potential reviewer workload trade-offs or training requirements for artifact-focused review would strengthen the practical implications section.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of our position paper and their recommendation to accept. Their summary accurately reflects our core argument that generative AI has reduced the effort needed for polished narratives, thereby weakening writing quality as a reliable signal of scientific rigor, while artifact evaluation remains effort-intensive and central to verifying claims in software engineering research. We appreciate the recognition of the paper's timeliness and lack of internal contradictions.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The manuscript is a position paper advancing a normative recommendation to treat artifact evaluation as first-class in peer review. It frames the issue as an attention-allocation problem arising from generative AI's differential impact on narrative polish versus the enduring human-expertise demands of verifying implementations and evidence. No equations, derivations, fitted parameters, or self-citation chains exist; the argument rests on direct observation of current AI capabilities and SE practices without any step that reduces by construction to its own inputs or prior self-referential claims.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The argument assumes generative AI substantially lowers the cost of high-quality narrative without improving underlying substance, and that artifacts remain the primary embodiment of rigor in SE research.

axioms (2)

domain assumption Generative AI has substantially reduced the human effort required to produce polished research narratives.
Stated directly in the abstract as the premise driving the attention-allocation problem.
domain assumption Assessing methods, analyses, and evidence support remains effort-intensive and dependent on human expertise.
Presented as the contrasting element that makes artifact evaluation valuable.

pith-pipeline@v0.9.0 · 5460 in / 1193 out tokens · 23883 ms · 2026-05-16T11:02:29.772388+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 1 internal anchor

[1]

Sebastian Baltes, Florian Angermeir, Chetan Arora, Marvin Muñoz Barón, Chun- yang Chen, Lukas Böhme, Fabio Calefato, Neil Ernst, Davide Falessi, Brian Fitzger- ald, Davide Fucci, Marcos Kalinowski, Stefano Lambiase, Daniel Russo, Mircea Lungu, Lutz Prechelt, Paul Ralph, Rijnard van Tonder, Christoph Treude, and Stefan Wagner. 2025. Guidelines for Empirica...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Ben Hermann, Stefan Winter, and Janet Siegmund. 2020. Community expectations for research artifacts and evaluation processes. InProceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 469–480

work page 2020
[3]

2024.Qualitative Research with Socio-Technical Grounded Theory

Rashina Hoda. 2024.Qualitative Research with Socio-Technical Grounded Theory. Springer. doi:10.1007/978-3-031-60533-8

work page doi:10.1007/978-3-031-60533-8 2024
[4]

Shriram Krishnamurthi. 2013. Artifact evaluation for software conferences.ACM SIGSOFT Software Engineering Notes38, 3 (2013), 7–10

work page 2013
[5]

Mugeng Liu, Xiaolong Huang, Wei He, Yibing Xie, Jie M Zhang, Xiang Jing, Zhenpeng Chen, and Yun Ma. 2024. Research artifacts in software engineering publications: Status and trends.Journal of Systems and Software213 (2024), 112032

work page 2024
[6]

Martin Monperrus, Benoit Baudry, and Clément Vidal. 2025. Project Rachel: Can an AI Become a Scholarly Author?arXiv preprint arXiv:2511.14819(2025)

work page arXiv 2025
[7]

Paul Ralph, Nauman bin Ali, Sebastian Baltes, Domenico Bianculli, Jessica Diaz, Yvonne Dittrich, Neil Ernst, Michael Felderer, Robert Feldt, Antonio Filieri, Breno Bernard Nicolau de França, Carlo Alberto Furia, Greg Gay, Nicolas Gold, Daniel Graziotin, Pinjia He, Rashina Hoda, Natalia Juristo, Barbara Kitchen- ham, Valentina Lenarduzzi, Jorge Martínez, J...

work page arXiv 2026
[8]

2012.Case study research in software engineering: Guidelines and examples

Per Runeson, Martin Host, Austen Rainer, and Bjorn Regnell. 2012.Case study research in software engineering: Guidelines and examples. John Wiley & Sons

work page 2012
[9]

Margaret-Anne Storey, Rashina Hoda, Alessandra Maciel Paz Milani, and Maria Teresa Baldassarre. 2025. Guiding Principles for Using Mixed Methods Research in Software Engineering.Empirical Software Engineering(2025)

work page 2025
[10]

Christopher S Timperley, Lauren Herckis, Claire Le Goues, and Michael Hilton

work page
[11]

Understanding and improving artifact sharing in software engineering research.Empirical Software Engineering(2021), 1–41. Issue 1

work page 2021
[12]

Stefan Winter, Christopher S Timperley, Ben Hermann, Jürgen Cito, Jonathan Bell, Michael Hilton, and Dirk Beyer. 2022. A retrospective study of one decade of artifact evaluations. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 145–156

work page 2022
[13]

2012.Experimentation in software engineering

Claes Wohlin, Per Runeson, Martin Höst, Magnus C Ohlsson, Björn Regnell, Anders Wesslén, et al. 2012.Experimentation in software engineering. Vol. 236. Springer

work page 2012