pith. machine review for the scientific record. sign in

arxiv: 2604.16306 · v1 · submitted 2026-01-26 · 💻 cs.SE

Recognition: no theorem link

Rethinking Artifact Evaluation for Software Engineering in the Age of Generative AI

Authors on Pith no claims yet

Pith reviewed 2026-05-16 11:02 UTC · model grok-4.3

classification 💻 cs.SE
keywords artifact evaluationpeer reviewgenerative AIsoftware engineeringresearch rigorreproducibilityattention allocation
0
0 comments X

The pith

Artifact evaluation should be treated as a first-class component of peer review in software engineering research.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Generative AI now lets researchers produce polished narratives with far less effort, so reviewer time spent on writing quality or literature positioning has become a weaker signal of actual rigor. The paper frames peer review as an attention allocation problem and argues that the remaining effort-intensive part—checking whether methods are implemented correctly, analyses are sound, and claims are backed by evidence—lives in artifacts such as code, data, and experimental infrastructure. Elevating artifact evaluation would redirect attention to the scientific substance that still requires human expertise. A sympathetic reader would care because this reallocation could make published software engineering research more reliable and reproducible.

Core claim

Peer review in software engineering operates under tight time constraints while generative AI reduces the human effort needed for polished narratives. Reviewer attention therefore drifts toward aspects that are now easier to improve, rather than toward verifying that methods are correctly implemented, analyses are sound, and claims are supported by evidence. In software engineering this substance is frequently embodied in artifacts including code, data, evidence and analysis samples, and experimental infrastructure. The paper therefore argues that artifact evaluation should be treated as a first-class component of peer review decisions.

What carries the argument

Artifact evaluation as the mechanism that verifies implementation correctness, analytical soundness, and evidential support for research claims.

If this is right

  • Peer review decisions would weigh artifact assessments more heavily than narrative polish.
  • Reviewers would allocate greater time and attention to examining code, data, and experimental setups.
  • Authors would need to prepare more complete, accessible, and verifiable artifacts for submission.
  • Conferences and journals would adjust guidelines and review workflows to support artifact checking.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Venues could introduce dedicated artifact review tracks or incentives to make the shift practical.
  • Tooling for automated partial checks on artifacts might emerge to reduce the remaining human burden.
  • The same logic could prompt reevaluation of review practices in other empirical fields facing similar AI pressures.

Load-bearing premise

Generative AI has substantially reduced the effort required to produce high-quality narratives while artifact evaluation remains effort-intensive and dependent on human expertise.

What would settle it

A controlled experiment in which reviewers assess the same set of AI-assisted papers first from narratives alone and then with full artifact access, measuring whether their rigor judgments change significantly.

read the original abstract

Peer review in software engineering research operates under tight time constraints, while generative AI has substantially reduced the human effort required to produce polished research narratives. Reviewer attention is often spent on aspects of submissions such as writing quality or literature positioning that have become relatively less effort-intensive to address, rather than on evaluating the scientific substance of a paper. At the same time, assessing whether methods are implemented correctly, analyses are sound, and claims are supported by evidence remains effort-intensive and dependent on human expertise. In software engineering research, this substance is frequently embodied in artifacts, including code, data, evidence and analysis samples, and experimental infrastructure. In this position paper, we argue that artifact evaluation should be treated as a first-class component of peer review. We frame peer review as an attention allocation problem, examine how generative AI weakens narrative quality as a signal of rigor, and argue that artifact evaluation should play a more prominent role in peer review decisions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. This position paper argues that generative AI has substantially reduced the human effort required to produce polished research narratives in software engineering, weakening narrative quality as a signal of rigor, while evaluation of artifacts (code, data, analyses, experimental infrastructure) remains effort-intensive and dependent on human expertise. It frames peer review as an attention-allocation problem and recommends treating artifact evaluation as a first-class component of peer review decisions.

Significance. If the argument holds, elevating artifact evaluation could meaningfully redirect limited reviewer attention toward verifiable scientific substance rather than AI-polished writing and positioning, potentially improving reproducibility and rigor in SE research. The paper offers a timely normative framing grounded in observed AI capabilities and standard SE practices, with no internal contradictions or unstated premises required for the recommendation.

minor comments (2)
  1. [Abstract] The abstract and introduction could more explicitly outline concrete mechanisms (e.g., mandatory artifact review checklists or dedicated review phases) for implementing the first-class status of artifact evaluation.
  2. [Discussion] A brief discussion of potential reviewer workload trade-offs or training requirements for artifact-focused review would strengthen the practical implications section.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of our position paper and their recommendation to accept. Their summary accurately reflects our core argument that generative AI has reduced the effort needed for polished narratives, thereby weakening writing quality as a reliable signal of scientific rigor, while artifact evaluation remains effort-intensive and central to verifying claims in software engineering research. We appreciate the recognition of the paper's timeliness and lack of internal contradictions.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The manuscript is a position paper advancing a normative recommendation to treat artifact evaluation as first-class in peer review. It frames the issue as an attention-allocation problem arising from generative AI's differential impact on narrative polish versus the enduring human-expertise demands of verifying implementations and evidence. No equations, derivations, fitted parameters, or self-citation chains exist; the argument rests on direct observation of current AI capabilities and SE practices without any step that reduces by construction to its own inputs or prior self-referential claims.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The argument assumes generative AI substantially lowers the cost of high-quality narrative without improving underlying substance, and that artifacts remain the primary embodiment of rigor in SE research.

axioms (2)
  • domain assumption Generative AI has substantially reduced the human effort required to produce polished research narratives.
    Stated directly in the abstract as the premise driving the attention-allocation problem.
  • domain assumption Assessing methods, analyses, and evidence support remains effort-intensive and dependent on human expertise.
    Presented as the contrasting element that makes artifact evaluation valuable.

pith-pipeline@v0.9.0 · 5460 in / 1193 out tokens · 23883 ms · 2026-05-16T11:02:29.772388+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 1 internal anchor

  1. [1]

    Sebastian Baltes, Florian Angermeir, Chetan Arora, Marvin Muñoz Barón, Chun- yang Chen, Lukas Böhme, Fabio Calefato, Neil Ernst, Davide Falessi, Brian Fitzger- ald, Davide Fucci, Marcos Kalinowski, Stefano Lambiase, Daniel Russo, Mircea Lungu, Lutz Prechelt, Paul Ralph, Rijnard van Tonder, Christoph Treude, and Stefan Wagner. 2025. Guidelines for Empirica...

  2. [2]

    Ben Hermann, Stefan Winter, and Janet Siegmund. 2020. Community expectations for research artifacts and evaluation processes. InProceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 469–480

  3. [3]

    2024.Qualitative Research with Socio-Technical Grounded Theory

    Rashina Hoda. 2024.Qualitative Research with Socio-Technical Grounded Theory. Springer. doi:10.1007/978-3-031-60533-8

  4. [4]

    Shriram Krishnamurthi. 2013. Artifact evaluation for software conferences.ACM SIGSOFT Software Engineering Notes38, 3 (2013), 7–10

  5. [5]

    Mugeng Liu, Xiaolong Huang, Wei He, Yibing Xie, Jie M Zhang, Xiang Jing, Zhenpeng Chen, and Yun Ma. 2024. Research artifacts in software engineering publications: Status and trends.Journal of Systems and Software213 (2024), 112032

  6. [6]

    Martin Monperrus, Benoit Baudry, and Clément Vidal. 2025. Project Rachel: Can an AI Become a Scholarly Author?arXiv preprint arXiv:2511.14819(2025)

  7. [7]

    Paul Ralph, Nauman bin Ali, Sebastian Baltes, Domenico Bianculli, Jessica Diaz, Yvonne Dittrich, Neil Ernst, Michael Felderer, Robert Feldt, Antonio Filieri, Breno Bernard Nicolau de França, Carlo Alberto Furia, Greg Gay, Nicolas Gold, Daniel Graziotin, Pinjia He, Rashina Hoda, Natalia Juristo, Barbara Kitchen- ham, Valentina Lenarduzzi, Jorge Martínez, J...

  8. [8]

    2012.Case study research in software engineering: Guidelines and examples

    Per Runeson, Martin Host, Austen Rainer, and Bjorn Regnell. 2012.Case study research in software engineering: Guidelines and examples. John Wiley & Sons

  9. [9]

    Margaret-Anne Storey, Rashina Hoda, Alessandra Maciel Paz Milani, and Maria Teresa Baldassarre. 2025. Guiding Principles for Using Mixed Methods Research in Software Engineering.Empirical Software Engineering(2025)

  10. [10]

    Christopher S Timperley, Lauren Herckis, Claire Le Goues, and Michael Hilton

  11. [11]

    Understanding and improving artifact sharing in software engineering research.Empirical Software Engineering(2021), 1–41. Issue 1

  12. [12]

    Stefan Winter, Christopher S Timperley, Ben Hermann, Jürgen Cito, Jonathan Bell, Michael Hilton, and Dirk Beyer. 2022. A retrospective study of one decade of artifact evaluations. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 145–156

  13. [13]

    2012.Experimentation in software engineering

    Claes Wohlin, Per Runeson, Martin Höst, Magnus C Ohlsson, Björn Regnell, Anders Wesslén, et al. 2012.Experimentation in software engineering. Vol. 236. Springer