ARA: Agentic Reproducibility Assessment For Scalable Support Of Scientific Peer-Review

Anastasios Kouvelas; Andres L. Marin; Fan Wu; Georgios Fontaras; Kevin Riehl; Michail A. Makridis; Nikofors Zacharof; Patrick Langer; Robert Jakob

arxiv: 2605.02651 · v2 · pith:SNWUXK5Xnew · submitted 2026-05-04 · 💻 cs.DL · cs.LG

ARA: Agentic Reproducibility Assessment For Scalable Support Of Scientific Peer-Review

Kevin Riehl , Andres L. Marin , Nikofors Zacharof , Fan Wu , Patrick Langer , Robert Jakob , Anastasios Kouvelas , Georgios Fontaras

show 1 more author

Michail A. Makridis

This is my paper

Pith reviewed 2026-05-19 17:09 UTC · model grok-4.3

classification 💻 cs.DL cs.LG

keywords reproducibility assessmentworkflow graphsagentic AIscientific peer reviewcomputational reproducibilityLLM agentsReScience C benchmark

0 comments

The pith

ARA extracts directed workflow graphs from papers to evaluate reproducibility at scale.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Agentic Reproducibility Assessment as a formal method for turning reproducibility checks into a structured reasoning task over full scientific documents. It works by first pulling out a directed graph that connects sources, methods, experiments, and outputs, then scoring how completely that graph can be rebuilt from the text alone. Large-scale tests on 213 human-validated articles across domains show the approach produces consistent results regardless of the underlying language model or temperature setting. The system reaches roughly 61 percent accuracy overall and sets new marks on two existing benchmarks by outperforming earlier automated baselines. If the graphs truly track real experimental dependencies, the method offers a concrete way to supplement human reviewers when paper volume outstrips manual capacity.

Core claim

ARA formalizes reproducibility assessment as a structured reasoning task over scientific documents by extracting a directed workflow graph linking sources, methods, experiments, and outputs, then evaluating its reconstructability using structural and content-based scores for reproducibility assessments.

What carries the argument

The directed workflow graph that links sources, methods, experiments, and outputs, scored for reconstructability through structural and content-based metrics.

If this is right

Reproducibility scoring becomes consistent across different language models and temperature settings.
The method records the highest accuracy to date on ReproBench and GoldStandardDB benchmarks.
Peer review gains a scalable complement that can process hundreds of papers with uniform criteria.
Next-generation review systems can incorporate automated graph reconstruction as a first-pass filter.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Review platforms could route low-scoring papers to deeper human inspection while clearing high-scoring ones faster.
The same graph-extraction logic might extend to non-computational fields if the scoring rules are adjusted for qualitative evidence.
Hybrid human-agent pipelines become feasible where agents handle routine dependency mapping and humans resolve edge cases.
Open release of the extracted graphs alongside papers would let later readers verify or extend the reproducibility claims directly.

Load-bearing premise

The workflow graphs the system extracts accurately reflect the paper's actual experimental dependencies, data flows, and result-generating steps.

What would settle it

Direct head-to-head comparison in which independent human experts manually reconstruct the same set of workflow graphs from the papers and measure whether the automated scores align with those human reconstructions.

Figures

Figures reproduced from arXiv: 2605.02651 by Anastasios Kouvelas, Andres L. Marin, Fan Wu, Georgios Fontaras, Kevin Riehl, Michail A. Makridis, Nikofors Zacharof, Patrick Langer, Robert Jakob.

**Figure 1.** Figure 1: Agentic Reproducibility Assessment Pipeline (ARA). First, a given scientific paper (resp. document) D is transformed into a directed workflow graph G, comprising four types of nodes (sources, methods, experiments, sinks). Second, the workflow graph’s reconstructability is projected on micro-level assessments of reproducibility (node-by-node) r(·). Third, the micro-level assessments are aggregated to reprod… view at source ↗

**Figure 2.** Figure 2: Human-Agent Disagreement on Reproducibility Assessment (Rescience C). work may extend the proposed framework beyond reproducibility assessment toward automated reproduction and implementation, integration of external artifact validation, and validity assessment of citations and scientific claims across publications. 10 view at source ↗

**Figure 2.** Figure 2: Human-Agent Disagreement on Reproducibility Assessment (Rescience C). 9 [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Workflow Graph Generated From A Scientific Paper. view at source ↗

**Figure 4.** Figure 4: Score and Rank Robustness under Dirichlet Perturbation. Left: per-profile |∆R| vs default (median, p95 band) as a function of concentration α. Right: per-paper Spearman ρ vs default (min, mean band) across 200 draws per α. Both axes are log-scaled in α. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_4.png] view at source ↗

read the original abstract

Scientific peer review increasingly struggles to assess reproducibility at the scale and complexity of modern research output. Evaluating reproducibility requires reconstructing experimental dependencies, methodological choices, data flows, and result-generating procedures, which often exceeds what human reviewers can provide. Agentic Reproducibility Assessment (ARA) formalizes reproducibility assessment as a structured reasoning task over scientific documents. Given a paper, ARA extracts a directed workflow graph linking sources, methods, experiments, and outputs, then evaluates its reconstructability using structural and content-based scores for reproducibility assessments. Experiments on 213 ReScience C articles - the largest cross-domain benchmark of human-validated computational reproducibility studies considered to date - demonstrate ARA's generalizability and consistent workflow reconstruction and assessment across LLMs, model temperatures, and scientific domains. ARA achieves ~61% accuracy on three benchmarks, and the highest accuracy reported on ReproBench (60.71% vs. 36.84%) and GoldStandardDB (61.68% vs. 43.56%), highlighting its potential to complement human review at scale and enabling next-generation peer review. Code and Data available: https://github.com/AndresLaverdeMarin/agentic_reproducibility_assessment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ARA builds agentic workflow graphs from papers to score reproducibility and beats a couple baselines at ~61% on ReScience C and related sets, but the graphs themselves lack direct human validation against actual dependencies.

read the letter

The paper's core move is to treat reproducibility assessment as agent-driven extraction of a directed workflow graph from a scientific document, followed by structural and content scores on that graph. They run this on 213 ReScience C articles plus ReproBench and GoldStandardDB, report consistent results across LLMs and temperatures, and show higher accuracy than the cited baselines on two of the sets. Code and data are released, which helps anyone who wants to inspect or extend the pipeline.

Referee Report

2 major / 2 minor

Summary. The paper introduces Agentic Reproducibility Assessment (ARA), an LLM-based agentic system that extracts a directed workflow graph from a scientific paper (linking sources, methods, experiments, and outputs) and then computes structural and content-based scores to assess reproducibility. It evaluates the approach on 213 ReScience C articles (the largest such human-validated benchmark cited), plus ReproBench and GoldStandardDB, reporting ~61% accuracy overall and the highest accuracy on the latter two benchmarks (60.71% and 61.68%) compared to prior baselines (36.84% and 43.56%). The work emphasizes generalizability across LLMs, temperatures, and domains, with code and data released.

Significance. If the extracted workflow graphs accurately reflect experimental dependencies and data flows, ARA could provide a scalable complement to human peer review for computational reproducibility assessment. The use of a large cross-domain benchmark of 213 human-validated studies, systematic testing across models and temperatures, and public release of code/data are clear strengths that support reproducibility of the reported results.

major comments (2)

[Experiments on 213 ReScience C articles and benchmark evaluations] The accuracy figures (e.g., on the 213 ReScience C articles, ReproBench, and GoldStandardDB) are computed against human-validated final reproducibility outcomes rather than against direct human annotations of the extracted directed workflow graphs themselves. Without separate validation of node/edge fidelity, missing implicit steps, or hallucinated links, it remains possible that structural and content-based scores are derived from incomplete or spurious graphs that happen to correlate with the outcome labels.
[Abstract and evaluation methodology description] The central claim of 'consistent workflow reconstruction and assessment' across LLMs, temperatures, and domains rests on the assumption that the agentic extraction faithfully captures data flows and result-generating procedures, yet the manuscript provides no error analysis or human study quantifying extraction accuracy independent of the downstream reproducibility label.

minor comments (2)

[Abstract] The abstract states 'highest accuracy reported' on ReproBench and GoldStandardDB; clarify whether the baseline numbers (36.84%, 43.56%) come from identical evaluation conditions or from the original papers.
[Methods] Provide more explicit description of the prompt templates, agent orchestration steps, and exact definitions of the structural versus content-based scores to aid replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment point by point below, acknowledging the validity of the concerns regarding evaluation methodology. Revisions will be made to clarify the scope of our claims and add supporting analysis.

read point-by-point responses

Referee: [Experiments on 213 ReScience C articles and benchmark evaluations] The accuracy figures (e.g., on the 213 ReScience C articles, ReproBench, and GoldStandardDB) are computed against human-validated final reproducibility outcomes rather than against direct human annotations of the extracted directed workflow graphs themselves. Without separate validation of node/edge fidelity, missing implicit steps, or hallucinated links, it remains possible that structural and content-based scores are derived from incomplete or spurious graphs that happen to correlate with the outcome labels.

Authors: We appreciate this observation and agree that our reported accuracies reflect end-to-end performance against final human-validated reproducibility labels rather than isolated human annotations of graph nodes, edges, or fidelity. This approach was chosen because it directly tests the system's utility for scalable reproducibility assessment in peer review. However, we acknowledge the limitation that correlation with outcomes does not fully prove graph quality. In the revised manuscript, we will add a dedicated subsection with qualitative error analysis on a sample of 20 papers, discussing issues like missing implicit steps and potential hallucinations, with examples. We will also update relevant sections to clarify this distinction. revision: yes
Referee: [Abstract and evaluation methodology description] The central claim of 'consistent workflow reconstruction and assessment' across LLMs, temperatures, and domains rests on the assumption that the agentic extraction faithfully captures data flows and result-generating procedures, yet the manuscript provides no error analysis or human study quantifying extraction accuracy independent of the downstream reproducibility label.

Authors: We agree that the manuscript lacks a separate human study or quantitative error analysis measuring extraction accuracy independently from the reproducibility prediction task. The consistency claims are currently supported by stable end-to-end accuracy across configurations. To address this, we will revise the abstract and methodology sections to qualify the claims as referring to consistent end-to-end assessment, and incorporate the qualitative graph analysis described in the response to the first comment. These changes will better align the presentation with the evaluation performed. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on external human-validated benchmarks

full rationale

The paper evaluates ARA by extracting directed workflow graphs from papers and computing structural/content-based reproducibility scores, then reports accuracy against independent external benchmarks (ReScience C with 213 human-validated articles, ReproBench, GoldStandardDB). These benchmarks supply outcome labels separate from the extraction process itself. No equations, definitions, or self-citations are shown to reduce the central performance claims to fitted inputs or prior author work by construction. The derivation chain therefore remains self-contained against external references rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only: the method rests on the assumption that LLMs can extract accurate workflow graphs representing scientific procedures.

axioms (1)

domain assumption LLMs can reliably extract directed workflow graphs linking sources, methods, experiments, and outputs from scientific documents
This is the core mechanism described for formalizing reproducibility assessment.

pith-pipeline@v0.9.0 · 5778 in / 1120 out tokens · 37012 ms · 2026-05-19T17:09:47.577891+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ARA extracts a directed workflow graph linking sources, methods, experiments, and outputs, then evaluates its reconstructability using structural and content-based scores
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

R = √(Rc · Rs) ... geometric mean ... fixed weights (w_sources, w_process, w_sinks) = (0.30,0.40,0.30)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 2 internal anchors

[1]

Publish or perish,

G. Parchomovsky, “Publish or perish,”Michigan Law Review, vol. 98, no. 4, pp. 926–952, 2000. doi: 10.2307/1290335

work page doi:10.2307/1290335 2000
[2]

Science in an exponential world,

A. Szalay and J. Gray, “Science in an exponential world,”Nature, vol. 440, no. 7083, pp. 413–414, 2006. doi: 10.1038/440413a

work page doi:10.1038/440413a 2006
[3]

Distinguishing

E. Mosca, M. H. I. Abdalla, P. Basso, M. Musumeci, and G. Groh, “Distinguishing fact from fiction: A benchmark dataset for identifying machine-generated scientific papers in the LLM era.”Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023), pp. 190–207, 2023. doi: 10.18653/v1/2023.trustnlp-1.17

work page doi:10.18653/v1/2023.trustnlp-1.17 2023
[4]

Have AI-generated texts from LLM infiltrated the realm of scientific writing? a large-scale analysis of preprint platforms,

H.-Z. Cheng, B. Sheng, A. Lee, V . Chaudhary, A. G. Atanasov, N. Liu, Y . Qiu, T. Y . Wong, Y .-C. Tham, and Y .-F. Zheng, “Have AI-generated texts from LLM infiltrated the realm of scientific writing? a large-scale analysis of preprint platforms,”bioRxiv, pp. 2024–03, 2024. doi: 10.1101/2024.03.25.586710

work page doi:10.1101/2024.03.25.586710 2024
[5]

Is LLM a reliable reviewer? a comprehensive evaluation of LLM on automatic paper reviewing tasks,

R. Zhou, L. Chen, and K. Yu, “Is LLM a reliable reviewer? a comprehensive evaluation of LLM on automatic paper reviewing tasks,”Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pp. 9340–9351, 2024. [Online]. Available: https://aclanthology.org/2024.lrec-main.816/

work page 2024
[6]

Is peer review in decline?

G. Ellison, “Is peer review in decline?”Economic Inquiry, vol. 49, no. 3, pp. 635–657, 2011. doi: 10.1111/j.1465-7295.2010.00261.x

work page doi:10.1111/j.1465-7295.2010.00261.x 2011
[7]

The AI impera- tive: Scaling high-quality peer review in machine learning,

Q. Wei, S. Holt, J. Yang, M. Wulfmeier, and M. van der Schaar, “The AI impera- tive: Scaling high-quality peer review in machine learning,”arXiv Preprints, 2025. doi: 10.48550/arXiv.2506.08134

work page doi:10.48550/arxiv.2506.08134 2025
[8]

Popper,Logik der Forschung

K. Popper,Logik der Forschung. Vienna, Austria: Julius Springer Verlag GmbH, 1935. doi: 10.1007/978-3-7091-4177-9

work page doi:10.1007/978-3-7091-4177-9 1935
[9]

London, UK: Hutchinson & Co., 1959

——,The Logic of Scientific Discovery. London, UK: Hutchinson & Co., 1959. doi: 10.2307/2412687

work page doi:10.2307/2412687 1959
[10]

The replicability crisis and public trust in psychological sci- ence,

F. Anvari and D. Lakens, “The replicability crisis and public trust in psychological sci- ence,”Comprehensive Results in Social Psychology, vol. 3, no. 3, pp. 266–286, 2018. doi: 10.1080/23743603.2019.1684822

work page doi:10.1080/23743603.2019.1684822 2018
[11]

An open investi- gation of the reproducibility of cancer biology research,

T. M. Errington, E. Iorns, W. Gunn, F. E. Tan, J. Lomax, and B. A. Nosek, “An open investi- gation of the reproducibility of cancer biology research,”Elife, vol. 3, p. e04333, 2014. doi: 10.7554/eLife.04333

work page doi:10.7554/elife.04333 2014
[12]

The reproducibility crisis in the age of digital medicine,

A. Stupple, D. Singerman, and L. A. Celi, “The reproducibility crisis in the age of digital medicine,”NPJ digital medicine, vol. 2, no. 1, p. 2, 2019. doi: 10.1038/s41746-019-0079-z

work page doi:10.1038/s41746-019-0079-z 2019
[13]

No raw data, no science: another possible source of the reproducibility crisis,

T. Miyakawa, “No raw data, no science: another possible source of the reproducibility crisis,” Molecular brain, vol. 13, no. 1, p. 24, 2020. doi: 10.1186/s13041-020-0552-2

work page doi:10.1186/s13041-020-0552-2 2020
[14]

Repro- ducibility in management science,

M. Fišar, B. Greiner, C. Huber, E. Katok, A. I. Ozkes, and M. S. R. Collaboration, “Repro- ducibility in management science,”Management Science, vol. 70, no. 3, pp. 1343–1356, 2024. doi: 10.1287/mnsc.2023.03556

work page doi:10.1287/mnsc.2023.03556 2024
[15]

Investigating the replicability of the social and behavioural sciences,

A. H. Tyner, A. L. Abatayo, M. Daley, S. Field, N. Fox, N. A. Haber, K. M. Hahn, M. K. Struhl, B. Mawhinney, O. Miskeet al., “Investigating the replicability of the social and behavioural sciences,”Nature, vol. 652, no. 8108, pp. 143–150, 2026. doi: 10.1038/s41586-025-10078-y

work page doi:10.1038/s41586-025-10078-y 2026
[16]

Artificial intelligence faces reproducibility crisis,

M. Hutson, “Artificial intelligence faces reproducibility crisis,”Science, vol. 359, pp. 725–726,

work page
[17]

doi: 10.1126/science.359.6377.725

work page doi:10.1126/science.359.6377.725
[18]

Revisiting reproducibility in transportation simulation studies,

K. Riehl, A. Kouvelas, and M. A. Makridis, “Revisiting reproducibility in transportation simulation studies,”European Transport Research Review, vol. 17, no. 1, p. 22, 2025. doi: 10.1186/s12544-025-00718-9 . 12

work page doi:10.1186/s12544-025-00718-9 2025
[19]

Reproducibility crisis,

M. Baker, “Reproducibility crisis,”nature, vol. 533, no. 26, pp. 353–66, 2016. doi: 10.1038/533437a

work page doi:10.1038/533437a 2016
[20]

Is science really facing a reproducibility crisis, and do we need it to?

D. Fanelli, “Is science really facing a reproducibility crisis, and do we need it to?”Pro- ceedings of the National Academy of Sciences, vol. 115, no. 11, pp. 2628–2631, 2018. doi: 10.1073/pnas.1708272114

work page doi:10.1073/pnas.1708272114 2018
[21]

Before reproducibility must come preproducibility

P. B. Stark, “Before reproducibility must come preproducibility.”Nature, vol. 557, no. 7706, pp. 613–614, 2018. doi: 10.1038/d41586-018-05256-0

work page doi:10.1038/d41586-018-05256-0 2018
[22]

National Academies Press, 2019

National Academies of Sciences,Reproducibility and replicability in science. National Academies Press, 2019. doi: 10.17226/25303

work page doi:10.17226/25303 2019
[23]

Agentic AI: Autonomous Intelligence for Complex Goals—A Comprehensive Survey,

D. B. Acharya, K. Kuppan, and B. Divya, “Agentic AI: Autonomous intelligence for com- plex goals—a comprehensive survey,”IEEE Access, vol. 13, pp. 18 912–18 936, 2025. doi: 10.1109/ACCESS.2025.3532853

work page doi:10.1109/access.2025.3532853 2025
[24]

From generation to judgment: Opportunities and challenges of LLM-as-a-judge

D. Li, B. Jiang, L. Huang, A. Beigi, C. Zhao, Z. Tan, A. Bhattacharjee, Y . Jiang, C. Chen, T. Wuet al., “From generation to judgment: Opportunities and challenges of LLM-as-a-judge,” Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 2757–2791, 2025. doi: 10.18653/v1/2025.emnlp-main.138

work page doi:10.18653/v1/2025.emnlp-main.138 2025
[25]

AgentReview: Exploring Peer Review Dynamics with

Y . Jin, Q. Zhao, Y . Wang, H. Chen, K. Zhu, Y . Xiao, and J. Wang, “Agentreview: Exploring peer review dynamics with LLM agents,”Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 1208–1226, 2024. doi: 10.18653/v1/2024.emnlp-main.70

work page doi:10.18653/v1/2024.emnlp-main.70 2024
[26]

Can LLM feedback enhance review quality? a randomized study of 20k reviews at ICLR 2025,

N. Thakkar, M. Yuksekgonul, J. Silberg, A. Garg, N. Peng, F. Sha, R. Yu, C. V ondrick, and J. Zou, “Can LLM feedback enhance review quality? a randomized study of 20k reviews at ICLR 2025,”arXiv Preprints, 2025. doi: 10.48550/arXiv.2504.09737

work page doi:10.48550/arxiv.2504.09737 2025
[27]

Can large language models provide useful feedback on research papers? a large-scale empirical analysis,

W. Liang, Y . Zhang, H. Cao, B. Wang, D. Y . Ding, X. Yang, K. V odrahalli, S. He, D. S. Smith, Y . Yinet al., “Can large language models provide useful feedback on research papers? a large-scale empirical analysis,”NEJM AI, vol. 1, no. 8, 2024. doi: 10.1056/AIoa2400196

work page doi:10.1056/aioa2400196 2024
[28]

LLMs as meta-reviewers’ assistants: A case study,

E. Hossain, S. K. Sinha, N. Bansal, R. A. Knipper, S. Sarkar, J. Salvador, Y . Mahajan, S. R. P. K. Guttikonda, M. Akter, M. M. Hassanet al., “LLMs as meta-reviewers’ assistants: A case study,” Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: ...

work page doi:10.18653/v1/2025.naacl-long.395 2025
[29]

Large language models for automated scholarly paper review: A survey,

Z. Zhuang, J. Chen, H. Xu, Y . Jiang, and J. Lin, “Large language models for automated scholarly paper review: A survey,”Information Fusion, vol. 124, p. 103332, 2025. doi: 10.1016/j.inffus.2025.103332

work page doi:10.1016/j.inffus.2025.103332 2025
[30]

Repro-bench: Can agentic ai systems assess the reproducibility of social science research?

C. Hu, L. Zhang, Y . Lim, A. Wadhwani, A. Peters, and D. Kang, “Repro-bench: Can agentic ai systems assess the reproducibility of social science research?”Findings of the Association for Computational Linguistics: ACL 2025, pp. 23 616–23 626, 2025. doi: 10.18653/v1/2025.findings-acl.1210

work page doi:10.18653/v1/2025.findings-acl.1210 2025
[31]

ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences

B. Nguyen, D. Soós, Q. Ma, R. R. Obadage, Z. Ranjan, S. Koneru, T. M. Errington, S. Nematova, S. Rajtmajer, J. Wuet al., “Replicatorbench: Benchmarking LLM agents for replicability in social and behavioral sciences,”arXiv Preprints, 2026. doi: 10.48550/arXiv.2602.11354

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.11354 2026
[32]

Airs-bench: a suite of tasks for frontier ai research science agents.arXiv preprint arXiv:2602.06855, 2026

A. Lupidi, B. Gauri, T. S. Foster, B. A. Omari, D. Magka, A. Pepe, A. Audran-Reiss, M. Aghamelu, N. Baldwin, L. Cipolina-Kunet al., “AIRS-Bench: a suite of tasks for frontier AI research science agents,”arXiv Preprints, 2026. doi: 10.48550/arXiv.2602.06855

work page doi:10.48550/arxiv.2602.06855 2026
[33]

Gyeongwon James Kim, Alex Wilf, Louis philippe Morency, and Daniel Fried

G. J. Kim, A. Wilf, L.-P. Morency, and D. Fried, “From reproduction to replication: Evaluating research agents with progressive code masking,”arXiv Preprints, 2025. doi: 10.48550/arXiv.2506.19724

work page doi:10.48550/arxiv.2506.19724 2025
[34]

Zachary S Siegel, Sayash Kapoor, Nitya Nagdir, Benedikt Stroebl, and Arvind Narayanan

M. Seo, J. Baek, S. Lee, and S. J. Hwang, “Paper2code: Automating code generation from scientific papers in machine learning,”arXiv Preprints, 2025. doi: 10.48550/arXiv.2504.17192 . 13

work page doi:10.48550/arxiv.2504.17192 2025
[35]

Rescience c: a journal for reproducible replications in computa- tional science,

N. P. Rougier and K. Hinsen, “Rescience c: a journal for reproducible replications in computa- tional science,”International Workshop on Reproducible Research in Pattern Recognition, pp. 150–156, 2018. doi: 10.1007/978-3-030-23987-9_14

work page doi:10.1007/978-3-030-23987-9_14 2018
[36]

Assessing data availability and research reproducibility in hydrology and water resources,

J. H. Stagge, D. E. Rosenberg, A. M. Abdallah, H. Akbar, N. A. Attallah, and R. James, “Assessing data availability and research reproducibility in hydrology and water resources,” Scientific Data, vol. 6, no. 1, p. 190030, 2019. doi: 10.1038/sdata.2019.30

work page doi:10.1038/sdata.2019.30 2019
[37]

Reliability: on the reproducibility of assessment data,

S. M. Downing, “Reliability: on the reproducibility of assessment data,”Medical Education, vol. 38, no. 9, pp. 1006–1012, 2004. doi: 10.1111/j.1365-2929.2004.01932.x

work page doi:10.1111/j.1365-2929.2004.01932.x 2004
[38]

Accuracy, reproducibility and repeatability of ultrasonography in the assessment of abdominal adiposity,

A. Bazzocchi, G. Filonzi, F. Ponti, C. Sassi, E. Salizzoni, G. Battista, and R. Canini, “Accuracy, reproducibility and repeatability of ultrasonography in the assessment of abdominal adiposity,” Academic Radiology, vol. 18, no. 9, pp. 1133–1143, 2011. doi: 10.1016/j.acra.2011.04.014

work page doi:10.1016/j.acra.2011.04.014 2011
[39]

Critical review of current approaches for echocardiographic reproducibility and reliability assessment in clinical research,

A. L. Crowley, E. Yow, H. X. Barnhart, M. A. Daubert, R. Bigelow, D. C. Sullivan, M. Pencina, and P. S. Douglas, “Critical review of current approaches for echocardiographic reproducibility and reliability assessment in clinical research,”Journal of the American Society of Echocardiog- raphy, vol. 29, no. 12, pp. 1144–1154, 2016. doi: 10.1016/j.echo.2016.08.006

work page doi:10.1016/j.echo.2016.08.006 2016
[40]

A practical guide to assess the reproducibility of echocardiographic measurements,

K. V . Bunting, R. P. Steeds, K. Slater, J. K. Rogers, G. V . Gkoutos, and D. Kotecha, “A practical guide to assess the reproducibility of echocardiographic measurements,”Journal of the American Society of Echocardiography, vol. 32, no. 12, pp. 1505–1515, 2019. doi: 10.1016/j.echo.2019.08.015

work page doi:10.1016/j.echo.2019.08.015 2019
[41]

Statistical methods for replicability assessment,

K. Hung and W. Fithian, “Statistical methods for replicability assessment,”The Annals of Applied Statistics, vol. 14, no. 3, pp. 1063–1087, 2020. doi: 10.1214/20-AOAS1336

work page doi:10.1214/20-aoas1336 2020
[42]

The assessment of replicability using the sum of p-values,

L. Held, S. Pawel, and C. Micheloud, “The assessment of replicability using the sum of p-values,” Royal Society Open Science, vol. 11, no. 8, 2024. doi: 10.1098/rsos.240149

work page doi:10.1098/rsos.240149 2024
[43]

Systematic assessment of the replicability and generalizability of preclinical findings: Impact of protocol harmonization across laboratory sites,

M. Arroyo-Araujo, B. V oelkl, C. Laloux, J. Novak, B. Koopmans, A.-M. Waldron, I. Seiffert, H. Stirling, K. Aulehner, S. K. Janhunenet al., “Systematic assessment of the replicability and generalizability of preclinical findings: Impact of protocol harmonization across laboratory sites,”PLoS biology, vol. 20, no. 11, p. e3001886, 2022. doi: 10.1371/journa...

work page doi:10.1371/journal.pbio.3001886 2022
[44]

Deepcode: Open agentic coding,

Z. Li, Z. Li, Z. Guo, X. Ren, and C. Huang, “Deepcode: Open agentic coding,”arXiv Preprints,

work page
[45]

doi: 10.48550/arXiv.2512.07921

work page doi:10.48550/arxiv.2512.07921
[46]

Replicationbench: Can AI agents replicate astrophysics research papers?

C. Ye, S. Yuan, S. Cooray, S. Dillmann, I. L. Roque, D. Baron, P. Frank, S. Martin-Alvarez, N. Koblischke, F. J. Quet al., “Replicationbench: Can AI agents replicate astrophysics research papers?”arXiv Preprints, 2025. doi: 10.48550/arXiv.2510.24591

work page doi:10.48550/arxiv.2510.24591 2025
[47]

PaperBench: Evaluating AI's Ability to Replicate AI Research

G. Starace, O. Jaffe, D. Sherburn, J. Aung, J. S. Chan, L. Maksin, R. Dias, E. Mays, B. Kinsella, W. Thompsonet al., “Paperbench: Evaluating AI’s ability to replicate AI research,”arXiv Preprints, 2025. doi: 10.48550/arXiv.2504.01848

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.01848 2025
[48]

LLM-assisted replication for quantitative social science,

S. Kubota, H. Yakura, S. Coavoux, S. Yamada, and Y . Nakamura, “LLM-assisted replication for quantitative social science,”arXiv Preprints, 2026. doi: 10.48550/arXiv.2602.18453

work page doi:10.48550/arxiv.2602.18453 2026
[49]

LLM-assisted replication as scientific infrastructure,

S. Kubota, H. Yakura, S. Yamada, Y . Nakamura, T. Werner, and S. Coavoux, “LLM-assisted replication as scientific infrastructure,”Open Science Framework, 2026

work page 2026
[50]

AI-driven review systems: evaluating LLMs in scalable and bias-aware academic reviews,

K. Tyser, B. Segev, G. Longhitano, X.-Y . Zhang, Z. Meeks, J. Lee, U. Garg, N. Belsten, A. Shporer, M. Udellet al., “AI-driven review systems: evaluating LLMs in scalable and bias-aware academic reviews,”arXiv Preprints, 2024. doi: 10.48550/arXiv.2408.10365

work page doi:10.48550/arxiv.2408.10365 2024
[51]

From replication to redesign: Exploring pairwise comparisons for LLM-based peer review,

Y . Zhang, H. Zhang, W. Ji, T. Hua, N. Haber, H. Cao, and W. Liang, “From replication to redesign: Exploring pairwise comparisons for LLM-based peer review,”arXiv Preprints, 2025. doi: 10.48550/arXiv.2506.11343

work page doi:10.48550/arxiv.2506.11343 2025
[52]

Ai is transforming peer review—and many scientists are worried,

M. Naddaf, “Ai is transforming peer review—and many scientists are worried,”Nature, vol. 639, no. 8056, pp. 852–854, 2025. doi: 10.1038/d41586-025-00894-7 . 14

work page doi:10.1038/d41586-025-00894-7 2025
[53]

More than half of researchers now use AI for peer review—often against guidance,

——, “More than half of researchers now use AI for peer review—often against guidance,” Nature, vol. 649, no. 8096, pp. 273–274, 2026. doi: 10.1038/d41586-025-04066-5

work page doi:10.1038/d41586-025-04066-5 2026
[54]

Reproscreener: Leveraging LLMs for assessing computational reproducibility of machine learning pipelines,

A. Bhaskar and V . Stodden, “Reproscreener: Leveraging LLMs for assessing computational reproducibility of machine learning pipelines,”Proceedings of the 2nd ACM Conference on Reproducibility and Replicability, pp. 101–109, 2024. doi: 10.1145/3641525.3663629

work page doi:10.1145/3641525.3663629 2024
[55]

Paper-snitch: A practical tool for evidence-based reproducibility assessment,

D. Santoli and F. Bolelli, “Paper-snitch: A practical tool for evidence-based reproducibility assessment,” Master’s thesis, University of Modena and Reggio Emilia, 2024. [Online]. Available: https://federicobolelli.it/media/supervision_pdfs/LM_Davide_Santoli.pdf

work page 2024
[56]

Assessing reproducibility in evolutionary computation: A case study using human-and LLM-based assessment,

F. Da Ros, T. Za ˇciragi´c, A. Plaat, T. Bäck, and N. van Stein, “Assessing reproducibility in evolutionary computation: A case study using human-and LLM-based assessment,”arXiv Preprints, 2026. doi: 10.48550/arXiv.2602.07059

work page doi:10.48550/arxiv.2602.07059 2026
[57]

Auto-metrics: LLM-assisted scientific quality control for radiomics research,

J. G. de Almeida and N. Papanikolaou, “Auto-metrics: LLM-assisted scientific quality control for radiomics research,”European Journal of Radiology, p. 112358, 2025. doi: 10.1016/j.ejrad.2025.112358

work page doi:10.1016/j.ejrad.2025.112358 2025
[58]

Mass reproducibility and replicability: A new hope,

A. Brodeur, D. Mikola, and N. Cook, “Mass reproducibility and replicability: A new hope,”Institute of Labor Economics (IZA), no. 16912, 2024. [Online]. Available: https://www.jstor.org/stable/pdf/resrep58994.pdf

work page 2024
[59]

State of the art: Reproducibility in artificial intelligence.Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), April 2018

O. E. Gundersen and S. Kjensmo, “State of the art: Reproducibility in artificial intelli- gence,”Proceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018. doi: 10.1609/aaai.v32i1.11503

work page doi:10.1609/aaai.v32i1.11503 2018
[60]

I4r discussion paper series, the institute for replication (i4r),

A. Brodeur, “I4r discussion paper series, the institute for replication (i4r),”Institute for Replication (I4R), Leibniz-Institut für Wirtschaftsforschung, 2024. [Online]. Available: https://www.rwi-essen.de/i4r-discussion-paper-series

work page 2024
[61]

Retraction watch – tracking retractions as a window into the scientific process,

A. Marcus and I. Oransky, “Retraction watch – tracking retractions as a window into the scientific process,” Retraction Watch, Tech. Rep., 2024. [Online]. Available: https://retractionwatch.com/ 15 A Appendix: Exemplary Scientific Workflow Graph Here, we provide an exemplary workflow generation for the work entitled"A Deep Reinforcement Learning Approach ...

work page doi:10.1155/2021/6669028 2024
[62]

Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects 36 Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country ...

work page

[1] [1]

Publish or perish,

G. Parchomovsky, “Publish or perish,”Michigan Law Review, vol. 98, no. 4, pp. 926–952, 2000. doi: 10.2307/1290335

work page doi:10.2307/1290335 2000

[2] [2]

Science in an exponential world,

A. Szalay and J. Gray, “Science in an exponential world,”Nature, vol. 440, no. 7083, pp. 413–414, 2006. doi: 10.1038/440413a

work page doi:10.1038/440413a 2006

[3] [3]

Distinguishing

E. Mosca, M. H. I. Abdalla, P. Basso, M. Musumeci, and G. Groh, “Distinguishing fact from fiction: A benchmark dataset for identifying machine-generated scientific papers in the LLM era.”Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023), pp. 190–207, 2023. doi: 10.18653/v1/2023.trustnlp-1.17

work page doi:10.18653/v1/2023.trustnlp-1.17 2023

[4] [4]

Have AI-generated texts from LLM infiltrated the realm of scientific writing? a large-scale analysis of preprint platforms,

H.-Z. Cheng, B. Sheng, A. Lee, V . Chaudhary, A. G. Atanasov, N. Liu, Y . Qiu, T. Y . Wong, Y .-C. Tham, and Y .-F. Zheng, “Have AI-generated texts from LLM infiltrated the realm of scientific writing? a large-scale analysis of preprint platforms,”bioRxiv, pp. 2024–03, 2024. doi: 10.1101/2024.03.25.586710

work page doi:10.1101/2024.03.25.586710 2024

[5] [5]

Is LLM a reliable reviewer? a comprehensive evaluation of LLM on automatic paper reviewing tasks,

R. Zhou, L. Chen, and K. Yu, “Is LLM a reliable reviewer? a comprehensive evaluation of LLM on automatic paper reviewing tasks,”Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pp. 9340–9351, 2024. [Online]. Available: https://aclanthology.org/2024.lrec-main.816/

work page 2024

[6] [6]

Is peer review in decline?

G. Ellison, “Is peer review in decline?”Economic Inquiry, vol. 49, no. 3, pp. 635–657, 2011. doi: 10.1111/j.1465-7295.2010.00261.x

work page doi:10.1111/j.1465-7295.2010.00261.x 2011

[7] [7]

The AI impera- tive: Scaling high-quality peer review in machine learning,

Q. Wei, S. Holt, J. Yang, M. Wulfmeier, and M. van der Schaar, “The AI impera- tive: Scaling high-quality peer review in machine learning,”arXiv Preprints, 2025. doi: 10.48550/arXiv.2506.08134

work page doi:10.48550/arxiv.2506.08134 2025

[8] [8]

Popper,Logik der Forschung

K. Popper,Logik der Forschung. Vienna, Austria: Julius Springer Verlag GmbH, 1935. doi: 10.1007/978-3-7091-4177-9

work page doi:10.1007/978-3-7091-4177-9 1935

[9] [9]

London, UK: Hutchinson & Co., 1959

——,The Logic of Scientific Discovery. London, UK: Hutchinson & Co., 1959. doi: 10.2307/2412687

work page doi:10.2307/2412687 1959

[10] [10]

The replicability crisis and public trust in psychological sci- ence,

F. Anvari and D. Lakens, “The replicability crisis and public trust in psychological sci- ence,”Comprehensive Results in Social Psychology, vol. 3, no. 3, pp. 266–286, 2018. doi: 10.1080/23743603.2019.1684822

work page doi:10.1080/23743603.2019.1684822 2018

[11] [11]

An open investi- gation of the reproducibility of cancer biology research,

T. M. Errington, E. Iorns, W. Gunn, F. E. Tan, J. Lomax, and B. A. Nosek, “An open investi- gation of the reproducibility of cancer biology research,”Elife, vol. 3, p. e04333, 2014. doi: 10.7554/eLife.04333

work page doi:10.7554/elife.04333 2014

[12] [12]

The reproducibility crisis in the age of digital medicine,

A. Stupple, D. Singerman, and L. A. Celi, “The reproducibility crisis in the age of digital medicine,”NPJ digital medicine, vol. 2, no. 1, p. 2, 2019. doi: 10.1038/s41746-019-0079-z

work page doi:10.1038/s41746-019-0079-z 2019

[13] [13]

No raw data, no science: another possible source of the reproducibility crisis,

T. Miyakawa, “No raw data, no science: another possible source of the reproducibility crisis,” Molecular brain, vol. 13, no. 1, p. 24, 2020. doi: 10.1186/s13041-020-0552-2

work page doi:10.1186/s13041-020-0552-2 2020

[14] [14]

Repro- ducibility in management science,

M. Fišar, B. Greiner, C. Huber, E. Katok, A. I. Ozkes, and M. S. R. Collaboration, “Repro- ducibility in management science,”Management Science, vol. 70, no. 3, pp. 1343–1356, 2024. doi: 10.1287/mnsc.2023.03556

work page doi:10.1287/mnsc.2023.03556 2024

[15] [15]

Investigating the replicability of the social and behavioural sciences,

A. H. Tyner, A. L. Abatayo, M. Daley, S. Field, N. Fox, N. A. Haber, K. M. Hahn, M. K. Struhl, B. Mawhinney, O. Miskeet al., “Investigating the replicability of the social and behavioural sciences,”Nature, vol. 652, no. 8108, pp. 143–150, 2026. doi: 10.1038/s41586-025-10078-y

work page doi:10.1038/s41586-025-10078-y 2026

[16] [16]

Artificial intelligence faces reproducibility crisis,

M. Hutson, “Artificial intelligence faces reproducibility crisis,”Science, vol. 359, pp. 725–726,

work page

[17] [17]

doi: 10.1126/science.359.6377.725

work page doi:10.1126/science.359.6377.725

[18] [18]

Revisiting reproducibility in transportation simulation studies,

K. Riehl, A. Kouvelas, and M. A. Makridis, “Revisiting reproducibility in transportation simulation studies,”European Transport Research Review, vol. 17, no. 1, p. 22, 2025. doi: 10.1186/s12544-025-00718-9 . 12

work page doi:10.1186/s12544-025-00718-9 2025

[19] [19]

Reproducibility crisis,

M. Baker, “Reproducibility crisis,”nature, vol. 533, no. 26, pp. 353–66, 2016. doi: 10.1038/533437a

work page doi:10.1038/533437a 2016

[20] [20]

Is science really facing a reproducibility crisis, and do we need it to?

D. Fanelli, “Is science really facing a reproducibility crisis, and do we need it to?”Pro- ceedings of the National Academy of Sciences, vol. 115, no. 11, pp. 2628–2631, 2018. doi: 10.1073/pnas.1708272114

work page doi:10.1073/pnas.1708272114 2018

[21] [21]

Before reproducibility must come preproducibility

P. B. Stark, “Before reproducibility must come preproducibility.”Nature, vol. 557, no. 7706, pp. 613–614, 2018. doi: 10.1038/d41586-018-05256-0

work page doi:10.1038/d41586-018-05256-0 2018

[22] [22]

National Academies Press, 2019

National Academies of Sciences,Reproducibility and replicability in science. National Academies Press, 2019. doi: 10.17226/25303

work page doi:10.17226/25303 2019

[23] [23]

Agentic AI: Autonomous Intelligence for Complex Goals—A Comprehensive Survey,

D. B. Acharya, K. Kuppan, and B. Divya, “Agentic AI: Autonomous intelligence for com- plex goals—a comprehensive survey,”IEEE Access, vol. 13, pp. 18 912–18 936, 2025. doi: 10.1109/ACCESS.2025.3532853

work page doi:10.1109/access.2025.3532853 2025

[24] [24]

From generation to judgment: Opportunities and challenges of LLM-as-a-judge

D. Li, B. Jiang, L. Huang, A. Beigi, C. Zhao, Z. Tan, A. Bhattacharjee, Y . Jiang, C. Chen, T. Wuet al., “From generation to judgment: Opportunities and challenges of LLM-as-a-judge,” Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 2757–2791, 2025. doi: 10.18653/v1/2025.emnlp-main.138

work page doi:10.18653/v1/2025.emnlp-main.138 2025

[25] [25]

AgentReview: Exploring Peer Review Dynamics with

Y . Jin, Q. Zhao, Y . Wang, H. Chen, K. Zhu, Y . Xiao, and J. Wang, “Agentreview: Exploring peer review dynamics with LLM agents,”Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 1208–1226, 2024. doi: 10.18653/v1/2024.emnlp-main.70

work page doi:10.18653/v1/2024.emnlp-main.70 2024

[26] [26]

Can LLM feedback enhance review quality? a randomized study of 20k reviews at ICLR 2025,

N. Thakkar, M. Yuksekgonul, J. Silberg, A. Garg, N. Peng, F. Sha, R. Yu, C. V ondrick, and J. Zou, “Can LLM feedback enhance review quality? a randomized study of 20k reviews at ICLR 2025,”arXiv Preprints, 2025. doi: 10.48550/arXiv.2504.09737

work page doi:10.48550/arxiv.2504.09737 2025

[27] [27]

Can large language models provide useful feedback on research papers? a large-scale empirical analysis,

W. Liang, Y . Zhang, H. Cao, B. Wang, D. Y . Ding, X. Yang, K. V odrahalli, S. He, D. S. Smith, Y . Yinet al., “Can large language models provide useful feedback on research papers? a large-scale empirical analysis,”NEJM AI, vol. 1, no. 8, 2024. doi: 10.1056/AIoa2400196

work page doi:10.1056/aioa2400196 2024

[28] [28]

LLMs as meta-reviewers’ assistants: A case study,

E. Hossain, S. K. Sinha, N. Bansal, R. A. Knipper, S. Sarkar, J. Salvador, Y . Mahajan, S. R. P. K. Guttikonda, M. Akter, M. M. Hassanet al., “LLMs as meta-reviewers’ assistants: A case study,” Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: ...

work page doi:10.18653/v1/2025.naacl-long.395 2025

[29] [29]

Large language models for automated scholarly paper review: A survey,

Z. Zhuang, J. Chen, H. Xu, Y . Jiang, and J. Lin, “Large language models for automated scholarly paper review: A survey,”Information Fusion, vol. 124, p. 103332, 2025. doi: 10.1016/j.inffus.2025.103332

work page doi:10.1016/j.inffus.2025.103332 2025

[30] [30]

Repro-bench: Can agentic ai systems assess the reproducibility of social science research?

C. Hu, L. Zhang, Y . Lim, A. Wadhwani, A. Peters, and D. Kang, “Repro-bench: Can agentic ai systems assess the reproducibility of social science research?”Findings of the Association for Computational Linguistics: ACL 2025, pp. 23 616–23 626, 2025. doi: 10.18653/v1/2025.findings-acl.1210

work page doi:10.18653/v1/2025.findings-acl.1210 2025

[31] [31]

ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences

B. Nguyen, D. Soós, Q. Ma, R. R. Obadage, Z. Ranjan, S. Koneru, T. M. Errington, S. Nematova, S. Rajtmajer, J. Wuet al., “Replicatorbench: Benchmarking LLM agents for replicability in social and behavioral sciences,”arXiv Preprints, 2026. doi: 10.48550/arXiv.2602.11354

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.11354 2026

[32] [32]

Airs-bench: a suite of tasks for frontier ai research science agents.arXiv preprint arXiv:2602.06855, 2026

A. Lupidi, B. Gauri, T. S. Foster, B. A. Omari, D. Magka, A. Pepe, A. Audran-Reiss, M. Aghamelu, N. Baldwin, L. Cipolina-Kunet al., “AIRS-Bench: a suite of tasks for frontier AI research science agents,”arXiv Preprints, 2026. doi: 10.48550/arXiv.2602.06855

work page doi:10.48550/arxiv.2602.06855 2026

[33] [33]

Gyeongwon James Kim, Alex Wilf, Louis philippe Morency, and Daniel Fried

G. J. Kim, A. Wilf, L.-P. Morency, and D. Fried, “From reproduction to replication: Evaluating research agents with progressive code masking,”arXiv Preprints, 2025. doi: 10.48550/arXiv.2506.19724

work page doi:10.48550/arxiv.2506.19724 2025

[34] [34]

Zachary S Siegel, Sayash Kapoor, Nitya Nagdir, Benedikt Stroebl, and Arvind Narayanan

M. Seo, J. Baek, S. Lee, and S. J. Hwang, “Paper2code: Automating code generation from scientific papers in machine learning,”arXiv Preprints, 2025. doi: 10.48550/arXiv.2504.17192 . 13

work page doi:10.48550/arxiv.2504.17192 2025

[35] [35]

Rescience c: a journal for reproducible replications in computa- tional science,

N. P. Rougier and K. Hinsen, “Rescience c: a journal for reproducible replications in computa- tional science,”International Workshop on Reproducible Research in Pattern Recognition, pp. 150–156, 2018. doi: 10.1007/978-3-030-23987-9_14

work page doi:10.1007/978-3-030-23987-9_14 2018

[36] [36]

Assessing data availability and research reproducibility in hydrology and water resources,

J. H. Stagge, D. E. Rosenberg, A. M. Abdallah, H. Akbar, N. A. Attallah, and R. James, “Assessing data availability and research reproducibility in hydrology and water resources,” Scientific Data, vol. 6, no. 1, p. 190030, 2019. doi: 10.1038/sdata.2019.30

work page doi:10.1038/sdata.2019.30 2019

[37] [37]

Reliability: on the reproducibility of assessment data,

S. M. Downing, “Reliability: on the reproducibility of assessment data,”Medical Education, vol. 38, no. 9, pp. 1006–1012, 2004. doi: 10.1111/j.1365-2929.2004.01932.x

work page doi:10.1111/j.1365-2929.2004.01932.x 2004

[38] [38]

Accuracy, reproducibility and repeatability of ultrasonography in the assessment of abdominal adiposity,

A. Bazzocchi, G. Filonzi, F. Ponti, C. Sassi, E. Salizzoni, G. Battista, and R. Canini, “Accuracy, reproducibility and repeatability of ultrasonography in the assessment of abdominal adiposity,” Academic Radiology, vol. 18, no. 9, pp. 1133–1143, 2011. doi: 10.1016/j.acra.2011.04.014

work page doi:10.1016/j.acra.2011.04.014 2011

[39] [39]

Critical review of current approaches for echocardiographic reproducibility and reliability assessment in clinical research,

A. L. Crowley, E. Yow, H. X. Barnhart, M. A. Daubert, R. Bigelow, D. C. Sullivan, M. Pencina, and P. S. Douglas, “Critical review of current approaches for echocardiographic reproducibility and reliability assessment in clinical research,”Journal of the American Society of Echocardiog- raphy, vol. 29, no. 12, pp. 1144–1154, 2016. doi: 10.1016/j.echo.2016.08.006

work page doi:10.1016/j.echo.2016.08.006 2016

[40] [40]

A practical guide to assess the reproducibility of echocardiographic measurements,

K. V . Bunting, R. P. Steeds, K. Slater, J. K. Rogers, G. V . Gkoutos, and D. Kotecha, “A practical guide to assess the reproducibility of echocardiographic measurements,”Journal of the American Society of Echocardiography, vol. 32, no. 12, pp. 1505–1515, 2019. doi: 10.1016/j.echo.2019.08.015

work page doi:10.1016/j.echo.2019.08.015 2019

[41] [41]

Statistical methods for replicability assessment,

K. Hung and W. Fithian, “Statistical methods for replicability assessment,”The Annals of Applied Statistics, vol. 14, no. 3, pp. 1063–1087, 2020. doi: 10.1214/20-AOAS1336

work page doi:10.1214/20-aoas1336 2020

[42] [42]

The assessment of replicability using the sum of p-values,

L. Held, S. Pawel, and C. Micheloud, “The assessment of replicability using the sum of p-values,” Royal Society Open Science, vol. 11, no. 8, 2024. doi: 10.1098/rsos.240149

work page doi:10.1098/rsos.240149 2024

[43] [43]

Systematic assessment of the replicability and generalizability of preclinical findings: Impact of protocol harmonization across laboratory sites,

M. Arroyo-Araujo, B. V oelkl, C. Laloux, J. Novak, B. Koopmans, A.-M. Waldron, I. Seiffert, H. Stirling, K. Aulehner, S. K. Janhunenet al., “Systematic assessment of the replicability and generalizability of preclinical findings: Impact of protocol harmonization across laboratory sites,”PLoS biology, vol. 20, no. 11, p. e3001886, 2022. doi: 10.1371/journa...

work page doi:10.1371/journal.pbio.3001886 2022

[44] [44]

Deepcode: Open agentic coding,

Z. Li, Z. Li, Z. Guo, X. Ren, and C. Huang, “Deepcode: Open agentic coding,”arXiv Preprints,

work page

[45] [45]

doi: 10.48550/arXiv.2512.07921

work page doi:10.48550/arxiv.2512.07921

[46] [46]

Replicationbench: Can AI agents replicate astrophysics research papers?

C. Ye, S. Yuan, S. Cooray, S. Dillmann, I. L. Roque, D. Baron, P. Frank, S. Martin-Alvarez, N. Koblischke, F. J. Quet al., “Replicationbench: Can AI agents replicate astrophysics research papers?”arXiv Preprints, 2025. doi: 10.48550/arXiv.2510.24591

work page doi:10.48550/arxiv.2510.24591 2025

[47] [47]

PaperBench: Evaluating AI's Ability to Replicate AI Research

G. Starace, O. Jaffe, D. Sherburn, J. Aung, J. S. Chan, L. Maksin, R. Dias, E. Mays, B. Kinsella, W. Thompsonet al., “Paperbench: Evaluating AI’s ability to replicate AI research,”arXiv Preprints, 2025. doi: 10.48550/arXiv.2504.01848

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.01848 2025

[48] [48]

LLM-assisted replication for quantitative social science,

S. Kubota, H. Yakura, S. Coavoux, S. Yamada, and Y . Nakamura, “LLM-assisted replication for quantitative social science,”arXiv Preprints, 2026. doi: 10.48550/arXiv.2602.18453

work page doi:10.48550/arxiv.2602.18453 2026

[49] [49]

LLM-assisted replication as scientific infrastructure,

S. Kubota, H. Yakura, S. Yamada, Y . Nakamura, T. Werner, and S. Coavoux, “LLM-assisted replication as scientific infrastructure,”Open Science Framework, 2026

work page 2026

[50] [50]

AI-driven review systems: evaluating LLMs in scalable and bias-aware academic reviews,

K. Tyser, B. Segev, G. Longhitano, X.-Y . Zhang, Z. Meeks, J. Lee, U. Garg, N. Belsten, A. Shporer, M. Udellet al., “AI-driven review systems: evaluating LLMs in scalable and bias-aware academic reviews,”arXiv Preprints, 2024. doi: 10.48550/arXiv.2408.10365

work page doi:10.48550/arxiv.2408.10365 2024

[51] [51]

From replication to redesign: Exploring pairwise comparisons for LLM-based peer review,

Y . Zhang, H. Zhang, W. Ji, T. Hua, N. Haber, H. Cao, and W. Liang, “From replication to redesign: Exploring pairwise comparisons for LLM-based peer review,”arXiv Preprints, 2025. doi: 10.48550/arXiv.2506.11343

work page doi:10.48550/arxiv.2506.11343 2025

[52] [52]

Ai is transforming peer review—and many scientists are worried,

M. Naddaf, “Ai is transforming peer review—and many scientists are worried,”Nature, vol. 639, no. 8056, pp. 852–854, 2025. doi: 10.1038/d41586-025-00894-7 . 14

work page doi:10.1038/d41586-025-00894-7 2025

[53] [53]

More than half of researchers now use AI for peer review—often against guidance,

——, “More than half of researchers now use AI for peer review—often against guidance,” Nature, vol. 649, no. 8096, pp. 273–274, 2026. doi: 10.1038/d41586-025-04066-5

work page doi:10.1038/d41586-025-04066-5 2026

[54] [54]

Reproscreener: Leveraging LLMs for assessing computational reproducibility of machine learning pipelines,

A. Bhaskar and V . Stodden, “Reproscreener: Leveraging LLMs for assessing computational reproducibility of machine learning pipelines,”Proceedings of the 2nd ACM Conference on Reproducibility and Replicability, pp. 101–109, 2024. doi: 10.1145/3641525.3663629

work page doi:10.1145/3641525.3663629 2024

[55] [55]

Paper-snitch: A practical tool for evidence-based reproducibility assessment,

D. Santoli and F. Bolelli, “Paper-snitch: A practical tool for evidence-based reproducibility assessment,” Master’s thesis, University of Modena and Reggio Emilia, 2024. [Online]. Available: https://federicobolelli.it/media/supervision_pdfs/LM_Davide_Santoli.pdf

work page 2024

[56] [56]

Assessing reproducibility in evolutionary computation: A case study using human-and LLM-based assessment,

F. Da Ros, T. Za ˇciragi´c, A. Plaat, T. Bäck, and N. van Stein, “Assessing reproducibility in evolutionary computation: A case study using human-and LLM-based assessment,”arXiv Preprints, 2026. doi: 10.48550/arXiv.2602.07059

work page doi:10.48550/arxiv.2602.07059 2026

[57] [57]

Auto-metrics: LLM-assisted scientific quality control for radiomics research,

J. G. de Almeida and N. Papanikolaou, “Auto-metrics: LLM-assisted scientific quality control for radiomics research,”European Journal of Radiology, p. 112358, 2025. doi: 10.1016/j.ejrad.2025.112358

work page doi:10.1016/j.ejrad.2025.112358 2025

[58] [58]

Mass reproducibility and replicability: A new hope,

A. Brodeur, D. Mikola, and N. Cook, “Mass reproducibility and replicability: A new hope,”Institute of Labor Economics (IZA), no. 16912, 2024. [Online]. Available: https://www.jstor.org/stable/pdf/resrep58994.pdf

work page 2024

[59] [59]

State of the art: Reproducibility in artificial intelligence.Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), April 2018

O. E. Gundersen and S. Kjensmo, “State of the art: Reproducibility in artificial intelli- gence,”Proceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018. doi: 10.1609/aaai.v32i1.11503

work page doi:10.1609/aaai.v32i1.11503 2018

[60] [60]

I4r discussion paper series, the institute for replication (i4r),

A. Brodeur, “I4r discussion paper series, the institute for replication (i4r),”Institute for Replication (I4R), Leibniz-Institut für Wirtschaftsforschung, 2024. [Online]. Available: https://www.rwi-essen.de/i4r-discussion-paper-series

work page 2024

[61] [61]

Retraction watch – tracking retractions as a window into the scientific process,

A. Marcus and I. Oransky, “Retraction watch – tracking retractions as a window into the scientific process,” Retraction Watch, Tech. Rep., 2024. [Online]. Available: https://retractionwatch.com/ 15 A Appendix: Exemplary Scientific Workflow Graph Here, we provide an exemplary workflow generation for the work entitled"A Deep Reinforcement Learning Approach ...

work page doi:10.1155/2021/6669028 2024

[62] [62]

Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects 36 Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country ...

work page