pith. sign in

arxiv: 2605.02651 · v2 · pith:SNWUXK5Xnew · submitted 2026-05-04 · 💻 cs.DL · cs.LG

ARA: Agentic Reproducibility Assessment For Scalable Support Of Scientific Peer-Review

Pith reviewed 2026-05-19 17:09 UTC · model grok-4.3

classification 💻 cs.DL cs.LG
keywords reproducibility assessmentworkflow graphsagentic AIscientific peer reviewcomputational reproducibilityLLM agentsReScience C benchmark
0
0 comments X

The pith

ARA extracts directed workflow graphs from papers to evaluate reproducibility at scale.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Agentic Reproducibility Assessment as a formal method for turning reproducibility checks into a structured reasoning task over full scientific documents. It works by first pulling out a directed graph that connects sources, methods, experiments, and outputs, then scoring how completely that graph can be rebuilt from the text alone. Large-scale tests on 213 human-validated articles across domains show the approach produces consistent results regardless of the underlying language model or temperature setting. The system reaches roughly 61 percent accuracy overall and sets new marks on two existing benchmarks by outperforming earlier automated baselines. If the graphs truly track real experimental dependencies, the method offers a concrete way to supplement human reviewers when paper volume outstrips manual capacity.

Core claim

ARA formalizes reproducibility assessment as a structured reasoning task over scientific documents by extracting a directed workflow graph linking sources, methods, experiments, and outputs, then evaluating its reconstructability using structural and content-based scores for reproducibility assessments.

What carries the argument

The directed workflow graph that links sources, methods, experiments, and outputs, scored for reconstructability through structural and content-based metrics.

If this is right

  • Reproducibility scoring becomes consistent across different language models and temperature settings.
  • The method records the highest accuracy to date on ReproBench and GoldStandardDB benchmarks.
  • Peer review gains a scalable complement that can process hundreds of papers with uniform criteria.
  • Next-generation review systems can incorporate automated graph reconstruction as a first-pass filter.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Review platforms could route low-scoring papers to deeper human inspection while clearing high-scoring ones faster.
  • The same graph-extraction logic might extend to non-computational fields if the scoring rules are adjusted for qualitative evidence.
  • Hybrid human-agent pipelines become feasible where agents handle routine dependency mapping and humans resolve edge cases.
  • Open release of the extracted graphs alongside papers would let later readers verify or extend the reproducibility claims directly.

Load-bearing premise

The workflow graphs the system extracts accurately reflect the paper's actual experimental dependencies, data flows, and result-generating steps.

What would settle it

Direct head-to-head comparison in which independent human experts manually reconstruct the same set of workflow graphs from the papers and measure whether the automated scores align with those human reconstructions.

Figures

Figures reproduced from arXiv: 2605.02651 by Anastasios Kouvelas, Andres L. Marin, Fan Wu, Georgios Fontaras, Kevin Riehl, Michail A. Makridis, Nikofors Zacharof, Patrick Langer, Robert Jakob.

Figure 1
Figure 1. Figure 1: Agentic Reproducibility Assessment Pipeline (ARA). First, a given scientific paper (resp. document) D is transformed into a directed workflow graph G, comprising four types of nodes (sources, methods, experiments, sinks). Second, the workflow graph’s reconstructability is projected on micro-level assessments of reproducibility (node-by-node) r(·). Third, the micro-level assessments are aggregated to reprod… view at source ↗
Figure 2
Figure 2. Figure 2: Human-Agent Disagreement on Reproducibility Assessment (Rescience C). work may extend the proposed framework beyond reproducibility assessment toward automated reproduction and implementation, integration of external artifact validation, and validity assessment of citations and scientific claims across publications. 10 view at source ↗
Figure 2
Figure 2. Figure 2: Human-Agent Disagreement on Reproducibility Assessment (Rescience C). 9 [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Workflow Graph Generated From A Scientific Paper. view at source ↗
Figure 4
Figure 4. Figure 4: Score and Rank Robustness under Dirichlet Perturbation. Left: per-profile |∆R| vs default (median, p95 band) as a function of concentration α. Right: per-paper Spearman ρ vs default (min, mean band) across 200 draws per α. Both axes are log-scaled in α. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_4.png] view at source ↗
read the original abstract

Scientific peer review increasingly struggles to assess reproducibility at the scale and complexity of modern research output. Evaluating reproducibility requires reconstructing experimental dependencies, methodological choices, data flows, and result-generating procedures, which often exceeds what human reviewers can provide. Agentic Reproducibility Assessment (ARA) formalizes reproducibility assessment as a structured reasoning task over scientific documents. Given a paper, ARA extracts a directed workflow graph linking sources, methods, experiments, and outputs, then evaluates its reconstructability using structural and content-based scores for reproducibility assessments. Experiments on 213 ReScience C articles - the largest cross-domain benchmark of human-validated computational reproducibility studies considered to date - demonstrate ARA's generalizability and consistent workflow reconstruction and assessment across LLMs, model temperatures, and scientific domains. ARA achieves ~61% accuracy on three benchmarks, and the highest accuracy reported on ReproBench (60.71% vs. 36.84%) and GoldStandardDB (61.68% vs. 43.56%), highlighting its potential to complement human review at scale and enabling next-generation peer review. Code and Data available: https://github.com/AndresLaverdeMarin/agentic_reproducibility_assessment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Agentic Reproducibility Assessment (ARA), an LLM-based agentic system that extracts a directed workflow graph from a scientific paper (linking sources, methods, experiments, and outputs) and then computes structural and content-based scores to assess reproducibility. It evaluates the approach on 213 ReScience C articles (the largest such human-validated benchmark cited), plus ReproBench and GoldStandardDB, reporting ~61% accuracy overall and the highest accuracy on the latter two benchmarks (60.71% and 61.68%) compared to prior baselines (36.84% and 43.56%). The work emphasizes generalizability across LLMs, temperatures, and domains, with code and data released.

Significance. If the extracted workflow graphs accurately reflect experimental dependencies and data flows, ARA could provide a scalable complement to human peer review for computational reproducibility assessment. The use of a large cross-domain benchmark of 213 human-validated studies, systematic testing across models and temperatures, and public release of code/data are clear strengths that support reproducibility of the reported results.

major comments (2)
  1. [Experiments on 213 ReScience C articles and benchmark evaluations] The accuracy figures (e.g., on the 213 ReScience C articles, ReproBench, and GoldStandardDB) are computed against human-validated final reproducibility outcomes rather than against direct human annotations of the extracted directed workflow graphs themselves. Without separate validation of node/edge fidelity, missing implicit steps, or hallucinated links, it remains possible that structural and content-based scores are derived from incomplete or spurious graphs that happen to correlate with the outcome labels.
  2. [Abstract and evaluation methodology description] The central claim of 'consistent workflow reconstruction and assessment' across LLMs, temperatures, and domains rests on the assumption that the agentic extraction faithfully captures data flows and result-generating procedures, yet the manuscript provides no error analysis or human study quantifying extraction accuracy independent of the downstream reproducibility label.
minor comments (2)
  1. [Abstract] The abstract states 'highest accuracy reported' on ReproBench and GoldStandardDB; clarify whether the baseline numbers (36.84%, 43.56%) come from identical evaluation conditions or from the original papers.
  2. [Methods] Provide more explicit description of the prompt templates, agent orchestration steps, and exact definitions of the structural versus content-based scores to aid replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment point by point below, acknowledging the validity of the concerns regarding evaluation methodology. Revisions will be made to clarify the scope of our claims and add supporting analysis.

read point-by-point responses
  1. Referee: [Experiments on 213 ReScience C articles and benchmark evaluations] The accuracy figures (e.g., on the 213 ReScience C articles, ReproBench, and GoldStandardDB) are computed against human-validated final reproducibility outcomes rather than against direct human annotations of the extracted directed workflow graphs themselves. Without separate validation of node/edge fidelity, missing implicit steps, or hallucinated links, it remains possible that structural and content-based scores are derived from incomplete or spurious graphs that happen to correlate with the outcome labels.

    Authors: We appreciate this observation and agree that our reported accuracies reflect end-to-end performance against final human-validated reproducibility labels rather than isolated human annotations of graph nodes, edges, or fidelity. This approach was chosen because it directly tests the system's utility for scalable reproducibility assessment in peer review. However, we acknowledge the limitation that correlation with outcomes does not fully prove graph quality. In the revised manuscript, we will add a dedicated subsection with qualitative error analysis on a sample of 20 papers, discussing issues like missing implicit steps and potential hallucinations, with examples. We will also update relevant sections to clarify this distinction. revision: yes

  2. Referee: [Abstract and evaluation methodology description] The central claim of 'consistent workflow reconstruction and assessment' across LLMs, temperatures, and domains rests on the assumption that the agentic extraction faithfully captures data flows and result-generating procedures, yet the manuscript provides no error analysis or human study quantifying extraction accuracy independent of the downstream reproducibility label.

    Authors: We agree that the manuscript lacks a separate human study or quantitative error analysis measuring extraction accuracy independently from the reproducibility prediction task. The consistency claims are currently supported by stable end-to-end accuracy across configurations. To address this, we will revise the abstract and methodology sections to qualify the claims as referring to consistent end-to-end assessment, and incorporate the qualitative graph analysis described in the response to the first comment. These changes will better align the presentation with the evaluation performed. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on external human-validated benchmarks

full rationale

The paper evaluates ARA by extracting directed workflow graphs from papers and computing structural/content-based reproducibility scores, then reports accuracy against independent external benchmarks (ReScience C with 213 human-validated articles, ReproBench, GoldStandardDB). These benchmarks supply outcome labels separate from the extraction process itself. No equations, definitions, or self-citations are shown to reduce the central performance claims to fitted inputs or prior author work by construction. The derivation chain therefore remains self-contained against external references rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only: the method rests on the assumption that LLMs can extract accurate workflow graphs representing scientific procedures.

axioms (1)
  • domain assumption LLMs can reliably extract directed workflow graphs linking sources, methods, experiments, and outputs from scientific documents
    This is the core mechanism described for formalizing reproducibility assessment.

pith-pipeline@v0.9.0 · 5778 in / 1120 out tokens · 37012 ms · 2026-05-19T17:09:47.577891+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 2 internal anchors

  1. [1]

    Publish or perish,

    G. Parchomovsky, “Publish or perish,”Michigan Law Review, vol. 98, no. 4, pp. 926–952, 2000. doi: 10.2307/1290335

  2. [2]

    Science in an exponential world,

    A. Szalay and J. Gray, “Science in an exponential world,”Nature, vol. 440, no. 7083, pp. 413–414, 2006. doi: 10.1038/440413a

  3. [3]

    Distinguishing

    E. Mosca, M. H. I. Abdalla, P. Basso, M. Musumeci, and G. Groh, “Distinguishing fact from fiction: A benchmark dataset for identifying machine-generated scientific papers in the LLM era.”Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023), pp. 190–207, 2023. doi: 10.18653/v1/2023.trustnlp-1.17

  4. [4]

    Have AI-generated texts from LLM infiltrated the realm of scientific writing? a large-scale analysis of preprint platforms,

    H.-Z. Cheng, B. Sheng, A. Lee, V . Chaudhary, A. G. Atanasov, N. Liu, Y . Qiu, T. Y . Wong, Y .-C. Tham, and Y .-F. Zheng, “Have AI-generated texts from LLM infiltrated the realm of scientific writing? a large-scale analysis of preprint platforms,”bioRxiv, pp. 2024–03, 2024. doi: 10.1101/2024.03.25.586710

  5. [5]

    Is LLM a reliable reviewer? a comprehensive evaluation of LLM on automatic paper reviewing tasks,

    R. Zhou, L. Chen, and K. Yu, “Is LLM a reliable reviewer? a comprehensive evaluation of LLM on automatic paper reviewing tasks,”Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pp. 9340–9351, 2024. [Online]. Available: https://aclanthology.org/2024.lrec-main.816/

  6. [6]

    Is peer review in decline?

    G. Ellison, “Is peer review in decline?”Economic Inquiry, vol. 49, no. 3, pp. 635–657, 2011. doi: 10.1111/j.1465-7295.2010.00261.x

  7. [7]

    The AI impera- tive: Scaling high-quality peer review in machine learning,

    Q. Wei, S. Holt, J. Yang, M. Wulfmeier, and M. van der Schaar, “The AI impera- tive: Scaling high-quality peer review in machine learning,”arXiv Preprints, 2025. doi: 10.48550/arXiv.2506.08134

  8. [8]

    Popper,Logik der Forschung

    K. Popper,Logik der Forschung. Vienna, Austria: Julius Springer Verlag GmbH, 1935. doi: 10.1007/978-3-7091-4177-9

  9. [9]

    London, UK: Hutchinson & Co., 1959

    ——,The Logic of Scientific Discovery. London, UK: Hutchinson & Co., 1959. doi: 10.2307/2412687

  10. [10]

    The replicability crisis and public trust in psychological sci- ence,

    F. Anvari and D. Lakens, “The replicability crisis and public trust in psychological sci- ence,”Comprehensive Results in Social Psychology, vol. 3, no. 3, pp. 266–286, 2018. doi: 10.1080/23743603.2019.1684822

  11. [11]

    An open investi- gation of the reproducibility of cancer biology research,

    T. M. Errington, E. Iorns, W. Gunn, F. E. Tan, J. Lomax, and B. A. Nosek, “An open investi- gation of the reproducibility of cancer biology research,”Elife, vol. 3, p. e04333, 2014. doi: 10.7554/eLife.04333

  12. [12]

    The reproducibility crisis in the age of digital medicine,

    A. Stupple, D. Singerman, and L. A. Celi, “The reproducibility crisis in the age of digital medicine,”NPJ digital medicine, vol. 2, no. 1, p. 2, 2019. doi: 10.1038/s41746-019-0079-z

  13. [13]

    No raw data, no science: another possible source of the reproducibility crisis,

    T. Miyakawa, “No raw data, no science: another possible source of the reproducibility crisis,” Molecular brain, vol. 13, no. 1, p. 24, 2020. doi: 10.1186/s13041-020-0552-2

  14. [14]

    Repro- ducibility in management science,

    M. Fišar, B. Greiner, C. Huber, E. Katok, A. I. Ozkes, and M. S. R. Collaboration, “Repro- ducibility in management science,”Management Science, vol. 70, no. 3, pp. 1343–1356, 2024. doi: 10.1287/mnsc.2023.03556

  15. [15]

    Investigating the replicability of the social and behavioural sciences,

    A. H. Tyner, A. L. Abatayo, M. Daley, S. Field, N. Fox, N. A. Haber, K. M. Hahn, M. K. Struhl, B. Mawhinney, O. Miskeet al., “Investigating the replicability of the social and behavioural sciences,”Nature, vol. 652, no. 8108, pp. 143–150, 2026. doi: 10.1038/s41586-025-10078-y

  16. [16]

    Artificial intelligence faces reproducibility crisis,

    M. Hutson, “Artificial intelligence faces reproducibility crisis,”Science, vol. 359, pp. 725–726,

  17. [17]

    doi: 10.1126/science.359.6377.725

  18. [18]

    Revisiting reproducibility in transportation simulation studies,

    K. Riehl, A. Kouvelas, and M. A. Makridis, “Revisiting reproducibility in transportation simulation studies,”European Transport Research Review, vol. 17, no. 1, p. 22, 2025. doi: 10.1186/s12544-025-00718-9 . 12

  19. [19]

    Reproducibility crisis,

    M. Baker, “Reproducibility crisis,”nature, vol. 533, no. 26, pp. 353–66, 2016. doi: 10.1038/533437a

  20. [20]

    Is science really facing a reproducibility crisis, and do we need it to?

    D. Fanelli, “Is science really facing a reproducibility crisis, and do we need it to?”Pro- ceedings of the National Academy of Sciences, vol. 115, no. 11, pp. 2628–2631, 2018. doi: 10.1073/pnas.1708272114

  21. [21]

    Before reproducibility must come preproducibility

    P. B. Stark, “Before reproducibility must come preproducibility.”Nature, vol. 557, no. 7706, pp. 613–614, 2018. doi: 10.1038/d41586-018-05256-0

  22. [22]

    National Academies Press, 2019

    National Academies of Sciences,Reproducibility and replicability in science. National Academies Press, 2019. doi: 10.17226/25303

  23. [23]

    Agentic AI: Autonomous Intelligence for Complex Goals—A Comprehensive Survey,

    D. B. Acharya, K. Kuppan, and B. Divya, “Agentic AI: Autonomous intelligence for com- plex goals—a comprehensive survey,”IEEE Access, vol. 13, pp. 18 912–18 936, 2025. doi: 10.1109/ACCESS.2025.3532853

  24. [24]

    From generation to judgment: Opportunities and challenges of LLM-as-a-judge

    D. Li, B. Jiang, L. Huang, A. Beigi, C. Zhao, Z. Tan, A. Bhattacharjee, Y . Jiang, C. Chen, T. Wuet al., “From generation to judgment: Opportunities and challenges of LLM-as-a-judge,” Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 2757–2791, 2025. doi: 10.18653/v1/2025.emnlp-main.138

  25. [25]

    AgentReview: Exploring Peer Review Dynamics with

    Y . Jin, Q. Zhao, Y . Wang, H. Chen, K. Zhu, Y . Xiao, and J. Wang, “Agentreview: Exploring peer review dynamics with LLM agents,”Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 1208–1226, 2024. doi: 10.18653/v1/2024.emnlp-main.70

  26. [26]

    Can LLM feedback enhance review quality? a randomized study of 20k reviews at ICLR 2025,

    N. Thakkar, M. Yuksekgonul, J. Silberg, A. Garg, N. Peng, F. Sha, R. Yu, C. V ondrick, and J. Zou, “Can LLM feedback enhance review quality? a randomized study of 20k reviews at ICLR 2025,”arXiv Preprints, 2025. doi: 10.48550/arXiv.2504.09737

  27. [27]

    Can large language models provide useful feedback on research papers? a large-scale empirical analysis,

    W. Liang, Y . Zhang, H. Cao, B. Wang, D. Y . Ding, X. Yang, K. V odrahalli, S. He, D. S. Smith, Y . Yinet al., “Can large language models provide useful feedback on research papers? a large-scale empirical analysis,”NEJM AI, vol. 1, no. 8, 2024. doi: 10.1056/AIoa2400196

  28. [28]

    LLMs as meta-reviewers’ assistants: A case study,

    E. Hossain, S. K. Sinha, N. Bansal, R. A. Knipper, S. Sarkar, J. Salvador, Y . Mahajan, S. R. P. K. Guttikonda, M. Akter, M. M. Hassanet al., “LLMs as meta-reviewers’ assistants: A case study,” Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: ...

  29. [29]

    Large language models for automated scholarly paper review: A survey,

    Z. Zhuang, J. Chen, H. Xu, Y . Jiang, and J. Lin, “Large language models for automated scholarly paper review: A survey,”Information Fusion, vol. 124, p. 103332, 2025. doi: 10.1016/j.inffus.2025.103332

  30. [30]

    Repro-bench: Can agentic ai systems assess the reproducibility of social science research?

    C. Hu, L. Zhang, Y . Lim, A. Wadhwani, A. Peters, and D. Kang, “Repro-bench: Can agentic ai systems assess the reproducibility of social science research?”Findings of the Association for Computational Linguistics: ACL 2025, pp. 23 616–23 626, 2025. doi: 10.18653/v1/2025.findings-acl.1210

  31. [31]

    ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences

    B. Nguyen, D. Soós, Q. Ma, R. R. Obadage, Z. Ranjan, S. Koneru, T. M. Errington, S. Nematova, S. Rajtmajer, J. Wuet al., “Replicatorbench: Benchmarking LLM agents for replicability in social and behavioral sciences,”arXiv Preprints, 2026. doi: 10.48550/arXiv.2602.11354

  32. [32]

    Airs-bench: a suite of tasks for frontier ai research science agents.arXiv preprint arXiv:2602.06855, 2026

    A. Lupidi, B. Gauri, T. S. Foster, B. A. Omari, D. Magka, A. Pepe, A. Audran-Reiss, M. Aghamelu, N. Baldwin, L. Cipolina-Kunet al., “AIRS-Bench: a suite of tasks for frontier AI research science agents,”arXiv Preprints, 2026. doi: 10.48550/arXiv.2602.06855

  33. [33]

    Gyeongwon James Kim, Alex Wilf, Louis philippe Morency, and Daniel Fried

    G. J. Kim, A. Wilf, L.-P. Morency, and D. Fried, “From reproduction to replication: Evaluating research agents with progressive code masking,”arXiv Preprints, 2025. doi: 10.48550/arXiv.2506.19724

  34. [34]

    Zachary S Siegel, Sayash Kapoor, Nitya Nagdir, Benedikt Stroebl, and Arvind Narayanan

    M. Seo, J. Baek, S. Lee, and S. J. Hwang, “Paper2code: Automating code generation from scientific papers in machine learning,”arXiv Preprints, 2025. doi: 10.48550/arXiv.2504.17192 . 13

  35. [35]

    Rescience c: a journal for reproducible replications in computa- tional science,

    N. P. Rougier and K. Hinsen, “Rescience c: a journal for reproducible replications in computa- tional science,”International Workshop on Reproducible Research in Pattern Recognition, pp. 150–156, 2018. doi: 10.1007/978-3-030-23987-9_14

  36. [36]

    Assessing data availability and research reproducibility in hydrology and water resources,

    J. H. Stagge, D. E. Rosenberg, A. M. Abdallah, H. Akbar, N. A. Attallah, and R. James, “Assessing data availability and research reproducibility in hydrology and water resources,” Scientific Data, vol. 6, no. 1, p. 190030, 2019. doi: 10.1038/sdata.2019.30

  37. [37]

    Reliability: on the reproducibility of assessment data,

    S. M. Downing, “Reliability: on the reproducibility of assessment data,”Medical Education, vol. 38, no. 9, pp. 1006–1012, 2004. doi: 10.1111/j.1365-2929.2004.01932.x

  38. [38]

    Accuracy, reproducibility and repeatability of ultrasonography in the assessment of abdominal adiposity,

    A. Bazzocchi, G. Filonzi, F. Ponti, C. Sassi, E. Salizzoni, G. Battista, and R. Canini, “Accuracy, reproducibility and repeatability of ultrasonography in the assessment of abdominal adiposity,” Academic Radiology, vol. 18, no. 9, pp. 1133–1143, 2011. doi: 10.1016/j.acra.2011.04.014

  39. [39]

    Critical review of current approaches for echocardiographic reproducibility and reliability assessment in clinical research,

    A. L. Crowley, E. Yow, H. X. Barnhart, M. A. Daubert, R. Bigelow, D. C. Sullivan, M. Pencina, and P. S. Douglas, “Critical review of current approaches for echocardiographic reproducibility and reliability assessment in clinical research,”Journal of the American Society of Echocardiog- raphy, vol. 29, no. 12, pp. 1144–1154, 2016. doi: 10.1016/j.echo.2016.08.006

  40. [40]

    A practical guide to assess the reproducibility of echocardiographic measurements,

    K. V . Bunting, R. P. Steeds, K. Slater, J. K. Rogers, G. V . Gkoutos, and D. Kotecha, “A practical guide to assess the reproducibility of echocardiographic measurements,”Journal of the American Society of Echocardiography, vol. 32, no. 12, pp. 1505–1515, 2019. doi: 10.1016/j.echo.2019.08.015

  41. [41]

    Statistical methods for replicability assessment,

    K. Hung and W. Fithian, “Statistical methods for replicability assessment,”The Annals of Applied Statistics, vol. 14, no. 3, pp. 1063–1087, 2020. doi: 10.1214/20-AOAS1336

  42. [42]

    The assessment of replicability using the sum of p-values,

    L. Held, S. Pawel, and C. Micheloud, “The assessment of replicability using the sum of p-values,” Royal Society Open Science, vol. 11, no. 8, 2024. doi: 10.1098/rsos.240149

  43. [43]

    Systematic assessment of the replicability and generalizability of preclinical findings: Impact of protocol harmonization across laboratory sites,

    M. Arroyo-Araujo, B. V oelkl, C. Laloux, J. Novak, B. Koopmans, A.-M. Waldron, I. Seiffert, H. Stirling, K. Aulehner, S. K. Janhunenet al., “Systematic assessment of the replicability and generalizability of preclinical findings: Impact of protocol harmonization across laboratory sites,”PLoS biology, vol. 20, no. 11, p. e3001886, 2022. doi: 10.1371/journa...

  44. [44]

    Deepcode: Open agentic coding,

    Z. Li, Z. Li, Z. Guo, X. Ren, and C. Huang, “Deepcode: Open agentic coding,”arXiv Preprints,

  45. [45]

    doi: 10.48550/arXiv.2512.07921

  46. [46]

    Replicationbench: Can AI agents replicate astrophysics research papers?

    C. Ye, S. Yuan, S. Cooray, S. Dillmann, I. L. Roque, D. Baron, P. Frank, S. Martin-Alvarez, N. Koblischke, F. J. Quet al., “Replicationbench: Can AI agents replicate astrophysics research papers?”arXiv Preprints, 2025. doi: 10.48550/arXiv.2510.24591

  47. [47]

    PaperBench: Evaluating AI's Ability to Replicate AI Research

    G. Starace, O. Jaffe, D. Sherburn, J. Aung, J. S. Chan, L. Maksin, R. Dias, E. Mays, B. Kinsella, W. Thompsonet al., “Paperbench: Evaluating AI’s ability to replicate AI research,”arXiv Preprints, 2025. doi: 10.48550/arXiv.2504.01848

  48. [48]

    LLM-assisted replication for quantitative social science,

    S. Kubota, H. Yakura, S. Coavoux, S. Yamada, and Y . Nakamura, “LLM-assisted replication for quantitative social science,”arXiv Preprints, 2026. doi: 10.48550/arXiv.2602.18453

  49. [49]

    LLM-assisted replication as scientific infrastructure,

    S. Kubota, H. Yakura, S. Yamada, Y . Nakamura, T. Werner, and S. Coavoux, “LLM-assisted replication as scientific infrastructure,”Open Science Framework, 2026

  50. [50]

    AI-driven review systems: evaluating LLMs in scalable and bias-aware academic reviews,

    K. Tyser, B. Segev, G. Longhitano, X.-Y . Zhang, Z. Meeks, J. Lee, U. Garg, N. Belsten, A. Shporer, M. Udellet al., “AI-driven review systems: evaluating LLMs in scalable and bias-aware academic reviews,”arXiv Preprints, 2024. doi: 10.48550/arXiv.2408.10365

  51. [51]

    From replication to redesign: Exploring pairwise comparisons for LLM-based peer review,

    Y . Zhang, H. Zhang, W. Ji, T. Hua, N. Haber, H. Cao, and W. Liang, “From replication to redesign: Exploring pairwise comparisons for LLM-based peer review,”arXiv Preprints, 2025. doi: 10.48550/arXiv.2506.11343

  52. [52]

    Ai is transforming peer review—and many scientists are worried,

    M. Naddaf, “Ai is transforming peer review—and many scientists are worried,”Nature, vol. 639, no. 8056, pp. 852–854, 2025. doi: 10.1038/d41586-025-00894-7 . 14

  53. [53]

    More than half of researchers now use AI for peer review—often against guidance,

    ——, “More than half of researchers now use AI for peer review—often against guidance,” Nature, vol. 649, no. 8096, pp. 273–274, 2026. doi: 10.1038/d41586-025-04066-5

  54. [54]

    Reproscreener: Leveraging LLMs for assessing computational reproducibility of machine learning pipelines,

    A. Bhaskar and V . Stodden, “Reproscreener: Leveraging LLMs for assessing computational reproducibility of machine learning pipelines,”Proceedings of the 2nd ACM Conference on Reproducibility and Replicability, pp. 101–109, 2024. doi: 10.1145/3641525.3663629

  55. [55]

    Paper-snitch: A practical tool for evidence-based reproducibility assessment,

    D. Santoli and F. Bolelli, “Paper-snitch: A practical tool for evidence-based reproducibility assessment,” Master’s thesis, University of Modena and Reggio Emilia, 2024. [Online]. Available: https://federicobolelli.it/media/supervision_pdfs/LM_Davide_Santoli.pdf

  56. [56]

    Assessing reproducibility in evolutionary computation: A case study using human-and LLM-based assessment,

    F. Da Ros, T. Za ˇciragi´c, A. Plaat, T. Bäck, and N. van Stein, “Assessing reproducibility in evolutionary computation: A case study using human-and LLM-based assessment,”arXiv Preprints, 2026. doi: 10.48550/arXiv.2602.07059

  57. [57]

    Auto-metrics: LLM-assisted scientific quality control for radiomics research,

    J. G. de Almeida and N. Papanikolaou, “Auto-metrics: LLM-assisted scientific quality control for radiomics research,”European Journal of Radiology, p. 112358, 2025. doi: 10.1016/j.ejrad.2025.112358

  58. [58]

    Mass reproducibility and replicability: A new hope,

    A. Brodeur, D. Mikola, and N. Cook, “Mass reproducibility and replicability: A new hope,”Institute of Labor Economics (IZA), no. 16912, 2024. [Online]. Available: https://www.jstor.org/stable/pdf/resrep58994.pdf

  59. [59]

    State of the art: Reproducibility in artificial intelligence.Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), April 2018

    O. E. Gundersen and S. Kjensmo, “State of the art: Reproducibility in artificial intelli- gence,”Proceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018. doi: 10.1609/aaai.v32i1.11503

  60. [60]

    I4r discussion paper series, the institute for replication (i4r),

    A. Brodeur, “I4r discussion paper series, the institute for replication (i4r),”Institute for Replication (I4R), Leibniz-Institut für Wirtschaftsforschung, 2024. [Online]. Available: https://www.rwi-essen.de/i4r-discussion-paper-series

  61. [61]

    Retraction watch – tracking retractions as a window into the scientific process,

    A. Marcus and I. Oransky, “Retraction watch – tracking retractions as a window into the scientific process,” Retraction Watch, Tech. Rep., 2024. [Online]. Available: https://retractionwatch.com/ 15 A Appendix: Exemplary Scientific Workflow Graph Here, we provide an exemplary workflow generation for the work entitled"A Deep Reinforcement Learning Approach ...

  62. [62]

    Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

    Institutional review board (IRB) approvals or equivalent for research with human subjects 36 Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country ...