ARA: Agentic Reproducibility Assessment For Scalable Support Of Scientific Peer-Review
Pith reviewed 2026-05-19 17:09 UTC · model grok-4.3
The pith
ARA extracts directed workflow graphs from papers to evaluate reproducibility at scale.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ARA formalizes reproducibility assessment as a structured reasoning task over scientific documents by extracting a directed workflow graph linking sources, methods, experiments, and outputs, then evaluating its reconstructability using structural and content-based scores for reproducibility assessments.
What carries the argument
The directed workflow graph that links sources, methods, experiments, and outputs, scored for reconstructability through structural and content-based metrics.
If this is right
- Reproducibility scoring becomes consistent across different language models and temperature settings.
- The method records the highest accuracy to date on ReproBench and GoldStandardDB benchmarks.
- Peer review gains a scalable complement that can process hundreds of papers with uniform criteria.
- Next-generation review systems can incorporate automated graph reconstruction as a first-pass filter.
Where Pith is reading between the lines
- Review platforms could route low-scoring papers to deeper human inspection while clearing high-scoring ones faster.
- The same graph-extraction logic might extend to non-computational fields if the scoring rules are adjusted for qualitative evidence.
- Hybrid human-agent pipelines become feasible where agents handle routine dependency mapping and humans resolve edge cases.
- Open release of the extracted graphs alongside papers would let later readers verify or extend the reproducibility claims directly.
Load-bearing premise
The workflow graphs the system extracts accurately reflect the paper's actual experimental dependencies, data flows, and result-generating steps.
What would settle it
Direct head-to-head comparison in which independent human experts manually reconstruct the same set of workflow graphs from the papers and measure whether the automated scores align with those human reconstructions.
Figures
read the original abstract
Scientific peer review increasingly struggles to assess reproducibility at the scale and complexity of modern research output. Evaluating reproducibility requires reconstructing experimental dependencies, methodological choices, data flows, and result-generating procedures, which often exceeds what human reviewers can provide. Agentic Reproducibility Assessment (ARA) formalizes reproducibility assessment as a structured reasoning task over scientific documents. Given a paper, ARA extracts a directed workflow graph linking sources, methods, experiments, and outputs, then evaluates its reconstructability using structural and content-based scores for reproducibility assessments. Experiments on 213 ReScience C articles - the largest cross-domain benchmark of human-validated computational reproducibility studies considered to date - demonstrate ARA's generalizability and consistent workflow reconstruction and assessment across LLMs, model temperatures, and scientific domains. ARA achieves ~61% accuracy on three benchmarks, and the highest accuracy reported on ReproBench (60.71% vs. 36.84%) and GoldStandardDB (61.68% vs. 43.56%), highlighting its potential to complement human review at scale and enabling next-generation peer review. Code and Data available: https://github.com/AndresLaverdeMarin/agentic_reproducibility_assessment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Agentic Reproducibility Assessment (ARA), an LLM-based agentic system that extracts a directed workflow graph from a scientific paper (linking sources, methods, experiments, and outputs) and then computes structural and content-based scores to assess reproducibility. It evaluates the approach on 213 ReScience C articles (the largest such human-validated benchmark cited), plus ReproBench and GoldStandardDB, reporting ~61% accuracy overall and the highest accuracy on the latter two benchmarks (60.71% and 61.68%) compared to prior baselines (36.84% and 43.56%). The work emphasizes generalizability across LLMs, temperatures, and domains, with code and data released.
Significance. If the extracted workflow graphs accurately reflect experimental dependencies and data flows, ARA could provide a scalable complement to human peer review for computational reproducibility assessment. The use of a large cross-domain benchmark of 213 human-validated studies, systematic testing across models and temperatures, and public release of code/data are clear strengths that support reproducibility of the reported results.
major comments (2)
- [Experiments on 213 ReScience C articles and benchmark evaluations] The accuracy figures (e.g., on the 213 ReScience C articles, ReproBench, and GoldStandardDB) are computed against human-validated final reproducibility outcomes rather than against direct human annotations of the extracted directed workflow graphs themselves. Without separate validation of node/edge fidelity, missing implicit steps, or hallucinated links, it remains possible that structural and content-based scores are derived from incomplete or spurious graphs that happen to correlate with the outcome labels.
- [Abstract and evaluation methodology description] The central claim of 'consistent workflow reconstruction and assessment' across LLMs, temperatures, and domains rests on the assumption that the agentic extraction faithfully captures data flows and result-generating procedures, yet the manuscript provides no error analysis or human study quantifying extraction accuracy independent of the downstream reproducibility label.
minor comments (2)
- [Abstract] The abstract states 'highest accuracy reported' on ReproBench and GoldStandardDB; clarify whether the baseline numbers (36.84%, 43.56%) come from identical evaluation conditions or from the original papers.
- [Methods] Provide more explicit description of the prompt templates, agent orchestration steps, and exact definitions of the structural versus content-based scores to aid replication.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment point by point below, acknowledging the validity of the concerns regarding evaluation methodology. Revisions will be made to clarify the scope of our claims and add supporting analysis.
read point-by-point responses
-
Referee: [Experiments on 213 ReScience C articles and benchmark evaluations] The accuracy figures (e.g., on the 213 ReScience C articles, ReproBench, and GoldStandardDB) are computed against human-validated final reproducibility outcomes rather than against direct human annotations of the extracted directed workflow graphs themselves. Without separate validation of node/edge fidelity, missing implicit steps, or hallucinated links, it remains possible that structural and content-based scores are derived from incomplete or spurious graphs that happen to correlate with the outcome labels.
Authors: We appreciate this observation and agree that our reported accuracies reflect end-to-end performance against final human-validated reproducibility labels rather than isolated human annotations of graph nodes, edges, or fidelity. This approach was chosen because it directly tests the system's utility for scalable reproducibility assessment in peer review. However, we acknowledge the limitation that correlation with outcomes does not fully prove graph quality. In the revised manuscript, we will add a dedicated subsection with qualitative error analysis on a sample of 20 papers, discussing issues like missing implicit steps and potential hallucinations, with examples. We will also update relevant sections to clarify this distinction. revision: yes
-
Referee: [Abstract and evaluation methodology description] The central claim of 'consistent workflow reconstruction and assessment' across LLMs, temperatures, and domains rests on the assumption that the agentic extraction faithfully captures data flows and result-generating procedures, yet the manuscript provides no error analysis or human study quantifying extraction accuracy independent of the downstream reproducibility label.
Authors: We agree that the manuscript lacks a separate human study or quantitative error analysis measuring extraction accuracy independently from the reproducibility prediction task. The consistency claims are currently supported by stable end-to-end accuracy across configurations. To address this, we will revise the abstract and methodology sections to qualify the claims as referring to consistent end-to-end assessment, and incorporate the qualitative graph analysis described in the response to the first comment. These changes will better align the presentation with the evaluation performed. revision: yes
Circularity Check
No significant circularity; claims rest on external human-validated benchmarks
full rationale
The paper evaluates ARA by extracting directed workflow graphs from papers and computing structural/content-based reproducibility scores, then reports accuracy against independent external benchmarks (ReScience C with 213 human-validated articles, ReproBench, GoldStandardDB). These benchmarks supply outcome labels separate from the extraction process itself. No equations, definitions, or self-citations are shown to reduce the central performance claims to fitted inputs or prior author work by construction. The derivation chain therefore remains self-contained against external references rather than tautological.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs can reliably extract directed workflow graphs linking sources, methods, experiments, and outputs from scientific documents
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ARA extracts a directed workflow graph linking sources, methods, experiments, and outputs, then evaluates its reconstructability using structural and content-based scores
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
R = √(Rc · Rs) ... geometric mean ... fixed weights (w_sources, w_process, w_sinks) = (0.30,0.40,0.30)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
G. Parchomovsky, “Publish or perish,”Michigan Law Review, vol. 98, no. 4, pp. 926–952, 2000. doi: 10.2307/1290335
-
[2]
Science in an exponential world,
A. Szalay and J. Gray, “Science in an exponential world,”Nature, vol. 440, no. 7083, pp. 413–414, 2006. doi: 10.1038/440413a
-
[3]
E. Mosca, M. H. I. Abdalla, P. Basso, M. Musumeci, and G. Groh, “Distinguishing fact from fiction: A benchmark dataset for identifying machine-generated scientific papers in the LLM era.”Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023), pp. 190–207, 2023. doi: 10.18653/v1/2023.trustnlp-1.17
-
[4]
H.-Z. Cheng, B. Sheng, A. Lee, V . Chaudhary, A. G. Atanasov, N. Liu, Y . Qiu, T. Y . Wong, Y .-C. Tham, and Y .-F. Zheng, “Have AI-generated texts from LLM infiltrated the realm of scientific writing? a large-scale analysis of preprint platforms,”bioRxiv, pp. 2024–03, 2024. doi: 10.1101/2024.03.25.586710
-
[5]
Is LLM a reliable reviewer? a comprehensive evaluation of LLM on automatic paper reviewing tasks,
R. Zhou, L. Chen, and K. Yu, “Is LLM a reliable reviewer? a comprehensive evaluation of LLM on automatic paper reviewing tasks,”Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pp. 9340–9351, 2024. [Online]. Available: https://aclanthology.org/2024.lrec-main.816/
work page 2024
-
[6]
G. Ellison, “Is peer review in decline?”Economic Inquiry, vol. 49, no. 3, pp. 635–657, 2011. doi: 10.1111/j.1465-7295.2010.00261.x
-
[7]
The AI impera- tive: Scaling high-quality peer review in machine learning,
Q. Wei, S. Holt, J. Yang, M. Wulfmeier, and M. van der Schaar, “The AI impera- tive: Scaling high-quality peer review in machine learning,”arXiv Preprints, 2025. doi: 10.48550/arXiv.2506.08134
-
[8]
K. Popper,Logik der Forschung. Vienna, Austria: Julius Springer Verlag GmbH, 1935. doi: 10.1007/978-3-7091-4177-9
-
[9]
London, UK: Hutchinson & Co., 1959
——,The Logic of Scientific Discovery. London, UK: Hutchinson & Co., 1959. doi: 10.2307/2412687
-
[10]
The replicability crisis and public trust in psychological sci- ence,
F. Anvari and D. Lakens, “The replicability crisis and public trust in psychological sci- ence,”Comprehensive Results in Social Psychology, vol. 3, no. 3, pp. 266–286, 2018. doi: 10.1080/23743603.2019.1684822
-
[11]
An open investi- gation of the reproducibility of cancer biology research,
T. M. Errington, E. Iorns, W. Gunn, F. E. Tan, J. Lomax, and B. A. Nosek, “An open investi- gation of the reproducibility of cancer biology research,”Elife, vol. 3, p. e04333, 2014. doi: 10.7554/eLife.04333
-
[12]
The reproducibility crisis in the age of digital medicine,
A. Stupple, D. Singerman, and L. A. Celi, “The reproducibility crisis in the age of digital medicine,”NPJ digital medicine, vol. 2, no. 1, p. 2, 2019. doi: 10.1038/s41746-019-0079-z
-
[13]
No raw data, no science: another possible source of the reproducibility crisis,
T. Miyakawa, “No raw data, no science: another possible source of the reproducibility crisis,” Molecular brain, vol. 13, no. 1, p. 24, 2020. doi: 10.1186/s13041-020-0552-2
-
[14]
Repro- ducibility in management science,
M. Fišar, B. Greiner, C. Huber, E. Katok, A. I. Ozkes, and M. S. R. Collaboration, “Repro- ducibility in management science,”Management Science, vol. 70, no. 3, pp. 1343–1356, 2024. doi: 10.1287/mnsc.2023.03556
-
[15]
Investigating the replicability of the social and behavioural sciences,
A. H. Tyner, A. L. Abatayo, M. Daley, S. Field, N. Fox, N. A. Haber, K. M. Hahn, M. K. Struhl, B. Mawhinney, O. Miskeet al., “Investigating the replicability of the social and behavioural sciences,”Nature, vol. 652, no. 8108, pp. 143–150, 2026. doi: 10.1038/s41586-025-10078-y
-
[16]
Artificial intelligence faces reproducibility crisis,
M. Hutson, “Artificial intelligence faces reproducibility crisis,”Science, vol. 359, pp. 725–726,
-
[17]
doi: 10.1126/science.359.6377.725
-
[18]
Revisiting reproducibility in transportation simulation studies,
K. Riehl, A. Kouvelas, and M. A. Makridis, “Revisiting reproducibility in transportation simulation studies,”European Transport Research Review, vol. 17, no. 1, p. 22, 2025. doi: 10.1186/s12544-025-00718-9 . 12
-
[19]
M. Baker, “Reproducibility crisis,”nature, vol. 533, no. 26, pp. 353–66, 2016. doi: 10.1038/533437a
-
[20]
Is science really facing a reproducibility crisis, and do we need it to?
D. Fanelli, “Is science really facing a reproducibility crisis, and do we need it to?”Pro- ceedings of the National Academy of Sciences, vol. 115, no. 11, pp. 2628–2631, 2018. doi: 10.1073/pnas.1708272114
-
[21]
Before reproducibility must come preproducibility
P. B. Stark, “Before reproducibility must come preproducibility.”Nature, vol. 557, no. 7706, pp. 613–614, 2018. doi: 10.1038/d41586-018-05256-0
-
[22]
National Academies Press, 2019
National Academies of Sciences,Reproducibility and replicability in science. National Academies Press, 2019. doi: 10.17226/25303
-
[23]
Agentic AI: Autonomous Intelligence for Complex Goals—A Comprehensive Survey,
D. B. Acharya, K. Kuppan, and B. Divya, “Agentic AI: Autonomous intelligence for com- plex goals—a comprehensive survey,”IEEE Access, vol. 13, pp. 18 912–18 936, 2025. doi: 10.1109/ACCESS.2025.3532853
-
[24]
From generation to judgment: Opportunities and challenges of LLM-as-a-judge
D. Li, B. Jiang, L. Huang, A. Beigi, C. Zhao, Z. Tan, A. Bhattacharjee, Y . Jiang, C. Chen, T. Wuet al., “From generation to judgment: Opportunities and challenges of LLM-as-a-judge,” Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 2757–2791, 2025. doi: 10.18653/v1/2025.emnlp-main.138
-
[25]
AgentReview: Exploring Peer Review Dynamics with
Y . Jin, Q. Zhao, Y . Wang, H. Chen, K. Zhu, Y . Xiao, and J. Wang, “Agentreview: Exploring peer review dynamics with LLM agents,”Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 1208–1226, 2024. doi: 10.18653/v1/2024.emnlp-main.70
-
[26]
Can LLM feedback enhance review quality? a randomized study of 20k reviews at ICLR 2025,
N. Thakkar, M. Yuksekgonul, J. Silberg, A. Garg, N. Peng, F. Sha, R. Yu, C. V ondrick, and J. Zou, “Can LLM feedback enhance review quality? a randomized study of 20k reviews at ICLR 2025,”arXiv Preprints, 2025. doi: 10.48550/arXiv.2504.09737
-
[27]
W. Liang, Y . Zhang, H. Cao, B. Wang, D. Y . Ding, X. Yang, K. V odrahalli, S. He, D. S. Smith, Y . Yinet al., “Can large language models provide useful feedback on research papers? a large-scale empirical analysis,”NEJM AI, vol. 1, no. 8, 2024. doi: 10.1056/AIoa2400196
-
[28]
LLMs as meta-reviewers’ assistants: A case study,
E. Hossain, S. K. Sinha, N. Bansal, R. A. Knipper, S. Sarkar, J. Salvador, Y . Mahajan, S. R. P. K. Guttikonda, M. Akter, M. M. Hassanet al., “LLMs as meta-reviewers’ assistants: A case study,” Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: ...
-
[29]
Large language models for automated scholarly paper review: A survey,
Z. Zhuang, J. Chen, H. Xu, Y . Jiang, and J. Lin, “Large language models for automated scholarly paper review: A survey,”Information Fusion, vol. 124, p. 103332, 2025. doi: 10.1016/j.inffus.2025.103332
-
[30]
Repro-bench: Can agentic ai systems assess the reproducibility of social science research?
C. Hu, L. Zhang, Y . Lim, A. Wadhwani, A. Peters, and D. Kang, “Repro-bench: Can agentic ai systems assess the reproducibility of social science research?”Findings of the Association for Computational Linguistics: ACL 2025, pp. 23 616–23 626, 2025. doi: 10.18653/v1/2025.findings-acl.1210
-
[31]
ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences
B. Nguyen, D. Soós, Q. Ma, R. R. Obadage, Z. Ranjan, S. Koneru, T. M. Errington, S. Nematova, S. Rajtmajer, J. Wuet al., “Replicatorbench: Benchmarking LLM agents for replicability in social and behavioral sciences,”arXiv Preprints, 2026. doi: 10.48550/arXiv.2602.11354
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.11354 2026
-
[32]
A. Lupidi, B. Gauri, T. S. Foster, B. A. Omari, D. Magka, A. Pepe, A. Audran-Reiss, M. Aghamelu, N. Baldwin, L. Cipolina-Kunet al., “AIRS-Bench: a suite of tasks for frontier AI research science agents,”arXiv Preprints, 2026. doi: 10.48550/arXiv.2602.06855
-
[33]
Gyeongwon James Kim, Alex Wilf, Louis philippe Morency, and Daniel Fried
G. J. Kim, A. Wilf, L.-P. Morency, and D. Fried, “From reproduction to replication: Evaluating research agents with progressive code masking,”arXiv Preprints, 2025. doi: 10.48550/arXiv.2506.19724
-
[34]
Zachary S Siegel, Sayash Kapoor, Nitya Nagdir, Benedikt Stroebl, and Arvind Narayanan
M. Seo, J. Baek, S. Lee, and S. J. Hwang, “Paper2code: Automating code generation from scientific papers in machine learning,”arXiv Preprints, 2025. doi: 10.48550/arXiv.2504.17192 . 13
-
[35]
Rescience c: a journal for reproducible replications in computa- tional science,
N. P. Rougier and K. Hinsen, “Rescience c: a journal for reproducible replications in computa- tional science,”International Workshop on Reproducible Research in Pattern Recognition, pp. 150–156, 2018. doi: 10.1007/978-3-030-23987-9_14
-
[36]
Assessing data availability and research reproducibility in hydrology and water resources,
J. H. Stagge, D. E. Rosenberg, A. M. Abdallah, H. Akbar, N. A. Attallah, and R. James, “Assessing data availability and research reproducibility in hydrology and water resources,” Scientific Data, vol. 6, no. 1, p. 190030, 2019. doi: 10.1038/sdata.2019.30
-
[37]
Reliability: on the reproducibility of assessment data,
S. M. Downing, “Reliability: on the reproducibility of assessment data,”Medical Education, vol. 38, no. 9, pp. 1006–1012, 2004. doi: 10.1111/j.1365-2929.2004.01932.x
-
[38]
A. Bazzocchi, G. Filonzi, F. Ponti, C. Sassi, E. Salizzoni, G. Battista, and R. Canini, “Accuracy, reproducibility and repeatability of ultrasonography in the assessment of abdominal adiposity,” Academic Radiology, vol. 18, no. 9, pp. 1133–1143, 2011. doi: 10.1016/j.acra.2011.04.014
-
[39]
A. L. Crowley, E. Yow, H. X. Barnhart, M. A. Daubert, R. Bigelow, D. C. Sullivan, M. Pencina, and P. S. Douglas, “Critical review of current approaches for echocardiographic reproducibility and reliability assessment in clinical research,”Journal of the American Society of Echocardiog- raphy, vol. 29, no. 12, pp. 1144–1154, 2016. doi: 10.1016/j.echo.2016.08.006
-
[40]
A practical guide to assess the reproducibility of echocardiographic measurements,
K. V . Bunting, R. P. Steeds, K. Slater, J. K. Rogers, G. V . Gkoutos, and D. Kotecha, “A practical guide to assess the reproducibility of echocardiographic measurements,”Journal of the American Society of Echocardiography, vol. 32, no. 12, pp. 1505–1515, 2019. doi: 10.1016/j.echo.2019.08.015
-
[41]
Statistical methods for replicability assessment,
K. Hung and W. Fithian, “Statistical methods for replicability assessment,”The Annals of Applied Statistics, vol. 14, no. 3, pp. 1063–1087, 2020. doi: 10.1214/20-AOAS1336
-
[42]
The assessment of replicability using the sum of p-values,
L. Held, S. Pawel, and C. Micheloud, “The assessment of replicability using the sum of p-values,” Royal Society Open Science, vol. 11, no. 8, 2024. doi: 10.1098/rsos.240149
-
[43]
M. Arroyo-Araujo, B. V oelkl, C. Laloux, J. Novak, B. Koopmans, A.-M. Waldron, I. Seiffert, H. Stirling, K. Aulehner, S. K. Janhunenet al., “Systematic assessment of the replicability and generalizability of preclinical findings: Impact of protocol harmonization across laboratory sites,”PLoS biology, vol. 20, no. 11, p. e3001886, 2022. doi: 10.1371/journa...
-
[44]
Deepcode: Open agentic coding,
Z. Li, Z. Li, Z. Guo, X. Ren, and C. Huang, “Deepcode: Open agentic coding,”arXiv Preprints,
-
[45]
doi: 10.48550/arXiv.2512.07921
-
[46]
Replicationbench: Can AI agents replicate astrophysics research papers?
C. Ye, S. Yuan, S. Cooray, S. Dillmann, I. L. Roque, D. Baron, P. Frank, S. Martin-Alvarez, N. Koblischke, F. J. Quet al., “Replicationbench: Can AI agents replicate astrophysics research papers?”arXiv Preprints, 2025. doi: 10.48550/arXiv.2510.24591
-
[47]
PaperBench: Evaluating AI's Ability to Replicate AI Research
G. Starace, O. Jaffe, D. Sherburn, J. Aung, J. S. Chan, L. Maksin, R. Dias, E. Mays, B. Kinsella, W. Thompsonet al., “Paperbench: Evaluating AI’s ability to replicate AI research,”arXiv Preprints, 2025. doi: 10.48550/arXiv.2504.01848
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.01848 2025
-
[48]
LLM-assisted replication for quantitative social science,
S. Kubota, H. Yakura, S. Coavoux, S. Yamada, and Y . Nakamura, “LLM-assisted replication for quantitative social science,”arXiv Preprints, 2026. doi: 10.48550/arXiv.2602.18453
-
[49]
LLM-assisted replication as scientific infrastructure,
S. Kubota, H. Yakura, S. Yamada, Y . Nakamura, T. Werner, and S. Coavoux, “LLM-assisted replication as scientific infrastructure,”Open Science Framework, 2026
work page 2026
-
[50]
AI-driven review systems: evaluating LLMs in scalable and bias-aware academic reviews,
K. Tyser, B. Segev, G. Longhitano, X.-Y . Zhang, Z. Meeks, J. Lee, U. Garg, N. Belsten, A. Shporer, M. Udellet al., “AI-driven review systems: evaluating LLMs in scalable and bias-aware academic reviews,”arXiv Preprints, 2024. doi: 10.48550/arXiv.2408.10365
-
[51]
From replication to redesign: Exploring pairwise comparisons for LLM-based peer review,
Y . Zhang, H. Zhang, W. Ji, T. Hua, N. Haber, H. Cao, and W. Liang, “From replication to redesign: Exploring pairwise comparisons for LLM-based peer review,”arXiv Preprints, 2025. doi: 10.48550/arXiv.2506.11343
-
[52]
Ai is transforming peer review—and many scientists are worried,
M. Naddaf, “Ai is transforming peer review—and many scientists are worried,”Nature, vol. 639, no. 8056, pp. 852–854, 2025. doi: 10.1038/d41586-025-00894-7 . 14
-
[53]
More than half of researchers now use AI for peer review—often against guidance,
——, “More than half of researchers now use AI for peer review—often against guidance,” Nature, vol. 649, no. 8096, pp. 273–274, 2026. doi: 10.1038/d41586-025-04066-5
-
[54]
A. Bhaskar and V . Stodden, “Reproscreener: Leveraging LLMs for assessing computational reproducibility of machine learning pipelines,”Proceedings of the 2nd ACM Conference on Reproducibility and Replicability, pp. 101–109, 2024. doi: 10.1145/3641525.3663629
-
[55]
Paper-snitch: A practical tool for evidence-based reproducibility assessment,
D. Santoli and F. Bolelli, “Paper-snitch: A practical tool for evidence-based reproducibility assessment,” Master’s thesis, University of Modena and Reggio Emilia, 2024. [Online]. Available: https://federicobolelli.it/media/supervision_pdfs/LM_Davide_Santoli.pdf
work page 2024
-
[56]
F. Da Ros, T. Za ˇciragi´c, A. Plaat, T. Bäck, and N. van Stein, “Assessing reproducibility in evolutionary computation: A case study using human-and LLM-based assessment,”arXiv Preprints, 2026. doi: 10.48550/arXiv.2602.07059
-
[57]
Auto-metrics: LLM-assisted scientific quality control for radiomics research,
J. G. de Almeida and N. Papanikolaou, “Auto-metrics: LLM-assisted scientific quality control for radiomics research,”European Journal of Radiology, p. 112358, 2025. doi: 10.1016/j.ejrad.2025.112358
-
[58]
Mass reproducibility and replicability: A new hope,
A. Brodeur, D. Mikola, and N. Cook, “Mass reproducibility and replicability: A new hope,”Institute of Labor Economics (IZA), no. 16912, 2024. [Online]. Available: https://www.jstor.org/stable/pdf/resrep58994.pdf
work page 2024
-
[59]
O. E. Gundersen and S. Kjensmo, “State of the art: Reproducibility in artificial intelli- gence,”Proceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018. doi: 10.1609/aaai.v32i1.11503
-
[60]
I4r discussion paper series, the institute for replication (i4r),
A. Brodeur, “I4r discussion paper series, the institute for replication (i4r),”Institute for Replication (I4R), Leibniz-Institut für Wirtschaftsforschung, 2024. [Online]. Available: https://www.rwi-essen.de/i4r-discussion-paper-series
work page 2024
-
[61]
Retraction watch – tracking retractions as a window into the scientific process,
A. Marcus and I. Oransky, “Retraction watch – tracking retractions as a window into the scientific process,” Retraction Watch, Tech. Rep., 2024. [Online]. Available: https://retractionwatch.com/ 15 A Appendix: Exemplary Scientific Workflow Graph Here, we provide an exemplary workflow generation for the work entitled"A Deep Reinforcement Learning Approach ...
-
[62]
Institutional review board (IRB) approvals or equivalent for research with human subjects 36 Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.