AI Research moves towards open and reproducible science

Holger Hoos; Kevin L Coakley; Odd Erik Gundersen; Thijs Snelleman

arxiv: 2606.16974 · v2 · pith:AD5DETZHnew · submitted 2026-06-15 · 💻 cs.AI

AI Research moves towards open and reproducible science

Kevin L Coakley , Thijs Snelleman , Holger Hoos , Odd Erik Gundersen This is my paper

Pith reviewed 2026-06-27 03:40 UTC · model grok-4.3

classification 💻 cs.AI

keywords reproducibilitydocumentation practicesAI conferencesopen sciencecode sharingdata sharingempirical analysis

0 comments

The pith

Documentation practices in top AI conferences improved markedly from 2014 to 2024, with papers sharing both code and data rising from 11 percent to 64 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines all publications from five leading AI conferences across a full decade, tracking seven specific documentation variables in 56,800 papers. It reports a nearly sixfold increase in joint code and data sharing and, by applying a prior empirical mapping, estimates that actual reproducibility rose from 28 percent to 64 percent. These gains began before reproducibility checklists were introduced at the venues. A sympathetic reader would care because higher documentation rates make published claims easier to verify and build upon in a field where many results have historically been hard to reproduce.

Core claim

In the period 2014 to 2024, documentation practices have improved; papers sharing both code and data increased nearly sixfold, from 11% to 64%. Building on empirical reproducibility rates from a prior study, we estimate - inferred from documentation practices, not direct testing - that reproducibility increased from 28% in 2014 to 64% in 2024. Improvements in documentation practices predate the introduction of reproducibility checklists, suggesting these changes reflect a broader movement toward open science rather than a direct response to formal requirements.

What carries the argument

Seven reproducibility variables, quality-assured and applied to every paper in the 56,800-publication dataset from five major AI conferences.

If this is right

Reproducibility in published AI work has roughly doubled over the decade according to the documentation proxy.
The shift toward better documentation began independently of formal checklists at the conferences.
A larger fraction of papers now supply the code and data needed for others to verify results.
Community norms rather than venue mandates appear to be the primary driver of the observed changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Continued improvement in documentation could raise the baseline reliability of new AI claims still further.
Similar long-term tracking in other fields might reveal whether the same open-science trend is occurring elsewhere.
If the proxy relationship holds, venues could use these variables to monitor progress without waiting for full reproduction studies.

Load-bearing premise

The seven selected documentation variables are valid and sufficient proxies for reproducibility, and the empirical mapping taken from the prior study applies uniformly across the entire decade-long dataset.

What would settle it

A direct reproduction attempt on a random sample of papers from both 2014 and 2024 that measures how closely the observed success rates match the documentation-based estimates.

read the original abstract

The reproducibility crisis has directed the AI research community toward improving documentation practices. Several studies have identified methodological issues, and in response, the most impactful venues in the field have introduced reproducibility checklists. We seek to understand whether documentation practices have changed over time by assessing all published papers at five leading AI conferences over the past decade. Seven reproducibility variables were identified, quality-assured and used to analyse 56 800 publications. Our analysis reveals that in the period 2014 to 2024, documentation practices have improved; papers sharing both code and data increased nearly sixfold, from 11% to 64% Building on empirical reproducibility rates from a prior study, we estimate - inferred from documentation practices, not direct testing - that reproducibility increased from 28% in 2014 to 64% in 2024. Improvements in documentation practices predate the introduction of reproducibility checklists, suggesting these changes reflect a broader movement toward open science rather than a direct response to formal requirements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Large-scale count of rising code and data sharing in AI papers from 2014-2024, but the reproducibility doubling claim rests on an unvalidated mapping borrowed from one earlier study.

read the letter

The paper's main deliverable is a decade-scale count of documentation practices across 56,800 papers from five top AI conferences. Code-plus-data sharing rose from 11% to 64%, and the authors apply rates from a prior study to estimate reproducibility moved from 28% to 64%. That scale and the explicit before-and-after comparison are the new pieces.

The work does a straightforward job of measuring the raw trends in sharing. Looking at every paper rather than a sample, and showing that the rise started before checklists were introduced, gives a useful baseline that smaller audits could not provide. The numbers on documentation practices themselves look like the solid output here.

The softer part is the reproducibility estimate. It is derived by feeding the seven variables into an empirical mapping taken from one earlier paper, with no subsample audit, temporal stability check, or sensitivity run reported in the abstract to confirm the mapping still holds across this corpus or across the ten years. The abstract is clear that this is inference rather than direct measurement, but without those checks the 28-to-64% claim carries more uncertainty than the sharing percentages.

This is the sort of paper that supplies reference numbers for people working on open-science policy or running reproducibility programs in AI. Readers who need a field-wide trend line on documentation will get direct value from the counts; the reproducibility extrapolation is more of a starting point for discussion than a settled result.

The scale and the question are large enough that it deserves referee time. I would send it out rather than desk-reject.

Referee Report

2 major / 1 minor

Summary. The manuscript examines documentation practices in 56,800 papers from five leading AI conferences (2014–2024). It identifies seven reproducibility variables, reports that papers sharing both code and data rose from 11% to 64%, and estimates—by applying empirical rates from a prior study—that overall reproducibility increased from 28% in 2014 to 64% in 2024. The improvements are argued to predate formal reproducibility checklists and to reflect a broader shift toward open science.

Significance. If the proxy mapping from the seven documentation variables to reproducibility rates is valid and stable over time and across venues, the work supplies the largest-scale temporal assessment to date of open-science trends in AI. The corpus size and the explicit caveat that the 28–64% figures are inferred rather than directly audited are notable strengths.

major comments (2)

[Methods] Methods: No description is given of how the seven reproducibility variables were chosen, how quality assurance was performed on the 56,800-paper corpus, or the exact procedure used to convert the observed documentation statistics into the 28%–64% reproducibility estimates. The mapping is taken from a prior study, yet no verification of its applicability to the current dataset or temporal window is supplied.
[Results] Results / Abstract: The headline reproducibility trend (28% to 64%) is load-bearing for the central claim yet rests on an unvalidated proxy relationship. No internal check—such as a subsample reproducibility audit, temporal-stability test, or sensitivity analysis—is reported to confirm that the documentation-to-reproducibility mapping remains constant across the decade or the five conferences.

minor comments (1)

[Abstract] Abstract: The phrase 'papers sharing both code and data increased nearly sixfold, from 11% to 64%' should clarify whether the percentages refer to the full corpus or to a filtered subset of papers that could plausibly share artifacts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for acknowledging the scale of the corpus and the explicit caveats in our estimates. We address the two major comments point by point below. Where the manuscript lacks sufficient detail, we will revise accordingly.

read point-by-point responses

Referee: [Methods] Methods: No description is given of how the seven reproducibility variables were chosen, how quality assurance was performed on the 56,800-paper corpus, or the exact procedure used to convert the observed documentation statistics into the 28%–64% reproducibility estimates. The mapping is taken from a prior study, yet no verification of its applicability to the current dataset or temporal window is supplied.

Authors: We will add a new subsection in the Methods that (a) explains the selection of the seven variables by reference to prior reproducibility taxonomies in the AI literature, (b) details the multi-stage quality-assurance protocol (including inter-annotator agreement metrics and sampling strategy) applied to the full corpus, and (c) states the precise linear mapping from the observed documentation rates to the reproducibility percentages taken from the cited prior study. We will also insert a short paragraph discussing the assumptions underlying the applicability of that mapping to the 2014–2024 window and the five conferences. revision: yes
Referee: [Results] Results / Abstract: The headline reproducibility trend (28% to 64%) is load-bearing for the central claim yet rests on an unvalidated proxy relationship. No internal check—such as a subsample reproducibility audit, temporal-stability test, or sensitivity analysis—is reported to confirm that the documentation-to-reproducibility mapping remains constant across the decade or the five conferences.

Authors: We agree that the proxy relationship is central and will therefore add a sensitivity analysis in the Results section that varies the mapping coefficients within the confidence intervals reported by the prior study and recomputes the 2014–2024 trend. We will also report per-conference and per-year breakdowns to allow readers to assess stability. A full subsample audit that directly tests reproducibility on hundreds of papers is outside the scope of the present observational study; we will strengthen the existing caveats in the abstract and discussion rather than claim such an audit was performed. revision: partial

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper directly counts seven documentation variables across its new 56,800-paper corpus and reports the observed trends (e.g., code+data sharing rising from 11% to 64%). The reproducibility percentages are obtained by applying an external empirical mapping taken from a prior study; this constitutes an inference step rather than any reduction of the measured quantities to themselves by definition, fitting, or self-citation chain. No equations or steps in the provided text exhibit the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central reproducibility estimate rests on an external prior study for the documentation-to-reproducibility mapping; the seven variables are treated as given indicators without independent validation shown in the abstract.

free parameters (1)

reproducibility mapping parameters
The conversion of observed documentation rates into the 28%–64% reproducibility figures depends on empirical rates taken from a prior study.

axioms (1)

domain assumption The seven reproducibility variables are appropriate and sufficient measures of documentation quality
The entire analysis is built on these variables being valid proxies.

pith-pipeline@v0.9.1-grok · 5703 in / 1243 out tokens · 45665 ms · 2026-06-27T03:40:48.257250+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

67 extracted references · 52 canonical work pages · 1 internal anchor

[1]

PLoS medicine , volume=

Ioannidis, J.P.: Why most published research findings are false. PLoS medicine 2(8), 124 (2005) https://doi.org/10.1371/journal.pmed.0020124

work page doi:10.1371/journal.pmed.0020124 2005
[2]

American Association for the Advancement of Science (2014)

McNutt, M.: Reproducibility. American Association for the Advancement of Science (2014). https://doi.org/10.1126/science.1250475

work page doi:10.1126/science.1250475 2014
[3]

Nature Publishing Group UK London (2016)

Baker, M.: 1,500 scientists lift the lid on reproducibility. Nature Publishing Group UK London (2016). https://doi.org/10.1038/533452a

work page doi:10.1038/533452a 2016
[4]

Pashler, H., Wagenmakers, E.-J.: Editors’ introduction to the special section on replicability in psychological science: A crisis of confidence? Perspectives on psycho- logical science7(6), 528–530 (2012) https://doi.org/10.1177/1745691612465253

work page doi:10.1177/1745691612465253 2012
[5]

Science , volume=

Open Science Collaboration: Estimating the reproducibility of psychological science. Science349(6251), 4716 (2015) https://doi.org/10.1126/science.aac4716

work page doi:10.1126/science.aac4716 2015
[6]

Social psychology (2014) https://doi.org/10.1027/ 1864-9335/a000178

Klein, R.A., Ratliff, K.A., Vianello, M., Adams Jr, R.B., Bahn´ ık,ˇS., Bernstein, M.J., Bocian, K., Brandt, M.J., Brooks, B., Brumbaugh, C.C.,et al.: Investigat- ing variation in replicability. Social psychology (2014) https://doi.org/10.1027/ 1864-9335/a000178

2014
[7]

Science351(6280), 1433–1436 (2016) https://doi.org/10.1126/science.aaf09

Camerer, C.F., Dreber, A., Forsell, E., Ho, T.-H., Huber, J., Johannesson, M., Kirchler, M., Almenberg, J., Altmejd, A., Chan, T.,et al.: Evaluating replicability of laboratory experiments in economics. Science351(6280), 1433–1436 (2016) https://doi.org/10.1126/science.aaf09

work page doi:10.1126/science.aaf09 2016
[8]

Prinz, F., Schlange, T., Asadullah, K.: Believe it or not: how much can we rely on published data on potential drug targets? Nature reviews Drug discovery10(9), 712–712 (2011) https://doi.org/10.1038/nrd3439-c1

work page doi:10.1038/nrd3439-c1 2011
[9]

Nature 505(7485), 612–613 (2014) https://doi.org/10.1038/505612a

Collins, F.S., Tabak, L.A.: Policy: NIH plans to enhance reproducibility. Nature 505(7485), 612–613 (2014) https://doi.org/10.1038/505612a

work page doi:10.1038/505612a 2014
[10]

RaiseStandardsforPreclinicalCancerResearch

Begley, C.G., Ellis, L.M.: Raise standards for preclinical cancer research. Nature 483(7391), 531–533 (2012) https://doi.org/10.1038/483531a

work page doi:10.1038/483531a 2012
[11]

Nature reviews neuroscience14(5), 365–376 (2013) https://doi.org/ 10.1038/nrn3475

Button, K.S., Ioannidis, J.P., Mokrysz, C., Nosek, B.A., Flint, J., Robinson, E.S., Munaf` o, M.R.: Power failure: why small sample size undermines the reliability of neuroscience. Nature reviews neuroscience14(5), 365–376 (2013) https://doi.org/ 10.1038/nrn3475

work page doi:10.1038/nrn3475 2013
[12]

Behavior genetics42(1), 1–2 (2012) https://doi.org/10.1007/s10519-011-9504-z

Hewitt, J.K.: Editorial policy on candidate gene association and candidate gene- by-environment interaction studies of complex traits. Behavior genetics42(1), 1–2 (2012) https://doi.org/10.1007/s10519-011-9504-z

work page doi:10.1007/s10519-011-9504-z 2012
[13]

In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

Gundersen, O.E., Kjensmo, S.: State of the art: Reproducibility in artificial 22 intelligence. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018). https://doi.org/10.1609/aaai.v32i1.11503

work page doi:10.1609/aaai.v32i1.11503 2018
[14]

Science359(6377), 725–726 (2018) https://doi.org/10.1126/science.359.6377.725

Hutson, M.: Artificial intelligence faces reproducibility crisis. Science359(6377), 725–726 (2018) https://doi.org/10.1126/science.359.6377.725

work page doi:10.1126/science.359.6377.725 2018
[15]

Journal of Business Research88, 428–436 (2018) https://doi.org/10.1016/j.jbusres.2017.12.043

Vicente-Saez, R., Martinez-Fuentes, C.: Open science now: A systematic literature review for an integrated definition. Journal of Business Research88, 428–436 (2018) https://doi.org/10.1016/j.jbusres.2017.12.043

work page doi:10.1016/j.jbusres.2017.12.043 2018
[16]

Patterns (2025) https://doi.org/10.1016/j.patter

Bischl, B., Casalicchio, G., Das, T., Feurer, M., Fischer, S., Gijsbers, P., Mukherjee, S., M¨ uller, A.C., N´ emeth, L., Oala, L.,et al.: OpenML: Insights from 10 years and more than a thousand papers. Patterns (2025) https://doi.org/10.1016/j.patter. 2025.101317

work page doi:10.1016/j.patter 2025
[17]

Journal of the Medical Library Association: JMLA105(2), 203 (2017) https://doi.org/10.5195/jmla.2017

Foster, E.D., Deardorff, A.: Open science framework (OSF). Journal of the Medical Library Association: JMLA105(2), 203 (2017) https://doi.org/10.5195/jmla.2017. 88

work page doi:10.5195/jmla.2017 2017
[18]

Scientific data3(1), 1–9 (2016) https://doi.org/10.1038/sdata.2016.18

Wilkinson, M.D., Dumontier, M., Aalbersberg, I.J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.-W., Silva Santos, L.B., Bourne, P.E.,et al.: The FAIR guiding principles for scientific data management and stewardship. Scientific data3(1), 1–9 (2016) https://doi.org/10.1038/sdata.2016.18

work page doi:10.1038/sdata.2016.18 2016
[19]

arXiv preprint arXiv:2403.13784 (2024) https://doi.org/10.48550/arXiv.2403.13784

White, M., Haddad, I., Osborne, C., Liu, X.-Y.Y., Abdelmonsef, A., Varghese, S., Hors, A.L.: The model openness framework: Promoting completeness and openness for reproducibility, transparency, and usability in artificial intelligence. arXiv preprint arXiv:2403.13784 (2024) https://doi.org/10.48550/arXiv.2403.13784

work page doi:10.48550/arxiv.2403.13784 2024
[20]

Scientific Data12(1), 328 (2025) https://doi.org/10.1038/s41597-025-04451-9

Wilkinson, S.R., Aloqalaa, M., Belhajjame, K., Crusoe, M.R., Paula Kinoshita, B., Gadelha, L., Garijo, D., Gustafsson, O.J.R., Juty, N., Kanwal, S.,et al.: Applying the FAIR principles to computational workflows. Scientific Data12(1), 328 (2025) https://doi.org/10.1038/s41597-025-04451-9

work page doi:10.1038/s41597-025-04451-9 2025
[21]

Journal of machine learning research22(164), 1–20 (2021)

Pineau, J., Vincent-Lamarre, P., Sinha, K., Larivi` ere, V., Beygelzimer, A., d’Alch´ e- Buc, F., Fox, E., Larochelle, H.: Improving reproducibility in machine learning research (a report from the NeurIPS 2019 reproducibility program). Journal of machine learning research22(164), 1–20 (2021)

2019
[22]

Journal of Artificial Intelligence Research81, 1019–1041 (2024) https://doi.org/10.1613/jair.1.16905

Gundersen, O.E., Helmert, M., Hoos, H.: Improving reproducibility in AI research: Four mechanisms adopted by JAIR. Journal of Artificial Intelligence Research81, 1019–1041 (2024) https://doi.org/10.1613/jair.1.16905

work page doi:10.1613/jair.1.16905 2024
[23]

Advances in neural information processing systems31 (2018) 23

Lucic, M., Kurach, K., Michalski, M., Gelly, S., Bousquet, O.: Are gans created equal? a large-scale study. Advances in neural information processing systems31 (2018) 23

2018
[24]

In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., Meger, D.: Deep reinforcement learning that matters. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018). https://doi.org/10.1609/aaai.v32i1.11694

work page doi:10.1609/aaai.v32i1.11694 2018
[25]

In: Proceedings of the 13th ACM Conference on Recommender Systems, pp

Ferrari Dacrema, M., Cremonesi, P., Jannach, D.: Are we really making much progress? a worrying analysis of recent neural recommendation approaches. In: Proceedings of the 13th ACM Conference on Recommender Systems, pp. 101–109 (2019). https://doi.org/10.1145/3298689.3347058

work page doi:10.1145/3298689.3347058 2019
[26]

Computational Linguistics48(4), 1125–1135 (2022) https://doi.org/10.1162/coli a 00448

Belz, A.: A metrological perspective on reproducibility in NLP. Computational Linguistics48(4), 1125–1135 (2022) https://doi.org/10.1162/coli a 00448

work page doi:10.1162/coli 2022
[27]

In: Proceedings of the 2023 ACM Conference on Reproducibility and Replicability, pp

Gundersen, O.E., Shamsaliei, S., Kjærnli, H.S., Langseth, H.: On reporting robust and trustworthy conclusions from model comparison studies involving neural networks and randomness. In: Proceedings of the 2023 ACM Conference on Reproducibility and Replicability, pp. 37–61 (2023). https://doi.org/10.1145/ 3589806.3600044

arXiv 2023
[28]

Communications of the ACM64(12), 86–92 (2021) https://doi.org/10.1145/3458723

Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J.W., Wallach, H., Iii, H.D., Crawford, K.: Datasheets for datasets. Communications of the ACM64(12), 86–92 (2021) https://doi.org/10.1145/3458723

work page doi:10.1145/3458723 2021
[29]

NPJ digital medicine5(1), 48 (2022) https://doi.org/10.1038/s41746-022-00592-y

Varoquaux, G., Cheplygina, V.: Machine learning for medical imaging: method- ological failures and recommendations for the future. NPJ digital medicine5(1), 48 (2022) https://doi.org/10.1038/s41746-022-00592-y

work page doi:10.1038/s41746-022-00592-y 2022
[30]

Patterns4(9) (2023) https://doi.org/10.1016/j.patter.2023

Kapoor, S., Narayanan, A.: Leakage and the reproducibility crisis in machine- learning-based science. Patterns4(9) (2023) https://doi.org/10.1016/j.patter.2023. 100804

work page doi:10.1016/j.patter.2023 2023
[31]

38 Mason Christopher E

Haibe-Kains, B., Adam, G.A., Hosny, A., Khodakarami, F., Directors Shraddha Thakkar 35 Kusko Rebecca 36 Sansone Susanna-Assunta 37 Tong Weida 35 Wolfinger Russ D. 38 Mason Christopher E. 39 Jones Wendell 40 Dopazo Joaquin 41 Furlanello Cesare 42, M.A.Q.C.M.S.B., Waldron, L., Wang, B., McIntosh, C., Goldenberg, A., Kundaje, A.,et al.: Transparency and repr...

2020
[32]

In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp

Belz, A., Agarwal, S., Shimorina, A., Reiter, E.: A systematic review of reproducibil- ity research in natural language processing. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 381–393 (2021). https://doi.org/10.18653/v1/2021.eacl-main.29

work page doi:10.18653/v1/2021.eacl-main.29 2021
[33]

In: International Conference on Machine Learning, pp

Bouthillier, X., Laurent, C., Vincent, P.: Unreproducible research is reproducible. In: International Conference on Machine Learning, pp. 725–734 (2019). PMLR

2019
[34]

IEEE Transactions on Parallel and Distributed Systems27(12), 3617–3630 (2016) https://doi.org/10.1109/TPDS.2016.2539167

Hunold, S., Carpen-Amarie, A.: Reproducible MPI benchmarking is still not as 24 easy as you think. IEEE Transactions on Parallel and Distributed Systems27(12), 3617–3630 (2016) https://doi.org/10.1109/TPDS.2016.2539167

work page doi:10.1109/tpds.2016.2539167 2016
[35]

Monthly Weather Review141(11), 4165–4172 (2013) https://doi.org/10.1175/MWR-D-12-00352.1

Hong, S.-Y., Koo, M.-S., Jang, J., Esther Kim, J.-E., Park, H., Joh, M.-S., Kang, J.-H., Oh, T.-J.: An evaluation of the software system dependency of a global atmospheric model. Monthly Weather Review141(11), 4165–4172 (2013) https://doi.org/10.1175/MWR-D-12-00352.1

work page doi:10.1175/mwr-d-12-00352.1 2013
[36]

Ebadi, A

Stodden, V., McNutt, M., Bailey, D.H., Deelman, E., Gil, Y., Hanson, B., Heroux, M.A., Ioannidis, J.P., Taufer, M.: Enhancing reproducibility for computational methods. Science354(6317), 1240–1241 (2016) https://doi.org/10.1126/science. aah6168

work page doi:10.1126/science 2016
[37]

In: International Conference on Document Analysis and Recognition, pp

Ajayi, K., Choudhury, M.H., Rajtmajer, S.M., Wu, J.: A study on reproducibil- ity and replicability of table structure recognition methods. In: International Conference on Document Analysis and Recognition, pp. 3–19 (2023). https: //doi.org/10.1007/978-3-031-41679-8 1 . Springer

work page doi:10.1007/978-3-031-41679-8 2023
[38]

arXiv preprint arXiv:2204.07610 (2022) https: //doi.org/10.48550/arXiv.2204.07610

Gundersen, O.E., Coakley, K., Kirkpatrick, C., Gil, Y.: Sources of irreproducibility in machine learning: A review. arXiv preprint arXiv:2204.07610 (2022) https: //doi.org/10.48550/arXiv.2204.07610

work page doi:10.48550/arxiv.2204.07610 2022
[39]

Philosophical Transactions of the Royal Society A379(2197), 20200210 (2021) https://doi.org/ 10.1098/rsta.2020.0210

Gundersen, O.E.: The fundamental principles of reproducibility. Philosophical Transactions of the Royal Society A379(2197), 20200210 (2021) https://doi.org/ 10.1098/rsta.2020.0210

work page doi:10.1098/rsta.2020.0210 2021
[40]

Review of general psychology13(2), 90–100 (2009) https://doi.org/10.1037/a0015108

Schmidt, S.: Shall we really do it again? the powerful concept of replication is neglected in the social sciences. Review of general psychology13(2), 90–100 (2009) https://doi.org/10.1037/a0015108

work page doi:10.1037/a0015108 2009
[41]

Social Psychology45(3), 137–141 (2014) https://doi.org/10.1027/1864-9335/ a000192

Nosek, B.A., Lakens, D.: A method to increase the credibility of published results. Social Psychology45(3), 137–141 (2014) https://doi.org/10.1027/1864-9335/ a000192

work page doi:10.1027/1864-9335/ 2014
[42]

Goodman, S.N., Fanelli, D., Ioannidis, J.P.: What does research reproducibility mean? Science translational medicine8(341), 341–1234112 (2016) https://doi.org/ 10.1126/scitranslmed.aaf5027

work page doi:10.1126/scitranslmed.aaf5027 2016
[43]

Communications of the ACM59(3), 62–69 (2016) https://doi.org/10.1145/ 2812803

Collberg, C., Proebsting, T.A.: Repeatability in computer systems research. Communications of the ACM59(3), 62–69 (2016) https://doi.org/10.1145/ 2812803

2016
[44]

In: Proceedings of the 33rd International Conference on Neural Informa- tion Processing Systems, vol

Raff, E.: A step toward quantifying independently reproducible machine learning research. In: Proceedings of the 33rd International Conference on Neural Informa- tion Processing Systems, vol. 32. Curran Associates Inc., Red Hook, NY, USA (2019) 25

2019
[45]

In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

Gundersen, O.E., Cappelen, O., Møln˚ a, M., Nilsen, N.G.: The unreasonable effectiveness of open science in AI: A replication study. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, pp. 26211–26219 (2025). https://doi.org/10.1609/aaai.v39i25.34818

work page doi:10.1609/aaai.v39i25.34818 2025
[46]

Automatic evaluate dialogue ap- propriateness by using dialogue act

Magnusson, I., Smith, N.A., Dodge, J.: Reproducibility in NLP: What have we learned from the checklist? In: Findings of the Association for Computational Linguistics: ACL 2023, pp. 12789–12811 (2023). https://doi.org/10.18653/v1/2023. findings-acl.809

work page doi:10.18653/v1/2023 2023
[47]

AI magazine 39(3), 56–68 (2018) https://doi.org/10.1609/aimag.v39i3.2816

Gundersen, O.E., Gil, Y., Aha, D.W.: On reproducible AI: Towards reproducible research, open science, and digital scholarship in AI publications. AI magazine 39(3), 56–68 (2018) https://doi.org/10.1609/aimag.v39i3.2816

work page doi:10.1609/aimag.v39i3.2816 2018
[48]

PloS one13(3), 0194889 (2018) https://doi.org/10.1371/journal.pone.0194889

Makridakis, S., Spiliotis, E., Assimakopoulos, V.: Statistical and machine learning forecasting methods: Concerns and ways forward. PloS one13(3), 0194889 (2018) https://doi.org/10.1371/journal.pone.0194889

work page doi:10.1371/journal.pone.0194889 2018
[49]

In: Parallel Computing: Technology Trends, pp

Pouchard, L., Lin, Y., Van Dam, H.: Replicating machine learning experiments in materials science. In: Parallel Computing: Technology Trends, pp. 743–755. IOS Press, Amsterdam (2020). https://doi.org/10.3233/APC200105

work page doi:10.3233/apc200105 2020
[50]

In: Proceedings of the IEEE 18th International Conference on e-Science (e-Science), pp

Coakley, K., Kirkpatrick, C.R., Gundersen, O.E.: Examining the effect of imple- mentation factors on deep learning reproducibility. In: Proceedings of the IEEE 18th International Conference on e-Science (e-Science), pp. 397–398 (2022). https://doi.org/10.1109/eScience55777.2022.00056 . IEEE

work page doi:10.1109/escience55777.2022.00056 2022
[51]

In: Marculescu, D., Chi, Y., Wu, C

Zhuang, D., Zhang, X., Song, S., Hooker, S.: Randomness in neural network training: Characterizing the impact of tooling. In: Marculescu, D., Chi, Y., Wu, C. (eds.) Proceedings of the Fourth Conference on Machine Learning and Systems, vol. 4, pp. 316–336 (2022)

2022
[52]

Advances in Neural Information Processing Systems34, 3081–3095 (2021)

Cooper, A.F., Lu, Y., Forde, J., De Sa, C.M.: Hyperparameter optimization is deceiving us, and how to stop it. Advances in Neural Information Processing Systems34, 3081–3095 (2021)

2021
[53]

In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp

Reimers, N., Gurevych, I.: Reporting score distributions makes a difference: Perfor- mance study of LSTM-networks for sequence tagging. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 338–348 (2017). https://doi.org/10.18653/v1/D17-1035

work page doi:10.18653/v1/d17-1035 2017
[54]

Metropolitan Books, New York City, New York (2010)

Gawande, A.: The Checklist Manifesto: How to Get Things Right. Metropolitan Books, New York City, New York (2010)

2010
[55]

Ai Magazine40(4), 9–23 (2019) https://doi.org/10.1609/aimag.v40i4.5185 26

Gundersen, O.E.: Standing on the feet of giants—reproducibility in AI. Ai Magazine40(4), 9–23 (2019) https://doi.org/10.1609/aimag.v40i4.5185 26

work page doi:10.1609/aimag.v40i4.5185 2019
[56]

Earth and Space Science3(10), 388–415 (2016) https://doi.org/10.1002/2015EA000136

Gil, Y., David, C.H., Demir, I., Essawy, B.T., Fulweiler, R.W., Goodall, J.L., Karlstrom, L., Lee, H., Mills, H.J., Oh, J.-H.,et al.: Toward the geoscience paper of the future: Best practices for documenting and sharing research from data to software to provenance. Earth and Space Science3(10), 388–415 (2016) https://doi.org/10.1002/2015EA000136

work page doi:10.1002/2015ea000136 2016
[57]

Gil, Y.: Will AI write scientific papers in the future? AI Magazine42(4), 3–15 (2022) https://doi.org/10.1609/aaai.12027

work page doi:10.1609/aaai.12027 2022
[58]

In: Proceedings of the 2nd ACM Conference on Reproducibility and Replicability, pp

Bhaskar, A., Stodden, V.: Reproscreener: Leveraging LLMs for assessing com- putational reproducibility of machine learning pipelines. In: Proceedings of the 2nd ACM Conference on Reproducibility and Replicability, pp. 101–109 (2024). https://doi.org/10.1145/3641525.3663629

work page doi:10.1145/3641525.3663629 2024
[59]

arXiv preprint arXiv:2506.20130 (2025) https://doi.org/10

Bibal, A., Minton, S.N., Khider, D., Gil, Y.: AI copilots for reproducibility in science: A case study. arXiv preprint arXiv:2506.20130 (2025) https://doi.org/10. 48550/arXiv.2506.20130

arXiv 2025
[60]

Towards an AI co-scientist

Gottweis, J., Weng, W.-H., Daryin, A., Tu, T., Palepu, A., Sirkovic, P., Myaskovsky, A., Weissenberger, F., Rong, K., Tanno, R.,et al.: Towards an AI co-scientist. arXiv preprint arXiv:2502.18864 (2025) https://doi.org/10.48550/arXiv.2502.18864

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.18864 2025
[61]

Chemistry of Materials36(8), 3490–3495 (2024) https://doi.org/10.1021/acs.chemmater

Cheetham, A.K., Seshadri, R.: Artificial intelligence driving materials discovery? perspective on the article: Scaling deep learning for materials discovery. Chemistry of Materials36(8), 3490–3495 (2024) https://doi.org/10.1021/acs.chemmater. 4c00643

work page doi:10.1021/acs.chemmater 2024
[62]

Advanced Science12(44), 08751 (2025) https://doi.org/10

Guan, Y., Cui, L., Inchai, J., Fang, Z., Law, J., Brito, A.A.G., Pawlosky, A., Gottweis, J., Daryin, A., Myaskovsky, A.,et al.: AI-assisted drug re-purposing for human liver fibrosis. Advanced Science12(44), 08751 (2025) https://doi.org/10. 1002/advs.202508751

2025
[63]

Cell188(23), 6636–665317 (2025) https://doi.org/10.1016/j.cell.2025.08.019

He, L., Patkowski, J.B., Wang, J., Miguel-Romero, L., Aylett, C.H.S., Fillol-Salom, A., Costa, T.R.D., Penad´ es, J.R.: Chimeric infective particles expand species boundaries in phage-inducible chromosomal island mobilization. Cell188(23), 6636–665317 (2025) https://doi.org/10.1016/j.cell.2025.08.019

work page doi:10.1016/j.cell.2025.08.019 2025
[64]

Royal Society Open Science12(4), 241776 (2025) https://doi.org/10.1098/rsos.241776

Peters, U., Chin-Yee, B.: Generalization bias in large language model summa- rization of scientific research. Royal Society Open Science12(4), 241776 (2025) https://doi.org/10.1098/rsos.241776

work page doi:10.1098/rsos.241776 2025
[65]

Advances in neural information processing systems33, 1877–1901 (2020)

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A.,et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020)

1901
[66]

Our method s i g n i f i c a n t l y o u t p e r f o r m s

Coakley, K.L., Snelleman, T., Hoos, H., Gundersen, O.E.: GitHub: Kevincoakley/ai- research-moves-towards. https://doi.org/10.5281/zenodo.20785801 27 S4 Supplementary Tables Reproducibility Variable AAAI ICML ICLR IJCAI NeurIPS 2021 2023 2022 2021 2019 Pseudocode✓– –✓– Open Code✓ ✓ ✓ ✓ ✓ Open Datasets✓ ✓ ✓ ✓ ✓ Dataset Splits –✓–✓ ✓ Hardware Specification✓ ...

work page doi:10.5281/zenodo.20785801 2021
[67]

In addition, one of the reasons our privacy results perform well is because we use two separate datasets for the training of the motif causality block and the GAN

and T1D Exchange Registry [31]. In addition, one of the reasons our privacy results perform well is because we use two separate datasets for the training of the motif causality block and the GAN. However, this may be a limiting factor for others that do not have a large enough set of traces available to be able to train adequately on partitioned data. Fal...

2024

[1] [1]

PLoS medicine , volume=

Ioannidis, J.P.: Why most published research findings are false. PLoS medicine 2(8), 124 (2005) https://doi.org/10.1371/journal.pmed.0020124

work page doi:10.1371/journal.pmed.0020124 2005

[2] [2]

American Association for the Advancement of Science (2014)

McNutt, M.: Reproducibility. American Association for the Advancement of Science (2014). https://doi.org/10.1126/science.1250475

work page doi:10.1126/science.1250475 2014

[3] [3]

Nature Publishing Group UK London (2016)

Baker, M.: 1,500 scientists lift the lid on reproducibility. Nature Publishing Group UK London (2016). https://doi.org/10.1038/533452a

work page doi:10.1038/533452a 2016

[4] [4]

Pashler, H., Wagenmakers, E.-J.: Editors’ introduction to the special section on replicability in psychological science: A crisis of confidence? Perspectives on psycho- logical science7(6), 528–530 (2012) https://doi.org/10.1177/1745691612465253

work page doi:10.1177/1745691612465253 2012

[5] [5]

Science , volume=

Open Science Collaboration: Estimating the reproducibility of psychological science. Science349(6251), 4716 (2015) https://doi.org/10.1126/science.aac4716

work page doi:10.1126/science.aac4716 2015

[6] [6]

Social psychology (2014) https://doi.org/10.1027/ 1864-9335/a000178

Klein, R.A., Ratliff, K.A., Vianello, M., Adams Jr, R.B., Bahn´ ık,ˇS., Bernstein, M.J., Bocian, K., Brandt, M.J., Brooks, B., Brumbaugh, C.C.,et al.: Investigat- ing variation in replicability. Social psychology (2014) https://doi.org/10.1027/ 1864-9335/a000178

2014

[7] [7]

Science351(6280), 1433–1436 (2016) https://doi.org/10.1126/science.aaf09

Camerer, C.F., Dreber, A., Forsell, E., Ho, T.-H., Huber, J., Johannesson, M., Kirchler, M., Almenberg, J., Altmejd, A., Chan, T.,et al.: Evaluating replicability of laboratory experiments in economics. Science351(6280), 1433–1436 (2016) https://doi.org/10.1126/science.aaf09

work page doi:10.1126/science.aaf09 2016

[8] [8]

Prinz, F., Schlange, T., Asadullah, K.: Believe it or not: how much can we rely on published data on potential drug targets? Nature reviews Drug discovery10(9), 712–712 (2011) https://doi.org/10.1038/nrd3439-c1

work page doi:10.1038/nrd3439-c1 2011

[9] [9]

Nature 505(7485), 612–613 (2014) https://doi.org/10.1038/505612a

Collins, F.S., Tabak, L.A.: Policy: NIH plans to enhance reproducibility. Nature 505(7485), 612–613 (2014) https://doi.org/10.1038/505612a

work page doi:10.1038/505612a 2014

[10] [10]

RaiseStandardsforPreclinicalCancerResearch

Begley, C.G., Ellis, L.M.: Raise standards for preclinical cancer research. Nature 483(7391), 531–533 (2012) https://doi.org/10.1038/483531a

work page doi:10.1038/483531a 2012

[11] [11]

Nature reviews neuroscience14(5), 365–376 (2013) https://doi.org/ 10.1038/nrn3475

Button, K.S., Ioannidis, J.P., Mokrysz, C., Nosek, B.A., Flint, J., Robinson, E.S., Munaf` o, M.R.: Power failure: why small sample size undermines the reliability of neuroscience. Nature reviews neuroscience14(5), 365–376 (2013) https://doi.org/ 10.1038/nrn3475

work page doi:10.1038/nrn3475 2013

[12] [12]

Behavior genetics42(1), 1–2 (2012) https://doi.org/10.1007/s10519-011-9504-z

Hewitt, J.K.: Editorial policy on candidate gene association and candidate gene- by-environment interaction studies of complex traits. Behavior genetics42(1), 1–2 (2012) https://doi.org/10.1007/s10519-011-9504-z

work page doi:10.1007/s10519-011-9504-z 2012

[13] [13]

In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

Gundersen, O.E., Kjensmo, S.: State of the art: Reproducibility in artificial 22 intelligence. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018). https://doi.org/10.1609/aaai.v32i1.11503

work page doi:10.1609/aaai.v32i1.11503 2018

[14] [14]

Science359(6377), 725–726 (2018) https://doi.org/10.1126/science.359.6377.725

Hutson, M.: Artificial intelligence faces reproducibility crisis. Science359(6377), 725–726 (2018) https://doi.org/10.1126/science.359.6377.725

work page doi:10.1126/science.359.6377.725 2018

[15] [15]

Journal of Business Research88, 428–436 (2018) https://doi.org/10.1016/j.jbusres.2017.12.043

Vicente-Saez, R., Martinez-Fuentes, C.: Open science now: A systematic literature review for an integrated definition. Journal of Business Research88, 428–436 (2018) https://doi.org/10.1016/j.jbusres.2017.12.043

work page doi:10.1016/j.jbusres.2017.12.043 2018

[16] [16]

Patterns (2025) https://doi.org/10.1016/j.patter

Bischl, B., Casalicchio, G., Das, T., Feurer, M., Fischer, S., Gijsbers, P., Mukherjee, S., M¨ uller, A.C., N´ emeth, L., Oala, L.,et al.: OpenML: Insights from 10 years and more than a thousand papers. Patterns (2025) https://doi.org/10.1016/j.patter. 2025.101317

work page doi:10.1016/j.patter 2025

[17] [17]

Journal of the Medical Library Association: JMLA105(2), 203 (2017) https://doi.org/10.5195/jmla.2017

Foster, E.D., Deardorff, A.: Open science framework (OSF). Journal of the Medical Library Association: JMLA105(2), 203 (2017) https://doi.org/10.5195/jmla.2017. 88

work page doi:10.5195/jmla.2017 2017

[18] [18]

Scientific data3(1), 1–9 (2016) https://doi.org/10.1038/sdata.2016.18

Wilkinson, M.D., Dumontier, M., Aalbersberg, I.J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.-W., Silva Santos, L.B., Bourne, P.E.,et al.: The FAIR guiding principles for scientific data management and stewardship. Scientific data3(1), 1–9 (2016) https://doi.org/10.1038/sdata.2016.18

work page doi:10.1038/sdata.2016.18 2016

[19] [19]

arXiv preprint arXiv:2403.13784 (2024) https://doi.org/10.48550/arXiv.2403.13784

White, M., Haddad, I., Osborne, C., Liu, X.-Y.Y., Abdelmonsef, A., Varghese, S., Hors, A.L.: The model openness framework: Promoting completeness and openness for reproducibility, transparency, and usability in artificial intelligence. arXiv preprint arXiv:2403.13784 (2024) https://doi.org/10.48550/arXiv.2403.13784

work page doi:10.48550/arxiv.2403.13784 2024

[20] [20]

Scientific Data12(1), 328 (2025) https://doi.org/10.1038/s41597-025-04451-9

Wilkinson, S.R., Aloqalaa, M., Belhajjame, K., Crusoe, M.R., Paula Kinoshita, B., Gadelha, L., Garijo, D., Gustafsson, O.J.R., Juty, N., Kanwal, S.,et al.: Applying the FAIR principles to computational workflows. Scientific Data12(1), 328 (2025) https://doi.org/10.1038/s41597-025-04451-9

work page doi:10.1038/s41597-025-04451-9 2025

[21] [21]

Journal of machine learning research22(164), 1–20 (2021)

Pineau, J., Vincent-Lamarre, P., Sinha, K., Larivi` ere, V., Beygelzimer, A., d’Alch´ e- Buc, F., Fox, E., Larochelle, H.: Improving reproducibility in machine learning research (a report from the NeurIPS 2019 reproducibility program). Journal of machine learning research22(164), 1–20 (2021)

2019

[22] [22]

Journal of Artificial Intelligence Research81, 1019–1041 (2024) https://doi.org/10.1613/jair.1.16905

Gundersen, O.E., Helmert, M., Hoos, H.: Improving reproducibility in AI research: Four mechanisms adopted by JAIR. Journal of Artificial Intelligence Research81, 1019–1041 (2024) https://doi.org/10.1613/jair.1.16905

work page doi:10.1613/jair.1.16905 2024

[23] [23]

Advances in neural information processing systems31 (2018) 23

Lucic, M., Kurach, K., Michalski, M., Gelly, S., Bousquet, O.: Are gans created equal? a large-scale study. Advances in neural information processing systems31 (2018) 23

2018

[24] [24]

In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., Meger, D.: Deep reinforcement learning that matters. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018). https://doi.org/10.1609/aaai.v32i1.11694

work page doi:10.1609/aaai.v32i1.11694 2018

[25] [25]

In: Proceedings of the 13th ACM Conference on Recommender Systems, pp

Ferrari Dacrema, M., Cremonesi, P., Jannach, D.: Are we really making much progress? a worrying analysis of recent neural recommendation approaches. In: Proceedings of the 13th ACM Conference on Recommender Systems, pp. 101–109 (2019). https://doi.org/10.1145/3298689.3347058

work page doi:10.1145/3298689.3347058 2019

[26] [26]

Computational Linguistics48(4), 1125–1135 (2022) https://doi.org/10.1162/coli a 00448

Belz, A.: A metrological perspective on reproducibility in NLP. Computational Linguistics48(4), 1125–1135 (2022) https://doi.org/10.1162/coli a 00448

work page doi:10.1162/coli 2022

[27] [27]

In: Proceedings of the 2023 ACM Conference on Reproducibility and Replicability, pp

Gundersen, O.E., Shamsaliei, S., Kjærnli, H.S., Langseth, H.: On reporting robust and trustworthy conclusions from model comparison studies involving neural networks and randomness. In: Proceedings of the 2023 ACM Conference on Reproducibility and Replicability, pp. 37–61 (2023). https://doi.org/10.1145/ 3589806.3600044

arXiv 2023

[28] [28]

Communications of the ACM64(12), 86–92 (2021) https://doi.org/10.1145/3458723

Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J.W., Wallach, H., Iii, H.D., Crawford, K.: Datasheets for datasets. Communications of the ACM64(12), 86–92 (2021) https://doi.org/10.1145/3458723

work page doi:10.1145/3458723 2021

[29] [29]

NPJ digital medicine5(1), 48 (2022) https://doi.org/10.1038/s41746-022-00592-y

Varoquaux, G., Cheplygina, V.: Machine learning for medical imaging: method- ological failures and recommendations for the future. NPJ digital medicine5(1), 48 (2022) https://doi.org/10.1038/s41746-022-00592-y

work page doi:10.1038/s41746-022-00592-y 2022

[30] [30]

Patterns4(9) (2023) https://doi.org/10.1016/j.patter.2023

Kapoor, S., Narayanan, A.: Leakage and the reproducibility crisis in machine- learning-based science. Patterns4(9) (2023) https://doi.org/10.1016/j.patter.2023. 100804

work page doi:10.1016/j.patter.2023 2023

[31] [31]

38 Mason Christopher E

Haibe-Kains, B., Adam, G.A., Hosny, A., Khodakarami, F., Directors Shraddha Thakkar 35 Kusko Rebecca 36 Sansone Susanna-Assunta 37 Tong Weida 35 Wolfinger Russ D. 38 Mason Christopher E. 39 Jones Wendell 40 Dopazo Joaquin 41 Furlanello Cesare 42, M.A.Q.C.M.S.B., Waldron, L., Wang, B., McIntosh, C., Goldenberg, A., Kundaje, A.,et al.: Transparency and repr...

2020

[32] [32]

In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp

Belz, A., Agarwal, S., Shimorina, A., Reiter, E.: A systematic review of reproducibil- ity research in natural language processing. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 381–393 (2021). https://doi.org/10.18653/v1/2021.eacl-main.29

work page doi:10.18653/v1/2021.eacl-main.29 2021

[33] [33]

In: International Conference on Machine Learning, pp

Bouthillier, X., Laurent, C., Vincent, P.: Unreproducible research is reproducible. In: International Conference on Machine Learning, pp. 725–734 (2019). PMLR

2019

[34] [34]

IEEE Transactions on Parallel and Distributed Systems27(12), 3617–3630 (2016) https://doi.org/10.1109/TPDS.2016.2539167

Hunold, S., Carpen-Amarie, A.: Reproducible MPI benchmarking is still not as 24 easy as you think. IEEE Transactions on Parallel and Distributed Systems27(12), 3617–3630 (2016) https://doi.org/10.1109/TPDS.2016.2539167

work page doi:10.1109/tpds.2016.2539167 2016

[35] [35]

Monthly Weather Review141(11), 4165–4172 (2013) https://doi.org/10.1175/MWR-D-12-00352.1

Hong, S.-Y., Koo, M.-S., Jang, J., Esther Kim, J.-E., Park, H., Joh, M.-S., Kang, J.-H., Oh, T.-J.: An evaluation of the software system dependency of a global atmospheric model. Monthly Weather Review141(11), 4165–4172 (2013) https://doi.org/10.1175/MWR-D-12-00352.1

work page doi:10.1175/mwr-d-12-00352.1 2013

[36] [36]

Ebadi, A

Stodden, V., McNutt, M., Bailey, D.H., Deelman, E., Gil, Y., Hanson, B., Heroux, M.A., Ioannidis, J.P., Taufer, M.: Enhancing reproducibility for computational methods. Science354(6317), 1240–1241 (2016) https://doi.org/10.1126/science. aah6168

work page doi:10.1126/science 2016

[37] [37]

In: International Conference on Document Analysis and Recognition, pp

Ajayi, K., Choudhury, M.H., Rajtmajer, S.M., Wu, J.: A study on reproducibil- ity and replicability of table structure recognition methods. In: International Conference on Document Analysis and Recognition, pp. 3–19 (2023). https: //doi.org/10.1007/978-3-031-41679-8 1 . Springer

work page doi:10.1007/978-3-031-41679-8 2023

[38] [38]

arXiv preprint arXiv:2204.07610 (2022) https: //doi.org/10.48550/arXiv.2204.07610

Gundersen, O.E., Coakley, K., Kirkpatrick, C., Gil, Y.: Sources of irreproducibility in machine learning: A review. arXiv preprint arXiv:2204.07610 (2022) https: //doi.org/10.48550/arXiv.2204.07610

work page doi:10.48550/arxiv.2204.07610 2022

[39] [39]

Philosophical Transactions of the Royal Society A379(2197), 20200210 (2021) https://doi.org/ 10.1098/rsta.2020.0210

Gundersen, O.E.: The fundamental principles of reproducibility. Philosophical Transactions of the Royal Society A379(2197), 20200210 (2021) https://doi.org/ 10.1098/rsta.2020.0210

work page doi:10.1098/rsta.2020.0210 2021

[40] [40]

Review of general psychology13(2), 90–100 (2009) https://doi.org/10.1037/a0015108

Schmidt, S.: Shall we really do it again? the powerful concept of replication is neglected in the social sciences. Review of general psychology13(2), 90–100 (2009) https://doi.org/10.1037/a0015108

work page doi:10.1037/a0015108 2009

[41] [41]

Social Psychology45(3), 137–141 (2014) https://doi.org/10.1027/1864-9335/ a000192

Nosek, B.A., Lakens, D.: A method to increase the credibility of published results. Social Psychology45(3), 137–141 (2014) https://doi.org/10.1027/1864-9335/ a000192

work page doi:10.1027/1864-9335/ 2014

[42] [42]

Goodman, S.N., Fanelli, D., Ioannidis, J.P.: What does research reproducibility mean? Science translational medicine8(341), 341–1234112 (2016) https://doi.org/ 10.1126/scitranslmed.aaf5027

work page doi:10.1126/scitranslmed.aaf5027 2016

[43] [43]

Communications of the ACM59(3), 62–69 (2016) https://doi.org/10.1145/ 2812803

Collberg, C., Proebsting, T.A.: Repeatability in computer systems research. Communications of the ACM59(3), 62–69 (2016) https://doi.org/10.1145/ 2812803

2016

[44] [44]

In: Proceedings of the 33rd International Conference on Neural Informa- tion Processing Systems, vol

Raff, E.: A step toward quantifying independently reproducible machine learning research. In: Proceedings of the 33rd International Conference on Neural Informa- tion Processing Systems, vol. 32. Curran Associates Inc., Red Hook, NY, USA (2019) 25

2019

[45] [45]

In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

Gundersen, O.E., Cappelen, O., Møln˚ a, M., Nilsen, N.G.: The unreasonable effectiveness of open science in AI: A replication study. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, pp. 26211–26219 (2025). https://doi.org/10.1609/aaai.v39i25.34818

work page doi:10.1609/aaai.v39i25.34818 2025

[46] [46]

Automatic evaluate dialogue ap- propriateness by using dialogue act

Magnusson, I., Smith, N.A., Dodge, J.: Reproducibility in NLP: What have we learned from the checklist? In: Findings of the Association for Computational Linguistics: ACL 2023, pp. 12789–12811 (2023). https://doi.org/10.18653/v1/2023. findings-acl.809

work page doi:10.18653/v1/2023 2023

[47] [47]

AI magazine 39(3), 56–68 (2018) https://doi.org/10.1609/aimag.v39i3.2816

Gundersen, O.E., Gil, Y., Aha, D.W.: On reproducible AI: Towards reproducible research, open science, and digital scholarship in AI publications. AI magazine 39(3), 56–68 (2018) https://doi.org/10.1609/aimag.v39i3.2816

work page doi:10.1609/aimag.v39i3.2816 2018

[48] [48]

PloS one13(3), 0194889 (2018) https://doi.org/10.1371/journal.pone.0194889

Makridakis, S., Spiliotis, E., Assimakopoulos, V.: Statistical and machine learning forecasting methods: Concerns and ways forward. PloS one13(3), 0194889 (2018) https://doi.org/10.1371/journal.pone.0194889

work page doi:10.1371/journal.pone.0194889 2018

[49] [49]

In: Parallel Computing: Technology Trends, pp

Pouchard, L., Lin, Y., Van Dam, H.: Replicating machine learning experiments in materials science. In: Parallel Computing: Technology Trends, pp. 743–755. IOS Press, Amsterdam (2020). https://doi.org/10.3233/APC200105

work page doi:10.3233/apc200105 2020

[50] [50]

In: Proceedings of the IEEE 18th International Conference on e-Science (e-Science), pp

Coakley, K., Kirkpatrick, C.R., Gundersen, O.E.: Examining the effect of imple- mentation factors on deep learning reproducibility. In: Proceedings of the IEEE 18th International Conference on e-Science (e-Science), pp. 397–398 (2022). https://doi.org/10.1109/eScience55777.2022.00056 . IEEE

work page doi:10.1109/escience55777.2022.00056 2022

[51] [51]

In: Marculescu, D., Chi, Y., Wu, C

Zhuang, D., Zhang, X., Song, S., Hooker, S.: Randomness in neural network training: Characterizing the impact of tooling. In: Marculescu, D., Chi, Y., Wu, C. (eds.) Proceedings of the Fourth Conference on Machine Learning and Systems, vol. 4, pp. 316–336 (2022)

2022

[52] [52]

Advances in Neural Information Processing Systems34, 3081–3095 (2021)

Cooper, A.F., Lu, Y., Forde, J., De Sa, C.M.: Hyperparameter optimization is deceiving us, and how to stop it. Advances in Neural Information Processing Systems34, 3081–3095 (2021)

2021

[53] [53]

In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp

Reimers, N., Gurevych, I.: Reporting score distributions makes a difference: Perfor- mance study of LSTM-networks for sequence tagging. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 338–348 (2017). https://doi.org/10.18653/v1/D17-1035

work page doi:10.18653/v1/d17-1035 2017

[54] [54]

Metropolitan Books, New York City, New York (2010)

Gawande, A.: The Checklist Manifesto: How to Get Things Right. Metropolitan Books, New York City, New York (2010)

2010

[55] [55]

Ai Magazine40(4), 9–23 (2019) https://doi.org/10.1609/aimag.v40i4.5185 26

Gundersen, O.E.: Standing on the feet of giants—reproducibility in AI. Ai Magazine40(4), 9–23 (2019) https://doi.org/10.1609/aimag.v40i4.5185 26

work page doi:10.1609/aimag.v40i4.5185 2019

[56] [56]

Earth and Space Science3(10), 388–415 (2016) https://doi.org/10.1002/2015EA000136

Gil, Y., David, C.H., Demir, I., Essawy, B.T., Fulweiler, R.W., Goodall, J.L., Karlstrom, L., Lee, H., Mills, H.J., Oh, J.-H.,et al.: Toward the geoscience paper of the future: Best practices for documenting and sharing research from data to software to provenance. Earth and Space Science3(10), 388–415 (2016) https://doi.org/10.1002/2015EA000136

work page doi:10.1002/2015ea000136 2016

[57] [57]

Gil, Y.: Will AI write scientific papers in the future? AI Magazine42(4), 3–15 (2022) https://doi.org/10.1609/aaai.12027

work page doi:10.1609/aaai.12027 2022

[58] [58]

In: Proceedings of the 2nd ACM Conference on Reproducibility and Replicability, pp

Bhaskar, A., Stodden, V.: Reproscreener: Leveraging LLMs for assessing com- putational reproducibility of machine learning pipelines. In: Proceedings of the 2nd ACM Conference on Reproducibility and Replicability, pp. 101–109 (2024). https://doi.org/10.1145/3641525.3663629

work page doi:10.1145/3641525.3663629 2024

[59] [59]

arXiv preprint arXiv:2506.20130 (2025) https://doi.org/10

Bibal, A., Minton, S.N., Khider, D., Gil, Y.: AI copilots for reproducibility in science: A case study. arXiv preprint arXiv:2506.20130 (2025) https://doi.org/10. 48550/arXiv.2506.20130

arXiv 2025

[60] [60]

Towards an AI co-scientist

Gottweis, J., Weng, W.-H., Daryin, A., Tu, T., Palepu, A., Sirkovic, P., Myaskovsky, A., Weissenberger, F., Rong, K., Tanno, R.,et al.: Towards an AI co-scientist. arXiv preprint arXiv:2502.18864 (2025) https://doi.org/10.48550/arXiv.2502.18864

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.18864 2025

[61] [61]

Chemistry of Materials36(8), 3490–3495 (2024) https://doi.org/10.1021/acs.chemmater

Cheetham, A.K., Seshadri, R.: Artificial intelligence driving materials discovery? perspective on the article: Scaling deep learning for materials discovery. Chemistry of Materials36(8), 3490–3495 (2024) https://doi.org/10.1021/acs.chemmater. 4c00643

work page doi:10.1021/acs.chemmater 2024

[62] [62]

Advanced Science12(44), 08751 (2025) https://doi.org/10

Guan, Y., Cui, L., Inchai, J., Fang, Z., Law, J., Brito, A.A.G., Pawlosky, A., Gottweis, J., Daryin, A., Myaskovsky, A.,et al.: AI-assisted drug re-purposing for human liver fibrosis. Advanced Science12(44), 08751 (2025) https://doi.org/10. 1002/advs.202508751

2025

[63] [63]

Cell188(23), 6636–665317 (2025) https://doi.org/10.1016/j.cell.2025.08.019

He, L., Patkowski, J.B., Wang, J., Miguel-Romero, L., Aylett, C.H.S., Fillol-Salom, A., Costa, T.R.D., Penad´ es, J.R.: Chimeric infective particles expand species boundaries in phage-inducible chromosomal island mobilization. Cell188(23), 6636–665317 (2025) https://doi.org/10.1016/j.cell.2025.08.019

work page doi:10.1016/j.cell.2025.08.019 2025

[64] [64]

Royal Society Open Science12(4), 241776 (2025) https://doi.org/10.1098/rsos.241776

Peters, U., Chin-Yee, B.: Generalization bias in large language model summa- rization of scientific research. Royal Society Open Science12(4), 241776 (2025) https://doi.org/10.1098/rsos.241776

work page doi:10.1098/rsos.241776 2025

[65] [65]

Advances in neural information processing systems33, 1877–1901 (2020)

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A.,et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020)

1901

[66] [66]

Our method s i g n i f i c a n t l y o u t p e r f o r m s

Coakley, K.L., Snelleman, T., Hoos, H., Gundersen, O.E.: GitHub: Kevincoakley/ai- research-moves-towards. https://doi.org/10.5281/zenodo.20785801 27 S4 Supplementary Tables Reproducibility Variable AAAI ICML ICLR IJCAI NeurIPS 2021 2023 2022 2021 2019 Pseudocode✓– –✓– Open Code✓ ✓ ✓ ✓ ✓ Open Datasets✓ ✓ ✓ ✓ ✓ Dataset Splits –✓–✓ ✓ Hardware Specification✓ ...

work page doi:10.5281/zenodo.20785801 2021

[67] [67]

In addition, one of the reasons our privacy results perform well is because we use two separate datasets for the training of the motif causality block and the GAN

and T1D Exchange Registry [31]. In addition, one of the reasons our privacy results perform well is because we use two separate datasets for the training of the motif causality block and the GAN. However, this may be a limiting factor for others that do not have a large enough set of traces available to be able to train adequately on partitioned data. Fal...

2024