State of the art: Reproducibility in artificial intelligence

Odd Erik Gundersen, Sigbjørn Kjensmo · 2018 · DOI 10.1609/aaai.v32i1.11503

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open at publisher browse 3 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

ARA: Agentic Reproducibility Assessment For Scalable Support Of Scientific Peer-Review

cs.DL · 2026-05-04 · unverdicted · novelty 6.0 · 2 refs

ARA uses LLMs to build workflow graphs linking sources, methods, and outputs in papers, then scores reproducibility, reaching ~61% accuracy on 213 ReScience C articles and outperforming priors on ReproBench and GoldStandardDB.

Sibyl-AutoResearch: Autonomous Research Needs Self-Evolving Trial-and-Error Harnesses, Not Paper Generators

cs.MA · 2026-05-21 · unverdicted · novelty 5.0

Sibyl-AutoResearch introduces self-evolving trial-and-error harnesses with auditable conversion units that link trial signals to updated research behaviors and harness repairs in autonomous systems.

Improving Reproducibility in Evaluation through Multi-Level Annotator Modeling

cs.LG · 2026-05-13 · unverdicted · novelty 5.0

Multi-level bootstrapping models annotator variance using large rater-ID datasets to find optimal tradeoffs between number of items N and ratings per item K for statistically significant AI evaluations.

citing papers explorer

Showing 3 of 3 citing papers.

ARA: Agentic Reproducibility Assessment For Scalable Support Of Scientific Peer-Review cs.DL · 2026-05-04 · unverdicted · none · ref 57 · 2 links
ARA uses LLMs to build workflow graphs linking sources, methods, and outputs in papers, then scores reproducibility, reaching ~61% accuracy on 213 ReScience C articles and outperforming priors on ReproBench and GoldStandardDB.
Sibyl-AutoResearch: Autonomous Research Needs Self-Evolving Trial-and-Error Harnesses, Not Paper Generators cs.MA · 2026-05-21 · unverdicted · none · ref 13
Sibyl-AutoResearch introduces self-evolving trial-and-error harnesses with auditable conversion units that link trial signals to updated research behaviors and harness repairs in autonomous systems.
Improving Reproducibility in Evaluation through Multi-Level Annotator Modeling cs.LG · 2026-05-13 · unverdicted · none · ref 12
Multi-level bootstrapping models annotator variance using large rater-ID datasets to find optimal tradeoffs between number of items N and ratings per item K for statistically significant AI evaluations.

State of the art: Reproducibility in artificial intelligence

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer