ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues
Pith reviewed 2026-06-27 00:58 UTC · model grok-4.3
The pith
LLM agents surface at least one human-reported reproducibility blocker for roughly 90 percent of machine learning papers from paper and repository text alone.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ReproRepo treats human-raised GitHub issues on paper repositories as naturally occurring supervision that marks genuine reproduction blockers. On a corpus of 1,149 recent machine learning papers, LLM agents that receive only the paper text and repository contents (no code execution) identify at least one semantically related blocker for approximately 90 percent of the papers, with the Codex-plus-GPT-5.5 configuration performing best. The agents are especially reliable at surfacing visible failures and identifying the correct semantic region yet remain limited in exact localization.
What carries the argument
ReproRepo, the framework that converts human-raised GitHub issues into scalable, naturally occurring labels for evaluating LLM agents on paper-repository pairs.
If this is right
- Reproducibility checks can be run at the scale of thousands of papers using only existing issue data.
- LLM agents supply a practical first filter that catches most visible blockers before any code is run.
- Evaluation effort can shift from labeling new examples to refining how agents localize issues more precisely.
- ReproRepo itself becomes a reusable testbed for comparing future agent versions on the same real-world task.
Where Pith is reading between the lines
- Combining the current text-only agents with lightweight code-execution steps could close the remaining gap in exact localization.
- The same GitHub-issue approach might transfer to other fields that maintain public code repositories with issue trackers.
- Patterns in the issues that agents consistently miss could guide targeted improvements in agent prompting or retrieval.
- Over time the growing set of agent outputs could itself become a dataset for training more specialized reproduction checkers.
Load-bearing premise
Human-raised GitHub issues accurately represent the true reproducibility blockers and semantic relatedness between agent output and those issues is a sufficient signal that the agent has found the problem.
What would settle it
An independent expert review on a fresh sample of papers showing that the issues agents flag are not the actual blockers that prevent reproduction, or a new run on held-out papers where the semantic-match rate falls well below 90 percent.
read the original abstract
Reproducing research results from papers and released code is central to scientific progress. Existing works have introduced benchmarks to evaluate whether LLM agents can assist with reproducibility, but they are difficult to scale due to their reliance on substantial manual effort for data curation and evaluation. We introduce ReproRepo, a scalable framework for reproducibility evaluation that leverages human-raised GitHub issues as naturally occurring supervision on realistic reproduction blockers. We instantiate ReproRepo on 1,149 recent machine learning papers from major conferences and evaluate four frontier model-agent configurations. Our results show that LLM agents, even without executing code, can identify many real-world reproducibility problems from paper-repository pairs: the best agent in our study, namely Codex with GPT-5.5, surfaces at least one semantically related human-reported blocker for ~90% of papers in the study. Further analysis shows that agents are particularly effective for surfacing visible failures and identifying the right semantic region, but may still be insufficient in exact localization. ReproRepo can serve as a reusable, scalable framework for future evaluations of LLM agents on real-world reproducibility auditing. Our code is released at https://github.com/LithiumDA/ReproRepo.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ReproRepo, a scalable framework that treats human-raised GitHub issues on paper repositories as naturally occurring labels for reproducibility blockers. It evaluates four LLM agent configurations on 1,149 recent ML papers and reports that the strongest configuration (Codex with GPT-5.5) surfaces at least one semantically related human-reported blocker for ~90% of papers, even without code execution. The work positions this approach as a reusable alternative to manually curated reproducibility benchmarks and releases the associated code.
Significance. If the assumptions about issue validity and semantic relatedness hold, the framework provides a low-cost, scalable method for auditing LLM agents on realistic reproducibility tasks, which could accelerate evaluation beyond the small-scale manual benchmarks common in the field. The public code release is a concrete strength that supports future reuse and extension.
major comments (3)
- [Abstract] Abstract and results paragraph: The central quantitative claim (~90% of papers have at least one semantically related blocker surfaced) rests on an unvalidated proxy; the manuscript provides no description of how semantic relatedness is operationalized (e.g., embedding similarity threshold, LLM judge prompt, or human annotation protocol) nor any inter-annotator agreement or manual validation that the matched issues actually describe reproducibility failures rather than installation queries or feature requests.
- [Abstract] Dataset construction (implied in abstract and methods): The 1,149-paper corpus is restricted to repositories that already contain GitHub issues; no statistics or filtering criteria are reported to confirm that the retained issues predominantly concern reproducibility blockers, which directly affects whether the 90% figure can be interpreted as evidence that agents identify real-world reproducibility problems.
- [Results paragraph] Evaluation design: The claim that agents are 'particularly effective for surfacing visible failures' but 'insufficient in exact localization' is presented without accompanying quantitative breakdowns, example agent outputs, or error analysis that would allow readers to assess the distinction between semantic-region identification and actionable diagnosis.
minor comments (1)
- [Abstract] The model name 'Codex with GPT-5.5' is non-standard and should be clarified with exact API identifiers or version numbers used.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and detailed comments. We address each major point below and commit to revisions that strengthen the clarity and rigor of the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract and results paragraph: The central quantitative claim (~90% of papers have at least one semantically related blocker surfaced) rests on an unvalidated proxy; the manuscript provides no description of how semantic relatedness is operationalized (e.g., embedding similarity threshold, LLM judge prompt, or human annotation protocol) nor any inter-annotator agreement or manual validation that the matched issues actually describe reproducibility failures rather than installation queries or feature requests.
Authors: We agree that additional detail is required. The current manuscript describes the matching procedure at a high level but does not provide the precise operationalization or validation statistics. In the revision we will expand the Methods section with the exact procedure (embedding model and threshold or LLM judge prompt), report inter-annotator agreement from a human validation study on a sampled subset, and clarify the criteria used to confirm that matched issues describe reproducibility blockers. revision: yes
-
Referee: [Abstract] Dataset construction (implied in abstract and methods): The 1,149-paper corpus is restricted to repositories that already contain GitHub issues; no statistics or filtering criteria are reported to confirm that the retained issues predominantly concern reproducibility blockers, which directly affects whether the 90% figure can be interpreted as evidence that agents identify real-world reproducibility problems.
Authors: We will add an explicit subsection on dataset construction that reports the repository and issue selection criteria together with summary statistics (e.g., proportion of issues manually categorized as reproducibility-related versus installation queries or feature requests) on a representative sample. This will allow readers to assess the composition of the supervision signal. revision: yes
-
Referee: [Results paragraph] Evaluation design: The claim that agents are 'particularly effective for surfacing visible failures' but 'insufficient in exact localization' is presented without accompanying quantitative breakdowns, example agent outputs, or error analysis that would allow readers to assess the distinction between semantic-region identification and actionable diagnosis.
Authors: We accept that the current presentation lacks supporting detail. The revision will include (i) quantitative breakdowns of success rates stratified by failure visibility and localization granularity, (ii) representative agent output examples, and (iii) a dedicated error-analysis subsection that distinguishes semantic-region matches from precise localization failures. revision: yes
Circularity Check
No significant circularity; evaluation uses external human-generated labels as independent ground truth
full rationale
The paper's central evaluation compares LLM agent outputs against pre-existing human-raised GitHub issues on paper repositories, treating those issues as naturally occurring external supervision. No derivation step reduces a claimed result to a fitted parameter, self-citation chain, or input by construction; the reported ~90% figure is a direct empirical match rate against an independent dataset. The framework is self-contained against these external benchmarks, with no load-bearing self-citations or ansatzes that collapse the claim into its own inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Improving reproducibility in machine learning research (a report from the NeurIPS 2019 reproducibility program).Journal of Machine Learning Research, 22(164):1–20, 2021
Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivière, Alina Beygelzimer, Flo- rence d’Alché Buc, Emily Fox, and Hugo Larochelle. Improving reproducibility in machine learning research (a report from the NeurIPS 2019 reproducibility program).Journal of Machine Learning Research, 22(164):1–20, 2021. URLhttps://www.jmlr.org/papers/v22/20...
2019
-
[2]
Daniel Nüst and Stephen J Eglen. CODECHECK: an open science initiative for the indepen- dent execution of computations underlying research articles during peer review to improve re- producibility.F1000Research, 10:253, 2021. doi: 10.12688/f1000research.51738.2. URLhttps: //f1000research.com/articles/10-253/v2. [version 2; peer review: 2 approved]
-
[3]
PaperBench: Evaluating AI’s ability to replicate AI research
Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, and Tejal Patwardhan. PaperBench: Evaluating AI’s ability to replicate AI research. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedin...
2025
-
[4]
Paper2Code: Automating code generation from scientific papers in machine learning
Minju Seo, Jinheon Baek, Seongyun Lee, and Sung Ju Hwang. Paper2Code: Automating code generation from scientific papers in machine learning. InInternational Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=3DcaUTjdKc
2026
-
[5]
CORE- bench: Fostering the credibility of published research through a computational reproducibility agent benchmark.Transactions on Machine Learning Research, 2024
Zachary S Siegel, Sayash Kapoor, Nitya Nadgir, Benedikt Stroebl, and Arvind Narayanan. CORE- bench: Fostering the credibility of published research through a computational reproducibility agent benchmark.Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URL https: //openreview.net/forum?id=BsMMc4MEGS
2024
-
[6]
Chuxuan Hu, Liyun Zhang, Yeji Lim, Aum Wadhwani, Austin Peters, and Daniel Kang. REPRO- bench: Can agentic AI systems assess the reproducibility of social science research? InFindings of the Association for Computational Linguistics: ACL 2025, pages 23616–23626, Vienna, Austria,
2025
-
[7]
doi: 10.18653/v1/2025.findings-acl.1210
Association for Computational Linguistics. doi: 10.18653/v1/2025.findings-acl.1210. URL https://aclanthology.org/2025.findings-acl.1210/
-
[8]
Christine Ye, Sihan Yuan, Suchetha Cooray, Steven Dillmann, Ian LV Roque, Dalya Baron, Philipp Frank, Sergio Martin-Alvarez, Nolan Koblischke, Frank J Qu, et al. Replicationbench: Can AI agents replicate astrophysics research papers?arXiv preprint arXiv:2510.24591, 2025
arXiv 2025
-
[9]
Syed Mehtab Hussain Shah, Frank Hopfgartner, and Arnim Bleier. Automating computational reproducibility in social science: Comparing prompt-based and agent-based approaches.arXiv preprint arXiv:2602.08561, 2026. doi: 10.48550/arXiv.2602.08561. URLhttps://arxiv.org/ abs/2602.08561. 12 ReproRepo : Scaling Reproducibility Audits with GitHub Repository Issues
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.08561 2026
-
[10]
AutoReproduce: Automatic AI Experiment Reproduction with Paper Lineage
Xuanle Zhao, Zilin Sang, Yuxuan Li, Qi Shi, Weilun Zhao, Shuo Wang, Duzhen Zhang, Xu Han, Zhiyuan Liu, and Maosong Sun. AutoReproduce: Automatic AI experiment reproduction with paper lineage.arXiv preprint arXiv:2505.20662, 2025. doi: 10.48550/arXiv.2505.20662. URL https://arxiv.org/abs/2505.20662. Accepted by ACL 2026 Main
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.20662 2025
-
[11]
Xiaoyan Bai, Alexander Baumgartner, Haojia Sun, Ari Holtzman, and Chenhao Tan. The story is not the science: Execution-grounded evaluation of mechanistic interpretability research.arXiv preprint arXiv:2602.18458, 2026
arXiv 2026
-
[12]
Scaling Reproducibility: An AI-Assisted Workflow for Large-Scale Replication and Reanalysis
Yiqing Xu and Leo Yang Yang. Scaling reproducibility: An AI-assisted workflow for large-scale replication and reanalysis.arXiv preprint arXiv:2602.16733, 2026. doi: 10.48550/arXiv.2602.16733. URLhttps://arxiv.org/abs/2602.16733
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.16733 2026
-
[13]
Benjamin Kohler, David Zollikofer, Johanna Einsiedler, Alexander Hoyle, and Elliott Ash. Read the paper, write the code: Agentic reproduction of social-science results.arXiv preprint arXiv:2604.21965, 2026
Pith/arXiv arXiv 2026
-
[14]
ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences
BangNguyen, DominikSoós, QianMa, RochanaRObadage, ZackRanjan, SaiKoneru, AnnaSzabelska, Adam Gill, Timothy M. Errington, Shakhlo Nematova, Sarah Rajtmajer, Jian Wu, and Meng Jiang. ReplicatorBench: Benchmarking LLM agents for replicability in social and behavioral sciences.arXiv preprint arXiv:2602.11354, 2026. doi: 10.48550/arXiv.2602.11354. URLhttps://a...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.11354 2026
-
[15]
Ian Magnusson, Noah A Smith, and Jesse Dodge. Reproducibility in NLP: What have we learned from the checklist? InFindings of the Association for Computational Linguistics: ACL 2023, pages 12789–12811, 2023. doi: 10.18653/v1/2023.findings-acl.809. URLhttps://aclanthology. org/2023.findings-acl.809/
-
[16]
ML code completeness checklist
Robert Stojnic. ML code completeness checklist. Papers with Code Blog, 2020. URL https: //medium.com/paperswithcode/ml-code-completeness-checklist-e9127b168501
2020
-
[17]
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? InInternational Conference on Learning Representations, volume 2024, pages 54107–54157, 2024. doi: 10.48550/ arXiv.2310.06770. URLhttps://arxiv.org/abs/2310.06770
Pith/arXiv arXiv 2024
-
[18]
MLE-bench: Evaluating machine learning agents on machine learning engineering
Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Aleksander Madry, and Lilian Weng. MLE-bench: Evaluating machine learning agents on machine learning engineering. InInternational Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=6s5uXNWGIh
2025
-
[19]
SciCode: A research coding benchmark curated by scientists
Minyang Tian, Luyu Gao, Shizhuo Dylan Zhang, Xinan Chen, Cunwei Fan, Xuefei Guo, Roland Haas, Pan Ji, Kittithat Krongchon, Yao Li, Shengyan Liu, Di Luo, Yutao Ma, Hao Tong, Kha Trinh, Chenyu Tian, Zihan Wang, Bohao Wu, Yanyu Xiong, Shengzhu Yin, Minhui Zhu, Kilian Lieret, Yanxin Lu, Genglin Liu, Yufeng Du, Tianhua Tao, Ofir Press, Jamie Callan, Eliu Huert...
2024
-
[20]
Shuo Yan, Ruochen Li, Ziming Luo, Zimu Wang, Daoyang Li, Liqiang Jing, Kaiyu He, Peilin Wu, Juntong Ni, George Michalopoulos, Yue Zhang, Ziyang Zhang, Mian Zhang, Zhiyu Chen, and Xinya Du. LMR-BENCH: Evaluating LLM agent’s ability on reproducing language modeling research. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Proc...
-
[21]
Ruan, Yaxing Cai, Ruihang Lai, Ziyi Xu, Yilong Zhao, and Tianqi Chen
Alexander Goldberg, Ihsan Ullah, Thanh Gia Hieu Khuong, Benedictus Kent Rachmat, Zhen Xu, Isabelle Guyon, and Nihar B Shah. Usefulness of LLMs as an author checklist assistant for scientific papers: NeurIPS’24 experiment.arXiv preprint arXiv:2411.03417, 2024. doi: 10.48550/arXiv.2411. 03417. URLhttps://arxiv.org/abs/2411.03417
-
[22]
Ryan Liu and Nihar Shah. ReviewerGPT? An exploratory study on using large language models for paper reviewing.arXiv preprint 2306.00622, 2023. AAAI 2024 Workshop on Scientific Document Understanding
arXiv 2023
-
[23]
Guijin Son, Jiwoo Hong, Honglu Fan, Heejeong Nam, Hyunwoo Ko, Seungwon Lim, Jinyeop Song, Jinha Choi, Gonçalo Paulo, Youngjae Yu, and Stella Biderman. When AI co-scientists fail: SPOT-a benchmark for automated verification of scientific research.arXiv preprint arXiv:2505.11855, 2025. doi: 10.48550/arXiv.2505.11855. URLhttps://arxiv.org/abs/2505.11855
-
[24]
Sarina Xi, Vishisht Rao, Justin Payan, and Nihar B Shah. FLAWS: A benchmark for error identification and localization in scientific papers.arXiv preprint arXiv:2511.21843, 2025. doi: 10.48550/arXiv. 2511.21843. URLhttps://arxiv.org/abs/2511.21843
work page internal anchor Pith review doi:10.48550/arxiv 2025
-
[25]
Soundnessbench: Can your AI scientist reallytell goodresearch ideas frombad ones?, 2026
Sy-Tuyen Ho, Minghui Liu, Huy Nghiem, and Furong Huang. Soundnessbench: Can your AI scientist reallytell goodresearch ideas frombad ones?, 2026. URLhttps://arxiv.org/abs/2605.30329
Pith/arXiv arXiv 2026
-
[26]
Baker, Benjamin Burns, Daniel Adu-Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun
Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, Vishal Dey, Mingyi Xue, Frazier N. Baker, Benjamin Burns, Daniel Adu-Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun. ScienceAgentBench: Toward rigorous assessment of language agents for data-driven scientific discovery. InI...
2025
-
[27]
The more you automate, the less you see: The hidden pitfalls of AI scientist systems
Ziming Luo, Atoosa Kasirzadeh, and Nihar B Shah. The more you automate, the less you see: The hidden pitfalls of AI scientist systems. InNeurIPS 2025 AI for Science Workshop, 2025. URL https://openreview.net/forum?id=7Sndugns1l
2025
-
[28]
Zhen Wang, Fan Bai, Zhongyan Luo, Jinyan Su, Kaiser Sun, Xinle Yu, Jieyuan Liu, Kun Zhou, Claire Cardie, Mark Dredze, Eric P. Xing, and Zhiting Hu. FIRE-bench: Evaluating agents on the rediscovery of scientific insights.arXiv preprint arXiv:2602.02905, 2026. doi: 10.48550/arXiv.2602.02905. URL https://arxiv.org/abs/2602.02905
-
[29]
Mingyang Zhou, Quanming Yao, Lun Du, Lanning Wei, and Da Zheng. Reflective paper-to-code reproduction enabled by fine-grained verification.arXiv preprint arXiv:2508.16671, 2025. doi: 10.48550/arXiv.2508.16671. URLhttps://arxiv.org/abs/2508.16671
-
[30]
FabScore: Fine-grained evaluation of fabrications in 14 ReproRepo : Scaling Reproducibility Audits with GitHub Repository Issues automated AI research
Hui Chen, James Xu Zhao, Dongfu Jiang, Qianyun Guo, Jiefeng Chen, Yiwei Wang, Muhao Chen, See-Kiong Ng, Pang Wei Koh, and Bryan Hooi. FabScore: Fine-grained evaluation of fabrications in 14 ReproRepo : Scaling Reproducibility Audits with GitHub Repository Issues automated AI research. InICML 2026 AI for Science Workshop, 2026. URLhttps://openreview. net/f...
2026
-
[31]
Linhao Zhang, Tong Xia, Jinghua Piao, Lizhen Cui, and Yong Li. PaperRepro: Automated computa- tional reproducibility assessment for social science papers.arXiv preprint arXiv:2603.00058, 2026. doi: 10.48550/arXiv.2603.00058. URLhttps://arxiv.org/abs/2603.00058
-
[32]
Paper Copilot: Tracking the evolution of peer review in AI conferences
Jing Yang, Qiyao Wei, and Jiaxin Pei. Paper Copilot: Tracking the evolution of peer review in AI conferences. InInternational Conference on Learning Representations, 2026. URL https:// openreview.net/forum?id=CyKVrhNABo
2026
-
[33]
DeepSeek-V4: Towards highly efficient million-token context intelligence, 2026
DeepSeek-AI. DeepSeek-V4: Towards highly efficient million-token context intelligence, 2026
2026
-
[34]
System Card: Claude Opus 4.7.https://www.anthropic.com/system-cards, April
Anthropic. System Card: Claude Opus 4.7.https://www.anthropic.com/system-cards, April
-
[35]
Introducing GPT-5.4 mini and nano
OpenAI. Introducing GPT-5.4 mini and nano. https://openai.com/index/ introducing-gpt-5-4-mini-and-nano/, March 2026. Accessed 2026-05-25
2026
-
[36]
GPT-5.5 System Card.https://openai.com/index/gpt-5-5-system-card/, April
OpenAI. GPT-5.5 System Card.https://openai.com/index/gpt-5-5-system-card/, April
-
[37]
reproducibility_assessment
Accessed 2026-05-25. 15 ReproRepo : Scaling Reproducibility Audits with GitHub Repository Issues A. Artifact Use, Licenses, & Intended Use Our study builds on existing public artifacts, including conference paper metadata, public GitHub reposito- ries, GitHub issue threads, and repository links discovered from Paper Copilot and conference metadata. We use...
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.