A Dataset of Reproducible Flaky-Test Failures
Pith reviewed 2026-05-22 08:58 UTC · model grok-4.3
The pith
The paper presents ReproFlake, a dataset providing reproducible environments and scripts for 1115 flaky tests to study and repair their nondeterministic failures.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present ReproFlake, a dataset of 1115 reproducible flaky tests across four flaky test categories. Compared to prior flaky test datasets, our dataset is the first to provide (1) a reproducible environment to compile flaky tests, (2) scripts to reproduce failures, (3) scripts to automatically apply flaky test fixes and ensure that the tests are no longer flaky, and (4) execution logs of flaky test passing and failing. We create guidelines to help others contribute to this reproducible dataset, and demonstrate how to use our dataset to understand challenges in reproducing flaky test failures, the characteristics such as location of the fix and its correlation with the flaky test category, as
What carries the argument
ReproFlake, a packaged collection of flaky tests with build environments, reproduction scripts, automated fix scripts, and pass/fail logs that allows consistent reproduction of nondeterministic behavior.
If this is right
- Error information helps identify flaky test categories and guide repairs.
- Unresolved compilation failures highlight challenges in building legacy projects.
- Knowing typical fix locations can help prioritize repair efforts.
- Guidelines support community contributions to expand the dataset with more reproducible tests.
Where Pith is reading between the lines
- Integrating this dataset into continuous integration pipelines could help teams detect and address flakiness earlier in development.
- Extending the reproduction scripts to measure code coverage might reveal patterns in which parts of the code contribute to flakiness.
- The approach of bundling reproduction and repair tools could apply to other nondeterministic issues in software, such as race conditions.
- Researchers could use the dataset to benchmark new flaky test detection algorithms against a standardized, reproducible set of examples.
Load-bearing premise
The tests selected from developer reports and previous datasets, together with the accompanying scripts, faithfully reproduce real flaky failures without introducing artificial behaviors or selection biases.
What would settle it
Executing the reproduction scripts on an independent system and verifying that the tests pass and fail intermittently as documented in the logs, while the fix scripts eliminate the flakiness.
Figures
read the original abstract
Flaky tests pass and fail non-deterministically when run on the same version of code. Although many techniques have been proposed to detect, debug, and repair flaky tests, reproducing their failures remains a major challenge due to their inherent nondeterminism. Many flaky test datasets exist to help researchers study them, but these datasets are often composed of disjoint sets of flaky tests, where each dataset provides unique information, such as flaky tests of different categories, failure logs of flaky tests, or flaky tests reported by developers vs. flaky tests found by automated tools. In this work, we aim to create a reproducible dataset of flaky tests, curated from both developer issue reports and a popular dataset of flaky tests. Compared to prior flaky test datasets, our dataset is the first to provide (1) a reproducible environment to compile flaky tests, (2) scripts to reproduce failures, (3) scripts to automatically apply flaky test fixes and ensure that the tests are no longer flaky, and (4) execution logs of flaky test passing and failing. We present ReproFlake, a dataset of 1115 reproducible flaky tests across four flaky test categories. We create guidelines to help others contribute to this reproducible dataset, and demonstrate how to use our dataset to understand challenges in reproducing flaky test failures (e.g., challenges researchers may face when using prior flaky test datasets), the characteristics (e.g., location of the fix and its correlation with the flaky test category), and difficulties researchers may face in using our dataset to collect additional information (e.g., code coverage) about flaky tests. Our findings show that error information helps identify flaky test categories and guide repairs, that unresolved compilation failures highlight challenges in building legacy projects, and knowing typical fix locations can help prioritize repair efforts.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents ReproFlake, a dataset of 1115 reproducible flaky tests across four categories, curated from developer issue reports and an existing flaky-test collection. It supplies Docker-style compilation environments, scripts to reproduce failures, scripts to apply fixes and verify non-flakiness, and passing/failing execution logs, together with contribution guidelines and an analysis of reproduction challenges, fix locations, and category identification via error messages.
Significance. If the reproducibility claims are substantiated, the dataset would provide a materially more usable resource than prior disjoint collections by enabling direct execution, fix verification, and controlled experiments on flaky-test repair. The explicit inclusion of both passing and failing logs plus automated fix scripts is a concrete strength that could support reproducible tool evaluations.
major comments (2)
- [Abstract] Abstract and dataset-construction section: the central claim that all 1115 tests are reproducible under the supplied environments and scripts is load-bearing, yet the manuscript does not report a quantified reproduction success rate (e.g., fraction that compile and exhibit nondeterminism on first run of the provided scripts). The abstract's reference to unresolved compilation failures for legacy projects indicates that a non-negligible subset may require manual patches or fail to reproduce the reported flaky behavior, which would undermine the advertised lack of selection bias and full reproducibility.
- [Evaluation] Evaluation / usage section: the analysis of challenges in reproducing failures and collecting additional data (e.g., coverage) is presented as a demonstration of the dataset's utility, but without an explicit accounting of how many of the 1115 entries actually succeeded in the authors' own reproduction runs, it is difficult to assess whether the reported characteristics (fix locations, category correlations) are representative of the full curated set or only of the successfully reproduced subset.
minor comments (2)
- [Abstract] The four flaky-test categories are referenced repeatedly but never defined or linked to standard taxonomies in the abstract or early sections; a short table or citation would improve readability.
- [Tables/Figures] Figure captions and table headers could more explicitly state the exact number of tests per category and the success/failure counts for the reproduction scripts.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our manuscript. The comments identify important opportunities to strengthen the substantiation of our reproducibility claims. We respond to each major comment below and indicate the revisions we will make.
read point-by-point responses
-
Referee: [Abstract] Abstract and dataset-construction section: the central claim that all 1115 tests are reproducible under the supplied environments and scripts is load-bearing, yet the manuscript does not report a quantified reproduction success rate (e.g., fraction that compile and exhibit nondeterminism on first run of the provided scripts). The abstract's reference to unresolved compilation failures for legacy projects indicates that a non-negligible subset may require manual patches or fail to reproduce the reported flaky behavior, which would undermine the advertised lack of selection bias and full reproducibility.
Authors: We agree that an explicit quantified reproduction success rate would make the central claim more transparent and easier to evaluate. The 1115 tests were included only after we successfully established Docker-style environments and observed the reported flaky behavior using the supplied scripts. The reference to unresolved compilation failures pertains to a separate analysis of legacy-project challenges encountered outside the core curated set. To address the referee's concern directly, we will revise the abstract for precision and add a dedicated paragraph in the dataset-construction section that reports the success metrics from our own reproduction runs, including the fraction of tests that compiled cleanly and exhibited nondeterminism on the first execution of the provided scripts. revision: yes
-
Referee: [Evaluation] Evaluation / usage section: the analysis of challenges in reproducing failures and collecting additional data (e.g., coverage) is presented as a demonstration of the dataset's utility, but without an explicit accounting of how many of the 1115 entries actually succeeded in the authors' own reproduction runs, it is difficult to assess whether the reported characteristics (fix locations, category correlations) are representative of the full curated set or only of the successfully reproduced subset.
Authors: The analyses of fix locations, category correlations, and reproduction challenges were performed on the full set of 1115 entries, each of which was verified to reproduce under the supplied environments before inclusion. Nevertheless, we acknowledge that stating the exact number of successful author reproductions would help readers judge representativeness. We will add an explicit accounting in the evaluation section, including any cases that required minor manual adjustments for legacy dependencies, so that the scope of the reported characteristics is unambiguous. revision: yes
Circularity Check
No circularity: dataset construction paper with explicit curation steps
full rationale
This is a dataset paper that curates 1115 flaky tests from developer reports and prior datasets, then supplies Docker-style environments, reproduction scripts, fix scripts, and logs. No equations, predictions, fitted parameters, or first-principles derivations appear anywhere in the abstract or described content. All claims are grounded in the concrete construction process (selection criteria, script execution, and verification that tests are no longer flaky after fixes). The work is self-contained against external benchmarks because reproducibility can be checked by running the provided artifacts on the released dataset; no load-bearing step reduces to a self-citation or to the target result by definition.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Flaky tests fall into four well-defined categories that can be reliably identified from error information.
Reference graph
Works this paper leans on
-
[1]
2005. SIR. https://sir.csc.ncsu.edu/portal/index.php. SIR Dataset
work page 2005
- [2]
- [3]
- [4]
- [5]
-
[6]
ReproFlake: A Reproducible Dataset of Flaky Tests
2026. ReproFlake: A Reproducible Dataset of Flaky Tests. https://sites.google.com/view/reproflakedataset. Accessed: January 2026
work page 2026
-
[7]
A machine learning solution for detecting and mitigating flaky tests
A machine learning solution for detecting and mitigating flaky tests 2026. A machine learning solution for detecting and mitigating flaky tests. https://eng.fitbit.com/a-machine-learning-solution-for-detecting-and-mitigating-flaky-tests
work page 2026
-
[8]
Amal Akli, Guillaume Haben, Sarra Habchi, Mike Papadakis, and Yves Le Traon. 2023. FlakyCat: Predicting Flaky Tests Categories using Few-Shot Learning. InInternational Conference on Automation of Software Test
work page 2023
-
[9]
Bissyandé, Jacques Klein, and Yves Le Traon
Kevin Allix, Tegawendé F. Bissyandé, Jacques Klein, and Yves Le Traon. 2016. AndroZoo: Collecting Millions of Android Apps for the Research Community. InMSR
work page 2016
-
[10]
Abdulrahman Alshammari, Christopher Morris, Michael Hilton, and Jonathan Bell. 2021. FlakeFlagger: Predicting flakiness without rerunning tests. InInternational Conference on Software Engineering
work page 2021
-
[11]
Apache Software Foundation. 2025. Apache JIRA Issue Tracker. https://issues.apache.org/jira. Accessed: January 2026
work page 2025
-
[12]
Apache Software Foundation. 2025. COLLECTIONS-812. https://issues.apache.org/jira/browse/COLLECTIONS-812. Issue report
work page 2025
-
[13]
Atlassian. 2024.Jira Python Library. https://jira.readthedocs.io/ Accessed: January 2026
work page 2024
-
[14]
Johannes Bader, Andrew Scott, Michael Pradel, and Satish Chandra. 2019. Getafix: learning to fix bugs automatically. Proc. ACM Program. Lang.(2019)
work page 2019
-
[15]
Keila Barbosa, Ronivaldo Ferreira, Gustavo Pinto, Marcelo d’Amorim, and Breno Miranda. 2023. Test Flakiness Across Programming Languages.IEEE Transactions on Software Engineering49, 4 (2023)
work page 2023
-
[16]
Jonathan Bell, Owolabi Legunsen, Michael Hilton, Lamyaa Eloussi, Tifany Yung, and Darko Marinov. 2018. DeFlaker: Automatically detecting flaky tests. InInternational Conference on Software Engineering
work page 2018
-
[17]
Moritz Beller, Georgios Gousios, and Andy Zaidman. 2017. TravisTorrent: Synthesizing Travis CI and GitHub for Full-Stack Research on Continuous Integration. InMSR
work page 2017
-
[18]
Yang Chen and Reyhaneh Jabbarvand. 2024. Neurosymbolic Repair of Test Flakiness. InISSTA 2024: Proceedings of the 2024 International Symposium on Software Testing and Analysis
work page 2024
-
[19]
Marco D’Ambros, Michele Lanza, and Romain Robbes. 2010. An Extensive Comparison of Bug Prediction Approaches. InMSR
work page 2010
-
[20]
Docker 2026. Docker. https://www.docker.com
work page 2026
-
[21]
Moritz Eck, Fabio Palomba, Marco Castelluccio, and Alberto Bacchelli. 2019. Understanding flaky tests: The developer’s perspective. InEuropean Software Engineering Conference and Symposium on the Foundations of Software Engineering
work page 2019
-
[22]
Emad Fallahzadeh and Peter C. Rigby. 2022. The impact of flaky tests on historical test prioritization on chrome.. In ICSE SEIP
work page 2022
-
[23]
Sakina Fatima, Taher A. Ghaleb, and Lionel Briand. 2023. Flakify: A Black-Box, Language Model-Based Predictor for Flaky Tests.Transactions on Software Engineering(2023)
work page 2023
-
[24]
Sakina Fatima, Hadi Hemmati, and Lionel Briand. 2024. FlakyFix: Using Large Language Models for Predicting Flaky Test Fix Categories and Test Code Repair.IEEE Transactions on Software Engineering. , Vol. 1, No. 1, Article . Publication date: May 2026. A Dataset of Reproducible Flaky-Test Failures 19
work page 2024
-
[25]
Flakiness Dashboard HOWTO 2026. Flakiness dashboard HOWTO. http://www.chromium.org/developers/testing/flakiness- dashboard
work page 2026
-
[26]
Flaky tests (and how to avoid them)
Flaky tests (and how to avoid them) 2026. Flaky tests (and how to avoid them). https://engineering.salesforce.com/flaky- tests-and-how-to-avoid-them-25b84b756f60
work page 2026
-
[27]
Phil Gochenour and Rachel Andre. 2026. How to Deal with Flaky Java Tests. https://wiki.saucelabs.com/display/ DOCS/How+to+Deal+with+Flaky+Java+Tests
work page 2026
-
[28]
Alex Gyori, Ben Lambeth, August Shi, Owolabi Legunsen, and Darko Marinov. 2016. NonDex: A tool for detecting and debugging wrong assumptions on Java API specifications. InInternational Symposium on Foundations of Software Engineering (Tool Demonstrations Track)
work page 2016
-
[29]
Sarra Habchi, Guillaume Haben, Jeongju Sohn, Adriano Franci, Mike Papadakis, Maxime Cordy, and Yves Le Traon
-
[30]
What Made This Test Flake? Pinpointing Classes Responsible for Test Flakiness. InICSME
-
[31]
Brian Harry. 2026. How we approach testing VSTS to enable continuous delivery. https://blogs.msdn.microsoft.com/bharry/2017/06/28/testing-in-a-cloud-delivery-cadence
work page 2026
-
[32]
Monica Hutchins, Herb Foster, Tarak Goradia, and Thomas Ostrand. 1994. Experiments of the effectiveness of dataflow- and controlflow-based test adequacy criteria. InICSE
work page 1994
-
[33]
He Jiang, Xiaochen Li, Zijiang Yang, and Jifeng Xuan. 2017. What causes my test alarm? Automatic cause analysis for test alarms in system and integration testing. InICSE 2017: Proceedings of the 39th International Conference on Software Engineering
work page 2017
-
[34]
René Just, Darioush Jalali, and Michael D. Ernst. 2014. Defects4J: A Database of Existing Faults to Enable Controlled Testing Studies for Java Programs, See [2]
work page 2014
-
[35]
Emily Kowalczyk, Karan Nair, Zebao Gao, Leo Silberstein, Teng Long, and Atif Memon. 2020. Modeling and ranking flaky tests at Apple. InICSE SEIP 2020: Proceedings of the 42nd International Conference on Software Engineering, Software Engineering in Practice Track
work page 2020
-
[36]
Bissyandé, Dongsun Kim, Martin Monperrus, Jacques Klein, and Yves Le Traon
Anil Koyuncu, Kui Liu, Tegawendé F. Bissyandé, Dongsun Kim, Martin Monperrus, Jacques Klein, and Yves Le Traon
-
[37]
iFixR: bug report driven program repair. InESEC/FSE 2019: Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering
work page 2019
-
[38]
Wing Lam, Patrice Godefroid, Suman Nath, Anirudh Santhiar, and Suresh Thummalapenta. 2019. Root Causing Flaky Tests in a Large-Scale Industrial Setting. InInternational Symposium on Software Testing and Analysis
work page 2019
-
[39]
Wing Lam, Kivanç Muşlu, Hitesh Sajnani, and Suresh Thummalapenta. 2020. A Study on the Lifecycle of Flaky Tests. InInternational Conference on Software Engineering
work page 2020
-
[40]
Wing Lam, Kivanç Muşlu, Hitesh Sajnani, and Suresh Thummalapenta. 2020. A study on the lifecycle of flaky tests. In ICSE 2020: Proceedings of the 42nd International Conference on Software Engineering
work page 2020
-
[41]
Wing Lam, Reed Oei, August Shi, Darko Marinov, and Tao Xie. 2019. iDFlakies: A framework for detecting and partially classifying flaky tests. InICST 2019: 12th International Conference on Software Testing, Verification and Validation
work page 2019
-
[42]
Wing Lam, Reed Oei, August Shi, Darko Marinov, and Tao Xie. 2019. iDFlakies: A framework for detecting and partially classifying flaky tests. InInternational Conference on Software Testing, Verification, and Validation
work page 2019
-
[43]
Wing Lam, August Shi, Reed Oei, Sai Zhang, Michael D. Ernst, and Tao Xie. 2020. Dependent-Test-Aware Regression Testing Techniques. InInternational Symposium on Software Testing and Analysis
work page 2020
-
[44]
Wing Lam, Stefan Winter, Angello Astorga, Victoria Stodden, and Darko Marinov. 2020. Understanding Reproducibility and Characteristics of Flaky Tests Through Test Reruns in Java Projects. InInternational Symposium on Software Reliability Engineering
work page 2020
-
[45]
Mahdi Khosravi, Wing Lam, and August Shi
Chengpeng Li, M. Mahdi Khosravi, Wing Lam, and August Shi. 2023. Systematically producing test-orders to detect order-dependent flaky tests. InISSTA 2023: Proceedings of the 2023 International Symposium on Software Testing and Analysis
work page 2023
-
[46]
Chengpeng Li, Chenguang Zhu, Wenxi Wang, and August Shi. 2022. Repairing Order-Dependent Flaky Tests via Test Generation. InICSE 2022: Proceedings of the 44th International Conference on Software Engineering
work page 2022
-
[47]
Fan Long and Martin Rinard. 2016. Automatic patch generation by learning correct code.SIGPLAN Not.(2016)
work page 2016
-
[48]
Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and Darko Marinov. 2014. An empirical analysis of flaky tests. InFSE 2014: Proceedings of the ACM SIGSOFT 22nd Symposium on the Foundations of Software Engineering
work page 2014
-
[49]
Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and Darko Marinov. 2014. An Empirical Analysis of Flaky Tests. In International Symposium on Foundations of Software Engineering
work page 2014
-
[50]
Expedia – Fixing flaky time based unit tests
MediumExpedia 2026. Expedia – Fixing flaky time based unit tests. https://medium.com/expedia-group-tech/fixing- flaky-time-based-unit-tests-176accf5096e
work page 2026
-
[51]
Atif Memon, Zebao Gao, Bao Nguyen, Sanjeev Dhanda, Eric Nickell, Rob Siemborski, and John Micco. 2017. Taming Google-scale continuous testing, See [4]
work page 2017
-
[52]
John Micco. 2026. Continuous integration at Google scale. https://www.slideshare.net/JohnMicco1/2016-0425- continuous-integration-at-google-scale. , Vol. 1, No. 1, Article . Publication date: May 2026. 20 Suzzana Rafi, Mahbub-Ul-Hoque Sumon, Md Erfan, Maruf Morshed Khan, August Shi, and Wing Lam
work page 2026
-
[53]
Netflix automation talks - Test automation at scale
Netflix automation talks - Test automation at scale 2026. Netflix automation talks - Test automation at scale. https://youtu.be/FrBN94gUn_I?t=764
work page 2026
-
[54]
Md Tajmilur Rahman and Peter C. Rigby. 2018. The impact of failing, flaky, and high failure tests on the number of crash reports associated with Firefox builds. InESEC/FSE 2018: Proceedings of the 2018 12th Joint Meeting on Foundations of Software Engineering
work page 2018
-
[55]
Shanto Rahman, Bala Naren Chanumolu, Suzzana Rafi, August Shi, and Wing Lam. 2025. Ranking Relevant Tests for Order-Dependent Flaky Tests. InICSE 2025: 47th International Conference on Software Engineering
work page 2025
-
[56]
Shanto Rahman, Aaron Massey, Wing Lam, August Shi, and Jonathan Bell. 2024. Automatically Reproducing Timing- Dependent Flaky-Test Failures. InInternational Conference on Software Testing, Verification, and Validation
work page 2024
-
[57]
Shanto Rahman and August Shi. 2024. FlakeSync: Automatically Repairing Async Flaky Tests. InInternational Conference on Software Engineering
work page 2024
-
[58]
August Shi, Alex Gyori, Owolabi Legunsen, and Darko Marinov. 2016. Detecting assumptions on deterministic implementations of non-deterministic specifications. InICST 2016: 11th International Conference on Software Testing, Verification and Validation
work page 2016
-
[59]
August Shi, Wing Lam, Reed Oei, Tao Xie, and Darko Marinov. 2019. iFixFlakies: A framework for automatically fixing order-dependent flaky tests. InESEC/FSE 2019: Proceedings of the 2019 13th Joint Meeting on Foundations of Software Engineering
work page 2019
-
[60]
Pavan Sudarshan. 2026. No more flaky tests on the Go team. http://www.thoughtworks.com/insights/blog/no-more- flaky-tests-go-team
work page 2026
-
[61]
Test verification 2026. Test verification. https://developer.mozilla.org/en-US/docs/Mozilla/QA/Test_Verification
work page 2026
-
[62]
Tomassi, Naji Dmeiri, Yichen Wang, Antara Bhowmick, Yen-Chuan Liu, Premkumar T
David A. Tomassi, Naji Dmeiri, Yichen Wang, Antara Bhowmick, Yen-Chuan Liu, Premkumar T. Devanbu, Bogdan Vasilescu, and Cindy Rubio-González. 2019. BugSwarm: Mining and Continuously Growing a Dataset of Reproducible Failures and Fixes. InICSE
work page 2019
-
[63]
University of Illinois at Urbana-Champaign. 2021. International Dataset of Flaky Tests (IDoFT). http://mir.cs.illinois. edu/flakytests. Accessed: January 2026
work page 2021
-
[64]
University of Illinois at Urbana-Champaign. 2022. NonDex. https://github.com/TestingResearchIllinois/NonDex. Accessed: January 2026
work page 2022
-
[65]
Anjiang Wei, Pu Yi, Zhengxi Li, Tao Xie, Darko Marinov, and Wing Lam. 2022. Preempting flaky tests via non- idempotent-outcome tests. InInternational Conference on Software Engineering
work page 2022
-
[66]
Eric Wendelin. 2026. Introducing flaky test mitigation tools. https://blog.gradle.org/gradle-flaky-test-retry-plugin
work page 2026
-
[67]
Andreas Zeller. 1999. Yesterday, my program worked. Today, it does not. Why?. InESEC/FSE ’99: Proceedings of the 7th European Software Engineering Conference and the 7th ACM SIGSOFT Symposium on the Foundations of Software Engineering
work page 1999
-
[68]
Sai Zhang, Darioush Jalali, Jochen Wuttke, Kıvanç Muşlu, Wing Lam, Michael D. Ernst, and David Notkin. 2014. Empirically revisiting the test independence assumption, See [2]
work page 2014
-
[69]
Celal Ziftci and Jim Reardon. 2017. Who broke the build?: Automatically identifying changes that induce test failures in continuous integration at Google scale, See [4]. , Vol. 1, No. 1, Article . Publication date: May 2026
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.