A Dataset of Reproducible Flaky-Test Failures

August Shi; Mahbub-Ul-Hoque Sumon; Maruf Morshed Khan; Md Erfan; Suzzana Rafi; Wing Lam

arxiv: 2605.21677 · v1 · pith:SW5TMYPCnew · submitted 2026-05-20 · 💻 cs.SE

A Dataset of Reproducible Flaky-Test Failures

Suzzana Rafi , Mahbub-Ul-Hoque Sumon , Md Erfan , Maruf Morshed Khan , August Shi , Wing Lam This is my paper

Pith reviewed 2026-05-22 08:58 UTC · model grok-4.3

classification 💻 cs.SE

keywords flaky testsreproducible datasetsoftware testingtest flakinessflaky test fixesexecution logsnondeterministic tests

0 comments

The pith

The paper presents ReproFlake, a dataset providing reproducible environments and scripts for 1115 flaky tests to study and repair their nondeterministic failures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Flaky tests fail unpredictably even on unchanged code, making them difficult to debug and fix. The authors create ReproFlake to overcome limitations in existing datasets by including everything needed to reliably reproduce the failures. This includes compilation environments, failure reproduction scripts, fix application scripts that confirm the tests become stable, and detailed execution logs. A sympathetic reader would see this as enabling systematic study of flaky test categories, fix locations, and practical challenges like legacy code builds. The dataset also comes with contribution guidelines to grow the collection over time.

Core claim

We present ReproFlake, a dataset of 1115 reproducible flaky tests across four flaky test categories. Compared to prior flaky test datasets, our dataset is the first to provide (1) a reproducible environment to compile flaky tests, (2) scripts to reproduce failures, (3) scripts to automatically apply flaky test fixes and ensure that the tests are no longer flaky, and (4) execution logs of flaky test passing and failing. We create guidelines to help others contribute to this reproducible dataset, and demonstrate how to use our dataset to understand challenges in reproducing flaky test failures, the characteristics such as location of the fix and its correlation with the flaky test category, as

What carries the argument

ReproFlake, a packaged collection of flaky tests with build environments, reproduction scripts, automated fix scripts, and pass/fail logs that allows consistent reproduction of nondeterministic behavior.

If this is right

Error information helps identify flaky test categories and guide repairs.
Unresolved compilation failures highlight challenges in building legacy projects.
Knowing typical fix locations can help prioritize repair efforts.
Guidelines support community contributions to expand the dataset with more reproducible tests.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Integrating this dataset into continuous integration pipelines could help teams detect and address flakiness earlier in development.
Extending the reproduction scripts to measure code coverage might reveal patterns in which parts of the code contribute to flakiness.
The approach of bundling reproduction and repair tools could apply to other nondeterministic issues in software, such as race conditions.
Researchers could use the dataset to benchmark new flaky test detection algorithms against a standardized, reproducible set of examples.

Load-bearing premise

The tests selected from developer reports and previous datasets, together with the accompanying scripts, faithfully reproduce real flaky failures without introducing artificial behaviors or selection biases.

What would settle it

Executing the reproduction scripts on an independent system and verifying that the tests pass and fail intermittently as documented in the logs, while the fix scripts eliminate the flakiness.

Figures

Figures reproduced from arXiv: 2605.21677 by August Shi, Mahbub-Ul-Hoque Sumon, Maruf Morshed Khan, Md Erfan, Suzzana Rafi, Wing Lam.

**Figure 2.** Figure 2: High-level workflow of our methodology with Thread.sleep to reproduce failure), and FixedCodeChange (e.g., Fixed with Thread.sleep to confirm the failure no longer occurs). 4 ReproFlake Infrastructure and Artifact We present ReproFlake [8], which is a dataset of reproducible flaky tests of multiple categories. We curate the dataset with flaky tests from a popular flaky test dataset, iDoFT [63], and issue r… view at source ↗

**Figure 3.** Figure 3: Categories of Dependency Resolution Errors in our Dataset (N=216) [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Assertion error categories for ID (N=805) and OD (N=122) flaky tests in our dataset. [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Assertion error categories for TD (36) and NIO (125) flaky tests in our dataset. [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

read the original abstract

Flaky tests pass and fail non-deterministically when run on the same version of code. Although many techniques have been proposed to detect, debug, and repair flaky tests, reproducing their failures remains a major challenge due to their inherent nondeterminism. Many flaky test datasets exist to help researchers study them, but these datasets are often composed of disjoint sets of flaky tests, where each dataset provides unique information, such as flaky tests of different categories, failure logs of flaky tests, or flaky tests reported by developers vs. flaky tests found by automated tools. In this work, we aim to create a reproducible dataset of flaky tests, curated from both developer issue reports and a popular dataset of flaky tests. Compared to prior flaky test datasets, our dataset is the first to provide (1) a reproducible environment to compile flaky tests, (2) scripts to reproduce failures, (3) scripts to automatically apply flaky test fixes and ensure that the tests are no longer flaky, and (4) execution logs of flaky test passing and failing. We present ReproFlake, a dataset of 1115 reproducible flaky tests across four flaky test categories. We create guidelines to help others contribute to this reproducible dataset, and demonstrate how to use our dataset to understand challenges in reproducing flaky test failures (e.g., challenges researchers may face when using prior flaky test datasets), the characteristics (e.g., location of the fix and its correlation with the flaky test category), and difficulties researchers may face in using our dataset to collect additional information (e.g., code coverage) about flaky tests. Our findings show that error information helps identify flaky test categories and guide repairs, that unresolved compilation failures highlight challenges in building legacy projects, and knowing typical fix locations can help prioritize repair efforts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReproFlake bundles executable environments and scripts for 1115 flaky tests in a way prior datasets did not, though reproduction success across the full set needs clearer numbers.

read the letter

ReproFlake is a dataset paper that packages 1115 flaky tests with full reproducible environments, failure reproduction scripts, automated fix scripts, and logs. This is the main takeaway for anyone working in the area. The authors curate tests from both developer issue reports and an existing flaky test dataset. They cover four categories and add the practical pieces that prior collections lacked: build environments that let you compile the code, scripts to trigger the flaky behavior on demand, fixes that stabilize the tests, and execution logs for both passing and failing runs. They also provide contribution guidelines so the dataset can grow. In the paper they show example uses, such as analyzing fix locations by category and noting difficulties with legacy project builds. This approach works well because it tackles the core reproducibility problem head on. Instead of researchers having to hunt down dependencies and figure out how to make tests flake, everything is set up to run. The mix of sources helps avoid over-reliance on one detection method. The analyses they include give some initial insights into the data, like how error information can guide repairs. The softer part is around verification details. The abstract points out unresolved compilation failures for some legacy projects, which suggests that not every test in the 1115 might reproduce without extra effort. A breakdown of how many succeeded out of the box, or any post-filtering steps, would strengthen the claim that the dataset is fully reproducible as presented. It's also worth checking if the supplied scripts introduce any artifacts that change the original flaky behavior. This paper is for software engineering researchers focused on flaky tests, especially those developing detection or repair tools who need a shared, executable benchmark. Someone planning experiments would find the artifacts directly useful. It shows honest engagement with the practical barriers in the area. I would send this to peer review. The contribution is the dataset and its supporting materials, and it fills a documented gap. Reviewers can focus on the curation process and reproduction evidence to confirm the value.

Referee Report

2 major / 2 minor

Summary. The paper presents ReproFlake, a dataset of 1115 reproducible flaky tests across four categories, curated from developer issue reports and an existing flaky-test collection. It supplies Docker-style compilation environments, scripts to reproduce failures, scripts to apply fixes and verify non-flakiness, and passing/failing execution logs, together with contribution guidelines and an analysis of reproduction challenges, fix locations, and category identification via error messages.

Significance. If the reproducibility claims are substantiated, the dataset would provide a materially more usable resource than prior disjoint collections by enabling direct execution, fix verification, and controlled experiments on flaky-test repair. The explicit inclusion of both passing and failing logs plus automated fix scripts is a concrete strength that could support reproducible tool evaluations.

major comments (2)

[Abstract] Abstract and dataset-construction section: the central claim that all 1115 tests are reproducible under the supplied environments and scripts is load-bearing, yet the manuscript does not report a quantified reproduction success rate (e.g., fraction that compile and exhibit nondeterminism on first run of the provided scripts). The abstract's reference to unresolved compilation failures for legacy projects indicates that a non-negligible subset may require manual patches or fail to reproduce the reported flaky behavior, which would undermine the advertised lack of selection bias and full reproducibility.
[Evaluation] Evaluation / usage section: the analysis of challenges in reproducing failures and collecting additional data (e.g., coverage) is presented as a demonstration of the dataset's utility, but without an explicit accounting of how many of the 1115 entries actually succeeded in the authors' own reproduction runs, it is difficult to assess whether the reported characteristics (fix locations, category correlations) are representative of the full curated set or only of the successfully reproduced subset.

minor comments (2)

[Abstract] The four flaky-test categories are referenced repeatedly but never defined or linked to standard taxonomies in the abstract or early sections; a short table or citation would improve readability.
[Tables/Figures] Figure captions and table headers could more explicitly state the exact number of tests per category and the success/failure counts for the reproduction scripts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. The comments identify important opportunities to strengthen the substantiation of our reproducibility claims. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract and dataset-construction section: the central claim that all 1115 tests are reproducible under the supplied environments and scripts is load-bearing, yet the manuscript does not report a quantified reproduction success rate (e.g., fraction that compile and exhibit nondeterminism on first run of the provided scripts). The abstract's reference to unresolved compilation failures for legacy projects indicates that a non-negligible subset may require manual patches or fail to reproduce the reported flaky behavior, which would undermine the advertised lack of selection bias and full reproducibility.

Authors: We agree that an explicit quantified reproduction success rate would make the central claim more transparent and easier to evaluate. The 1115 tests were included only after we successfully established Docker-style environments and observed the reported flaky behavior using the supplied scripts. The reference to unresolved compilation failures pertains to a separate analysis of legacy-project challenges encountered outside the core curated set. To address the referee's concern directly, we will revise the abstract for precision and add a dedicated paragraph in the dataset-construction section that reports the success metrics from our own reproduction runs, including the fraction of tests that compiled cleanly and exhibited nondeterminism on the first execution of the provided scripts. revision: yes
Referee: [Evaluation] Evaluation / usage section: the analysis of challenges in reproducing failures and collecting additional data (e.g., coverage) is presented as a demonstration of the dataset's utility, but without an explicit accounting of how many of the 1115 entries actually succeeded in the authors' own reproduction runs, it is difficult to assess whether the reported characteristics (fix locations, category correlations) are representative of the full curated set or only of the successfully reproduced subset.

Authors: The analyses of fix locations, category correlations, and reproduction challenges were performed on the full set of 1115 entries, each of which was verified to reproduce under the supplied environments before inclusion. Nevertheless, we acknowledge that stating the exact number of successful author reproductions would help readers judge representativeness. We will add an explicit accounting in the evaluation section, including any cases that required minor manual adjustments for legacy dependencies, so that the scope of the reported characteristics is unambiguous. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset construction paper with explicit curation steps

full rationale

This is a dataset paper that curates 1115 flaky tests from developer reports and prior datasets, then supplies Docker-style environments, reproduction scripts, fix scripts, and logs. No equations, predictions, fitted parameters, or first-principles derivations appear anywhere in the abstract or described content. All claims are grounded in the concrete construction process (selection criteria, script execution, and verification that tests are no longer flaky after fixes). The work is self-contained against external benchmarks because reproducibility can be checked by running the provided artifacts on the released dataset; no load-bearing step reduces to a self-citation or to the target result by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard domain assumptions about flaky-test categories and the reliability of developer-reported issues; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Flaky tests fall into four well-defined categories that can be reliably identified from error information.
The paper uses these categories to organize the dataset and findings without providing an independent validation of the taxonomy.

pith-pipeline@v0.9.0 · 5870 in / 1336 out tokens · 40803 ms · 2026-05-22T08:58:08.370793+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages

[1]

2005. SIR. https://sir.csc.ncsu.edu/portal/index.php. SIR Dataset

work page 2005
[2]

Defects4J

2014. Defects4J. https://github.com/rjust/defects4j. Defects4J Dataset

work page 2014
[3]

BugSwarm

2019. BugSwarm. https://www.bugswarm.org/dataset. Bug Swarm Dataset

work page 2019
[4]

Guideline

2025. Guideline. https://docs.google.com/document/d/1VcORh2Otr7iYqbIXlwVW7ikPnlEV8bOF. Accessed: January 2026

work page 2025
[5]

Accessed 2026

2026. . Accessed 2026

work page 2026
[6]

ReproFlake: A Reproducible Dataset of Flaky Tests

2026. ReproFlake: A Reproducible Dataset of Flaky Tests. https://sites.google.com/view/reproflakedataset. Accessed: January 2026

work page 2026
[7]

A machine learning solution for detecting and mitigating flaky tests

A machine learning solution for detecting and mitigating flaky tests 2026. A machine learning solution for detecting and mitigating flaky tests. https://eng.fitbit.com/a-machine-learning-solution-for-detecting-and-mitigating-flaky-tests

work page 2026
[8]

Amal Akli, Guillaume Haben, Sarra Habchi, Mike Papadakis, and Yves Le Traon. 2023. FlakyCat: Predicting Flaky Tests Categories using Few-Shot Learning. InInternational Conference on Automation of Software Test

work page 2023
[9]

Bissyandé, Jacques Klein, and Yves Le Traon

Kevin Allix, Tegawendé F. Bissyandé, Jacques Klein, and Yves Le Traon. 2016. AndroZoo: Collecting Millions of Android Apps for the Research Community. InMSR

work page 2016
[10]

Abdulrahman Alshammari, Christopher Morris, Michael Hilton, and Jonathan Bell. 2021. FlakeFlagger: Predicting flakiness without rerunning tests. InInternational Conference on Software Engineering

work page 2021
[11]

Apache Software Foundation. 2025. Apache JIRA Issue Tracker. https://issues.apache.org/jira. Accessed: January 2026

work page 2025
[12]

Apache Software Foundation. 2025. COLLECTIONS-812. https://issues.apache.org/jira/browse/COLLECTIONS-812. Issue report

work page 2025
[13]

2024.Jira Python Library

Atlassian. 2024.Jira Python Library. https://jira.readthedocs.io/ Accessed: January 2026

work page 2024
[14]

Johannes Bader, Andrew Scott, Michael Pradel, and Satish Chandra. 2019. Getafix: learning to fix bugs automatically. Proc. ACM Program. Lang.(2019)

work page 2019
[15]

Keila Barbosa, Ronivaldo Ferreira, Gustavo Pinto, Marcelo d’Amorim, and Breno Miranda. 2023. Test Flakiness Across Programming Languages.IEEE Transactions on Software Engineering49, 4 (2023)

work page 2023
[16]

Jonathan Bell, Owolabi Legunsen, Michael Hilton, Lamyaa Eloussi, Tifany Yung, and Darko Marinov. 2018. DeFlaker: Automatically detecting flaky tests. InInternational Conference on Software Engineering

work page 2018
[17]

Moritz Beller, Georgios Gousios, and Andy Zaidman. 2017. TravisTorrent: Synthesizing Travis CI and GitHub for Full-Stack Research on Continuous Integration. InMSR

work page 2017
[18]

Yang Chen and Reyhaneh Jabbarvand. 2024. Neurosymbolic Repair of Test Flakiness. InISSTA 2024: Proceedings of the 2024 International Symposium on Software Testing and Analysis

work page 2024
[19]

Marco D’Ambros, Michele Lanza, and Romain Robbes. 2010. An Extensive Comparison of Bug Prediction Approaches. InMSR

work page 2010
[20]

Docker 2026. Docker. https://www.docker.com

work page 2026
[21]

Moritz Eck, Fabio Palomba, Marco Castelluccio, and Alberto Bacchelli. 2019. Understanding flaky tests: The developer’s perspective. InEuropean Software Engineering Conference and Symposium on the Foundations of Software Engineering

work page 2019
[22]

Emad Fallahzadeh and Peter C. Rigby. 2022. The impact of flaky tests on historical test prioritization on chrome.. In ICSE SEIP

work page 2022
[23]

Ghaleb, and Lionel Briand

Sakina Fatima, Taher A. Ghaleb, and Lionel Briand. 2023. Flakify: A Black-Box, Language Model-Based Predictor for Flaky Tests.Transactions on Software Engineering(2023)

work page 2023
[24]

Sakina Fatima, Hadi Hemmati, and Lionel Briand. 2024. FlakyFix: Using Large Language Models for Predicting Flaky Test Fix Categories and Test Code Repair.IEEE Transactions on Software Engineering. , Vol. 1, No. 1, Article . Publication date: May 2026. A Dataset of Reproducible Flaky-Test Failures 19

work page 2024
[25]

Flakiness dashboard HOWTO

Flakiness Dashboard HOWTO 2026. Flakiness dashboard HOWTO. http://www.chromium.org/developers/testing/flakiness- dashboard

work page 2026
[26]

Flaky tests (and how to avoid them)

Flaky tests (and how to avoid them) 2026. Flaky tests (and how to avoid them). https://engineering.salesforce.com/flaky- tests-and-how-to-avoid-them-25b84b756f60

work page 2026
[27]

Phil Gochenour and Rachel Andre. 2026. How to Deal with Flaky Java Tests. https://wiki.saucelabs.com/display/ DOCS/How+to+Deal+with+Flaky+Java+Tests

work page 2026
[28]

Alex Gyori, Ben Lambeth, August Shi, Owolabi Legunsen, and Darko Marinov. 2016. NonDex: A tool for detecting and debugging wrong assumptions on Java API specifications. InInternational Symposium on Foundations of Software Engineering (Tool Demonstrations Track)

work page 2016
[29]

Sarra Habchi, Guillaume Haben, Jeongju Sohn, Adriano Franci, Mike Papadakis, Maxime Cordy, and Yves Le Traon

work page
[30]

What Made This Test Flake? Pinpointing Classes Responsible for Test Flakiness. InICSME

work page
[31]

Brian Harry. 2026. How we approach testing VSTS to enable continuous delivery. https://blogs.msdn.microsoft.com/bharry/2017/06/28/testing-in-a-cloud-delivery-cadence

work page 2026
[32]

Monica Hutchins, Herb Foster, Tarak Goradia, and Thomas Ostrand. 1994. Experiments of the effectiveness of dataflow- and controlflow-based test adequacy criteria. InICSE

work page 1994
[33]

He Jiang, Xiaochen Li, Zijiang Yang, and Jifeng Xuan. 2017. What causes my test alarm? Automatic cause analysis for test alarms in system and integration testing. InICSE 2017: Proceedings of the 39th International Conference on Software Engineering

work page 2017
[34]

René Just, Darioush Jalali, and Michael D. Ernst. 2014. Defects4J: A Database of Existing Faults to Enable Controlled Testing Studies for Java Programs, See [2]

work page 2014
[35]

Emily Kowalczyk, Karan Nair, Zebao Gao, Leo Silberstein, Teng Long, and Atif Memon. 2020. Modeling and ranking flaky tests at Apple. InICSE SEIP 2020: Proceedings of the 42nd International Conference on Software Engineering, Software Engineering in Practice Track

work page 2020
[36]

Bissyandé, Dongsun Kim, Martin Monperrus, Jacques Klein, and Yves Le Traon

Anil Koyuncu, Kui Liu, Tegawendé F. Bissyandé, Dongsun Kim, Martin Monperrus, Jacques Klein, and Yves Le Traon

work page
[37]

InESEC/FSE 2019: Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

iFixR: bug report driven program repair. InESEC/FSE 2019: Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

work page 2019
[38]

Wing Lam, Patrice Godefroid, Suman Nath, Anirudh Santhiar, and Suresh Thummalapenta. 2019. Root Causing Flaky Tests in a Large-Scale Industrial Setting. InInternational Symposium on Software Testing and Analysis

work page 2019
[39]

Wing Lam, Kivanç Muşlu, Hitesh Sajnani, and Suresh Thummalapenta. 2020. A Study on the Lifecycle of Flaky Tests. InInternational Conference on Software Engineering

work page 2020
[40]

Wing Lam, Kivanç Muşlu, Hitesh Sajnani, and Suresh Thummalapenta. 2020. A study on the lifecycle of flaky tests. In ICSE 2020: Proceedings of the 42nd International Conference on Software Engineering

work page 2020
[41]

Wing Lam, Reed Oei, August Shi, Darko Marinov, and Tao Xie. 2019. iDFlakies: A framework for detecting and partially classifying flaky tests. InICST 2019: 12th International Conference on Software Testing, Verification and Validation

work page 2019
[42]

Wing Lam, Reed Oei, August Shi, Darko Marinov, and Tao Xie. 2019. iDFlakies: A framework for detecting and partially classifying flaky tests. InInternational Conference on Software Testing, Verification, and Validation

work page 2019
[43]

Ernst, and Tao Xie

Wing Lam, August Shi, Reed Oei, Sai Zhang, Michael D. Ernst, and Tao Xie. 2020. Dependent-Test-Aware Regression Testing Techniques. InInternational Symposium on Software Testing and Analysis

work page 2020
[44]

Wing Lam, Stefan Winter, Angello Astorga, Victoria Stodden, and Darko Marinov. 2020. Understanding Reproducibility and Characteristics of Flaky Tests Through Test Reruns in Java Projects. InInternational Symposium on Software Reliability Engineering

work page 2020
[45]

Mahdi Khosravi, Wing Lam, and August Shi

Chengpeng Li, M. Mahdi Khosravi, Wing Lam, and August Shi. 2023. Systematically producing test-orders to detect order-dependent flaky tests. InISSTA 2023: Proceedings of the 2023 International Symposium on Software Testing and Analysis

work page 2023
[46]

Chengpeng Li, Chenguang Zhu, Wenxi Wang, and August Shi. 2022. Repairing Order-Dependent Flaky Tests via Test Generation. InICSE 2022: Proceedings of the 44th International Conference on Software Engineering

work page 2022
[47]

Fan Long and Martin Rinard. 2016. Automatic patch generation by learning correct code.SIGPLAN Not.(2016)

work page 2016
[48]

Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and Darko Marinov. 2014. An empirical analysis of flaky tests. InFSE 2014: Proceedings of the ACM SIGSOFT 22nd Symposium on the Foundations of Software Engineering

work page 2014
[49]

Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and Darko Marinov. 2014. An Empirical Analysis of Flaky Tests. In International Symposium on Foundations of Software Engineering

work page 2014
[50]

Expedia – Fixing flaky time based unit tests

MediumExpedia 2026. Expedia – Fixing flaky time based unit tests. https://medium.com/expedia-group-tech/fixing- flaky-time-based-unit-tests-176accf5096e

work page 2026
[51]

Atif Memon, Zebao Gao, Bao Nguyen, Sanjeev Dhanda, Eric Nickell, Rob Siemborski, and John Micco. 2017. Taming Google-scale continuous testing, See [4]

work page 2017
[52]

John Micco. 2026. Continuous integration at Google scale. https://www.slideshare.net/JohnMicco1/2016-0425- continuous-integration-at-google-scale. , Vol. 1, No. 1, Article . Publication date: May 2026. 20 Suzzana Rafi, Mahbub-Ul-Hoque Sumon, Md Erfan, Maruf Morshed Khan, August Shi, and Wing Lam

work page 2026
[53]

Netflix automation talks - Test automation at scale

Netflix automation talks - Test automation at scale 2026. Netflix automation talks - Test automation at scale. https://youtu.be/FrBN94gUn_I?t=764

work page 2026
[54]

Md Tajmilur Rahman and Peter C. Rigby. 2018. The impact of failing, flaky, and high failure tests on the number of crash reports associated with Firefox builds. InESEC/FSE 2018: Proceedings of the 2018 12th Joint Meeting on Foundations of Software Engineering

work page 2018
[55]

Shanto Rahman, Bala Naren Chanumolu, Suzzana Rafi, August Shi, and Wing Lam. 2025. Ranking Relevant Tests for Order-Dependent Flaky Tests. InICSE 2025: 47th International Conference on Software Engineering

work page 2025
[56]

Shanto Rahman, Aaron Massey, Wing Lam, August Shi, and Jonathan Bell. 2024. Automatically Reproducing Timing- Dependent Flaky-Test Failures. InInternational Conference on Software Testing, Verification, and Validation

work page 2024
[57]

Shanto Rahman and August Shi. 2024. FlakeSync: Automatically Repairing Async Flaky Tests. InInternational Conference on Software Engineering

work page 2024
[58]

August Shi, Alex Gyori, Owolabi Legunsen, and Darko Marinov. 2016. Detecting assumptions on deterministic implementations of non-deterministic specifications. InICST 2016: 11th International Conference on Software Testing, Verification and Validation

work page 2016
[59]

August Shi, Wing Lam, Reed Oei, Tao Xie, and Darko Marinov. 2019. iFixFlakies: A framework for automatically fixing order-dependent flaky tests. InESEC/FSE 2019: Proceedings of the 2019 13th Joint Meeting on Foundations of Software Engineering

work page 2019
[60]

Pavan Sudarshan. 2026. No more flaky tests on the Go team. http://www.thoughtworks.com/insights/blog/no-more- flaky-tests-go-team

work page 2026
[61]

Test verification

Test verification 2026. Test verification. https://developer.mozilla.org/en-US/docs/Mozilla/QA/Test_Verification

work page 2026
[62]

Tomassi, Naji Dmeiri, Yichen Wang, Antara Bhowmick, Yen-Chuan Liu, Premkumar T

David A. Tomassi, Naji Dmeiri, Yichen Wang, Antara Bhowmick, Yen-Chuan Liu, Premkumar T. Devanbu, Bogdan Vasilescu, and Cindy Rubio-González. 2019. BugSwarm: Mining and Continuously Growing a Dataset of Reproducible Failures and Fixes. InICSE

work page 2019
[63]

University of Illinois at Urbana-Champaign. 2021. International Dataset of Flaky Tests (IDoFT). http://mir.cs.illinois. edu/flakytests. Accessed: January 2026

work page 2021
[64]

University of Illinois at Urbana-Champaign. 2022. NonDex. https://github.com/TestingResearchIllinois/NonDex. Accessed: January 2026

work page 2022
[65]

Anjiang Wei, Pu Yi, Zhengxi Li, Tao Xie, Darko Marinov, and Wing Lam. 2022. Preempting flaky tests via non- idempotent-outcome tests. InInternational Conference on Software Engineering

work page 2022
[66]

Eric Wendelin. 2026. Introducing flaky test mitigation tools. https://blog.gradle.org/gradle-flaky-test-retry-plugin

work page 2026
[67]

Andreas Zeller. 1999. Yesterday, my program worked. Today, it does not. Why?. InESEC/FSE ’99: Proceedings of the 7th European Software Engineering Conference and the 7th ACM SIGSOFT Symposium on the Foundations of Software Engineering

work page 1999
[68]

Ernst, and David Notkin

Sai Zhang, Darioush Jalali, Jochen Wuttke, Kıvanç Muşlu, Wing Lam, Michael D. Ernst, and David Notkin. 2014. Empirically revisiting the test independence assumption, See [2]

work page 2014
[69]

Celal Ziftci and Jim Reardon. 2017. Who broke the build?: Automatically identifying changes that induce test failures in continuous integration at Google scale, See [4]. , Vol. 1, No. 1, Article . Publication date: May 2026

work page 2017

[1] [1]

2005. SIR. https://sir.csc.ncsu.edu/portal/index.php. SIR Dataset

work page 2005

[2] [2]

Defects4J

2014. Defects4J. https://github.com/rjust/defects4j. Defects4J Dataset

work page 2014

[3] [3]

BugSwarm

2019. BugSwarm. https://www.bugswarm.org/dataset. Bug Swarm Dataset

work page 2019

[4] [4]

Guideline

2025. Guideline. https://docs.google.com/document/d/1VcORh2Otr7iYqbIXlwVW7ikPnlEV8bOF. Accessed: January 2026

work page 2025

[5] [5]

Accessed 2026

2026. . Accessed 2026

work page 2026

[6] [6]

ReproFlake: A Reproducible Dataset of Flaky Tests

2026. ReproFlake: A Reproducible Dataset of Flaky Tests. https://sites.google.com/view/reproflakedataset. Accessed: January 2026

work page 2026

[7] [7]

A machine learning solution for detecting and mitigating flaky tests

A machine learning solution for detecting and mitigating flaky tests 2026. A machine learning solution for detecting and mitigating flaky tests. https://eng.fitbit.com/a-machine-learning-solution-for-detecting-and-mitigating-flaky-tests

work page 2026

[8] [8]

Amal Akli, Guillaume Haben, Sarra Habchi, Mike Papadakis, and Yves Le Traon. 2023. FlakyCat: Predicting Flaky Tests Categories using Few-Shot Learning. InInternational Conference on Automation of Software Test

work page 2023

[9] [9]

Bissyandé, Jacques Klein, and Yves Le Traon

Kevin Allix, Tegawendé F. Bissyandé, Jacques Klein, and Yves Le Traon. 2016. AndroZoo: Collecting Millions of Android Apps for the Research Community. InMSR

work page 2016

[10] [10]

Abdulrahman Alshammari, Christopher Morris, Michael Hilton, and Jonathan Bell. 2021. FlakeFlagger: Predicting flakiness without rerunning tests. InInternational Conference on Software Engineering

work page 2021

[11] [11]

Apache Software Foundation. 2025. Apache JIRA Issue Tracker. https://issues.apache.org/jira. Accessed: January 2026

work page 2025

[12] [12]

Apache Software Foundation. 2025. COLLECTIONS-812. https://issues.apache.org/jira/browse/COLLECTIONS-812. Issue report

work page 2025

[13] [13]

2024.Jira Python Library

Atlassian. 2024.Jira Python Library. https://jira.readthedocs.io/ Accessed: January 2026

work page 2024

[14] [14]

Johannes Bader, Andrew Scott, Michael Pradel, and Satish Chandra. 2019. Getafix: learning to fix bugs automatically. Proc. ACM Program. Lang.(2019)

work page 2019

[15] [15]

Keila Barbosa, Ronivaldo Ferreira, Gustavo Pinto, Marcelo d’Amorim, and Breno Miranda. 2023. Test Flakiness Across Programming Languages.IEEE Transactions on Software Engineering49, 4 (2023)

work page 2023

[16] [16]

Jonathan Bell, Owolabi Legunsen, Michael Hilton, Lamyaa Eloussi, Tifany Yung, and Darko Marinov. 2018. DeFlaker: Automatically detecting flaky tests. InInternational Conference on Software Engineering

work page 2018

[17] [17]

Moritz Beller, Georgios Gousios, and Andy Zaidman. 2017. TravisTorrent: Synthesizing Travis CI and GitHub for Full-Stack Research on Continuous Integration. InMSR

work page 2017

[18] [18]

Yang Chen and Reyhaneh Jabbarvand. 2024. Neurosymbolic Repair of Test Flakiness. InISSTA 2024: Proceedings of the 2024 International Symposium on Software Testing and Analysis

work page 2024

[19] [19]

Marco D’Ambros, Michele Lanza, and Romain Robbes. 2010. An Extensive Comparison of Bug Prediction Approaches. InMSR

work page 2010

[20] [20]

Docker 2026. Docker. https://www.docker.com

work page 2026

[21] [21]

Moritz Eck, Fabio Palomba, Marco Castelluccio, and Alberto Bacchelli. 2019. Understanding flaky tests: The developer’s perspective. InEuropean Software Engineering Conference and Symposium on the Foundations of Software Engineering

work page 2019

[22] [22]

Emad Fallahzadeh and Peter C. Rigby. 2022. The impact of flaky tests on historical test prioritization on chrome.. In ICSE SEIP

work page 2022

[23] [23]

Ghaleb, and Lionel Briand

Sakina Fatima, Taher A. Ghaleb, and Lionel Briand. 2023. Flakify: A Black-Box, Language Model-Based Predictor for Flaky Tests.Transactions on Software Engineering(2023)

work page 2023

[24] [24]

Sakina Fatima, Hadi Hemmati, and Lionel Briand. 2024. FlakyFix: Using Large Language Models for Predicting Flaky Test Fix Categories and Test Code Repair.IEEE Transactions on Software Engineering. , Vol. 1, No. 1, Article . Publication date: May 2026. A Dataset of Reproducible Flaky-Test Failures 19

work page 2024

[25] [25]

Flakiness dashboard HOWTO

Flakiness Dashboard HOWTO 2026. Flakiness dashboard HOWTO. http://www.chromium.org/developers/testing/flakiness- dashboard

work page 2026

[26] [26]

Flaky tests (and how to avoid them)

Flaky tests (and how to avoid them) 2026. Flaky tests (and how to avoid them). https://engineering.salesforce.com/flaky- tests-and-how-to-avoid-them-25b84b756f60

work page 2026

[27] [27]

Phil Gochenour and Rachel Andre. 2026. How to Deal with Flaky Java Tests. https://wiki.saucelabs.com/display/ DOCS/How+to+Deal+with+Flaky+Java+Tests

work page 2026

[28] [28]

Alex Gyori, Ben Lambeth, August Shi, Owolabi Legunsen, and Darko Marinov. 2016. NonDex: A tool for detecting and debugging wrong assumptions on Java API specifications. InInternational Symposium on Foundations of Software Engineering (Tool Demonstrations Track)

work page 2016

[29] [29]

Sarra Habchi, Guillaume Haben, Jeongju Sohn, Adriano Franci, Mike Papadakis, Maxime Cordy, and Yves Le Traon

work page

[30] [30]

What Made This Test Flake? Pinpointing Classes Responsible for Test Flakiness. InICSME

work page

[31] [31]

Brian Harry. 2026. How we approach testing VSTS to enable continuous delivery. https://blogs.msdn.microsoft.com/bharry/2017/06/28/testing-in-a-cloud-delivery-cadence

work page 2026

[32] [32]

Monica Hutchins, Herb Foster, Tarak Goradia, and Thomas Ostrand. 1994. Experiments of the effectiveness of dataflow- and controlflow-based test adequacy criteria. InICSE

work page 1994

[33] [33]

He Jiang, Xiaochen Li, Zijiang Yang, and Jifeng Xuan. 2017. What causes my test alarm? Automatic cause analysis for test alarms in system and integration testing. InICSE 2017: Proceedings of the 39th International Conference on Software Engineering

work page 2017

[34] [34]

René Just, Darioush Jalali, and Michael D. Ernst. 2014. Defects4J: A Database of Existing Faults to Enable Controlled Testing Studies for Java Programs, See [2]

work page 2014

[35] [35]

Emily Kowalczyk, Karan Nair, Zebao Gao, Leo Silberstein, Teng Long, and Atif Memon. 2020. Modeling and ranking flaky tests at Apple. InICSE SEIP 2020: Proceedings of the 42nd International Conference on Software Engineering, Software Engineering in Practice Track

work page 2020

[36] [36]

Bissyandé, Dongsun Kim, Martin Monperrus, Jacques Klein, and Yves Le Traon

Anil Koyuncu, Kui Liu, Tegawendé F. Bissyandé, Dongsun Kim, Martin Monperrus, Jacques Klein, and Yves Le Traon

work page

[37] [37]

InESEC/FSE 2019: Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

iFixR: bug report driven program repair. InESEC/FSE 2019: Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

work page 2019

[38] [38]

Wing Lam, Patrice Godefroid, Suman Nath, Anirudh Santhiar, and Suresh Thummalapenta. 2019. Root Causing Flaky Tests in a Large-Scale Industrial Setting. InInternational Symposium on Software Testing and Analysis

work page 2019

[39] [39]

Wing Lam, Kivanç Muşlu, Hitesh Sajnani, and Suresh Thummalapenta. 2020. A Study on the Lifecycle of Flaky Tests. InInternational Conference on Software Engineering

work page 2020

[40] [40]

Wing Lam, Kivanç Muşlu, Hitesh Sajnani, and Suresh Thummalapenta. 2020. A study on the lifecycle of flaky tests. In ICSE 2020: Proceedings of the 42nd International Conference on Software Engineering

work page 2020

[41] [41]

Wing Lam, Reed Oei, August Shi, Darko Marinov, and Tao Xie. 2019. iDFlakies: A framework for detecting and partially classifying flaky tests. InICST 2019: 12th International Conference on Software Testing, Verification and Validation

work page 2019

[42] [42]

Wing Lam, Reed Oei, August Shi, Darko Marinov, and Tao Xie. 2019. iDFlakies: A framework for detecting and partially classifying flaky tests. InInternational Conference on Software Testing, Verification, and Validation

work page 2019

[43] [43]

Ernst, and Tao Xie

Wing Lam, August Shi, Reed Oei, Sai Zhang, Michael D. Ernst, and Tao Xie. 2020. Dependent-Test-Aware Regression Testing Techniques. InInternational Symposium on Software Testing and Analysis

work page 2020

[44] [44]

Wing Lam, Stefan Winter, Angello Astorga, Victoria Stodden, and Darko Marinov. 2020. Understanding Reproducibility and Characteristics of Flaky Tests Through Test Reruns in Java Projects. InInternational Symposium on Software Reliability Engineering

work page 2020

[45] [45]

Mahdi Khosravi, Wing Lam, and August Shi

Chengpeng Li, M. Mahdi Khosravi, Wing Lam, and August Shi. 2023. Systematically producing test-orders to detect order-dependent flaky tests. InISSTA 2023: Proceedings of the 2023 International Symposium on Software Testing and Analysis

work page 2023

[46] [46]

Chengpeng Li, Chenguang Zhu, Wenxi Wang, and August Shi. 2022. Repairing Order-Dependent Flaky Tests via Test Generation. InICSE 2022: Proceedings of the 44th International Conference on Software Engineering

work page 2022

[47] [47]

Fan Long and Martin Rinard. 2016. Automatic patch generation by learning correct code.SIGPLAN Not.(2016)

work page 2016

[48] [48]

Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and Darko Marinov. 2014. An empirical analysis of flaky tests. InFSE 2014: Proceedings of the ACM SIGSOFT 22nd Symposium on the Foundations of Software Engineering

work page 2014

[49] [49]

Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and Darko Marinov. 2014. An Empirical Analysis of Flaky Tests. In International Symposium on Foundations of Software Engineering

work page 2014

[50] [50]

Expedia – Fixing flaky time based unit tests

MediumExpedia 2026. Expedia – Fixing flaky time based unit tests. https://medium.com/expedia-group-tech/fixing- flaky-time-based-unit-tests-176accf5096e

work page 2026

[51] [51]

Atif Memon, Zebao Gao, Bao Nguyen, Sanjeev Dhanda, Eric Nickell, Rob Siemborski, and John Micco. 2017. Taming Google-scale continuous testing, See [4]

work page 2017

[52] [52]

John Micco. 2026. Continuous integration at Google scale. https://www.slideshare.net/JohnMicco1/2016-0425- continuous-integration-at-google-scale. , Vol. 1, No. 1, Article . Publication date: May 2026. 20 Suzzana Rafi, Mahbub-Ul-Hoque Sumon, Md Erfan, Maruf Morshed Khan, August Shi, and Wing Lam

work page 2026

[53] [53]

Netflix automation talks - Test automation at scale

Netflix automation talks - Test automation at scale 2026. Netflix automation talks - Test automation at scale. https://youtu.be/FrBN94gUn_I?t=764

work page 2026

[54] [54]

Md Tajmilur Rahman and Peter C. Rigby. 2018. The impact of failing, flaky, and high failure tests on the number of crash reports associated with Firefox builds. InESEC/FSE 2018: Proceedings of the 2018 12th Joint Meeting on Foundations of Software Engineering

work page 2018

[55] [55]

Shanto Rahman, Bala Naren Chanumolu, Suzzana Rafi, August Shi, and Wing Lam. 2025. Ranking Relevant Tests for Order-Dependent Flaky Tests. InICSE 2025: 47th International Conference on Software Engineering

work page 2025

[56] [56]

Shanto Rahman, Aaron Massey, Wing Lam, August Shi, and Jonathan Bell. 2024. Automatically Reproducing Timing- Dependent Flaky-Test Failures. InInternational Conference on Software Testing, Verification, and Validation

work page 2024

[57] [57]

Shanto Rahman and August Shi. 2024. FlakeSync: Automatically Repairing Async Flaky Tests. InInternational Conference on Software Engineering

work page 2024

[58] [58]

August Shi, Alex Gyori, Owolabi Legunsen, and Darko Marinov. 2016. Detecting assumptions on deterministic implementations of non-deterministic specifications. InICST 2016: 11th International Conference on Software Testing, Verification and Validation

work page 2016

[59] [59]

August Shi, Wing Lam, Reed Oei, Tao Xie, and Darko Marinov. 2019. iFixFlakies: A framework for automatically fixing order-dependent flaky tests. InESEC/FSE 2019: Proceedings of the 2019 13th Joint Meeting on Foundations of Software Engineering

work page 2019

[60] [60]

Pavan Sudarshan. 2026. No more flaky tests on the Go team. http://www.thoughtworks.com/insights/blog/no-more- flaky-tests-go-team

work page 2026

[61] [61]

Test verification

Test verification 2026. Test verification. https://developer.mozilla.org/en-US/docs/Mozilla/QA/Test_Verification

work page 2026

[62] [62]

Tomassi, Naji Dmeiri, Yichen Wang, Antara Bhowmick, Yen-Chuan Liu, Premkumar T

David A. Tomassi, Naji Dmeiri, Yichen Wang, Antara Bhowmick, Yen-Chuan Liu, Premkumar T. Devanbu, Bogdan Vasilescu, and Cindy Rubio-González. 2019. BugSwarm: Mining and Continuously Growing a Dataset of Reproducible Failures and Fixes. InICSE

work page 2019

[63] [63]

University of Illinois at Urbana-Champaign. 2021. International Dataset of Flaky Tests (IDoFT). http://mir.cs.illinois. edu/flakytests. Accessed: January 2026

work page 2021

[64] [64]

University of Illinois at Urbana-Champaign. 2022. NonDex. https://github.com/TestingResearchIllinois/NonDex. Accessed: January 2026

work page 2022

[65] [65]

Anjiang Wei, Pu Yi, Zhengxi Li, Tao Xie, Darko Marinov, and Wing Lam. 2022. Preempting flaky tests via non- idempotent-outcome tests. InInternational Conference on Software Engineering

work page 2022

[66] [66]

Eric Wendelin. 2026. Introducing flaky test mitigation tools. https://blog.gradle.org/gradle-flaky-test-retry-plugin

work page 2026

[67] [67]

Andreas Zeller. 1999. Yesterday, my program worked. Today, it does not. Why?. InESEC/FSE ’99: Proceedings of the 7th European Software Engineering Conference and the 7th ACM SIGSOFT Symposium on the Foundations of Software Engineering

work page 1999

[68] [68]

Ernst, and David Notkin

Sai Zhang, Darioush Jalali, Jochen Wuttke, Kıvanç Muşlu, Wing Lam, Michael D. Ernst, and David Notkin. 2014. Empirically revisiting the test independence assumption, See [2]

work page 2014

[69] [69]

Celal Ziftci and Jim Reardon. 2017. Who broke the build?: Automatically identifying changes that induce test failures in continuous integration at Google scale, See [4]. , Vol. 1, No. 1, Article . Publication date: May 2026

work page 2017