pith. sign in

arxiv: 2605.21677 · v1 · pith:SW5TMYPCnew · submitted 2026-05-20 · 💻 cs.SE

A Dataset of Reproducible Flaky-Test Failures

Pith reviewed 2026-05-22 08:58 UTC · model grok-4.3

classification 💻 cs.SE
keywords flaky testsreproducible datasetsoftware testingtest flakinessflaky test fixesexecution logsnondeterministic tests
0
0 comments X

The pith

The paper presents ReproFlake, a dataset providing reproducible environments and scripts for 1115 flaky tests to study and repair their nondeterministic failures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Flaky tests fail unpredictably even on unchanged code, making them difficult to debug and fix. The authors create ReproFlake to overcome limitations in existing datasets by including everything needed to reliably reproduce the failures. This includes compilation environments, failure reproduction scripts, fix application scripts that confirm the tests become stable, and detailed execution logs. A sympathetic reader would see this as enabling systematic study of flaky test categories, fix locations, and practical challenges like legacy code builds. The dataset also comes with contribution guidelines to grow the collection over time.

Core claim

We present ReproFlake, a dataset of 1115 reproducible flaky tests across four flaky test categories. Compared to prior flaky test datasets, our dataset is the first to provide (1) a reproducible environment to compile flaky tests, (2) scripts to reproduce failures, (3) scripts to automatically apply flaky test fixes and ensure that the tests are no longer flaky, and (4) execution logs of flaky test passing and failing. We create guidelines to help others contribute to this reproducible dataset, and demonstrate how to use our dataset to understand challenges in reproducing flaky test failures, the characteristics such as location of the fix and its correlation with the flaky test category, as

What carries the argument

ReproFlake, a packaged collection of flaky tests with build environments, reproduction scripts, automated fix scripts, and pass/fail logs that allows consistent reproduction of nondeterministic behavior.

If this is right

  • Error information helps identify flaky test categories and guide repairs.
  • Unresolved compilation failures highlight challenges in building legacy projects.
  • Knowing typical fix locations can help prioritize repair efforts.
  • Guidelines support community contributions to expand the dataset with more reproducible tests.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Integrating this dataset into continuous integration pipelines could help teams detect and address flakiness earlier in development.
  • Extending the reproduction scripts to measure code coverage might reveal patterns in which parts of the code contribute to flakiness.
  • The approach of bundling reproduction and repair tools could apply to other nondeterministic issues in software, such as race conditions.
  • Researchers could use the dataset to benchmark new flaky test detection algorithms against a standardized, reproducible set of examples.

Load-bearing premise

The tests selected from developer reports and previous datasets, together with the accompanying scripts, faithfully reproduce real flaky failures without introducing artificial behaviors or selection biases.

What would settle it

Executing the reproduction scripts on an independent system and verifying that the tests pass and fail intermittently as documented in the logs, while the fix scripts eliminate the flakiness.

Figures

Figures reproduced from arXiv: 2605.21677 by August Shi, Mahbub-Ul-Hoque Sumon, Maruf Morshed Khan, Md Erfan, Suzzana Rafi, Wing Lam.

Figure 1
Figure 1. Figure 1: Example of TD flaky and fixed test from apache commons-collections. The red-commented block was [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: High-level workflow of our methodology with Thread.sleep to reproduce failure), and FixedCodeChange (e.g., Fixed with Thread.sleep to confirm the failure no longer occurs). 4 ReproFlake Infrastructure and Artifact We present ReproFlake [8], which is a dataset of reproducible flaky tests of multiple categories. We curate the dataset with flaky tests from a popular flaky test dataset, iDoFT [63], and issue r… view at source ↗
Figure 3
Figure 3. Figure 3: Categories of Dependency Resolution Errors in our Dataset (N=216) [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Assertion error categories for ID (N=805) and OD (N=122) flaky tests in our dataset. [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Assertion error categories for TD (36) and NIO (125) flaky tests in our dataset. [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
read the original abstract

Flaky tests pass and fail non-deterministically when run on the same version of code. Although many techniques have been proposed to detect, debug, and repair flaky tests, reproducing their failures remains a major challenge due to their inherent nondeterminism. Many flaky test datasets exist to help researchers study them, but these datasets are often composed of disjoint sets of flaky tests, where each dataset provides unique information, such as flaky tests of different categories, failure logs of flaky tests, or flaky tests reported by developers vs. flaky tests found by automated tools. In this work, we aim to create a reproducible dataset of flaky tests, curated from both developer issue reports and a popular dataset of flaky tests. Compared to prior flaky test datasets, our dataset is the first to provide (1) a reproducible environment to compile flaky tests, (2) scripts to reproduce failures, (3) scripts to automatically apply flaky test fixes and ensure that the tests are no longer flaky, and (4) execution logs of flaky test passing and failing. We present ReproFlake, a dataset of 1115 reproducible flaky tests across four flaky test categories. We create guidelines to help others contribute to this reproducible dataset, and demonstrate how to use our dataset to understand challenges in reproducing flaky test failures (e.g., challenges researchers may face when using prior flaky test datasets), the characteristics (e.g., location of the fix and its correlation with the flaky test category), and difficulties researchers may face in using our dataset to collect additional information (e.g., code coverage) about flaky tests. Our findings show that error information helps identify flaky test categories and guide repairs, that unresolved compilation failures highlight challenges in building legacy projects, and knowing typical fix locations can help prioritize repair efforts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents ReproFlake, a dataset of 1115 reproducible flaky tests across four categories, curated from developer issue reports and an existing flaky-test collection. It supplies Docker-style compilation environments, scripts to reproduce failures, scripts to apply fixes and verify non-flakiness, and passing/failing execution logs, together with contribution guidelines and an analysis of reproduction challenges, fix locations, and category identification via error messages.

Significance. If the reproducibility claims are substantiated, the dataset would provide a materially more usable resource than prior disjoint collections by enabling direct execution, fix verification, and controlled experiments on flaky-test repair. The explicit inclusion of both passing and failing logs plus automated fix scripts is a concrete strength that could support reproducible tool evaluations.

major comments (2)
  1. [Abstract] Abstract and dataset-construction section: the central claim that all 1115 tests are reproducible under the supplied environments and scripts is load-bearing, yet the manuscript does not report a quantified reproduction success rate (e.g., fraction that compile and exhibit nondeterminism on first run of the provided scripts). The abstract's reference to unresolved compilation failures for legacy projects indicates that a non-negligible subset may require manual patches or fail to reproduce the reported flaky behavior, which would undermine the advertised lack of selection bias and full reproducibility.
  2. [Evaluation] Evaluation / usage section: the analysis of challenges in reproducing failures and collecting additional data (e.g., coverage) is presented as a demonstration of the dataset's utility, but without an explicit accounting of how many of the 1115 entries actually succeeded in the authors' own reproduction runs, it is difficult to assess whether the reported characteristics (fix locations, category correlations) are representative of the full curated set or only of the successfully reproduced subset.
minor comments (2)
  1. [Abstract] The four flaky-test categories are referenced repeatedly but never defined or linked to standard taxonomies in the abstract or early sections; a short table or citation would improve readability.
  2. [Tables/Figures] Figure captions and table headers could more explicitly state the exact number of tests per category and the success/failure counts for the reproduction scripts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. The comments identify important opportunities to strengthen the substantiation of our reproducibility claims. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract] Abstract and dataset-construction section: the central claim that all 1115 tests are reproducible under the supplied environments and scripts is load-bearing, yet the manuscript does not report a quantified reproduction success rate (e.g., fraction that compile and exhibit nondeterminism on first run of the provided scripts). The abstract's reference to unresolved compilation failures for legacy projects indicates that a non-negligible subset may require manual patches or fail to reproduce the reported flaky behavior, which would undermine the advertised lack of selection bias and full reproducibility.

    Authors: We agree that an explicit quantified reproduction success rate would make the central claim more transparent and easier to evaluate. The 1115 tests were included only after we successfully established Docker-style environments and observed the reported flaky behavior using the supplied scripts. The reference to unresolved compilation failures pertains to a separate analysis of legacy-project challenges encountered outside the core curated set. To address the referee's concern directly, we will revise the abstract for precision and add a dedicated paragraph in the dataset-construction section that reports the success metrics from our own reproduction runs, including the fraction of tests that compiled cleanly and exhibited nondeterminism on the first execution of the provided scripts. revision: yes

  2. Referee: [Evaluation] Evaluation / usage section: the analysis of challenges in reproducing failures and collecting additional data (e.g., coverage) is presented as a demonstration of the dataset's utility, but without an explicit accounting of how many of the 1115 entries actually succeeded in the authors' own reproduction runs, it is difficult to assess whether the reported characteristics (fix locations, category correlations) are representative of the full curated set or only of the successfully reproduced subset.

    Authors: The analyses of fix locations, category correlations, and reproduction challenges were performed on the full set of 1115 entries, each of which was verified to reproduce under the supplied environments before inclusion. Nevertheless, we acknowledge that stating the exact number of successful author reproductions would help readers judge representativeness. We will add an explicit accounting in the evaluation section, including any cases that required minor manual adjustments for legacy dependencies, so that the scope of the reported characteristics is unambiguous. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset construction paper with explicit curation steps

full rationale

This is a dataset paper that curates 1115 flaky tests from developer reports and prior datasets, then supplies Docker-style environments, reproduction scripts, fix scripts, and logs. No equations, predictions, fitted parameters, or first-principles derivations appear anywhere in the abstract or described content. All claims are grounded in the concrete construction process (selection criteria, script execution, and verification that tests are no longer flaky after fixes). The work is self-contained against external benchmarks because reproducibility can be checked by running the provided artifacts on the released dataset; no load-bearing step reduces to a self-citation or to the target result by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard domain assumptions about flaky-test categories and the reliability of developer-reported issues; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Flaky tests fall into four well-defined categories that can be reliably identified from error information.
    The paper uses these categories to organize the dataset and findings without providing an independent validation of the taxonomy.

pith-pipeline@v0.9.0 · 5870 in / 1336 out tokens · 40803 ms · 2026-05-22T08:58:08.370793+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages

  1. [1]

    2005. SIR. https://sir.csc.ncsu.edu/portal/index.php. SIR Dataset

  2. [2]

    Defects4J

    2014. Defects4J. https://github.com/rjust/defects4j. Defects4J Dataset

  3. [3]

    BugSwarm

    2019. BugSwarm. https://www.bugswarm.org/dataset. Bug Swarm Dataset

  4. [4]

    Guideline

    2025. Guideline. https://docs.google.com/document/d/1VcORh2Otr7iYqbIXlwVW7ikPnlEV8bOF. Accessed: January 2026

  5. [5]

    Accessed 2026

    2026. . Accessed 2026

  6. [6]

    ReproFlake: A Reproducible Dataset of Flaky Tests

    2026. ReproFlake: A Reproducible Dataset of Flaky Tests. https://sites.google.com/view/reproflakedataset. Accessed: January 2026

  7. [7]

    A machine learning solution for detecting and mitigating flaky tests

    A machine learning solution for detecting and mitigating flaky tests 2026. A machine learning solution for detecting and mitigating flaky tests. https://eng.fitbit.com/a-machine-learning-solution-for-detecting-and-mitigating-flaky-tests

  8. [8]

    Amal Akli, Guillaume Haben, Sarra Habchi, Mike Papadakis, and Yves Le Traon. 2023. FlakyCat: Predicting Flaky Tests Categories using Few-Shot Learning. InInternational Conference on Automation of Software Test

  9. [9]

    Bissyandé, Jacques Klein, and Yves Le Traon

    Kevin Allix, Tegawendé F. Bissyandé, Jacques Klein, and Yves Le Traon. 2016. AndroZoo: Collecting Millions of Android Apps for the Research Community. InMSR

  10. [10]

    Abdulrahman Alshammari, Christopher Morris, Michael Hilton, and Jonathan Bell. 2021. FlakeFlagger: Predicting flakiness without rerunning tests. InInternational Conference on Software Engineering

  11. [11]

    Apache Software Foundation. 2025. Apache JIRA Issue Tracker. https://issues.apache.org/jira. Accessed: January 2026

  12. [12]

    Apache Software Foundation. 2025. COLLECTIONS-812. https://issues.apache.org/jira/browse/COLLECTIONS-812. Issue report

  13. [13]

    2024.Jira Python Library

    Atlassian. 2024.Jira Python Library. https://jira.readthedocs.io/ Accessed: January 2026

  14. [14]

    Johannes Bader, Andrew Scott, Michael Pradel, and Satish Chandra. 2019. Getafix: learning to fix bugs automatically. Proc. ACM Program. Lang.(2019)

  15. [15]

    Keila Barbosa, Ronivaldo Ferreira, Gustavo Pinto, Marcelo d’Amorim, and Breno Miranda. 2023. Test Flakiness Across Programming Languages.IEEE Transactions on Software Engineering49, 4 (2023)

  16. [16]

    Jonathan Bell, Owolabi Legunsen, Michael Hilton, Lamyaa Eloussi, Tifany Yung, and Darko Marinov. 2018. DeFlaker: Automatically detecting flaky tests. InInternational Conference on Software Engineering

  17. [17]

    Moritz Beller, Georgios Gousios, and Andy Zaidman. 2017. TravisTorrent: Synthesizing Travis CI and GitHub for Full-Stack Research on Continuous Integration. InMSR

  18. [18]

    Yang Chen and Reyhaneh Jabbarvand. 2024. Neurosymbolic Repair of Test Flakiness. InISSTA 2024: Proceedings of the 2024 International Symposium on Software Testing and Analysis

  19. [19]

    Marco D’Ambros, Michele Lanza, and Romain Robbes. 2010. An Extensive Comparison of Bug Prediction Approaches. InMSR

  20. [20]

    Docker 2026. Docker. https://www.docker.com

  21. [21]

    Moritz Eck, Fabio Palomba, Marco Castelluccio, and Alberto Bacchelli. 2019. Understanding flaky tests: The developer’s perspective. InEuropean Software Engineering Conference and Symposium on the Foundations of Software Engineering

  22. [22]

    Emad Fallahzadeh and Peter C. Rigby. 2022. The impact of flaky tests on historical test prioritization on chrome.. In ICSE SEIP

  23. [23]

    Ghaleb, and Lionel Briand

    Sakina Fatima, Taher A. Ghaleb, and Lionel Briand. 2023. Flakify: A Black-Box, Language Model-Based Predictor for Flaky Tests.Transactions on Software Engineering(2023)

  24. [24]

    Sakina Fatima, Hadi Hemmati, and Lionel Briand. 2024. FlakyFix: Using Large Language Models for Predicting Flaky Test Fix Categories and Test Code Repair.IEEE Transactions on Software Engineering. , Vol. 1, No. 1, Article . Publication date: May 2026. A Dataset of Reproducible Flaky-Test Failures 19

  25. [25]

    Flakiness dashboard HOWTO

    Flakiness Dashboard HOWTO 2026. Flakiness dashboard HOWTO. http://www.chromium.org/developers/testing/flakiness- dashboard

  26. [26]

    Flaky tests (and how to avoid them)

    Flaky tests (and how to avoid them) 2026. Flaky tests (and how to avoid them). https://engineering.salesforce.com/flaky- tests-and-how-to-avoid-them-25b84b756f60

  27. [27]

    Phil Gochenour and Rachel Andre. 2026. How to Deal with Flaky Java Tests. https://wiki.saucelabs.com/display/ DOCS/How+to+Deal+with+Flaky+Java+Tests

  28. [28]

    Alex Gyori, Ben Lambeth, August Shi, Owolabi Legunsen, and Darko Marinov. 2016. NonDex: A tool for detecting and debugging wrong assumptions on Java API specifications. InInternational Symposium on Foundations of Software Engineering (Tool Demonstrations Track)

  29. [29]

    Sarra Habchi, Guillaume Haben, Jeongju Sohn, Adriano Franci, Mike Papadakis, Maxime Cordy, and Yves Le Traon

  30. [30]

    What Made This Test Flake? Pinpointing Classes Responsible for Test Flakiness. InICSME

  31. [31]

    Brian Harry. 2026. How we approach testing VSTS to enable continuous delivery. https://blogs.msdn.microsoft.com/bharry/2017/06/28/testing-in-a-cloud-delivery-cadence

  32. [32]

    Monica Hutchins, Herb Foster, Tarak Goradia, and Thomas Ostrand. 1994. Experiments of the effectiveness of dataflow- and controlflow-based test adequacy criteria. InICSE

  33. [33]

    He Jiang, Xiaochen Li, Zijiang Yang, and Jifeng Xuan. 2017. What causes my test alarm? Automatic cause analysis for test alarms in system and integration testing. InICSE 2017: Proceedings of the 39th International Conference on Software Engineering

  34. [34]

    René Just, Darioush Jalali, and Michael D. Ernst. 2014. Defects4J: A Database of Existing Faults to Enable Controlled Testing Studies for Java Programs, See [2]

  35. [35]

    Emily Kowalczyk, Karan Nair, Zebao Gao, Leo Silberstein, Teng Long, and Atif Memon. 2020. Modeling and ranking flaky tests at Apple. InICSE SEIP 2020: Proceedings of the 42nd International Conference on Software Engineering, Software Engineering in Practice Track

  36. [36]

    Bissyandé, Dongsun Kim, Martin Monperrus, Jacques Klein, and Yves Le Traon

    Anil Koyuncu, Kui Liu, Tegawendé F. Bissyandé, Dongsun Kim, Martin Monperrus, Jacques Klein, and Yves Le Traon

  37. [37]

    InESEC/FSE 2019: Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

    iFixR: bug report driven program repair. InESEC/FSE 2019: Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

  38. [38]

    Wing Lam, Patrice Godefroid, Suman Nath, Anirudh Santhiar, and Suresh Thummalapenta. 2019. Root Causing Flaky Tests in a Large-Scale Industrial Setting. InInternational Symposium on Software Testing and Analysis

  39. [39]

    Wing Lam, Kivanç Muşlu, Hitesh Sajnani, and Suresh Thummalapenta. 2020. A Study on the Lifecycle of Flaky Tests. InInternational Conference on Software Engineering

  40. [40]

    Wing Lam, Kivanç Muşlu, Hitesh Sajnani, and Suresh Thummalapenta. 2020. A study on the lifecycle of flaky tests. In ICSE 2020: Proceedings of the 42nd International Conference on Software Engineering

  41. [41]

    Wing Lam, Reed Oei, August Shi, Darko Marinov, and Tao Xie. 2019. iDFlakies: A framework for detecting and partially classifying flaky tests. InICST 2019: 12th International Conference on Software Testing, Verification and Validation

  42. [42]

    Wing Lam, Reed Oei, August Shi, Darko Marinov, and Tao Xie. 2019. iDFlakies: A framework for detecting and partially classifying flaky tests. InInternational Conference on Software Testing, Verification, and Validation

  43. [43]

    Ernst, and Tao Xie

    Wing Lam, August Shi, Reed Oei, Sai Zhang, Michael D. Ernst, and Tao Xie. 2020. Dependent-Test-Aware Regression Testing Techniques. InInternational Symposium on Software Testing and Analysis

  44. [44]

    Wing Lam, Stefan Winter, Angello Astorga, Victoria Stodden, and Darko Marinov. 2020. Understanding Reproducibility and Characteristics of Flaky Tests Through Test Reruns in Java Projects. InInternational Symposium on Software Reliability Engineering

  45. [45]

    Mahdi Khosravi, Wing Lam, and August Shi

    Chengpeng Li, M. Mahdi Khosravi, Wing Lam, and August Shi. 2023. Systematically producing test-orders to detect order-dependent flaky tests. InISSTA 2023: Proceedings of the 2023 International Symposium on Software Testing and Analysis

  46. [46]

    Chengpeng Li, Chenguang Zhu, Wenxi Wang, and August Shi. 2022. Repairing Order-Dependent Flaky Tests via Test Generation. InICSE 2022: Proceedings of the 44th International Conference on Software Engineering

  47. [47]

    Fan Long and Martin Rinard. 2016. Automatic patch generation by learning correct code.SIGPLAN Not.(2016)

  48. [48]

    Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and Darko Marinov. 2014. An empirical analysis of flaky tests. InFSE 2014: Proceedings of the ACM SIGSOFT 22nd Symposium on the Foundations of Software Engineering

  49. [49]

    Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and Darko Marinov. 2014. An Empirical Analysis of Flaky Tests. In International Symposium on Foundations of Software Engineering

  50. [50]

    Expedia – Fixing flaky time based unit tests

    MediumExpedia 2026. Expedia – Fixing flaky time based unit tests. https://medium.com/expedia-group-tech/fixing- flaky-time-based-unit-tests-176accf5096e

  51. [51]

    Atif Memon, Zebao Gao, Bao Nguyen, Sanjeev Dhanda, Eric Nickell, Rob Siemborski, and John Micco. 2017. Taming Google-scale continuous testing, See [4]

  52. [52]

    John Micco. 2026. Continuous integration at Google scale. https://www.slideshare.net/JohnMicco1/2016-0425- continuous-integration-at-google-scale. , Vol. 1, No. 1, Article . Publication date: May 2026. 20 Suzzana Rafi, Mahbub-Ul-Hoque Sumon, Md Erfan, Maruf Morshed Khan, August Shi, and Wing Lam

  53. [53]

    Netflix automation talks - Test automation at scale

    Netflix automation talks - Test automation at scale 2026. Netflix automation talks - Test automation at scale. https://youtu.be/FrBN94gUn_I?t=764

  54. [54]

    Md Tajmilur Rahman and Peter C. Rigby. 2018. The impact of failing, flaky, and high failure tests on the number of crash reports associated with Firefox builds. InESEC/FSE 2018: Proceedings of the 2018 12th Joint Meeting on Foundations of Software Engineering

  55. [55]

    Shanto Rahman, Bala Naren Chanumolu, Suzzana Rafi, August Shi, and Wing Lam. 2025. Ranking Relevant Tests for Order-Dependent Flaky Tests. InICSE 2025: 47th International Conference on Software Engineering

  56. [56]

    Shanto Rahman, Aaron Massey, Wing Lam, August Shi, and Jonathan Bell. 2024. Automatically Reproducing Timing- Dependent Flaky-Test Failures. InInternational Conference on Software Testing, Verification, and Validation

  57. [57]

    Shanto Rahman and August Shi. 2024. FlakeSync: Automatically Repairing Async Flaky Tests. InInternational Conference on Software Engineering

  58. [58]

    August Shi, Alex Gyori, Owolabi Legunsen, and Darko Marinov. 2016. Detecting assumptions on deterministic implementations of non-deterministic specifications. InICST 2016: 11th International Conference on Software Testing, Verification and Validation

  59. [59]

    August Shi, Wing Lam, Reed Oei, Tao Xie, and Darko Marinov. 2019. iFixFlakies: A framework for automatically fixing order-dependent flaky tests. InESEC/FSE 2019: Proceedings of the 2019 13th Joint Meeting on Foundations of Software Engineering

  60. [60]

    Pavan Sudarshan. 2026. No more flaky tests on the Go team. http://www.thoughtworks.com/insights/blog/no-more- flaky-tests-go-team

  61. [61]

    Test verification

    Test verification 2026. Test verification. https://developer.mozilla.org/en-US/docs/Mozilla/QA/Test_Verification

  62. [62]

    Tomassi, Naji Dmeiri, Yichen Wang, Antara Bhowmick, Yen-Chuan Liu, Premkumar T

    David A. Tomassi, Naji Dmeiri, Yichen Wang, Antara Bhowmick, Yen-Chuan Liu, Premkumar T. Devanbu, Bogdan Vasilescu, and Cindy Rubio-González. 2019. BugSwarm: Mining and Continuously Growing a Dataset of Reproducible Failures and Fixes. InICSE

  63. [63]

    University of Illinois at Urbana-Champaign. 2021. International Dataset of Flaky Tests (IDoFT). http://mir.cs.illinois. edu/flakytests. Accessed: January 2026

  64. [64]

    University of Illinois at Urbana-Champaign. 2022. NonDex. https://github.com/TestingResearchIllinois/NonDex. Accessed: January 2026

  65. [65]

    Anjiang Wei, Pu Yi, Zhengxi Li, Tao Xie, Darko Marinov, and Wing Lam. 2022. Preempting flaky tests via non- idempotent-outcome tests. InInternational Conference on Software Engineering

  66. [66]

    Eric Wendelin. 2026. Introducing flaky test mitigation tools. https://blog.gradle.org/gradle-flaky-test-retry-plugin

  67. [67]

    Andreas Zeller. 1999. Yesterday, my program worked. Today, it does not. Why?. InESEC/FSE ’99: Proceedings of the 7th European Software Engineering Conference and the 7th ACM SIGSOFT Symposium on the Foundations of Software Engineering

  68. [68]

    Ernst, and David Notkin

    Sai Zhang, Darioush Jalali, Jochen Wuttke, Kıvanç Muşlu, Wing Lam, Michael D. Ernst, and David Notkin. 2014. Empirically revisiting the test independence assumption, See [2]

  69. [69]

    Celal Ziftci and Jim Reardon. 2017. Who broke the build?: Automatically identifying changes that induce test failures in continuous integration at Google scale, See [4]. , Vol. 1, No. 1, Article . Publication date: May 2026