Understanding Flaky Tests: The Developer's Perspective
Pith reviewed 2026-05-25 10:50 UTC · model grok-4.3
The pith
Flaky tests stem from multiple causes, four of them new and the costliest to fix, with reproduction and cause identification as the main developer challenges.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Through the classifications provided by 21 professional developers for 200 flaky tests they previously fixed and an online survey of 121 developers with a median of five years of industrial experience, the study shows that the flakiness is due to several different causes, four of which have never been reported before despite being the most costly to fix; flakiness is perceived as significant by the vast majority of developers, regardless of their team's size and project's domain, and it can have effects on resource allocation, scheduling, and the perceived reliability of the test suite; and the challenges developers report to face regard mostly the reproduction of the flaky behavior and the
What carries the argument
Developer classifications of flaky tests by nature of flakiness, origin, and fixing effort, together with survey responses on perceptions and reported challenges.
If this is right
- Four previously unreported causes of flakiness require the greatest fixing effort and therefore warrant targeted attention.
- Flakiness influences resource allocation and scheduling decisions in development projects of any size or domain.
- Support for reproducing flaky behavior would directly address the challenge developers identify most often.
- Improved methods for identifying the cause of flakiness would reduce the main reported difficulty in handling these tests.
Where Pith is reading between the lines
- Tools that automate reproduction steps could reduce the time developers spend on the most common challenge.
- Awareness of the newly identified causes could be built into test maintenance practices to lower overall costs.
- Surveying additional developers on the same questions might confirm whether the reported cost rankings hold beyond the initial sample.
Load-bearing premise
The classifications provided by the 21 developers who fixed the 200 tests, and the self-reported perceptions from the 121 survey respondents, accurately reflect the true underlying causes and costs without substantial recall bias or social-desirability effects.
What would settle it
An independent review of the same 200 tests that assigns different primary causes or different effort rankings than the ones supplied by the developers who fixed them.
Figures
read the original abstract
Flaky tests are software tests that exhibit a seemingly random outcome (pass or fail) when run against the same, identical code. Previous work has examined fixes to flaky tests and has proposed automated solutions to locate as well as fix flaky tests--we complement it by examining the perceptions of software developers about the nature, relevance, and challenges of this phenomenon. We asked 21 professional developers to classify 200 flaky tests they previously fixed, in terms of the nature of the flakiness, the origin of the flakiness, and the fixing effort. We complement this analysis with information about the fixing strategy. Subsequently, we conducted an online survey with 121 developers with a median industrial programming experience of five years. Our research shows that: The flakiness is due to several different causes, four of which have never been reported before, despite being the most costly to fix; flakiness is perceived as significant by the vast majority of developers, regardless of their team's size and project's domain, and it can have effects on resource allocation, scheduling, and the perceived reliability of the test suite; and the challenges developers report to face regard mostly the reproduction of the flaky behavior and the identification of the cause for the flakiness. Data and materials [https://doi.org/10.5281/zenodo.3265785].
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reports results from an empirical study in which 21 professional developers classified 200 flaky tests they had previously fixed (by nature of flakiness, origin, fixing effort, and strategy) together with an online survey of 121 developers (median 5 years experience) on perceptions of significance, effects, and challenges. Key claims are that four previously unreported causes are the most costly, that flakiness is viewed as significant by the vast majority of developers independent of team size or domain, and that reproduction of flaky behavior plus cause identification are the dominant challenges.
Significance. If the classifications and self-reports hold, the work supplies developer-centric evidence that complements prior automated-detection papers, surfaces four new causes with cost implications, and identifies actionable pain points around reproduction and diagnosis. The public release of data and materials is a positive contribution to reproducibility.
major comments (2)
- [Study of 200 flaky tests (classification procedure and results)] The identification of four new causes and the claim that they are the most costly rest entirely on the 21 developers' post-hoc self-classification of the 200 tests; no independent verification (code review, execution logs, or third-party diagnosis) is described to confirm that the assigned categories match actual root causes. This directly supports the novelty and cost-ranking results.
- [Survey results and discussion of perceptions/challenges] Claims that flakiness is perceived as significant by the vast majority and that reproduction/identification are the primary challenges derive solely from the 121 survey responses without cross-validation against project artifacts or behavioral data, leaving the findings open to recall bias or social-desirability effects.
minor comments (1)
- [Abstract and §1] The abstract and introduction could state the sample sizes and the self-report nature of the data earlier to set expectations.
Simulated Author's Rebuttal
We thank the referee for the thoughtful review and constructive comments. Our work is explicitly framed as an investigation of the developer's perspective (title, abstract, and research questions), so the study design relies on self-classification and self-reported perceptions. We address each major comment below and will incorporate clarifications on limitations.
read point-by-point responses
-
Referee: [Study of 200 flaky tests (classification procedure and results)] The identification of four new causes and the claim that they are the most costly rest entirely on the 21 developers' post-hoc self-classification of the 200 tests; no independent verification (code review, execution logs, or third-party diagnosis) is described to confirm that the assigned categories match actual root causes. This directly supports the novelty and cost-ranking results.
Authors: The classifications were intentionally collected from the 21 developers who had fixed the tests, as the paper's goal is to surface how developers themselves categorize causes, effort, and strategies rather than to establish objective ground truth via external verification. This matches the stated focus on the developer's perspective and complements prior automated-detection work. We agree that the absence of independent verification is a limitation for claims about actual root causes; we will add an explicit subsection on threats to validity addressing self-reported classifications and their implications for the novelty and cost results. revision: partial
-
Referee: [Survey results and discussion of perceptions/challenges] Claims that flakiness is perceived as significant by the vast majority and that reproduction/identification are the primary challenges derive solely from the 121 survey responses without cross-validation against project artifacts or behavioral data, leaving the findings open to recall bias or social-desirability effects.
Authors: The survey component is designed to capture developers' perceptions of significance, effects, and challenges, which is the intended contribution. Standard survey methodology in empirical software engineering relies on self-reports for such questions; cross-validation against artifacts would address a different research goal. We acknowledge the potential for recall and social-desirability bias and will expand the threats-to-validity discussion to cover these issues explicitly while retaining the perceptual findings as reported. revision: partial
Circularity Check
No circularity: purely empirical survey and classification study with no derivations or fitted predictions
full rationale
The paper reports results from developer classifications of 200 tests and a survey of 121 respondents. No equations, models, parameters, or predictions are derived from prior fitted quantities; the claims are direct summaries of collected responses. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify the central findings. The study is self-contained as an empirical data collection effort.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Developers' classifications of flaky-test causes and fixing effort accurately capture reality without substantial bias.
- domain assumption The 121 survey respondents are representative of professional developers who encounter flaky tests.
Reference graph
Works this paper leans on
- [1]
-
[2]
2018. 2018. ACM Joint European Software Engineering Conference and Sympo- sium on the Foundations of Software Engineering. http://www.esec-fse.org
work page 2018
-
[3]
2018. 2018. ACM SIGSOFT International Symposium on Software Testing and Analysis. https://conf.researchr.org/series/issta
work page 2018
-
[4]
2018. 2018. ACM Transactions on Software Engineering and Methodology. https: //tosem.acm.org
work page 2018
-
[5]
2018. 2018. IEEE TCSE International Conference on Software Maintenance and Evolution. http://conferences.computer.org/icsm/
work page 2018
-
[6]
2018. 2018. IEEE Transactions on Software Engineering. https://www.computer. org/web/tse
work page 2018
-
[7]
2018. 2018. IEEE/ACM International Conference on Software Engineering. http: //www.icse-conferences.org
work page 2018
-
[8]
2018. 2018. International Conference on Software Testing. https://www.es.mdh. se/icst2018/
work page 2018
-
[9]
2018. 2018. Springer’s Empirical Software Engineering Journal. https://link. springer.com/journal/10664
work page 2018
-
[10]
Jonathan Bell and Gail Kaiser. 2014. Unit Test Virtualization with VMVM. In Proceedings of the International Conference on Software Engineering (ICSE) . ACM, 550–561. https://doi.org/10.1145/2568225.2568248
-
[11]
Jonathan Bell, Owolabi Legunsen, Michael Hilton, Lamyaa Eloussi, Tifany Yung, and Darko Marinov. 2018. DeFlaker: Automatically Detecting Flaky Tests. In Proceedings of the International Conference on Software Engineering (ICSE) . To Appear
work page 2018
-
[12]
Juliet M Corbin and Anselm Strauss. 1990. Grounded theory research: Procedures, canons, and evaluative criteria. Qualitative sociology 13, 1 (1990), 3–21
work page 1990
-
[13]
Moritz Eck, Fabio Palomba, Marco Castelluccio, and Alberto Bacchelli. 2019. Data and materials for: ‘Understanding Flaky Tests: The Developer’s Perspective’. https://doi.org/10.5281/zenodo.3265830
-
[14]
E. Farchi, Y. Nir, and S. Ur. 2003. Concurrent bug patterns and how to test them. In Proceedings International Parallel and Distributed Processing Symposium . 7 pp.–. https://doi.org/10.1109/IPDPS.2003.1213511
-
[15]
Timothy S Flanigan, Emily McFarlane, and Sarah Cook. 2008. Conducting survey research among physicians and other medical professionals: a review of cur- rent literature. In Proceedings of the Survey Research Methods Section, American Statistical Association, Vol. 1. 4136–47
work page 2008
-
[16]
Martin Fowler. [n. d.]. Eradicating non-determinism in tests. https://martinfowler. com/articles/nonDeterminism.html
-
[17]
M. Fowler. 1999. Refactoring: improving the design of existing code . Addison- Wesley
work page 1999
-
[18]
Gordon Fraser and Andrea Arcuri. 2013. Whole test suite generation. IEEE Transactions on Software Engineering 39, 2 (2013), 276–291
work page 2013
-
[19]
Vahid Garousi, Michael Felderer, and Mika V Mäntylä. 2016. The need for multivocal literature reviews in software engineering: complementing systematic literature reviews with grey literature. In Proceedings of the 20th International Conference on Evaluation and Assessment in Software Engineering . ACM, 26
work page 2016
-
[20]
Michael Hilton, Jonathan Bell, and Darko Marinov. 2018. A Large-Scale, Lon- gitudinal Study of Test Coverage Evolution. In 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE 2018) . http://jonbell.net/ publications/coverage
work page 2018
-
[21]
Guoliang Jin, Linhai Song, Wei Zhang, Shan Lu, and Ben Liblit. 2011. Automated Atomicity-violation Fixing. In Proceedings of the 32Nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI) . ACM, 389–400. https://doi.org/10.1145/1993498.1993544
-
[22]
R Burke Johnson and Anthony J Onwuegbuzie. 2004. Mixed methods research: A research paradigm whose time has come. Educational researcher 33, 7 (2004), 14–26
work page 2004
-
[23]
Shan Lu, Soyeon Park, Eunsoo Seo, and Yuanyuan Zhou. 2008. Learning from Mistakes: A Comprehensive Study on Real World Concurrency Bug Charac- teristics. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) . ACM, 329–339. https://doi.org/10.1145/1346281.1346323
-
[24]
Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and Darko Marinov. 2014. An Empirical Analysis of Flaky Tests. In Proceedings of the SIGSOFT International Symposium on Foundations of Software Engineering (FSE) . ACM, 643–653. https: //doi.org/10.1145/2635868.2635920
-
[25]
Paul Marinescu, Petr Hosek, and Cristian Cadar. 2014. Covrig: A Framework for the Analysis of Code, Test, and Coverage Evolution in Real Software. In Proceedings of the International Symposium on Software Testing and Analysis (ISSTA). ACM, 93–104. https://doi.org/10.1145/2610384.2610419
-
[26]
Atif M. Memon and Myra B. Cohen. 2013. Automated Testing of GUI Applications: Models, Tools, and Controlling Flakiness. In Proceedings of the International Conference on Software Engineering (ICSE) . IEEE, 1479–1480
work page 2013
-
[27]
Kivanç Muşlu, Bilge Soran, and Jochen Wuttke. 2011. Finding Bugs by Isolating Unit Tests. In Proceedings of the SIGSOFT Symposium on Foundations of Software Engineering and the European Conference on Software Engineering (ESEC/FSE) . ACM, 496–499. https://doi.org/10.1145/2025113.2025202
-
[28]
A. N. Oppenheim. 1992. Questionnaire Design, Interviewing and Attitude Measure- ment. Pinter Publishers
work page 1992
-
[29]
Fabio Palomba, Annibale Panichella, Andy Zaidman, Rocco Oliveto, and Andrea De Lucia. 2016. Automatic test case generation: What if test code quality mat- ters?. In Proceedings of the 25th International Symposium on Software Testing and Analysis. ACM, 130–141
work page 2016
-
[30]
Fabio Palomba and Andy Zaidman. 2017. Does refactoring of test smells induce fixing flaky tests?. In Proceedings - 2017 IEEE International Conference on Software Maintenance and Evolution, ICSME 2017 . 1–12. https://doi.org/10.1109/ICSME. 2017.12
-
[31]
Fabio Palomba and Andy Zaidman. 2017. Does refactoring of test smells induce fixing flaky tests?. In Software Maintenance and Evolution (ICSME), 2017 IEEE International Conference on. IEEE, 1–12
work page 2017
-
[32]
Fabio Palomba and Andy Zaidman. 2019. The smell of fear: On the relation between test smells and flaky tests. Journal of Empirical Software Engineering (2019)
work page 2019
-
[33]
Fabio Palomba, Andy Zaidman, and AD Lucia. 2018. Automatic test smell detec- tion using information retrieval techniques. In Proceedings of the International Conference on Software Maintenance and Evolution (ICSME). IEEE
work page 2018
-
[34]
Davide Spadini, Maurício Aniche, Magiel Bruntink, and Alberto Bacchelli. 2017. To Mock or Not To Mock? An Empirical Study on Mocking Practices. In Mining Software Repositories (MSR), 2017 IEEE/ACM 14th International Conference on . IEEE, 402–412
work page 2017
-
[35]
Davide Spadini, Maurício Aniche, Magiel Bruntink, and Alberto Bacchelli. 2019. Mock objects for testing java systems: Why and how developers use them, and how they evolve. Empirical Software Engineering 24, 3 (Jun 2019), 1461–1498
work page 2019
-
[36]
Davide Spadini, Fabio Palomba, Tobias Baum, Stefan Hanenberg, Magiel Bruntink, and Alberto Bacchelli. 2019. Test-driven code review: an empirical study. In Proceedings of the 41st International Conference on Software Engineering . IEEE Press, 1061–1072
work page 2019
-
[37]
Davide Spadini, Fabio Palomba, Andy Zaidman, Magiel Bruntink, and Alberto Bacchelli. 2018. On the relation of test smells to software code quality. In Pro- ceedings of the International Conference on Software Maintenance and Evolution (ICSME). IEEE
work page 2018
-
[38]
Arie van Deursen, Leon Moonen, Alex Bergh, and Gerard Kok. 2001. Refac- toring Test Code. In Proceedings of the 2nd International Conference on Extreme Programming and Flexible Processes in Software Engineering (XP) . 92–95
work page 2001
-
[39]
Marilyn Domas White and Emily E Marsh. 2006. Content analysis: A flexible methodology. Library trends 55, 1 (2006), 22–45
work page 2006
-
[40]
Mozilla wiki. 2019. Sheriffing. https://wiki.mozilla.org/Sheriffing
work page 2019
-
[41]
Claes Wohlin. 2014. Guidelines for snowballing in systematic literature studies and a replication in software engineering. In Proceedings of the 18th international conference on evaluation and assessment in software engineering . ACM, 38
work page 2014
-
[42]
Sai Zhang, Darioush Jalali, Jochen Wuttke, Kivanç Muslu, Wing Lam, Michael D. Ernst, and David Notkin. 2014. Empirically Revisiting the Test Independence Assumption. In Proceedings of the International Symposium on Software Testing and Analysis (ISSTA). ACM, 385–396. https://doi.org/10.1145/2610384.2610404
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.