pith. sign in

arxiv: 1907.01466 · v1 · pith:KPAHG5WAnew · submitted 2019-07-02 · 💻 cs.SE

Understanding Flaky Tests: The Developer's Perspective

Pith reviewed 2026-05-25 10:50 UTC · model grok-4.3

classification 💻 cs.SE
keywords flaky testssoftware testingdeveloper perceptionstest flakinesscauses of flakinessfixing effortsurvey studyclassification of tests
0
0 comments X

The pith

Flaky tests stem from multiple causes, four of them new and the costliest to fix, with reproduction and cause identification as the main developer challenges.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes the nature of flaky tests through classifications made by 21 developers on 200 tests they had fixed, covering the type of flakiness, its origin, and the effort required, along with a survey of 121 developers on their views and difficulties. It shows that flakiness arises from several different causes, four of which had not been documented before yet demand the highest fixing effort. Developers across team sizes and project domains view flakiness as a meaningful issue that affects how resources are used, how work is scheduled, and how reliable the test suite appears. The chief problems reported are making the inconsistent behavior occur again and determining what produces it. This view matters because it identifies where current understanding of test unreliability falls short and where practical help for developers would be most useful.

Core claim

Through the classifications provided by 21 professional developers for 200 flaky tests they previously fixed and an online survey of 121 developers with a median of five years of industrial experience, the study shows that the flakiness is due to several different causes, four of which have never been reported before despite being the most costly to fix; flakiness is perceived as significant by the vast majority of developers, regardless of their team's size and project's domain, and it can have effects on resource allocation, scheduling, and the perceived reliability of the test suite; and the challenges developers report to face regard mostly the reproduction of the flaky behavior and the

What carries the argument

Developer classifications of flaky tests by nature of flakiness, origin, and fixing effort, together with survey responses on perceptions and reported challenges.

If this is right

  • Four previously unreported causes of flakiness require the greatest fixing effort and therefore warrant targeted attention.
  • Flakiness influences resource allocation and scheduling decisions in development projects of any size or domain.
  • Support for reproducing flaky behavior would directly address the challenge developers identify most often.
  • Improved methods for identifying the cause of flakiness would reduce the main reported difficulty in handling these tests.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Tools that automate reproduction steps could reduce the time developers spend on the most common challenge.
  • Awareness of the newly identified causes could be built into test maintenance practices to lower overall costs.
  • Surveying additional developers on the same questions might confirm whether the reported cost rankings hold beyond the initial sample.

Load-bearing premise

The classifications provided by the 21 developers who fixed the 200 tests, and the self-reported perceptions from the 121 survey respondents, accurately reflect the true underlying causes and costs without substantial recall bias or social-desirability effects.

What would settle it

An independent review of the same 200 tests that assigns different primary causes or different effort rankings than the ones supplied by the developers who fixed them.

Figures

Figures reproduced from arXiv: 1907.01466 by Alberto Bacchelli, Fabio Palomba, Marco Castelluccio, Moritz Eck.

Figure 1
Figure 1. Figure 1: Frequency and relevance of the problem according [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
read the original abstract

Flaky tests are software tests that exhibit a seemingly random outcome (pass or fail) when run against the same, identical code. Previous work has examined fixes to flaky tests and has proposed automated solutions to locate as well as fix flaky tests--we complement it by examining the perceptions of software developers about the nature, relevance, and challenges of this phenomenon. We asked 21 professional developers to classify 200 flaky tests they previously fixed, in terms of the nature of the flakiness, the origin of the flakiness, and the fixing effort. We complement this analysis with information about the fixing strategy. Subsequently, we conducted an online survey with 121 developers with a median industrial programming experience of five years. Our research shows that: The flakiness is due to several different causes, four of which have never been reported before, despite being the most costly to fix; flakiness is perceived as significant by the vast majority of developers, regardless of their team's size and project's domain, and it can have effects on resource allocation, scheduling, and the perceived reliability of the test suite; and the challenges developers report to face regard mostly the reproduction of the flaky behavior and the identification of the cause for the flakiness. Data and materials [https://doi.org/10.5281/zenodo.3265785].

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper reports results from an empirical study in which 21 professional developers classified 200 flaky tests they had previously fixed (by nature of flakiness, origin, fixing effort, and strategy) together with an online survey of 121 developers (median 5 years experience) on perceptions of significance, effects, and challenges. Key claims are that four previously unreported causes are the most costly, that flakiness is viewed as significant by the vast majority of developers independent of team size or domain, and that reproduction of flaky behavior plus cause identification are the dominant challenges.

Significance. If the classifications and self-reports hold, the work supplies developer-centric evidence that complements prior automated-detection papers, surfaces four new causes with cost implications, and identifies actionable pain points around reproduction and diagnosis. The public release of data and materials is a positive contribution to reproducibility.

major comments (2)
  1. [Study of 200 flaky tests (classification procedure and results)] The identification of four new causes and the claim that they are the most costly rest entirely on the 21 developers' post-hoc self-classification of the 200 tests; no independent verification (code review, execution logs, or third-party diagnosis) is described to confirm that the assigned categories match actual root causes. This directly supports the novelty and cost-ranking results.
  2. [Survey results and discussion of perceptions/challenges] Claims that flakiness is perceived as significant by the vast majority and that reproduction/identification are the primary challenges derive solely from the 121 survey responses without cross-validation against project artifacts or behavioral data, leaving the findings open to recall bias or social-desirability effects.
minor comments (1)
  1. [Abstract and §1] The abstract and introduction could state the sample sizes and the self-report nature of the data earlier to set expectations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and constructive comments. Our work is explicitly framed as an investigation of the developer's perspective (title, abstract, and research questions), so the study design relies on self-classification and self-reported perceptions. We address each major comment below and will incorporate clarifications on limitations.

read point-by-point responses
  1. Referee: [Study of 200 flaky tests (classification procedure and results)] The identification of four new causes and the claim that they are the most costly rest entirely on the 21 developers' post-hoc self-classification of the 200 tests; no independent verification (code review, execution logs, or third-party diagnosis) is described to confirm that the assigned categories match actual root causes. This directly supports the novelty and cost-ranking results.

    Authors: The classifications were intentionally collected from the 21 developers who had fixed the tests, as the paper's goal is to surface how developers themselves categorize causes, effort, and strategies rather than to establish objective ground truth via external verification. This matches the stated focus on the developer's perspective and complements prior automated-detection work. We agree that the absence of independent verification is a limitation for claims about actual root causes; we will add an explicit subsection on threats to validity addressing self-reported classifications and their implications for the novelty and cost results. revision: partial

  2. Referee: [Survey results and discussion of perceptions/challenges] Claims that flakiness is perceived as significant by the vast majority and that reproduction/identification are the primary challenges derive solely from the 121 survey responses without cross-validation against project artifacts or behavioral data, leaving the findings open to recall bias or social-desirability effects.

    Authors: The survey component is designed to capture developers' perceptions of significance, effects, and challenges, which is the intended contribution. Standard survey methodology in empirical software engineering relies on self-reports for such questions; cross-validation against artifacts would address a different research goal. We acknowledge the potential for recall and social-desirability bias and will expand the threats-to-validity discussion to cover these issues explicitly while retaining the perceptual findings as reported. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical survey and classification study with no derivations or fitted predictions

full rationale

The paper reports results from developer classifications of 200 tests and a survey of 121 respondents. No equations, models, parameters, or predictions are derived from prior fitted quantities; the claims are direct summaries of collected responses. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify the central findings. The study is self-contained as an empirical data collection effort.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the assumption that developer self-reports and classifications are sufficiently accurate and representative; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption Developers' classifications of flaky-test causes and fixing effort accurately capture reality without substantial bias.
    The study design uses these classifications as the primary data source for identifying new causes and cost rankings.
  • domain assumption The 121 survey respondents are representative of professional developers who encounter flaky tests.
    Broad claims about perception of significance rely on this sample.

pith-pipeline@v0.9.0 · 5770 in / 1439 out tokens · 21134 ms · 2026-05-25T10:50:41.499605+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages

  1. [1]

    SurveyGizmo

    2019. SurveyGizmo. https://www.surveygizmo.com

  2. [2]

    2018. 2018. ACM Joint European Software Engineering Conference and Sympo- sium on the Foundations of Software Engineering. http://www.esec-fse.org

  3. [3]

    2018. 2018. ACM SIGSOFT International Symposium on Software Testing and Analysis. https://conf.researchr.org/series/issta

  4. [4]

    2018. 2018. ACM Transactions on Software Engineering and Methodology. https: //tosem.acm.org

  5. [5]

    2018. 2018. IEEE TCSE International Conference on Software Maintenance and Evolution. http://conferences.computer.org/icsm/

  6. [6]

    2018. 2018. IEEE Transactions on Software Engineering. https://www.computer. org/web/tse

  7. [7]

    2018. 2018. IEEE/ACM International Conference on Software Engineering. http: //www.icse-conferences.org

  8. [8]

    2018. 2018. International Conference on Software Testing. https://www.es.mdh. se/icst2018/

  9. [9]

    2018. 2018. Springer’s Empirical Software Engineering Journal. https://link. springer.com/journal/10664

  10. [10]

    Jonathan Bell and Gail Kaiser. 2014. Unit Test Virtualization with VMVM. In Proceedings of the International Conference on Software Engineering (ICSE) . ACM, 550–561. https://doi.org/10.1145/2568225.2568248

  11. [11]

    Jonathan Bell, Owolabi Legunsen, Michael Hilton, Lamyaa Eloussi, Tifany Yung, and Darko Marinov. 2018. DeFlaker: Automatically Detecting Flaky Tests. In Proceedings of the International Conference on Software Engineering (ICSE) . To Appear

  12. [12]

    Juliet M Corbin and Anselm Strauss. 1990. Grounded theory research: Procedures, canons, and evaluative criteria. Qualitative sociology 13, 1 (1990), 3–21

  13. [13]

    Moritz Eck, Fabio Palomba, Marco Castelluccio, and Alberto Bacchelli. 2019. Data and materials for: ‘Understanding Flaky Tests: The Developer’s Perspective’. https://doi.org/10.5281/zenodo.3265830

  14. [14]

    Farchi, Y

    E. Farchi, Y. Nir, and S. Ur. 2003. Concurrent bug patterns and how to test them. In Proceedings International Parallel and Distributed Processing Symposium . 7 pp.–. https://doi.org/10.1109/IPDPS.2003.1213511

  15. [15]

    Timothy S Flanigan, Emily McFarlane, and Sarah Cook. 2008. Conducting survey research among physicians and other medical professionals: a review of cur- rent literature. In Proceedings of the Survey Research Methods Section, American Statistical Association, Vol. 1. 4136–47

  16. [16]

    Martin Fowler. [n. d.]. Eradicating non-determinism in tests. https://martinfowler. com/articles/nonDeterminism.html

  17. [17]

    M. Fowler. 1999. Refactoring: improving the design of existing code . Addison- Wesley

  18. [18]

    Gordon Fraser and Andrea Arcuri. 2013. Whole test suite generation. IEEE Transactions on Software Engineering 39, 2 (2013), 276–291

  19. [19]

    Vahid Garousi, Michael Felderer, and Mika V Mäntylä. 2016. The need for multivocal literature reviews in software engineering: complementing systematic literature reviews with grey literature. In Proceedings of the 20th International Conference on Evaluation and Assessment in Software Engineering . ACM, 26

  20. [20]

    Michael Hilton, Jonathan Bell, and Darko Marinov. 2018. A Large-Scale, Lon- gitudinal Study of Test Coverage Evolution. In 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE 2018) . http://jonbell.net/ publications/coverage

  21. [21]

    Guoliang Jin, Linhai Song, Wei Zhang, Shan Lu, and Ben Liblit. 2011. Automated Atomicity-violation Fixing. In Proceedings of the 32Nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI) . ACM, 389–400. https://doi.org/10.1145/1993498.1993544

  22. [22]

    R Burke Johnson and Anthony J Onwuegbuzie. 2004. Mixed methods research: A research paradigm whose time has come. Educational researcher 33, 7 (2004), 14–26

  23. [23]

    Shan Lu, Soyeon Park, Eunsoo Seo, and Yuanyuan Zhou. 2008. Learning from Mistakes: A Comprehensive Study on Real World Concurrency Bug Charac- teristics. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) . ACM, 329–339. https://doi.org/10.1145/1346281.1346323

  24. [24]

    Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and Darko Marinov. 2014. An Empirical Analysis of Flaky Tests. In Proceedings of the SIGSOFT International Symposium on Foundations of Software Engineering (FSE) . ACM, 643–653. https: //doi.org/10.1145/2635868.2635920

  25. [25]

    Paul Marinescu, Petr Hosek, and Cristian Cadar. 2014. Covrig: A Framework for the Analysis of Code, Test, and Coverage Evolution in Real Software. In Proceedings of the International Symposium on Software Testing and Analysis (ISSTA). ACM, 93–104. https://doi.org/10.1145/2610384.2610419

  26. [26]

    Memon and Myra B

    Atif M. Memon and Myra B. Cohen. 2013. Automated Testing of GUI Applications: Models, Tools, and Controlling Flakiness. In Proceedings of the International Conference on Software Engineering (ICSE) . IEEE, 1479–1480

  27. [27]

    Kivanç Muşlu, Bilge Soran, and Jochen Wuttke. 2011. Finding Bugs by Isolating Unit Tests. In Proceedings of the SIGSOFT Symposium on Foundations of Software Engineering and the European Conference on Software Engineering (ESEC/FSE) . ACM, 496–499. https://doi.org/10.1145/2025113.2025202

  28. [28]

    A. N. Oppenheim. 1992. Questionnaire Design, Interviewing and Attitude Measure- ment. Pinter Publishers

  29. [29]

    Fabio Palomba, Annibale Panichella, Andy Zaidman, Rocco Oliveto, and Andrea De Lucia. 2016. Automatic test case generation: What if test code quality mat- ters?. In Proceedings of the 25th International Symposium on Software Testing and Analysis. ACM, 130–141

  30. [30]

    Fabio Palomba and Andy Zaidman. 2017. Does refactoring of test smells induce fixing flaky tests?. In Proceedings - 2017 IEEE International Conference on Software Maintenance and Evolution, ICSME 2017 . 1–12. https://doi.org/10.1109/ICSME. 2017.12

  31. [31]

    Fabio Palomba and Andy Zaidman. 2017. Does refactoring of test smells induce fixing flaky tests?. In Software Maintenance and Evolution (ICSME), 2017 IEEE International Conference on. IEEE, 1–12

  32. [32]

    Fabio Palomba and Andy Zaidman. 2019. The smell of fear: On the relation between test smells and flaky tests. Journal of Empirical Software Engineering (2019)

  33. [33]

    Fabio Palomba, Andy Zaidman, and AD Lucia. 2018. Automatic test smell detec- tion using information retrieval techniques. In Proceedings of the International Conference on Software Maintenance and Evolution (ICSME). IEEE

  34. [34]

    Davide Spadini, Maurício Aniche, Magiel Bruntink, and Alberto Bacchelli. 2017. To Mock or Not To Mock? An Empirical Study on Mocking Practices. In Mining Software Repositories (MSR), 2017 IEEE/ACM 14th International Conference on . IEEE, 402–412

  35. [35]

    Davide Spadini, Maurício Aniche, Magiel Bruntink, and Alberto Bacchelli. 2019. Mock objects for testing java systems: Why and how developers use them, and how they evolve. Empirical Software Engineering 24, 3 (Jun 2019), 1461–1498

  36. [36]

    Davide Spadini, Fabio Palomba, Tobias Baum, Stefan Hanenberg, Magiel Bruntink, and Alberto Bacchelli. 2019. Test-driven code review: an empirical study. In Proceedings of the 41st International Conference on Software Engineering . IEEE Press, 1061–1072

  37. [37]

    Davide Spadini, Fabio Palomba, Andy Zaidman, Magiel Bruntink, and Alberto Bacchelli. 2018. On the relation of test smells to software code quality. In Pro- ceedings of the International Conference on Software Maintenance and Evolution (ICSME). IEEE

  38. [38]

    Arie van Deursen, Leon Moonen, Alex Bergh, and Gerard Kok. 2001. Refac- toring Test Code. In Proceedings of the 2nd International Conference on Extreme Programming and Flexible Processes in Software Engineering (XP) . 92–95

  39. [39]

    Marilyn Domas White and Emily E Marsh. 2006. Content analysis: A flexible methodology. Library trends 55, 1 (2006), 22–45

  40. [40]

    Mozilla wiki. 2019. Sheriffing. https://wiki.mozilla.org/Sheriffing

  41. [41]

    Claes Wohlin. 2014. Guidelines for snowballing in systematic literature studies and a replication in software engineering. In Proceedings of the 18th international conference on evaluation and assessment in software engineering . ACM, 38

  42. [42]

    Ernst, and David Notkin

    Sai Zhang, Darioush Jalali, Jochen Wuttke, Kivanç Muslu, Wing Lam, Michael D. Ernst, and David Notkin. 2014. Empirically Revisiting the Test Independence Assumption. In Proceedings of the International Symposium on Software Testing and Analysis (ISSTA). ACM, 385–396. https://doi.org/10.1145/2610384.2610404