Understanding Flaky Tests: The Developer's Perspective

Alberto Bacchelli; Fabio Palomba; Marco Castelluccio; Moritz Eck

arxiv: 1907.01466 · v1 · pith:KPAHG5WAnew · submitted 2019-07-02 · 💻 cs.SE

Understanding Flaky Tests: The Developer's Perspective

Moritz Eck , Fabio Palomba , Marco Castelluccio , Alberto Bacchelli This is my paper

Pith reviewed 2026-05-25 10:50 UTC · model grok-4.3

classification 💻 cs.SE

keywords flaky testssoftware testingdeveloper perceptionstest flakinesscauses of flakinessfixing effortsurvey studyclassification of tests

0 comments

The pith

Flaky tests stem from multiple causes, four of them new and the costliest to fix, with reproduction and cause identification as the main developer challenges.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes the nature of flaky tests through classifications made by 21 developers on 200 tests they had fixed, covering the type of flakiness, its origin, and the effort required, along with a survey of 121 developers on their views and difficulties. It shows that flakiness arises from several different causes, four of which had not been documented before yet demand the highest fixing effort. Developers across team sizes and project domains view flakiness as a meaningful issue that affects how resources are used, how work is scheduled, and how reliable the test suite appears. The chief problems reported are making the inconsistent behavior occur again and determining what produces it. This view matters because it identifies where current understanding of test unreliability falls short and where practical help for developers would be most useful.

Core claim

Through the classifications provided by 21 professional developers for 200 flaky tests they previously fixed and an online survey of 121 developers with a median of five years of industrial experience, the study shows that the flakiness is due to several different causes, four of which have never been reported before despite being the most costly to fix; flakiness is perceived as significant by the vast majority of developers, regardless of their team's size and project's domain, and it can have effects on resource allocation, scheduling, and the perceived reliability of the test suite; and the challenges developers report to face regard mostly the reproduction of the flaky behavior and the

What carries the argument

Developer classifications of flaky tests by nature of flakiness, origin, and fixing effort, together with survey responses on perceptions and reported challenges.

If this is right

Four previously unreported causes of flakiness require the greatest fixing effort and therefore warrant targeted attention.
Flakiness influences resource allocation and scheduling decisions in development projects of any size or domain.
Support for reproducing flaky behavior would directly address the challenge developers identify most often.
Improved methods for identifying the cause of flakiness would reduce the main reported difficulty in handling these tests.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Tools that automate reproduction steps could reduce the time developers spend on the most common challenge.
Awareness of the newly identified causes could be built into test maintenance practices to lower overall costs.
Surveying additional developers on the same questions might confirm whether the reported cost rankings hold beyond the initial sample.

Load-bearing premise

The classifications provided by the 21 developers who fixed the 200 tests, and the self-reported perceptions from the 121 survey respondents, accurately reflect the true underlying causes and costs without substantial recall bias or social-desirability effects.

What would settle it

An independent review of the same 200 tests that assigns different primary causes or different effort rankings than the ones supplied by the developers who fixed them.

Figures

Figures reproduced from arXiv: 1907.01466 by Alberto Bacchelli, Fabio Palomba, Marco Castelluccio, Moritz Eck.

read the original abstract

Flaky tests are software tests that exhibit a seemingly random outcome (pass or fail) when run against the same, identical code. Previous work has examined fixes to flaky tests and has proposed automated solutions to locate as well as fix flaky tests--we complement it by examining the perceptions of software developers about the nature, relevance, and challenges of this phenomenon. We asked 21 professional developers to classify 200 flaky tests they previously fixed, in terms of the nature of the flakiness, the origin of the flakiness, and the fixing effort. We complement this analysis with information about the fixing strategy. Subsequently, we conducted an online survey with 121 developers with a median industrial programming experience of five years. Our research shows that: The flakiness is due to several different causes, four of which have never been reported before, despite being the most costly to fix; flakiness is perceived as significant by the vast majority of developers, regardless of their team's size and project's domain, and it can have effects on resource allocation, scheduling, and the perceived reliability of the test suite; and the challenges developers report to face regard mostly the reproduction of the flaky behavior and the identification of the cause for the flakiness. Data and materials [https://doi.org/10.5281/zenodo.3265785].

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper reports results from an empirical study in which 21 professional developers classified 200 flaky tests they had previously fixed (by nature of flakiness, origin, fixing effort, and strategy) together with an online survey of 121 developers (median 5 years experience) on perceptions of significance, effects, and challenges. Key claims are that four previously unreported causes are the most costly, that flakiness is viewed as significant by the vast majority of developers independent of team size or domain, and that reproduction of flaky behavior plus cause identification are the dominant challenges.

Significance. If the classifications and self-reports hold, the work supplies developer-centric evidence that complements prior automated-detection papers, surfaces four new causes with cost implications, and identifies actionable pain points around reproduction and diagnosis. The public release of data and materials is a positive contribution to reproducibility.

major comments (2)

[Study of 200 flaky tests (classification procedure and results)] The identification of four new causes and the claim that they are the most costly rest entirely on the 21 developers' post-hoc self-classification of the 200 tests; no independent verification (code review, execution logs, or third-party diagnosis) is described to confirm that the assigned categories match actual root causes. This directly supports the novelty and cost-ranking results.
[Survey results and discussion of perceptions/challenges] Claims that flakiness is perceived as significant by the vast majority and that reproduction/identification are the primary challenges derive solely from the 121 survey responses without cross-validation against project artifacts or behavioral data, leaving the findings open to recall bias or social-desirability effects.

minor comments (1)

[Abstract and §1] The abstract and introduction could state the sample sizes and the self-report nature of the data earlier to set expectations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and constructive comments. Our work is explicitly framed as an investigation of the developer's perspective (title, abstract, and research questions), so the study design relies on self-classification and self-reported perceptions. We address each major comment below and will incorporate clarifications on limitations.

read point-by-point responses

Referee: [Study of 200 flaky tests (classification procedure and results)] The identification of four new causes and the claim that they are the most costly rest entirely on the 21 developers' post-hoc self-classification of the 200 tests; no independent verification (code review, execution logs, or third-party diagnosis) is described to confirm that the assigned categories match actual root causes. This directly supports the novelty and cost-ranking results.

Authors: The classifications were intentionally collected from the 21 developers who had fixed the tests, as the paper's goal is to surface how developers themselves categorize causes, effort, and strategies rather than to establish objective ground truth via external verification. This matches the stated focus on the developer's perspective and complements prior automated-detection work. We agree that the absence of independent verification is a limitation for claims about actual root causes; we will add an explicit subsection on threats to validity addressing self-reported classifications and their implications for the novelty and cost results. revision: partial
Referee: [Survey results and discussion of perceptions/challenges] Claims that flakiness is perceived as significant by the vast majority and that reproduction/identification are the primary challenges derive solely from the 121 survey responses without cross-validation against project artifacts or behavioral data, leaving the findings open to recall bias or social-desirability effects.

Authors: The survey component is designed to capture developers' perceptions of significance, effects, and challenges, which is the intended contribution. Standard survey methodology in empirical software engineering relies on self-reports for such questions; cross-validation against artifacts would address a different research goal. We acknowledge the potential for recall and social-desirability bias and will expand the threats-to-validity discussion to cover these issues explicitly while retaining the perceptual findings as reported. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical survey and classification study with no derivations or fitted predictions

full rationale

The paper reports results from developer classifications of 200 tests and a survey of 121 respondents. No equations, models, parameters, or predictions are derived from prior fitted quantities; the claims are direct summaries of collected responses. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify the central findings. The study is self-contained as an empirical data collection effort.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the assumption that developer self-reports and classifications are sufficiently accurate and representative; no free parameters or invented entities are introduced.

axioms (2)

domain assumption Developers' classifications of flaky-test causes and fixing effort accurately capture reality without substantial bias.
The study design uses these classifications as the primary data source for identifying new causes and cost rankings.
domain assumption The 121 survey respondents are representative of professional developers who encounter flaky tests.
Broad claims about perception of significance rely on this sample.

pith-pipeline@v0.9.0 · 5770 in / 1439 out tokens · 21134 ms · 2026-05-25T10:50:41.499605+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages

[1]

SurveyGizmo

2019. SurveyGizmo. https://www.surveygizmo.com

work page 2019
[2]

2018. 2018. ACM Joint European Software Engineering Conference and Sympo- sium on the Foundations of Software Engineering. http://www.esec-fse.org

work page 2018
[3]

2018. 2018. ACM SIGSOFT International Symposium on Software Testing and Analysis. https://conf.researchr.org/series/issta

work page 2018
[4]

2018. 2018. ACM Transactions on Software Engineering and Methodology. https: //tosem.acm.org

work page 2018
[5]

2018. 2018. IEEE TCSE International Conference on Software Maintenance and Evolution. http://conferences.computer.org/icsm/

work page 2018
[6]

2018. 2018. IEEE Transactions on Software Engineering. https://www.computer. org/web/tse

work page 2018
[7]

2018. 2018. IEEE/ACM International Conference on Software Engineering. http: //www.icse-conferences.org

work page 2018
[8]

2018. 2018. International Conference on Software Testing. https://www.es.mdh. se/icst2018/

work page 2018
[9]

2018. 2018. Springer’s Empirical Software Engineering Journal. https://link. springer.com/journal/10664

work page 2018
[10]

Jonathan Bell and Gail Kaiser. 2014. Unit Test Virtualization with VMVM. In Proceedings of the International Conference on Software Engineering (ICSE) . ACM, 550–561. https://doi.org/10.1145/2568225.2568248

work page doi:10.1145/2568225.2568248 2014
[11]

Jonathan Bell, Owolabi Legunsen, Michael Hilton, Lamyaa Eloussi, Tifany Yung, and Darko Marinov. 2018. DeFlaker: Automatically Detecting Flaky Tests. In Proceedings of the International Conference on Software Engineering (ICSE) . To Appear

work page 2018
[12]

Juliet M Corbin and Anselm Strauss. 1990. Grounded theory research: Procedures, canons, and evaluative criteria. Qualitative sociology 13, 1 (1990), 3–21

work page 1990
[13]

Moritz Eck, Fabio Palomba, Marco Castelluccio, and Alberto Bacchelli. 2019. Data and materials for: ‘Understanding Flaky Tests: The Developer’s Perspective’. https://doi.org/10.5281/zenodo.3265830

work page doi:10.5281/zenodo.3265830 2019
[14]

Farchi, Y

E. Farchi, Y. Nir, and S. Ur. 2003. Concurrent bug patterns and how to test them. In Proceedings International Parallel and Distributed Processing Symposium . 7 pp.–. https://doi.org/10.1109/IPDPS.2003.1213511

work page doi:10.1109/ipdps.2003.1213511 2003
[15]

Timothy S Flanigan, Emily McFarlane, and Sarah Cook. 2008. Conducting survey research among physicians and other medical professionals: a review of cur- rent literature. In Proceedings of the Survey Research Methods Section, American Statistical Association, Vol. 1. 4136–47

work page 2008
[16]

Martin Fowler. [n. d.]. Eradicating non-determinism in tests. https://martinfowler. com/articles/nonDeterminism.html

work page
[17]

M. Fowler. 1999. Refactoring: improving the design of existing code . Addison- Wesley

work page 1999
[18]

Gordon Fraser and Andrea Arcuri. 2013. Whole test suite generation. IEEE Transactions on Software Engineering 39, 2 (2013), 276–291

work page 2013
[19]

Vahid Garousi, Michael Felderer, and Mika V Mäntylä. 2016. The need for multivocal literature reviews in software engineering: complementing systematic literature reviews with grey literature. In Proceedings of the 20th International Conference on Evaluation and Assessment in Software Engineering . ACM, 26

work page 2016
[20]

Michael Hilton, Jonathan Bell, and Darko Marinov. 2018. A Large-Scale, Lon- gitudinal Study of Test Coverage Evolution. In 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE 2018) . http://jonbell.net/ publications/coverage

work page 2018
[21]

Guoliang Jin, Linhai Song, Wei Zhang, Shan Lu, and Ben Liblit. 2011. Automated Atomicity-violation Fixing. In Proceedings of the 32Nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI) . ACM, 389–400. https://doi.org/10.1145/1993498.1993544

work page doi:10.1145/1993498.1993544 2011
[22]

R Burke Johnson and Anthony J Onwuegbuzie. 2004. Mixed methods research: A research paradigm whose time has come. Educational researcher 33, 7 (2004), 14–26

work page 2004
[23]

Shan Lu, Soyeon Park, Eunsoo Seo, and Yuanyuan Zhou. 2008. Learning from Mistakes: A Comprehensive Study on Real World Concurrency Bug Charac- teristics. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) . ACM, 329–339. https://doi.org/10.1145/1346281.1346323

work page doi:10.1145/1346281.1346323 2008
[24]

Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and Darko Marinov. 2014. An Empirical Analysis of Flaky Tests. In Proceedings of the SIGSOFT International Symposium on Foundations of Software Engineering (FSE) . ACM, 643–653. https: //doi.org/10.1145/2635868.2635920

work page doi:10.1145/2635868.2635920 2014
[25]

Paul Marinescu, Petr Hosek, and Cristian Cadar. 2014. Covrig: A Framework for the Analysis of Code, Test, and Coverage Evolution in Real Software. In Proceedings of the International Symposium on Software Testing and Analysis (ISSTA). ACM, 93–104. https://doi.org/10.1145/2610384.2610419

work page doi:10.1145/2610384.2610419 2014
[26]

Memon and Myra B

Atif M. Memon and Myra B. Cohen. 2013. Automated Testing of GUI Applications: Models, Tools, and Controlling Flakiness. In Proceedings of the International Conference on Software Engineering (ICSE) . IEEE, 1479–1480

work page 2013
[27]

Kivanç Muşlu, Bilge Soran, and Jochen Wuttke. 2011. Finding Bugs by Isolating Unit Tests. In Proceedings of the SIGSOFT Symposium on Foundations of Software Engineering and the European Conference on Software Engineering (ESEC/FSE) . ACM, 496–499. https://doi.org/10.1145/2025113.2025202

work page doi:10.1145/2025113.2025202 2011
[28]

A. N. Oppenheim. 1992. Questionnaire Design, Interviewing and Attitude Measure- ment. Pinter Publishers

work page 1992
[29]

Fabio Palomba, Annibale Panichella, Andy Zaidman, Rocco Oliveto, and Andrea De Lucia. 2016. Automatic test case generation: What if test code quality mat- ters?. In Proceedings of the 25th International Symposium on Software Testing and Analysis. ACM, 130–141

work page 2016
[30]

Fabio Palomba and Andy Zaidman. 2017. Does refactoring of test smells induce fixing flaky tests?. In Proceedings - 2017 IEEE International Conference on Software Maintenance and Evolution, ICSME 2017 . 1–12. https://doi.org/10.1109/ICSME. 2017.12

work page doi:10.1109/icsme 2017
[31]

Fabio Palomba and Andy Zaidman. 2017. Does refactoring of test smells induce fixing flaky tests?. In Software Maintenance and Evolution (ICSME), 2017 IEEE International Conference on. IEEE, 1–12

work page 2017
[32]

Fabio Palomba and Andy Zaidman. 2019. The smell of fear: On the relation between test smells and flaky tests. Journal of Empirical Software Engineering (2019)

work page 2019
[33]

Fabio Palomba, Andy Zaidman, and AD Lucia. 2018. Automatic test smell detec- tion using information retrieval techniques. In Proceedings of the International Conference on Software Maintenance and Evolution (ICSME). IEEE

work page 2018
[34]

Davide Spadini, Maurício Aniche, Magiel Bruntink, and Alberto Bacchelli. 2017. To Mock or Not To Mock? An Empirical Study on Mocking Practices. In Mining Software Repositories (MSR), 2017 IEEE/ACM 14th International Conference on . IEEE, 402–412

work page 2017
[35]

Davide Spadini, Maurício Aniche, Magiel Bruntink, and Alberto Bacchelli. 2019. Mock objects for testing java systems: Why and how developers use them, and how they evolve. Empirical Software Engineering 24, 3 (Jun 2019), 1461–1498

work page 2019
[36]

Davide Spadini, Fabio Palomba, Tobias Baum, Stefan Hanenberg, Magiel Bruntink, and Alberto Bacchelli. 2019. Test-driven code review: an empirical study. In Proceedings of the 41st International Conference on Software Engineering . IEEE Press, 1061–1072

work page 2019
[37]

Davide Spadini, Fabio Palomba, Andy Zaidman, Magiel Bruntink, and Alberto Bacchelli. 2018. On the relation of test smells to software code quality. In Pro- ceedings of the International Conference on Software Maintenance and Evolution (ICSME). IEEE

work page 2018
[38]

Arie van Deursen, Leon Moonen, Alex Bergh, and Gerard Kok. 2001. Refac- toring Test Code. In Proceedings of the 2nd International Conference on Extreme Programming and Flexible Processes in Software Engineering (XP) . 92–95

work page 2001
[39]

Marilyn Domas White and Emily E Marsh. 2006. Content analysis: A flexible methodology. Library trends 55, 1 (2006), 22–45

work page 2006
[40]

Mozilla wiki. 2019. Sheriffing. https://wiki.mozilla.org/Sheriffing

work page 2019
[41]

Claes Wohlin. 2014. Guidelines for snowballing in systematic literature studies and a replication in software engineering. In Proceedings of the 18th international conference on evaluation and assessment in software engineering . ACM, 38

work page 2014
[42]

Ernst, and David Notkin

Sai Zhang, Darioush Jalali, Jochen Wuttke, Kivanç Muslu, Wing Lam, Michael D. Ernst, and David Notkin. 2014. Empirically Revisiting the Test Independence Assumption. In Proceedings of the International Symposium on Software Testing and Analysis (ISSTA). ACM, 385–396. https://doi.org/10.1145/2610384.2610404

work page doi:10.1145/2610384.2610404 2014

[1] [1]

SurveyGizmo

2019. SurveyGizmo. https://www.surveygizmo.com

work page 2019

[2] [2]

2018. 2018. ACM Joint European Software Engineering Conference and Sympo- sium on the Foundations of Software Engineering. http://www.esec-fse.org

work page 2018

[3] [3]

2018. 2018. ACM SIGSOFT International Symposium on Software Testing and Analysis. https://conf.researchr.org/series/issta

work page 2018

[4] [4]

2018. 2018. ACM Transactions on Software Engineering and Methodology. https: //tosem.acm.org

work page 2018

[5] [5]

2018. 2018. IEEE TCSE International Conference on Software Maintenance and Evolution. http://conferences.computer.org/icsm/

work page 2018

[6] [6]

2018. 2018. IEEE Transactions on Software Engineering. https://www.computer. org/web/tse

work page 2018

[7] [7]

2018. 2018. IEEE/ACM International Conference on Software Engineering. http: //www.icse-conferences.org

work page 2018

[8] [8]

2018. 2018. International Conference on Software Testing. https://www.es.mdh. se/icst2018/

work page 2018

[9] [9]

2018. 2018. Springer’s Empirical Software Engineering Journal. https://link. springer.com/journal/10664

work page 2018

[10] [10]

Jonathan Bell and Gail Kaiser. 2014. Unit Test Virtualization with VMVM. In Proceedings of the International Conference on Software Engineering (ICSE) . ACM, 550–561. https://doi.org/10.1145/2568225.2568248

work page doi:10.1145/2568225.2568248 2014

[11] [11]

Jonathan Bell, Owolabi Legunsen, Michael Hilton, Lamyaa Eloussi, Tifany Yung, and Darko Marinov. 2018. DeFlaker: Automatically Detecting Flaky Tests. In Proceedings of the International Conference on Software Engineering (ICSE) . To Appear

work page 2018

[12] [12]

Juliet M Corbin and Anselm Strauss. 1990. Grounded theory research: Procedures, canons, and evaluative criteria. Qualitative sociology 13, 1 (1990), 3–21

work page 1990

[13] [13]

Moritz Eck, Fabio Palomba, Marco Castelluccio, and Alberto Bacchelli. 2019. Data and materials for: ‘Understanding Flaky Tests: The Developer’s Perspective’. https://doi.org/10.5281/zenodo.3265830

work page doi:10.5281/zenodo.3265830 2019

[14] [14]

Farchi, Y

E. Farchi, Y. Nir, and S. Ur. 2003. Concurrent bug patterns and how to test them. In Proceedings International Parallel and Distributed Processing Symposium . 7 pp.–. https://doi.org/10.1109/IPDPS.2003.1213511

work page doi:10.1109/ipdps.2003.1213511 2003

[15] [15]

Timothy S Flanigan, Emily McFarlane, and Sarah Cook. 2008. Conducting survey research among physicians and other medical professionals: a review of cur- rent literature. In Proceedings of the Survey Research Methods Section, American Statistical Association, Vol. 1. 4136–47

work page 2008

[16] [16]

Martin Fowler. [n. d.]. Eradicating non-determinism in tests. https://martinfowler. com/articles/nonDeterminism.html

work page

[17] [17]

M. Fowler. 1999. Refactoring: improving the design of existing code . Addison- Wesley

work page 1999

[18] [18]

Gordon Fraser and Andrea Arcuri. 2013. Whole test suite generation. IEEE Transactions on Software Engineering 39, 2 (2013), 276–291

work page 2013

[19] [19]

Vahid Garousi, Michael Felderer, and Mika V Mäntylä. 2016. The need for multivocal literature reviews in software engineering: complementing systematic literature reviews with grey literature. In Proceedings of the 20th International Conference on Evaluation and Assessment in Software Engineering . ACM, 26

work page 2016

[20] [20]

Michael Hilton, Jonathan Bell, and Darko Marinov. 2018. A Large-Scale, Lon- gitudinal Study of Test Coverage Evolution. In 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE 2018) . http://jonbell.net/ publications/coverage

work page 2018

[21] [21]

Guoliang Jin, Linhai Song, Wei Zhang, Shan Lu, and Ben Liblit. 2011. Automated Atomicity-violation Fixing. In Proceedings of the 32Nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI) . ACM, 389–400. https://doi.org/10.1145/1993498.1993544

work page doi:10.1145/1993498.1993544 2011

[22] [22]

R Burke Johnson and Anthony J Onwuegbuzie. 2004. Mixed methods research: A research paradigm whose time has come. Educational researcher 33, 7 (2004), 14–26

work page 2004

[23] [23]

Shan Lu, Soyeon Park, Eunsoo Seo, and Yuanyuan Zhou. 2008. Learning from Mistakes: A Comprehensive Study on Real World Concurrency Bug Charac- teristics. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) . ACM, 329–339. https://doi.org/10.1145/1346281.1346323

work page doi:10.1145/1346281.1346323 2008

[24] [24]

Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and Darko Marinov. 2014. An Empirical Analysis of Flaky Tests. In Proceedings of the SIGSOFT International Symposium on Foundations of Software Engineering (FSE) . ACM, 643–653. https: //doi.org/10.1145/2635868.2635920

work page doi:10.1145/2635868.2635920 2014

[25] [25]

Paul Marinescu, Petr Hosek, and Cristian Cadar. 2014. Covrig: A Framework for the Analysis of Code, Test, and Coverage Evolution in Real Software. In Proceedings of the International Symposium on Software Testing and Analysis (ISSTA). ACM, 93–104. https://doi.org/10.1145/2610384.2610419

work page doi:10.1145/2610384.2610419 2014

[26] [26]

Memon and Myra B

Atif M. Memon and Myra B. Cohen. 2013. Automated Testing of GUI Applications: Models, Tools, and Controlling Flakiness. In Proceedings of the International Conference on Software Engineering (ICSE) . IEEE, 1479–1480

work page 2013

[27] [27]

Kivanç Muşlu, Bilge Soran, and Jochen Wuttke. 2011. Finding Bugs by Isolating Unit Tests. In Proceedings of the SIGSOFT Symposium on Foundations of Software Engineering and the European Conference on Software Engineering (ESEC/FSE) . ACM, 496–499. https://doi.org/10.1145/2025113.2025202

work page doi:10.1145/2025113.2025202 2011

[28] [28]

A. N. Oppenheim. 1992. Questionnaire Design, Interviewing and Attitude Measure- ment. Pinter Publishers

work page 1992

[29] [29]

Fabio Palomba, Annibale Panichella, Andy Zaidman, Rocco Oliveto, and Andrea De Lucia. 2016. Automatic test case generation: What if test code quality mat- ters?. In Proceedings of the 25th International Symposium on Software Testing and Analysis. ACM, 130–141

work page 2016

[30] [30]

Fabio Palomba and Andy Zaidman. 2017. Does refactoring of test smells induce fixing flaky tests?. In Proceedings - 2017 IEEE International Conference on Software Maintenance and Evolution, ICSME 2017 . 1–12. https://doi.org/10.1109/ICSME. 2017.12

work page doi:10.1109/icsme 2017

[31] [31]

Fabio Palomba and Andy Zaidman. 2017. Does refactoring of test smells induce fixing flaky tests?. In Software Maintenance and Evolution (ICSME), 2017 IEEE International Conference on. IEEE, 1–12

work page 2017

[32] [32]

Fabio Palomba and Andy Zaidman. 2019. The smell of fear: On the relation between test smells and flaky tests. Journal of Empirical Software Engineering (2019)

work page 2019

[33] [33]

Fabio Palomba, Andy Zaidman, and AD Lucia. 2018. Automatic test smell detec- tion using information retrieval techniques. In Proceedings of the International Conference on Software Maintenance and Evolution (ICSME). IEEE

work page 2018

[34] [34]

Davide Spadini, Maurício Aniche, Magiel Bruntink, and Alberto Bacchelli. 2017. To Mock or Not To Mock? An Empirical Study on Mocking Practices. In Mining Software Repositories (MSR), 2017 IEEE/ACM 14th International Conference on . IEEE, 402–412

work page 2017

[35] [35]

Davide Spadini, Maurício Aniche, Magiel Bruntink, and Alberto Bacchelli. 2019. Mock objects for testing java systems: Why and how developers use them, and how they evolve. Empirical Software Engineering 24, 3 (Jun 2019), 1461–1498

work page 2019

[36] [36]

Davide Spadini, Fabio Palomba, Tobias Baum, Stefan Hanenberg, Magiel Bruntink, and Alberto Bacchelli. 2019. Test-driven code review: an empirical study. In Proceedings of the 41st International Conference on Software Engineering . IEEE Press, 1061–1072

work page 2019

[37] [37]

Davide Spadini, Fabio Palomba, Andy Zaidman, Magiel Bruntink, and Alberto Bacchelli. 2018. On the relation of test smells to software code quality. In Pro- ceedings of the International Conference on Software Maintenance and Evolution (ICSME). IEEE

work page 2018

[38] [38]

Arie van Deursen, Leon Moonen, Alex Bergh, and Gerard Kok. 2001. Refac- toring Test Code. In Proceedings of the 2nd International Conference on Extreme Programming and Flexible Processes in Software Engineering (XP) . 92–95

work page 2001

[39] [39]

Marilyn Domas White and Emily E Marsh. 2006. Content analysis: A flexible methodology. Library trends 55, 1 (2006), 22–45

work page 2006

[40] [40]

Mozilla wiki. 2019. Sheriffing. https://wiki.mozilla.org/Sheriffing

work page 2019

[41] [41]

Claes Wohlin. 2014. Guidelines for snowballing in systematic literature studies and a replication in software engineering. In Proceedings of the 18th international conference on evaluation and assessment in software engineering . ACM, 38

work page 2014

[42] [42]

Ernst, and David Notkin

Sai Zhang, Darioush Jalali, Jochen Wuttke, Kivanç Muslu, Wing Lam, Michael D. Ernst, and David Notkin. 2014. Empirically Revisiting the Test Independence Assumption. In Proceedings of the International Symposium on Software Testing and Analysis (ISSTA). ACM, 385–396. https://doi.org/10.1145/2610384.2610404

work page doi:10.1145/2610384.2610404 2014