Continuous Integration Theater
Pith reviewed 2026-05-25 10:29 UTC · model grok-4.3
The pith
Many TravisCI projects use continuous integration without following its core practices.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By inspecting 1,270 open-source projects that use TravisCI, we quantitatively studied how common it is to use CI with infrequent commits, in projects with poor test coverage, with builds that stay broken for long periods, and with builds that take too long to run. We observed that 748 (~60%) projects face infrequent commits, 85% have at least one broken build that takes more than four days to be fixed, and for the majority the build is executed under the 10 minutes rule of thumb.
What carries the argument
Continuous Integration Theater, the situation in which software engineers do not employ CI tools effectively, leading to unhealthy practices.
Load-bearing premise
That the 1,270 TravisCI projects represent typical CI usage and that the chosen cutoffs for infrequent commits, long-broken builds, and long build times validly mark unhealthy practices.
What would settle it
A replication study on a different set of projects or with different thresholds showing substantially lower rates of infrequent commits and long-broken builds.
Figures
read the original abstract
Background: Continuous Integration (CI) systems are now the bedrock of several software development practices. Several tools such as TravisCI, CircleCI, and Hudson, that implement CI practices, are commonly adopted by software engineers. However, the way that software engineers use these tools could lead to what we call "Continuous Integration Theater", a situation in which software engineers do not employ these tools effectively, leading to unhealthy CI practices. Aims: The goal of this paper is to make sense of how commonplace are these unhealthy continuous integration practices being employed in practice. Method: By inspecting 1,270 open-source projects that use TravisCI, the most used CI service, we quantitatively studied how common is to use CI (1) with infrequent commits, (2) in a software project with poor test coverage, (3) with builds that stay broken for long periods, and (4) with builds that take too long to run. Results: We observed that 748 ($sim$60%) projects face infrequent commits, which essentially makes the merging process harder. Moreover, we were able to find code coverage information for 51 projects. The average code coverage was 78%, although Ruby projects have a higher code coverage than Java projects (86% and 63%, respectively). However, some projects with very small coverage ($sim$4%) were found. Still, we observed that 85% of the studied projects have at least one broken build that take more than four days to be fixed. Interestingly, very small projects (up to 1,000 lines of code) are the ones that take the longest to fix broken builds. Finally, we noted that, for the majority of the studied projects, the build is executed under the 10 minutes rule of thumb. Conclusions: Our results are important to an increasing community of software engineers that employ CI practices on daily basis but may not be aware of bad practices that are eventually employed.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper examines the prevalence of four unhealthy CI practices ('Continuous Integration Theater') across 1,270 open-source projects using TravisCI: infrequent commits (748 projects, ~60%), poor test coverage (data for 51 projects, average 78%), builds remaining broken for more than four days (85% of projects), and builds exceeding 10 minutes (minority of projects). It concludes these practices are common and warrant attention from the CI community.
Significance. If the prevalence estimates prove robust, the work supplies concrete observational counts from a sizable TravisCI sample that document gaps between CI tool adoption and effective usage. This could usefully inform practitioner guidelines and CI platform design. The explicit project counts and breakdown by language (e.g., Ruby vs. Java coverage) are strengths.
major comments (4)
- [Abstract / Results] Abstract and Results: the 60% infrequent-commits figure and the 85% long-broken-build figure rest on three un-derived cutoffs (commit frequency, four-day broken-build window, ten-minute build duration) with no sensitivity analysis or alternative thresholds reported; modest changes to any cutoff could shift the headline percentages substantially.
- [Results] Results (coverage paragraph): coverage data exist for only 51 of 1,270 projects; the reported averages and language comparisons therefore rest on a small, possibly non-representative subset and should be qualified accordingly.
- [Method] Method: the sample is drawn exclusively from TravisCI users, yet no discussion addresses whether this introduces selection bias toward more CI-aware projects, limiting claims about the broader population of CI users.
- [Results] Results: prevalence estimates are given as point values with no error bars, confidence intervals, or statistical tests; this weakens the quantitative claims even for the chosen thresholds.
minor comments (2)
- [Abstract] Abstract: the phrase '10 minutes rule of thumb' appears without prior definition; the main text should state the exact rule and its provenance.
- [Conclusions] Conclusions: the final paragraph could more explicitly restate the data limitations (small coverage subsample, TravisCI-only sample) alongside the prevalence numbers.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments, which highlight important limitations in our presentation of results. We agree that the manuscript would benefit from additional analysis and qualifications. We will revise the paper to incorporate sensitivity analyses, better qualification of the coverage subsample, discussion of selection bias, and uncertainty measures for prevalence estimates.
read point-by-point responses
-
Referee: [Abstract / Results] Abstract and Results: the 60% infrequent-commits figure and the 85% long-broken-build figure rest on three un-derived cutoffs (commit frequency, four-day broken-build window, ten-minute build duration) with no sensitivity analysis or alternative thresholds reported; modest changes to any cutoff could shift the headline percentages substantially.
Authors: We acknowledge that the chosen thresholds (infrequent commits, four-day broken builds, and ten-minute builds) are presented without sensitivity analysis. While the ten-minute threshold is described as a 'rule of thumb' in the manuscript and the four-day window is motivated by prior work on build breakage, we agree that robustness should be demonstrated. In the revision we will add a sensitivity analysis subsection that varies each threshold and reports how the headline percentages change. revision: yes
-
Referee: [Results] Results (coverage paragraph): coverage data exist for only 51 of 1,270 projects; the reported averages and language comparisons therefore rest on a small, possibly non-representative subset and should be qualified accordingly.
Authors: The referee correctly notes the small sample (n=51) for coverage. We will revise the results and discussion sections to explicitly qualify this subsample as potentially non-representative, state the limitation prominently, and avoid over-generalizing the language-specific comparisons. revision: yes
-
Referee: [Method] Method: the sample is drawn exclusively from TravisCI users, yet no discussion addresses whether this introduces selection bias toward more CI-aware projects, limiting claims about the broader population of CI users.
Authors: We agree that restricting the sample to TravisCI projects may introduce selection bias. The revised manuscript will include an explicit limitations paragraph discussing this issue and its implications for generalizability beyond TravisCI users. revision: yes
-
Referee: [Results] Results: prevalence estimates are given as point values with no error bars, confidence intervals, or statistical tests; this weakens the quantitative claims even for the chosen thresholds.
Authors: We accept that point estimates alone are insufficient. The revision will add binomial confidence intervals for the main prevalence figures (60% and 85%) and, where feasible, for the coverage statistics. We will also note the absence of formal hypothesis tests as a limitation of the observational design. revision: yes
Circularity Check
No circularity; purely observational counts from explicit thresholds
full rationale
The paper reports direct observational statistics (e.g., 748 projects with infrequent commits, 85% with broken builds >4 days) obtained by applying chosen cutoffs to the 1,270-project TravisCI dataset. No equations, fitted parameters, predictions, or derivations appear. No self-citations, uniqueness theorems, or ansatzes are invoked to support the central claims. The results are computed counts from the data under the stated definitions; they do not reduce to the inputs by construction. Threshold arbitrariness is a validity concern, not a circularity issue per the defined patterns.
Axiom & Free-Parameter Ledger
free parameters (2)
- broken-build duration threshold =
4 days
- build duration threshold =
10 minutes
axioms (2)
- domain assumption TravisCI usage is a valid proxy for CI adoption and the selected projects represent broader CI practice.
- domain assumption Public coverage reports from 51 projects are sufficient to characterize test quality across the sample.
Reference graph
Works this paper leans on
-
[1]
Continuous Delivery: Reliable Software Releases Through Build, Test, and Deployment Automation
Jez Humble and David Farley. Continuous Delivery: Reliable Software Releases Through Build, Test, and Deployment Automation . Addison- Wesley Professional, 1st edition, 2010
work page 2010
-
[2]
Frederick P. Brooks, Jr. The Mythical Man-Month: Essays on Softw . Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1st edition, 1978
work page 1978
-
[3]
Work practices and challenges in continuous integration: A survey with travis CI users
Gustavo Pinto, Fernando Castor, Rodrigo Bonif ´acio, and Marcel Rebouc ¸as. Work practices and challenges in continuous integration: A survey with travis CI users. Softw., Pract. Exper ., 48(12):2223–2236, 2018
work page 2018
-
[4]
M. Rebouc ¸as, R. O. Santos, G. Pinto, and F. Castor. How does contributors’ involvement influence the build status of an open-source software project? In Proceedings of the 14th International Conference on Mining Software Repositories , MSR ’17, pages 475–478, Piscataway, NJ, USA, 2017. IEEE Press
work page 2017
-
[5]
B. Vasilescu, Y . Yu, H. Wang, P. Devanbu, and V . Filkov. Quality and productivity outcomes relating to continuous integration in github. In Proceedings of the 2015 10th Joint Meeting on F oundations of Software Engineering, ESEC/FSE 2015, pages 805–816, 2015
work page 2015
-
[6]
B. Vasilescu, S. van Schuylenburg, J. Wulms, A. Serebrenik, and M. G. J. van den Brand. Continuous integration in a social-coding world: Empiri- cal evidence from github. In Proceedings of the 2014 IEEE International Conference on Software Maintenance and Evolution , ICSME ’14, pages 401–405, Washington, DC, USA, 2014. IEEE Computer Society
work page 2014
- [7]
-
[8]
Building a collaborative culture: a grounded theory of well succeeded devops adoption in practice
Welder Pinheiro Luz, Gustavo Pinto, and Rodrigo Bonif ´acio. Building a collaborative culture: a grounded theory of well succeeded devops adoption in practice. In Proceedings of the 12th ACM/IEEE Interna- tional Symposium on Empirical Software Engineering and Measurement, ESEM 2018, Oulu, Finland, October 11-12, 2018 , pages 6:1–6:10, 2018
work page 2018
-
[9]
Martin Fowler. Continuous integration. https://www.martinfowler.com/ articles/continuousIntegration.html. Accessed: 2019-06-23
work page 2019
-
[10]
One size does not fit all: an empirical study of containerized continuous deployment workflows
Yang Zhang, Bogdan Vasilescu, Huaimin Wang, and Vladimir Filkov. One size does not fit all: an empirical study of containerized continuous deployment workflows. In Proceedings of the 2018 ACM Joint Meeting on European Software Engineering Conference and Symposium on the F oundations of Software Engineering, ESEC/SIGSOFT FSE 2018, Lake Buena Vista, FL, USA, ...
work page 2018
-
[11]
Exploring scrumbutan empirical study of scrum anti-patterns
Veli-Pekka Eloranta, Kai Koskimies, and Tommi Mikkonen. Exploring scrumbutan empirical study of scrum anti-patterns. Information and Software Technology, 74:194 – 203, 2016
work page 2016
-
[12]
Trade-offs in continuous integration: assurance, security, and flexibility
Michael Hilton, Nicholas Nelson, Timothy Tunnell, Darko Marinov, and Danny Dig. Trade-offs in continuous integration: assurance, security, and flexibility. In Proceedings of the 2017 11th Joint Meeting on F oundations of Software Engineering , pages 197–207. ACM, 2017
work page 2017
-
[13]
P. Ammann and J. Offutt and. Coverage criteria for logical expressions. In 14th International Symposium on Software Reliability Engineering,
- [14]
-
[15]
Comparing non-adequate test suites using coverage criteria
Milos Gligoric, Alex Groce, Chaoqiang Zhang, Rohan Sharma, Moham- mad Amin Alipour, and Darko Marinov. Comparing non-adequate test suites using coverage criteria. In Proceedings of the 2013 International Symposium on Software Testing and Analysis , ISSTA 2013, pages 302– 313, 2013
work page 2013
-
[16]
Coverage criteria for testing of object interactions in sequence diagrams
Atanas Rountev, Scott Kagan, and Jason Sawin. Coverage criteria for testing of object interactions in sequence diagrams. In Maura Cerioli, editor, Fundamental Approaches to Software Engineering , pages 289– 304, Berlin, Heidelberg, 2005. Springer Berlin Heidelberg
work page 2005
-
[17]
Travistorrent: synthesizing travis CI and github for full-stack research on continuous integration
Moritz Beller, Georgios Gousios, and Andy Zaidman. Travistorrent: synthesizing travis CI and github for full-stack research on continuous integration. In Proceedings of the 14th International Conference on Mining Software Repositories, MSR 2017, Buenos Aires, Argentina, May 20-28, 2017 , pages 447–450, 2017
work page 2017
-
[18]
A large-scale study of test coverage evolution
Michael Hilton, Jonathan Bell, and Darko Marinov. A large-scale study of test coverage evolution. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering , ASE 2018, pages 53–63, 2018
work page 2018
-
[19]
Evaluating and improving semistructured merge
Guilherme Cavalcanti, Paulo Borba, and Paola Accioly. Evaluating and improving semistructured merge. Proc. ACM Program. Lang. , 1(OOPSLA):59:1–59:27, October 2017
work page 2017
-
[20]
Guilherme Avelino, Leonardo Teixeira Passos, Andr ´e C. Hora, and Marco Tulio Valente. A novel approach for estimating truck factors. In 24th IEEE International Conference on Program Comprehension, ICPC 2016, Austin, TX, USA, May 16-17, 2016 , pages 1–10, 2016
work page 2016
- [21]
-
[22]
An empirical study of the long duration of continuous integration builds
Taher Ahmed Ghaleb, Daniel Alencar da Costa, and Ying Zou. An empirical study of the long duration of continuous integration builds. Empirical Software Engineering , pages 1–38, 2019
work page 2019
-
[23]
Studying the impact of adopting continuous integration on the delivery time of pull requests
Jo ˜ao Helis Bernardo, Daniel Alencar da Costa, and Uir ´a Kulesza. Studying the impact of adopting continuous integration on the delivery time of pull requests. In 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR) , pages 131–141. IEEE, 2018
work page 2018
-
[24]
Yangyang Zhao, Alexander Serebrenik, Yuming Zhou, Vladimir Filkov, and Bogdan Vasilescu. The impact of continuous integration on other software development practices: a large-scale empirical study. In 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE) , pages 60–71. IEEE, 2017
work page 2017
-
[25]
Test activities in the continuous integration and delivery pipeline
Torvald M ˚artensson, Daniel St ˚ahl, and Jan Bosch. Test activities in the continuous integration and delivery pipeline. Journal of Software: Evolution and Process , page e2153, 2019
work page 2019
-
[26]
Noise and heterogeneity in historical build data: an empirical study of travis ci
Keheliya Gallaba, Christian Macho, Martin Pinzger, and Shane McIn- tosh. Noise and heterogeneity in historical build data: an empirical study of travis ci. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering , pages 87–97. ACM, 2018
work page 2018
-
[27]
A study on the interplay between pull request review and continuous integration builds
Fiorella Zampetti, Gabriele Bavota, Gerardo Canfora, and Massimiliano Di Penta. A study on the interplay between pull request review and continuous integration builds. In 2019 IEEE 26th International Con- ference on Software Analysis, Evolution and Reengineering (SANER) , pages 38–48. IEEE, 2019
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.