pith. sign in

arxiv: 1907.01602 · v1 · pith:QDFMP4UFnew · submitted 2019-07-02 · 💻 cs.SE

Continuous Integration Theater

Pith reviewed 2026-05-25 10:29 UTC · model grok-4.3

classification 💻 cs.SE
keywords continuous integrationtravisciopen source projectsbuild failurescode coverageunhealthy practicesci theaterinfrequent commits
0
0 comments X

The pith

Many TravisCI projects use continuous integration without following its core practices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes 1,270 open-source projects using TravisCI to identify unhealthy CI practices. It reports that roughly 60 percent of projects make infrequent commits, which complicates merging. Additionally, 85 percent of projects have at least one build that remains broken for more than four days. Code coverage averages 78 percent in the projects where it could be measured, though some have very low coverage. The authors conclude that these patterns indicate 'Continuous Integration Theater,' where tools are adopted but not used effectively.

Core claim

By inspecting 1,270 open-source projects that use TravisCI, we quantitatively studied how common it is to use CI with infrequent commits, in projects with poor test coverage, with builds that stay broken for long periods, and with builds that take too long to run. We observed that 748 (~60%) projects face infrequent commits, 85% have at least one broken build that takes more than four days to be fixed, and for the majority the build is executed under the 10 minutes rule of thumb.

What carries the argument

Continuous Integration Theater, the situation in which software engineers do not employ CI tools effectively, leading to unhealthy practices.

Load-bearing premise

That the 1,270 TravisCI projects represent typical CI usage and that the chosen cutoffs for infrequent commits, long-broken builds, and long build times validly mark unhealthy practices.

What would settle it

A replication study on a different set of projects or with different thresholds showing substantially lower rates of infrequent commits and long-broken builds.

Figures

Figures reproduced from arXiv: 1907.01602 by Bruno Cartaxo, Daniel da Costa, Gustavo Pinto, Leonardo Furtado, Wagner Felidr\'e.

Figure 1
Figure 1. Figure 1: The impact of applying each filter on the quantitative [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Size of the project and its frequency perday of the [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Frequency of commits, grouped by the size of the projects (boxplots), and the programming languages (Ruby on the [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Code coverage per programming language [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of days with broken build, grouped by the size of the projects (boxplots), and the programming languages [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Build duration, grouped by the size of the projects (boxplots), and the programming languages (Ruby on the left and [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
read the original abstract

Background: Continuous Integration (CI) systems are now the bedrock of several software development practices. Several tools such as TravisCI, CircleCI, and Hudson, that implement CI practices, are commonly adopted by software engineers. However, the way that software engineers use these tools could lead to what we call "Continuous Integration Theater", a situation in which software engineers do not employ these tools effectively, leading to unhealthy CI practices. Aims: The goal of this paper is to make sense of how commonplace are these unhealthy continuous integration practices being employed in practice. Method: By inspecting 1,270 open-source projects that use TravisCI, the most used CI service, we quantitatively studied how common is to use CI (1) with infrequent commits, (2) in a software project with poor test coverage, (3) with builds that stay broken for long periods, and (4) with builds that take too long to run. Results: We observed that 748 ($sim$60%) projects face infrequent commits, which essentially makes the merging process harder. Moreover, we were able to find code coverage information for 51 projects. The average code coverage was 78%, although Ruby projects have a higher code coverage than Java projects (86% and 63%, respectively). However, some projects with very small coverage ($sim$4%) were found. Still, we observed that 85% of the studied projects have at least one broken build that take more than four days to be fixed. Interestingly, very small projects (up to 1,000 lines of code) are the ones that take the longest to fix broken builds. Finally, we noted that, for the majority of the studied projects, the build is executed under the 10 minutes rule of thumb. Conclusions: Our results are important to an increasing community of software engineers that employ CI practices on daily basis but may not be aware of bad practices that are eventually employed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

4 major / 2 minor

Summary. The paper examines the prevalence of four unhealthy CI practices ('Continuous Integration Theater') across 1,270 open-source projects using TravisCI: infrequent commits (748 projects, ~60%), poor test coverage (data for 51 projects, average 78%), builds remaining broken for more than four days (85% of projects), and builds exceeding 10 minutes (minority of projects). It concludes these practices are common and warrant attention from the CI community.

Significance. If the prevalence estimates prove robust, the work supplies concrete observational counts from a sizable TravisCI sample that document gaps between CI tool adoption and effective usage. This could usefully inform practitioner guidelines and CI platform design. The explicit project counts and breakdown by language (e.g., Ruby vs. Java coverage) are strengths.

major comments (4)
  1. [Abstract / Results] Abstract and Results: the 60% infrequent-commits figure and the 85% long-broken-build figure rest on three un-derived cutoffs (commit frequency, four-day broken-build window, ten-minute build duration) with no sensitivity analysis or alternative thresholds reported; modest changes to any cutoff could shift the headline percentages substantially.
  2. [Results] Results (coverage paragraph): coverage data exist for only 51 of 1,270 projects; the reported averages and language comparisons therefore rest on a small, possibly non-representative subset and should be qualified accordingly.
  3. [Method] Method: the sample is drawn exclusively from TravisCI users, yet no discussion addresses whether this introduces selection bias toward more CI-aware projects, limiting claims about the broader population of CI users.
  4. [Results] Results: prevalence estimates are given as point values with no error bars, confidence intervals, or statistical tests; this weakens the quantitative claims even for the chosen thresholds.
minor comments (2)
  1. [Abstract] Abstract: the phrase '10 minutes rule of thumb' appears without prior definition; the main text should state the exact rule and its provenance.
  2. [Conclusions] Conclusions: the final paragraph could more explicitly restate the data limitations (small coverage subsample, TravisCI-only sample) alongside the prevalence numbers.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which highlight important limitations in our presentation of results. We agree that the manuscript would benefit from additional analysis and qualifications. We will revise the paper to incorporate sensitivity analyses, better qualification of the coverage subsample, discussion of selection bias, and uncertainty measures for prevalence estimates.

read point-by-point responses
  1. Referee: [Abstract / Results] Abstract and Results: the 60% infrequent-commits figure and the 85% long-broken-build figure rest on three un-derived cutoffs (commit frequency, four-day broken-build window, ten-minute build duration) with no sensitivity analysis or alternative thresholds reported; modest changes to any cutoff could shift the headline percentages substantially.

    Authors: We acknowledge that the chosen thresholds (infrequent commits, four-day broken builds, and ten-minute builds) are presented without sensitivity analysis. While the ten-minute threshold is described as a 'rule of thumb' in the manuscript and the four-day window is motivated by prior work on build breakage, we agree that robustness should be demonstrated. In the revision we will add a sensitivity analysis subsection that varies each threshold and reports how the headline percentages change. revision: yes

  2. Referee: [Results] Results (coverage paragraph): coverage data exist for only 51 of 1,270 projects; the reported averages and language comparisons therefore rest on a small, possibly non-representative subset and should be qualified accordingly.

    Authors: The referee correctly notes the small sample (n=51) for coverage. We will revise the results and discussion sections to explicitly qualify this subsample as potentially non-representative, state the limitation prominently, and avoid over-generalizing the language-specific comparisons. revision: yes

  3. Referee: [Method] Method: the sample is drawn exclusively from TravisCI users, yet no discussion addresses whether this introduces selection bias toward more CI-aware projects, limiting claims about the broader population of CI users.

    Authors: We agree that restricting the sample to TravisCI projects may introduce selection bias. The revised manuscript will include an explicit limitations paragraph discussing this issue and its implications for generalizability beyond TravisCI users. revision: yes

  4. Referee: [Results] Results: prevalence estimates are given as point values with no error bars, confidence intervals, or statistical tests; this weakens the quantitative claims even for the chosen thresholds.

    Authors: We accept that point estimates alone are insufficient. The revision will add binomial confidence intervals for the main prevalence figures (60% and 85%) and, where feasible, for the coverage statistics. We will also note the absence of formal hypothesis tests as a limitation of the observational design. revision: yes

Circularity Check

0 steps flagged

No circularity; purely observational counts from explicit thresholds

full rationale

The paper reports direct observational statistics (e.g., 748 projects with infrequent commits, 85% with broken builds >4 days) obtained by applying chosen cutoffs to the 1,270-project TravisCI dataset. No equations, fitted parameters, predictions, or derivations appear. No self-citations, uniqueness theorems, or ansatzes are invoked to support the central claims. The results are computed counts from the data under the stated definitions; they do not reduce to the inputs by construction. Threshold arbitrariness is a validity concern, not a circularity issue per the defined patterns.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

Findings rest on four ad-hoc thresholds (infrequent commits, poor coverage, >4 days broken, >10 min builds) and on the assumption that TravisCI metadata accurately captures practice quality.

free parameters (2)
  • broken-build duration threshold = 4 days
    Four days is used to mark 'long periods' without derivation from the data or external benchmark.
  • build duration threshold = 10 minutes
    Ten minutes is invoked as a 'rule of thumb' without justification or sensitivity analysis.
axioms (2)
  • domain assumption TravisCI usage is a valid proxy for CI adoption and the selected projects represent broader CI practice.
    Method section selects only TravisCI projects.
  • domain assumption Public coverage reports from 51 projects are sufficient to characterize test quality across the sample.
    Results report coverage only for this small subset.

pith-pipeline@v0.9.0 · 5886 in / 1449 out tokens · 38845 ms · 2026-05-25T10:29:34.065555+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages

  1. [1]

    Continuous Delivery: Reliable Software Releases Through Build, Test, and Deployment Automation

    Jez Humble and David Farley. Continuous Delivery: Reliable Software Releases Through Build, Test, and Deployment Automation . Addison- Wesley Professional, 1st edition, 2010

  2. [2]

    Brooks, Jr

    Frederick P. Brooks, Jr. The Mythical Man-Month: Essays on Softw . Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1st edition, 1978

  3. [3]

    Work practices and challenges in continuous integration: A survey with travis CI users

    Gustavo Pinto, Fernando Castor, Rodrigo Bonif ´acio, and Marcel Rebouc ¸as. Work practices and challenges in continuous integration: A survey with travis CI users. Softw., Pract. Exper ., 48(12):2223–2236, 2018

  4. [4]

    Rebouc ¸as, R

    M. Rebouc ¸as, R. O. Santos, G. Pinto, and F. Castor. How does contributors’ involvement influence the build status of an open-source software project? In Proceedings of the 14th International Conference on Mining Software Repositories , MSR ’17, pages 475–478, Piscataway, NJ, USA, 2017. IEEE Press

  5. [5]

    Vasilescu, Y

    B. Vasilescu, Y . Yu, H. Wang, P. Devanbu, and V . Filkov. Quality and productivity outcomes relating to continuous integration in github. In Proceedings of the 2015 10th Joint Meeting on F oundations of Software Engineering, ESEC/FSE 2015, pages 805–816, 2015

  6. [6]

    Vasilescu, S

    B. Vasilescu, S. van Schuylenburg, J. Wulms, A. Serebrenik, and M. G. J. van den Brand. Continuous integration in a social-coding world: Empiri- cal evidence from github. In Proceedings of the 2014 IEEE International Conference on Software Maintenance and Evolution , ICSME ’14, pages 401–405, Washington, DC, USA, 2014. IEEE Computer Society

  7. [7]

    Hilton, T

    M. Hilton, T. Tunnell, K. Huang, D. Marinov, and D. Dig. Usage, costs, and benefits of continuous integration in open-source projects. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering , ASE 2016, pages 426–437, 2016

  8. [8]

    Building a collaborative culture: a grounded theory of well succeeded devops adoption in practice

    Welder Pinheiro Luz, Gustavo Pinto, and Rodrigo Bonif ´acio. Building a collaborative culture: a grounded theory of well succeeded devops adoption in practice. In Proceedings of the 12th ACM/IEEE Interna- tional Symposium on Empirical Software Engineering and Measurement, ESEM 2018, Oulu, Finland, October 11-12, 2018 , pages 6:1–6:10, 2018

  9. [9]

    Continuous integration

    Martin Fowler. Continuous integration. https://www.martinfowler.com/ articles/continuousIntegration.html. Accessed: 2019-06-23

  10. [10]

    One size does not fit all: an empirical study of containerized continuous deployment workflows

    Yang Zhang, Bogdan Vasilescu, Huaimin Wang, and Vladimir Filkov. One size does not fit all: an empirical study of containerized continuous deployment workflows. In Proceedings of the 2018 ACM Joint Meeting on European Software Engineering Conference and Symposium on the F oundations of Software Engineering, ESEC/SIGSOFT FSE 2018, Lake Buena Vista, FL, USA, ...

  11. [11]

    Exploring scrumbutan empirical study of scrum anti-patterns

    Veli-Pekka Eloranta, Kai Koskimies, and Tommi Mikkonen. Exploring scrumbutan empirical study of scrum anti-patterns. Information and Software Technology, 74:194 – 203, 2016

  12. [12]

    Trade-offs in continuous integration: assurance, security, and flexibility

    Michael Hilton, Nicholas Nelson, Timothy Tunnell, Darko Marinov, and Danny Dig. Trade-offs in continuous integration: assurance, security, and flexibility. In Proceedings of the 2017 11th Joint Meeting on F oundations of Software Engineering , pages 197–207. ACM, 2017

  13. [13]

    Ammann and J

    P. Ammann and J. Offutt and. Coverage criteria for logical expressions. In 14th International Symposium on Software Reliability Engineering,

  14. [14]

    , pages 99–107, Nov 2003

    ISSRE 2003. , pages 99–107, Nov 2003

  15. [15]

    Comparing non-adequate test suites using coverage criteria

    Milos Gligoric, Alex Groce, Chaoqiang Zhang, Rohan Sharma, Moham- mad Amin Alipour, and Darko Marinov. Comparing non-adequate test suites using coverage criteria. In Proceedings of the 2013 International Symposium on Software Testing and Analysis , ISSTA 2013, pages 302– 313, 2013

  16. [16]

    Coverage criteria for testing of object interactions in sequence diagrams

    Atanas Rountev, Scott Kagan, and Jason Sawin. Coverage criteria for testing of object interactions in sequence diagrams. In Maura Cerioli, editor, Fundamental Approaches to Software Engineering , pages 289– 304, Berlin, Heidelberg, 2005. Springer Berlin Heidelberg

  17. [17]

    Travistorrent: synthesizing travis CI and github for full-stack research on continuous integration

    Moritz Beller, Georgios Gousios, and Andy Zaidman. Travistorrent: synthesizing travis CI and github for full-stack research on continuous integration. In Proceedings of the 14th International Conference on Mining Software Repositories, MSR 2017, Buenos Aires, Argentina, May 20-28, 2017 , pages 447–450, 2017

  18. [18]

    A large-scale study of test coverage evolution

    Michael Hilton, Jonathan Bell, and Darko Marinov. A large-scale study of test coverage evolution. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering , ASE 2018, pages 53–63, 2018

  19. [19]

    Evaluating and improving semistructured merge

    Guilherme Cavalcanti, Paulo Borba, and Paola Accioly. Evaluating and improving semistructured merge. Proc. ACM Program. Lang. , 1(OOPSLA):59:1–59:27, October 2017

  20. [20]

    Hora, and Marco Tulio Valente

    Guilherme Avelino, Leonardo Teixeira Passos, Andr ´e C. Hora, and Marco Tulio Valente. A novel approach for estimating truck factors. In 24th IEEE International Conference on Program Comprehension, ICPC 2016, Austin, TX, USA, May 16-17, 2016 , pages 1–10, 2016

  21. [21]

    Beller, G

    M. Beller, G. Gousios, and A. Zaidman. Oops, my tests broke the build: An explorative analysis of travis ci with github. In Proceedings of the 14th International Conference on Mining Software Repositories , MSR ’17, pages 356–367, Piscataway, NJ, USA, 2017. IEEE Press

  22. [22]

    An empirical study of the long duration of continuous integration builds

    Taher Ahmed Ghaleb, Daniel Alencar da Costa, and Ying Zou. An empirical study of the long duration of continuous integration builds. Empirical Software Engineering , pages 1–38, 2019

  23. [23]

    Studying the impact of adopting continuous integration on the delivery time of pull requests

    Jo ˜ao Helis Bernardo, Daniel Alencar da Costa, and Uir ´a Kulesza. Studying the impact of adopting continuous integration on the delivery time of pull requests. In 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR) , pages 131–141. IEEE, 2018

  24. [24]

    The impact of continuous integration on other software development practices: a large-scale empirical study

    Yangyang Zhao, Alexander Serebrenik, Yuming Zhou, Vladimir Filkov, and Bogdan Vasilescu. The impact of continuous integration on other software development practices: a large-scale empirical study. In 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE) , pages 60–71. IEEE, 2017

  25. [25]

    Test activities in the continuous integration and delivery pipeline

    Torvald M ˚artensson, Daniel St ˚ahl, and Jan Bosch. Test activities in the continuous integration and delivery pipeline. Journal of Software: Evolution and Process , page e2153, 2019

  26. [26]

    Noise and heterogeneity in historical build data: an empirical study of travis ci

    Keheliya Gallaba, Christian Macho, Martin Pinzger, and Shane McIn- tosh. Noise and heterogeneity in historical build data: an empirical study of travis ci. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering , pages 87–97. ACM, 2018

  27. [27]

    A study on the interplay between pull request review and continuous integration builds

    Fiorella Zampetti, Gabriele Bavota, Gerardo Canfora, and Massimiliano Di Penta. A study on the interplay between pull request review and continuous integration builds. In 2019 IEEE 26th International Con- ference on Software Analysis, Evolution and Reengineering (SANER) , pages 38–48. IEEE, 2019