pith. sign in

arxiv: 2604.17662 · v1 · submitted 2026-04-19 · 💻 cs.SE

Beyond the YAML File: Understanding Real-World GitHub Actions Workflow Adoption

Pith reviewed 2026-05-10 05:05 UTC · model grok-4.3

classification 💻 cs.SE
keywords GitHub Actionsworkflow adoptionCI/CDfailure patternsempirical studysoftware repositoriesautomation usageconfiguration gap
0
0 comments X

The pith

Real-world GitHub Actions data reveals three distinct developer responses to workflow failures along with a gap between configuration and actual use.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates how GitHub Actions are adopted and used in practice by looking at actual run records instead of just configuration files. It quantitatively processes over 258,000 workflow executions from 952 repositories and qualitatively examines 21 diverse projects to see how people respond to failures and how workflows fit into project work. The study finds three clear patterns in how teams deal with failed workflows, a link between more frequent use and fewer failures, and cases where workflow files are present but the workflows are turned off or ignored. These insights help explain why automation sometimes falls short in real projects and point to areas where better support could make CI/CD more effective.

Core claim

We identify three distinct failure response patterns, observe that higher usage intensity of GHA workflows correlates with lower failure rates, and uncover a configuration-usage gap where the presence of configuration files masks disabled or unused workflows. Moreover, our qualitative analysis of relationships between project characteristics and utilization patterns yields five hypotheses for future validation.

What carries the argument

Mixed-methods analysis of 258,300 workflow run records combined with in-depth review of 21 repositories to map failure responses and usage patterns.

Load-bearing premise

The chosen set of 952 repositories for quantitative data and 21 for qualitative analysis accurately reflects how GitHub Actions are used more broadly.

What would settle it

Finding a different number of failure response patterns or no correlation between usage intensity and failure rates in a larger random sample of repositories would challenge the main findings.

Figures

Figures reproduced from arXiv: 2604.17662 by Ali Khatami, Andy Zaidman, Carolin Brandt.

Figure 1
Figure 1. Figure 1: Methodology Overview repositories using GHA, and (2) ensuring these repositories have accessible execution history. To address these challenges, we begin with an existing dataset while recognizing that GitHub’s workflow run retention policy3 would require us to collect fresh data. We use the dataset by Bouzenia and Pradel [8], which contains 952 GitHub repositories with workflow run histories. This dataset… view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of sampled repositories (red dots, n=21) across the population. Our sample spans the complete range [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Repositories Run Count vs. Run Failure Rate. Points [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Trigger event diversity increasing as the # of runs [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Developer Response Patterns to Workflow Failures in PR and Main Branch Contexts [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
read the original abstract

Continuous Integration and Continuous Deployment (CI/CD) have become fundamental to modern software development, with GitHub Actions (GHA) emerging as a dominant automation platform. In this study, we analyze real-world execution records of GHA, examining how developers react to workflow failures, how these workflows are utilized by projects, and how these aspects relate to project characteristics. We quantitatively analyze 258,300 workflow run records from 952 repositories and perform an in-depth qualitative analysis of 21 selected, diverse GitHub repositories to understand how maintainers and contributors interact with workflow results. We identify three distinct failure response patterns, observe that higher usage intensity of GHA workflows correlates with lower failure rates, and uncover a configuration-usage gap where the presence of configuration files masks disabled or unused workflows. Moreover, our qualitative analysis of relationships between project characteristics and utilization patterns yields five hypotheses for future validation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents an empirical study of real-world GitHub Actions (GHA) adoption. It quantitatively analyzes 258,300 workflow run records from 952 repositories to examine failure responses, usage intensity, and failure rates, and qualitatively studies 21 selected repositories to identify patterns in how maintainers interact with workflow results. The central claims are the identification of three distinct failure response patterns, a negative correlation between higher GHA usage intensity and lower failure rates, a configuration-usage gap where config files mask disabled workflows, and five hypotheses relating project characteristics to utilization patterns.

Significance. If the findings hold after addressing sampling and analysis details, the work offers valuable large-scale observational data on CI/CD practices with GHA, a dominant platform. The scale of the quantitative dataset (258k runs) is a strength for identifying usage patterns and correlations, and the mixed-methods approach yields actionable hypotheses. This could inform tool builders and practitioners on workflow design, though the observational design inherently limits causal claims.

major comments (3)
  1. [§3] §3 (Data Collection and Sampling): The criteria and process for selecting the 952 repositories (and the 21 for qualitative analysis) are not described in sufficient detail to evaluate representativeness or rule out selection bias. This is load-bearing for the correlation between usage intensity and failure rates (reported in §4) and the three failure patterns, as unmeasured factors like repository popularity, age, or language could confound results.
  2. [§4.2] §4.2 (Quantitative Results on Usage and Failures): The reported negative correlation lacks mention of statistical controls for potential confounders (e.g., project size, team activity, primary language). Without these or sensitivity analyses, the claim that higher usage intensity correlates with lower failure rates cannot be confidently attributed to usage rather than external variables.
  3. [§5] §5 (Qualitative Analysis): The failure classification criteria, inter-rater reliability measures, and exact selection process for the 21 repositories are not specified. This undermines the validity of the three identified failure response patterns and the configuration-usage gap, as measurement bias or non-representative cases could produce the observed patterns.
minor comments (2)
  1. [Abstract] The abstract and introduction could more clearly distinguish between the quantitative sample (952 repos) and qualitative subsample (21 repos) to avoid reader confusion about scope.
  2. [Figures/Tables] Figure captions and table descriptions would benefit from explicit definitions of 'usage intensity' and 'failure rate' to improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which highlights important areas for improving the transparency and robustness of our empirical study. We address each major comment point by point below, outlining the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Data Collection and Sampling): The criteria and process for selecting the 952 repositories (and the 21 for qualitative analysis) are not described in sufficient detail to evaluate representativeness or rule out selection bias. This is load-bearing for the correlation between usage intensity and failure rates (reported in §4) and the three failure patterns, as unmeasured factors like repository popularity, age, or language could confound results.

    Authors: We agree that greater detail on the sampling process is required to allow evaluation of representativeness and potential biases. In the revised manuscript, we will expand §3 with a full description of the repository selection criteria, including the source population (e.g., GitHub repositories with public workflow histories), inclusion filters such as minimum workflow run counts and activity thresholds, and steps taken to promote diversity in programming languages and project sizes. For the 21 repositories in the qualitative analysis, we will document the purposive sampling approach used to capture variation in failure response patterns. We will also add an explicit limitations subsection addressing selection bias and generalizability. revision: yes

  2. Referee: [§4.2] §4.2 (Quantitative Results on Usage and Failures): The reported negative correlation lacks mention of statistical controls for potential confounders (e.g., project size, team activity, primary language). Without these or sensitivity analyses, the claim that higher usage intensity correlates with lower failure rates cannot be confidently attributed to usage rather than external variables.

    Authors: We concur that controlling for confounders strengthens causal interpretation in observational data. In the revision, we will augment the analysis in §4.2 with multivariate regression models that include controls for project size (measured by stars and contributors), team activity (commit frequency), primary language, and repository age. We will also report sensitivity analyses, such as stratified correlations and alternative model specifications, to assess the stability of the negative association between usage intensity and failure rates. These additions will be presented alongside the existing descriptive results while maintaining the observational framing of the study. revision: yes

  3. Referee: [§5] §5 (Qualitative Analysis): The failure classification criteria, inter-rater reliability measures, and exact selection process for the 21 repositories are not specified. This undermines the validity of the three identified failure response patterns and the configuration-usage gap, as measurement bias or non-representative cases could produce the observed patterns.

    Authors: We recognize the need for explicit methodological transparency in the qualitative component. In the revised §5, we will specify the failure classification criteria in detail, including the coding scheme, category definitions, and illustrative examples from the data. We will report inter-rater reliability statistics (e.g., Cohen's kappa) from the independent coding performed by the research team. Additionally, we will describe the exact selection process for the 21 repositories, including how cases were chosen to reflect diversity in failure patterns and project characteristics. These clarifications will support readers' assessment of the identified patterns and the configuration-usage gap. revision: yes

Circularity Check

0 steps flagged

No circularity: purely observational empirical study

full rationale

The paper performs quantitative analysis of 258300 external GitHub workflow runs from 952 repositories plus qualitative coding of 21 cases. It reports observed failure-response patterns, a usage-intensity correlation, a configuration-usage gap, and five hypotheses. No equations, fitted parameters, predictions, derivations, or self-citations appear in the provided text or abstract; all claims are direct summaries of collected external data with no reduction to internal definitions or prior author results.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on assumptions about data representativeness and the validity of qualitative pattern extraction; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption The 952 repositories provide a representative sample of GitHub Actions usage across open-source projects.
    Selection process and potential biases not described in abstract.
  • domain assumption Failure responses observed in the 21 qualitative repositories can be generalized into three distinct patterns.
    Depends on the diversity and coding reliability of the selected cases.

pith-pipeline@v0.9.0 · 5447 in / 1211 out tokens · 41953 ms · 2026-05-10T05:05:45.128537+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages

  1. [1]

    Jessy Ayala and Joshua Garcia. 2023. An empirical study on workflows and secu- rity policies in popular github repositories. In2023 IEEE/ACM 1st International Workshop on Software Vulnerability (SVM). IEEE, 6–9

  2. [2]

    Alberto Bacchelli and Christian Bird. 2013. Expectations, outcomes, and chal- lenges of modern code review. In2013 35th International Conference on Software Engineering (ICSE), 712–721

  3. [3]

    Moritz Beller, Radjino Bholanath, Shane McIntosh, and Andy Zaidman. 2016. Analyzing the state of static analysis: A large-scale evaluation in open source software. InIEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER). IEEE, 470–481

  4. [4]

    Moritz Beller, Georgios Gousios, and Andy Zaidman. 2017. Oops, my tests broke the build: an explorative analysis of travis CI with github. InProceedings of the 14th International Conference on Mining Software Repositories (MSR). IEEE, 356–367

  5. [5]

    Moritz Beller, Georgios Gousios, and Andy Zaidman. 2017. Travistorrent: syn- thesizing travis CI and github for full-stack research on continuous integration. InProceedings of the 14th International Conference on Mining Software Reposito- ries (MSR). IEEE, 447–450

  6. [6]

    Giacomo Benedetti, Luca Verderame, and Alessio Merlo. 2022. Automatic secu- rity assessment of GitHub Actions workflows. InProceedings of the 2022 ACM Workshop on Software Supply Chain Offensive Research and Ecosystem Defenses, 37–45

  7. [7]

    Al Bessey et al. 2010. A few billion lines of code later: using static analysis to find bugs in the real world.Commun. ACM, 53, 2, 66–75

  8. [8]

    Islem Bouzenia and Michael Pradel. 2024. Resource usage and optimization opportunities in workflows of GitHub Actions. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering (ICSE). ACM, 25:1– 25:12

  9. [9]

    Tingting Chen, Yang Zhang, Shu Chen, Tao Wang, and Yiwen Wu. 2021. Let’s supercharge the workflows: an empirical study of GitHub Actions. In2021 IEEE 21st International Conference on Software Quality, Reliability and Security Companion (QRS-C). IEEE, 01–10

  10. [10]

    2013.Applied multiple regression/correlation analysis for the behavioral sciences

    Jacob Cohen, Patricia Cohen, Stephen G West, and Leona S Aiken. 2013.Applied multiple regression/correlation analysis for the behavioral sciences. Routledge

  11. [11]

    Alexandre Decan, Tom Mens, and Hassan Onsori Delicheh. 2023. On the out- datedness of workflows in the GitHub Actions ecosystem.Journal of Systems and Software, 206, 111827

  12. [12]

    Alexandre Decan, Tom Mens, Pooya Rostami Mazrae, and Mehdi Golzadeh

  13. [13]

    In IEEE International Conference on Software Maintenance and Evolution, (ICSME)

    On the use of GitHub Actions in software development repositories. In IEEE International Conference on Software Maintenance and Evolution, (ICSME). IEEE, 235–245

  14. [14]

    2005.A Modern Introduction to Probability and Statistics: Understanding why and how

    Frederik Michel Dekking, Cornelis Kraaikamp, Hendrik Paul Lopuhaä, and Ludolf Erwin Meester. 2005.A Modern Introduction to Probability and Statistics: Understanding why and how. Vol. 488. Springer

  15. [15]

    Hassan Onsori Delicheh, Alexandre Decan, and Tom Mens. 2023. A preliminary study of GitHub Actions dependencies. InSATToSE, 66–77

  16. [16]

    Hassan Onsori Delicheh and Tom Mens. 2024. Mitigating security issues in GitHub Actions. InProceedings of the 2024 ACM/IEEE 4th International Work- shop on Engineering and Cybersecurity of Critical Systems (EnCyCriS) and 2024 IEEE/ACM Second International Workshop on Software Vulnerability, 6–11

  17. [17]

    Storey, Neil A

    Omar Elazhary, Margaret-Anne D. Storey, Neil A. Ernst, and Andy Zaidman

  18. [18]

    IEEE, 286–290

    Do as I do, not as I say: do contribution guidelines match the GitHub contribution process? In2019 IEEE International Conference on Software Main- tenance and Evolution, (ICSME). IEEE, 286–290

  19. [19]

    Ernst, and Margaret-Anne Storey

    Omar Elazhary, Colin Werner, Ze Shi Li, Derek Lowlind, Neil A. Ernst, and Margaret-Anne Storey. 2022. Uncovering the benefits and challenges of con- tinuous integration practices.IEEE Transactions on Software Engineering, 48, 7, 2570–2583. doi:10.1109/TSE.2021.3064953

  20. [20]

    Fowler and M

    M. Fowler and M. Foemmel. [n. d.] Continuous integration. [Online; accessed 29-May-2025]. (). https://tinyurl.com/ycbl2uhj

  21. [21]

    Randy Garrison, Martha Cleveland-Innes, Marguerite Koole, and James Kappelman

    D. Randy Garrison, Martha Cleveland-Innes, Marguerite Koole, and James Kappelman. 2006. Revisiting methodological issues in transcript analysis: ne- gotiated coding and reliability.Internet High. Educ., 9, 1, 1–8

  22. [22]

    2017.Discovery of grounded theory: Strate- gies for qualitative research

    Barney Glaser and Anselm Strauss. 2017.Discovery of grounded theory: Strate- gies for qualitative research. Routledge

  23. [23]

    Mehdi Golzadeh, Alexandre Decan, and Tom Mens. 2022. On the rise and fall of CI services in GitHub. In2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 662–672

  24. [24]

    Georgios Gousios and Andy Zaidman. 2014. A dataset for pull-based devel- opment research. InProceedings of the 11th Working Conference on Mining Software Repositories(MSR 2014). ACM, 368–371

  25. [25]

    Storey, and Arie van Deursen

    Georgios Gousios, Andy Zaidman, Margaret-Anne D. Storey, and Arie van Deursen. 2015. Work practices and challenges in pull-based development: the integrator’s perspective. In37th IEEE/ACM International Conference on Software Engineering (ICSE). IEEE, 358–368

  26. [26]

    Michael Hilton, Timothy Tunnell, Kai Huang, Darko Marinov, and Danny Dig. 2016. Usage, costs, and benefits of continuous integration in open-source projects. InProceedings of the 31st IEEE/ACM International Conference on Auto- mated Software Engineering (ASE). ACM, 426–437

  27. [27]

    Brittany Johnson, Yoonki Song, Emerson Murphy-Hill, and Robert Bowdidge

  28. [28]

    IEEE, 672–681

    Why don’t software developers use static analysis tools to find bugs? In International Conference on Software Engineering (ICSE). IEEE, 672–681

  29. [29]

    Ali Khatami, Carolin Brandt, and Andy Zaidman. 2026. Replication package for “Beyond the YAML File: Understanding Real-World Github Actions Workflow Adoption. (2026). doi:10.5281/zenodo.18258226

  30. [30]

    Ali Khatami, Carolin Brandt, and Andy Zaidman. 2024. Software quality as- surance analytics: enabling software engineers to reflect on QA practices. In 2024 IEEE International Conference on Source Code Analysis and Manipulation (SCAM), 189–200

  31. [31]

    Ali Khatami, Cédric Willekens, and Andy Zaidman. 2024. Catching smells in the act: A github actions workflow investigation. InInternational Conference on Source Code Analysis and Manipulation (SCAM). IEEE, 47–58

  32. [32]

    Ali Khatami and Andy Zaidman. 2024. State-of-the-practice in quality assur- ance in Java-based open source software development.Software: Practice and Experience, 54, 8, 1408–1446

  33. [33]

    Timothy Kinsman, Mairieli Wessel, Marco A Gerosa, and Christoph Treude

  34. [34]

    IEEE, 420–431

    How do software developers use github actions to automate their work- flows? In2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR). IEEE, 420–431

  35. [35]

    Eriks Klotins, Tony Gorschek, Katarina Sundelin, and Erik Falk. 2022. Towards cost-benefit evaluation for continuous software engineering activities.Empiri- cal Software Engineering, 157, 6

  36. [36]

    Igibek Koishybayev, Aleksandr Nahapetyan, Raima Zachariah, Siddharth Mu- ralee, Bradley Reaves, Alexandros Kapravelos, and Aravind Machiry. 2022. Characterizing the security of GitHub CI workflows. In31st USENIX Security Symposium (USENIX Security 22), 2747–2763

  37. [37]

    Zhixing Li, Yue Yu, Tao Wang, Shanshan Li, and Huaimin Wang. 2022. Op- portunities and challenges in repeated revisions to pull-requests: an empirical study.Proc. ACM Hum.-Comput. Interact., 6, CSCW2, Article 317, (Nov. 2022), 35 pages

  38. [38]

    Pooya Rostami Mazrae, Alexandre Decan, and Tom Mens. 2024. Gawd: a differ- encing tool for github actions workflows. InProceedings of the 21st International Conference on Mining Software Repositories, 682–686

  39. [39]

    Pooya Rostami Mazrae, Tom Mens, Mehdi Golzadeh, and Alexandre Decan

  40. [40]

    On the usage, co-usage and migration of CI/CD tools: a qualitative analy- sis.Empirical Software Engineering, 28, 2, 52

  41. [41]

    Jadson Santos, Daniel Alencar da Costa, Shane McIntosh, and Uirá Kulesza

  42. [42]

    On the need to monitor continuous integration practices.Empirical Software Engineering, 30, 5, (June 2025), 47 pages

  43. [43]

    Sk Golam Saroar and Maleknaz Nayebi. 2023. Developers’ perception of GitHub Actions: a survey analysis. InProceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering. ACM, 121–130

  44. [44]

    Pablo Valenzuela-Toledo and Alexandre Bergel. 2022. Evolution of github action workflows. In2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 123–127

  45. [45]

    Erik van der Veen, Georgios Gousios, and Andy Zaidman. 2015. Automatically prioritizing pull requests. In12th IEEE/ACM Working Conference on Mining Software Repositories (MSR). IEEE, 357–361

  46. [46]

    Bogdan Vasilescu, Yue Yu, Huaimin Wang, Premkumar Devanbu, and Vladimir Filkov. 2015. Quality and productivity outcomes relating to continuous inte- gration in GitHub. InProceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering (ESEC/FSE). ACM, 805–816.isbn: 9781450336758

  47. [47]

    Mairieli Wessel, Joseph Vargovich, Marco Aurélio Gerosa, and Christoph Treude. 2023. Github actions: the impact on the pull request process.Empir. Softw. Eng., 28, 6, 131

  48. [48]

    Yang Zhang, Yiwen Wu, Tingting Chen, Tao Wang, Hui Liu, and Huaimin Wang. 2024. How do developers talk about GitHub Actions? Evidence from online software development community. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering (ICSE). ACM