Recognition: 2 theorem links
How Compliant Are GitHub Actions Workflows? A Checklist-Based Study with LLM-Assisted Auditing
Pith reviewed 2026-05-08 19:14 UTC · model grok-4.3
The pith
A 30-criteria checklist audit of 95 Java GitHub Actions workflows finds 28% overall compliance, with LLMs enabling scalable checks via an 81% effort-reducing hybrid framework.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose a novel, documentation-grounded GHA compliance checklist with 30 criteria spanning four workflow sections and eight themes, and demonstrate on 95 real-world Java workflows that LLMs enable scalable compliance measurement but cannot replace experts; a multi-tier adjudication framework in which GPT-5 resolves model conflicts before targeted manual review reduces verification effort by 81% while retaining 87% agreement with expert judgment, revealing major compliance gaps including 28% overall and 4% for permission controls.
What carries the argument
The multi-tier adjudication framework that routes LLM outputs through GPT-5 conflict resolution followed by selective human review against the 30-criteria checklist.
If this is right
- Overall compliance stands at 28%, with permission controls at 4% and security themes at 26% while clarity reaches 68%.
- LLMs display only fair agreement (Fleiss' kappa = 0.28) and systematic disagreements on structural reasoning and security-sensitive judgments.
- Hybrid human-AI auditing supplies a practical route to defensible large-scale compliance measurement.
- Security and maintainability practices lag substantially behind clarity practices in real workflows.
Where Pith is reading between the lines
- Embedding the checklist and framework directly into workflow editors could flag non-compliant patterns at creation time.
- The same checklist-driven hybrid method could extend to auditing other CI/CD platforms or infrastructure-as-code repositories.
- Widespread low compliance points to a need for clearer official guidance or built-in tooling that enforces the documented practices.
Load-bearing premise
The 30-criteria checklist accurately and comprehensively captures documented best practices for security, maintainability, and performance in GitHub Actions workflows.
What would settle it
A complete independent manual audit of the same 95 workflows using the identical 30-criteria checklist, then direct comparison of the resulting compliance percentages and adjudication effort savings against the reported 28% and 81% figures.
Figures
read the original abstract
GitHub Actions (GHA) CI workflows are critical infrastructure, but current tooling offers only syntactic or heuristic checks and does not enforce documented best practices for security, maintainability, or performance. Consequently, issues like over-privileged permissions, weak secrets management, and missing failure notifications remain undetected in real-world pipelines. This paper proposes a novel, documentation-grounded GHA compliance checklist with 30 criteria spanning four workflow sections and eight themes, and assesses Large Language Models (LLMs) for scalable compliance auditing. On 95 real-world Java workflows (2,850 assessments) using four open-weight LLMs, we find only fair agreement (Fleiss' kappa = 0.28), with systematic disagreement on structural reasoning and security-sensitive judgments. To address this, we introduce a multi-tier adjudication framework in which GPT 5 resolves model conflicts before targeted manual review, reducing verification effort by 81% while retaining 87% agreement with expert judgment. At scale, it reveals major compliance gaps: overall compliance is 28%, dropping to 4% for permission controls; Security (26%) lags far behind Clarity (68%). Our results show that LLMs enable scalable compliance measurement but cannot replace experts, highlighting the need for hybrid human-AI auditing and providing empirical benchmarks and guidance for defensible GHA workflow audits.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a documentation-grounded 30-criteria checklist for GitHub Actions (GHA) workflows covering security, maintainability, and performance across four sections and eight themes. It evaluates four open-weight LLMs via 2,850 assessments on 95 real-world Java workflows, reports only fair inter-LLM agreement (Fleiss' kappa = 0.28) with systematic disagreements on structural and security judgments, and proposes a multi-tier adjudication framework (GPT-5 resolving conflicts before targeted manual review) that reduces verification effort by 81% while retaining 87% agreement with expert judgment. The study finds low compliance (28% overall, 4% for permissions, 26% for Security vs. 68% for Clarity) and concludes that LLMs enable scalable auditing but require hybrid human oversight.
Significance. If the checklist and sampling are robust, this provides concrete empirical benchmarks on real-world GHA compliance gaps and a practical, quantified hybrid auditing method that could inform both practitioners and future tooling. The work is strengthened by its use of a substantial assessment volume (2,850), reporting of Fleiss' kappa, and explicit effort-reduction and agreement metrics for the adjudication framework.
major comments (2)
- [§3 (Checklist Construction) and Abstract] §3 (Checklist Construction) and Abstract: The headline compliance rates (28% overall, 4% permissions, Security 26% vs. Clarity 68%) and the LLM-vs-expert comparison both presuppose that the 30-criteria checklist is a faithful, complete proxy for documented best practices. The manuscript describes it as 'documentation-grounded' but provides no explicit derivation process, source-to-criterion mapping, coverage audit against GitHub docs/advisories, or external validation (e.g., expert consensus). This is load-bearing for the 'major gaps' claim and risks systematic omissions or debatable thresholds inflating the reported non-compliance.
- [§4 (Workflow Sampling and Dataset)] §4 (Workflow Sampling and Dataset): The selection criteria and representativeness of the 95 Java workflows are not fully specified (e.g., filtering rules, repository popularity thresholds, or exclusion of toy/example repos). This directly affects the generalizability of the compliance statistics and the claim that the findings reveal 'major compliance gaps' in real-world pipelines.
minor comments (3)
- [Results tables] Table 2 or equivalent (per-criterion results): Consider adding a column or footnote showing the exact wording of each criterion alongside the compliance percentage to improve traceability.
- [Figure 1] Figure 1 (adjudication framework): The diagram and caption could more explicitly label the three tiers and the exact conditions for escalating to manual review.
- [§5 (Threats to Validity)] §5 (Threats to Validity): The discussion of LLM prompt sensitivity and criterion interpretation variability could be expanded with a brief sensitivity analysis or example prompts.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below with specific plans for revision where the concerns are valid, while defending the core methodology on substantive grounds.
read point-by-point responses
-
Referee: [§3 (Checklist Construction) and Abstract] §3 (Checklist Construction) and Abstract: The headline compliance rates (28% overall, 4% permissions, Security 26% vs. Clarity 68%) and the LLM-vs-expert comparison both presuppose that the 30-criteria checklist is a faithful, complete proxy for documented best practices. The manuscript describes it as 'documentation-grounded' but provides no explicit derivation process, source-to-criterion mapping, coverage audit against GitHub docs/advisories, or external validation (e.g., expert consensus). This is load-bearing for the 'major gaps' claim and risks systematic omissions or debatable thresholds inflating the reported non-compliance.
Authors: We agree that greater transparency in the checklist derivation would strengthen the paper. The 30 criteria were systematically extracted from GitHub's official documentation on workflow syntax, security best practices, and advisories (e.g., permissions, secrets, and job configuration sections), cross-referenced against community guidelines and prior literature on CI security. However, the original submission did not include an explicit mapping table or coverage audit. In the revised manuscript we will add a dedicated subsection in §3 with a complete source-to-criterion mapping, references to specific GitHub documentation URLs, and an explicit discussion of coverage and any threshold decisions. We will also note the absence of a formal expert consensus round as a limitation. These additions directly address the load-bearing concern without changing the reported compliance figures. revision: yes
-
Referee: [§4 (Workflow Sampling and Dataset)] §4 (Workflow Sampling and Dataset): The selection criteria and representativeness of the 95 Java workflows are not fully specified (e.g., filtering rules, repository popularity thresholds, or exclusion of toy/example repos). This directly affects the generalizability of the compliance statistics and the claim that the findings reveal 'major compliance gaps' in real-world pipelines.
Authors: We acknowledge that the sampling description in §4 could be more explicit to support reproducibility and generalizability claims. The 95 workflows were drawn from actively maintained open-source Java repositories on GitHub that contain at least one .github/workflows file, with selection prioritizing repositories having non-trivial activity (measured by recent commits and stars) while excluding forks, archived repositories, and purely tutorial/example projects. To improve clarity, the revised §4 will enumerate the precise filtering rules, any popularity thresholds applied, exclusion criteria, and the rationale for restricting to Java projects (to control for language-specific workflow patterns). This expansion will better substantiate the claim of revealing gaps in real-world pipelines. revision: yes
Circularity Check
No circularity: direct empirical measurement on external workflows
full rationale
The paper performs an empirical audit: it defines a 30-criteria checklist, applies LLMs to 95 real-world Java workflows (2,850 assessments), measures agreement via Fleiss' kappa, and evaluates a multi-tier adjudication framework against expert judgment. No equations, fitted parameters, predictions derived from inputs, or self-citation chains appear in the abstract or described methodology. Compliance rates (28% overall, 4% permissions) and effort-reduction figures (81%) are direct counts and comparisons on external data, not reductions by construction. The checklist is presented as documentation-grounded without any claim that its validity is proven by the compliance results themselves. This is a standard measurement study whose central claims rest on external benchmarks rather than self-referential steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The 30 criteria accurately reflect documented best practices for GitHub Actions workflows in security, maintainability, and performance.
Reference graph
Works this paper leans on
-
[1]
Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauff- mann, et al. 2024. Phi-4 technical report.arXiv preprint arXiv:2412.08905(2024)
work page internal anchor Pith review arXiv 2024
-
[2]
Edward Abrokwah and Taher A. Ghaleb. 2026. Auditing GitHub Actions Work- flows: A Compliance Checklist and Evaluation Using LLMs (Replication Package). https://github.com/Taher-Ghaleb/GHACompliance-EASE2026
2026
-
[3]
Alireza Amiri-Margavi, Iman Jebellat, Ehsan Jebellat, and Seyed Pouyan Mousavi Davoudi. 2025. Enhancing answer reliability through inter-model consensus of large language models. InIFIP International Conference on Artificial Intelligence Applications and Innovations. Springer, 299–316
2025
-
[4]
Moritz Beller, Georgios Gousios, and Andy Zaidman. 2017. Travistorrent: Syn- thesizing Travis CI and GitHub for full-stack research on continuous integration. In2017 IEEE/ACM 14th International Conference on Mining Software Repositories. IEEE, 447–450
2017
- [5]
-
[6]
Łukasz Chomątek, Jakub Papuga, Przemyslaw Nowak, and Aneta Poniszewska- Marańda. 2025. Decoding CI/CD Practices in Open-Source Projects with LLM Insights. InProceedings of the 33rd ACM International Conference on the Founda- tions of Software Engineering. 1638–1644
2025
-
[7]
Nitika Chopra and Taher A Ghaleb. 2025. From First Use to Final Commit: Studying the Evolution of Multi-CI Service Adoption. InInternational Conference on Software Maintenance and Evolution. IEEE, 773–778
2025
-
[8]
Codecov. 2021. Codecov Bash Uploader Security Incident. https://about.codecov. io/security-update. Accessed: 2026-01-20
2021
-
[9]
Cycode Security Research. 2024. How we found vulnerabilities in GitHub Ac- tions CI/CD pipelines. https://cycode.com/blog/github-actions-vulnerabilities. Accessed: 2026-01-20
2024
-
[10]
Martin Fowler. [n. d.]. Continuous Integration. https://martinfowler.com/articles/ originalContinuousIntegration.html
-
[11]
Keheliya Gallaba and Shane McIntosh. 2018. Use and misuse of continuous integration features: An empirical study of projects that (mis) use Travis CI.IEEE Transactions on Software Engineering46, 1 (2018), 33–50
2018
-
[12]
Taher A Ghaleb. 2026. When AI Agents Touch CI/CD Configurations: Frequency and Success. InProceedings of the 23rd International Conference on Mining Software Repositories. ACM, 1–5
2026
-
[13]
Taher A Ghaleb, Osamah Abduljalil, and Safwat Hassan. 2026. CI/CD Config- uration Practices in Open Source Android Apps: An Empirical Study.ACM Transactions on Software Engineering and Methodology35, 2 (2026), 1–40
2026
-
[14]
Taher Ahmed Ghaleb, Daniel Alencar da Costa, and Ying Zou. 2019. An empirical study of the long duration of continuous integration builds.Empirical Software Engineering24, 4 (2019), 2102–2139
2019
-
[15]
Taher Ahmed Ghaleb, Daniel Alencar da Costa, Ying Zou, and Ahmed E Hassan
-
[16]
Studying the Impact of Noises in Build Breakage Data.IEEE Transactions on Software Engineering(2019), 1–14. doi:10.1109/TSE.2019.2941880
-
[17]
Taher A Ghaleb, Safwat Hassan, and Ying Zou. 2022. Studying the interplay between the durations and breakages of continuous integration builds.IEEE Transactions on Software Engineering49, 4 (2022), 2476–2497
2022
-
[18]
Taher A Ghaleb and Dulina Rathnayake. 2025. Can LLMs Write CI? A Study on Automatic Generation of GitHub Actions Configurations. In2025 IEEE Interna- tional Conference on Software Maintenance and Evolution. IEEE, 767–772
2025
-
[19]
GitHub. [n. d.]. GitHub Actions Documentation. https://docs.github.com/en/ actions. Accessed: 2025-11-17
2025
- [20]
- [21]
-
[22]
Michael Hilton, Nicholas Nelson, Timothy Tunnell, Darko Marinov, and Danny Dig. 2017. Trade-offs in continuous integration: assurance, security, and flexibility. InProceedings of the 11th Joint Meeting on Foundations of Software Engineering. 197–207
2017
- [23]
-
[24]
IN-COM DATA SYSTEMS. 2025. What Is the Difference Between Static Code Analysis and Linting? https://www.in-com.com/blog/what-is-the-difference- between-static-code-analysis-and-linting. Accessed: 2026-01-20
2025
-
[25]
Ali Khatami, Cćdric Willekens, and Andy Zaidman. 2024. Catching smells in the act: A GitHub Actions workflow investigation. In2024 IEEE International Conference on Source Code Analysis and Manipulation (SCAM). IEEE, 47–58
2024
-
[26]
Legit Security Research Team. 2024. Vulnerable GitHub Actions Workflows Part 1: Privilege Escalation Inside Your CI/CD Pipeline. https://www.legitsecurity. com/blog/github-privilege-escalation-vulnerability. Accessed: 2026-01-20
2024
-
[27]
Yiqi Liu, Nafise Sadat Moosavi, and Chenghua Lin. 2024. LLMs as narcissistic evaluators: When ego inflates evaluation scores. InFindings of the Association for Computational Linguistics: ACL 2024. 12688–12701
2024
-
[28]
Francisco S Marcondes, Adelino Gala, Renata Magalhães, Fernando Perez de Britto, Dalila Durães, and Paulo Novais. 2025. Using Ollama. InNatural Language Analytics with Generative Large-Language Models: A Practical Approach with Ollama and Open-Source LLMs. Springer, 23–35
2025
- [29]
-
[30]
2024.Mistral 7B
Mistral AI. 2024.Mistral 7B. https://mistral.ai/news/mistral-7b/ Accessed: 2025-10-18
2024
-
[31]
Oligo Security. 2025. Static Code Analysis: Top 7 Methods, Pros/Cons and Best Practices. https://www.oligo.security/academy/static-code-analysis. Accessed on Jan 20, 2026
2025
-
[32]
OpenAI. 2025. Introducing GPT-5. https://openai.com/gpt-5 Online; accessed 27-August-2025
2025
-
[33]
OpenAI. 2026. Reasoning Models Guide. https://developers.openai.com/api/docs/ guides/reasoning. Accessed:2026
2026
- [34]
-
[35]
Enio G Santana Jr, Jander Pereira Santos Junior, Erlon P Almeida, Iftekhar Ahmed, Paulo Anselmo da Mota Silveira Neto, and Eduardo Santana de Almeida. 2025. Evaluating LLMs Effectiveness in Detecting and Correcting Test Smells: An Empirical Study.arXiv preprint arXiv:2506.07594(2025)
-
[36]
Klaas-Jan Stol, Paul Ralph, and Brian Fitzgerald. 2016. Grounded theory in software engineering research: a critical review and guidelines. InProceedings of the 38th International conference on software engineering. 120–131
2016
- [37]
-
[38]
Davide Taibi, Andrea Janes, and Valentina Lenarduzzi. 2017. How developers perceive smells in source code: A replicated study.Information and Software Technology92 (2017), 223–235
2017
-
[39]
Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. 2025. Gemma 3 technical report.arXiv preprint arXiv:2503.19786 (2025)
work page internal anchor Pith review arXiv 2025
-
[40]
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971(2023)
work page internal anchor Pith review arXiv 2023
-
[41]
Pablo Valenzuela-Toledo, Alexandre Bergel, Timo Kehrer, and Oscar Nierstrasz
-
[42]
In2024 IEEE International Conference on Source Code Analysis and Manipulation (SCAM)
The Hidden Costs of Automation: An Empirical Study on GitHub Actions Workflow Maintenance. In2024 IEEE International Conference on Source Code Analysis and Manipulation (SCAM). IEEE, 213–223
-
[43]
Carmine Vassallo, Sebastian Proksch, Anna Jancso, Harald C Gall, and Massi- miliano Di Penta. 2020. Configuration smells in continuous delivery pipelines: a linter and a six-month study on GitLab. InProceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 327–337
2020
-
[44]
Stefan Wagner, Marvin Muñoz Barón, Davide Falessi, and Sebastian Baltes. 2025. Towards evaluation guidelines for empirical studies involving LLMs. In2025 IEEE/ACM International Workshop on Methodological Issues with Empirical Studies in Software Engineering (WSESE). IEEE, 24–27
2025
-
[45]
Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Lingpeng Kong, Qi Liu, Tianyu Liu, et al. 2024. Large language models are not fair evaluators. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. 9440–9450
2024
-
[46]
Weiyuan Xu, Juntao Luo, Tao Huang, Kaixin Sui, Jie Geng, Qijun Ma, Isami Akasaka, Xiaoxue Shi, Jing Tang, and Peng Cai. 2025. LogSage: An LLM-Based Framework for CI/CD Failure Detection and Remediation with Industrial Valida- tion. In40th IEEE/ACM International Conference on Automated Software Engineer- ing
2025
-
[47]
Fiorella Zampetti, Carmine Vassallo, Sebastiano Panichella, Gerardo Canfora, Harald Gall, and Massimiliano Di Penta. 2020. An empirical characterization of bad practices in continuous integration.Empirical Software Engineering25, 2 (2020), 1095–1135
2020
-
[48]
Chen Zhang, Bihuan Chen, Junhao Hu, Xin Peng, and Wenyun Zhao. 2022. BuildSonic: Detecting and repairing performance-related configuration smells for continuous integration builds. InProceedings of the 37th IEEE/ACM international conference on automated software engineering. 1–13
2022
-
[49]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judg- ing LLM-as-a-Judge with MT-Bench and chatbot arena.Advances in neural information processing systems36 (2023), 46595–46623
2023
-
[50]
Xiaoxin Zhou, Taher A Ghaleb, and Safwat Hassan. 2026. Role of CI Adoption in Mobile App Success: An Empirical Study of Open-Source Android Projects. In Proceedings of the 23rd International Conference on Mining Software Repositories. ACM, 1–12
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.