pith. machine review for the scientific record. sign in

arxiv: 2605.02091 · v1 · submitted 2026-05-03 · 💻 cs.SE

Recognition: 2 theorem links

How Compliant Are GitHub Actions Workflows? A Checklist-Based Study with LLM-Assisted Auditing

Authors on Pith no claims yet

Pith reviewed 2026-05-08 19:14 UTC · model grok-4.3

classification 💻 cs.SE
keywords GitHub ActionsCI/CD complianceLLM-assisted auditingworkflow securitymulti-tier adjudicationbest practices checklistJava workflows
0
0 comments X

The pith

A 30-criteria checklist audit of 95 Java GitHub Actions workflows finds 28% overall compliance, with LLMs enabling scalable checks via an 81% effort-reducing hybrid framework.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates a documentation-based checklist of 30 criteria across four workflow sections and eight themes to evaluate security, maintainability, and performance in GitHub Actions pipelines. It tests four open-weight LLMs on 2,850 assessments from 95 real-world Java workflows and records only fair inter-model agreement with notable gaps on structural and security items. To make the process practical, the authors add a multi-tier adjudication step that lets GPT-5 reconcile disagreements before limited manual review. The resulting hybrid method cuts verification work by 81% while preserving 87% agreement with expert judgment, yet the measured compliance remains low at 28% overall and 4% for permission controls.

Core claim

We propose a novel, documentation-grounded GHA compliance checklist with 30 criteria spanning four workflow sections and eight themes, and demonstrate on 95 real-world Java workflows that LLMs enable scalable compliance measurement but cannot replace experts; a multi-tier adjudication framework in which GPT-5 resolves model conflicts before targeted manual review reduces verification effort by 81% while retaining 87% agreement with expert judgment, revealing major compliance gaps including 28% overall and 4% for permission controls.

What carries the argument

The multi-tier adjudication framework that routes LLM outputs through GPT-5 conflict resolution followed by selective human review against the 30-criteria checklist.

If this is right

  • Overall compliance stands at 28%, with permission controls at 4% and security themes at 26% while clarity reaches 68%.
  • LLMs display only fair agreement (Fleiss' kappa = 0.28) and systematic disagreements on structural reasoning and security-sensitive judgments.
  • Hybrid human-AI auditing supplies a practical route to defensible large-scale compliance measurement.
  • Security and maintainability practices lag substantially behind clarity practices in real workflows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Embedding the checklist and framework directly into workflow editors could flag non-compliant patterns at creation time.
  • The same checklist-driven hybrid method could extend to auditing other CI/CD platforms or infrastructure-as-code repositories.
  • Widespread low compliance points to a need for clearer official guidance or built-in tooling that enforces the documented practices.

Load-bearing premise

The 30-criteria checklist accurately and comprehensively captures documented best practices for security, maintainability, and performance in GitHub Actions workflows.

What would settle it

A complete independent manual audit of the same 95 workflows using the identical 30-criteria checklist, then direct comparison of the resulting compliance percentages and adjudication effort savings against the reported 28% and 81% figures.

Figures

Figures reproduced from arXiv: 2605.02091 by Edward Abrokwah, Taher A. Ghaleb.

Figure 1
Figure 1. Figure 1: Full study pipeline. Left: checklist derivation from GHA documentation (top) and dataset construction from 8,924 Java projects, filtered and sampled to 2,850 checklist evaluations (bottom). Right: LLM-based compliance auditing (unanimous or near-unanimous model verdicts are accepted; disagreements escalate to GPT 5 and, if still unresolved, to manual review). cross-project applicability. Problematic items … view at source ↗
read the original abstract

GitHub Actions (GHA) CI workflows are critical infrastructure, but current tooling offers only syntactic or heuristic checks and does not enforce documented best practices for security, maintainability, or performance. Consequently, issues like over-privileged permissions, weak secrets management, and missing failure notifications remain undetected in real-world pipelines. This paper proposes a novel, documentation-grounded GHA compliance checklist with 30 criteria spanning four workflow sections and eight themes, and assesses Large Language Models (LLMs) for scalable compliance auditing. On 95 real-world Java workflows (2,850 assessments) using four open-weight LLMs, we find only fair agreement (Fleiss' kappa = 0.28), with systematic disagreement on structural reasoning and security-sensitive judgments. To address this, we introduce a multi-tier adjudication framework in which GPT 5 resolves model conflicts before targeted manual review, reducing verification effort by 81% while retaining 87% agreement with expert judgment. At scale, it reveals major compliance gaps: overall compliance is 28%, dropping to 4% for permission controls; Security (26%) lags far behind Clarity (68%). Our results show that LLMs enable scalable compliance measurement but cannot replace experts, highlighting the need for hybrid human-AI auditing and providing empirical benchmarks and guidance for defensible GHA workflow audits.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces a documentation-grounded 30-criteria checklist for GitHub Actions (GHA) workflows covering security, maintainability, and performance across four sections and eight themes. It evaluates four open-weight LLMs via 2,850 assessments on 95 real-world Java workflows, reports only fair inter-LLM agreement (Fleiss' kappa = 0.28) with systematic disagreements on structural and security judgments, and proposes a multi-tier adjudication framework (GPT-5 resolving conflicts before targeted manual review) that reduces verification effort by 81% while retaining 87% agreement with expert judgment. The study finds low compliance (28% overall, 4% for permissions, 26% for Security vs. 68% for Clarity) and concludes that LLMs enable scalable auditing but require hybrid human oversight.

Significance. If the checklist and sampling are robust, this provides concrete empirical benchmarks on real-world GHA compliance gaps and a practical, quantified hybrid auditing method that could inform both practitioners and future tooling. The work is strengthened by its use of a substantial assessment volume (2,850), reporting of Fleiss' kappa, and explicit effort-reduction and agreement metrics for the adjudication framework.

major comments (2)
  1. [§3 (Checklist Construction) and Abstract] §3 (Checklist Construction) and Abstract: The headline compliance rates (28% overall, 4% permissions, Security 26% vs. Clarity 68%) and the LLM-vs-expert comparison both presuppose that the 30-criteria checklist is a faithful, complete proxy for documented best practices. The manuscript describes it as 'documentation-grounded' but provides no explicit derivation process, source-to-criterion mapping, coverage audit against GitHub docs/advisories, or external validation (e.g., expert consensus). This is load-bearing for the 'major gaps' claim and risks systematic omissions or debatable thresholds inflating the reported non-compliance.
  2. [§4 (Workflow Sampling and Dataset)] §4 (Workflow Sampling and Dataset): The selection criteria and representativeness of the 95 Java workflows are not fully specified (e.g., filtering rules, repository popularity thresholds, or exclusion of toy/example repos). This directly affects the generalizability of the compliance statistics and the claim that the findings reveal 'major compliance gaps' in real-world pipelines.
minor comments (3)
  1. [Results tables] Table 2 or equivalent (per-criterion results): Consider adding a column or footnote showing the exact wording of each criterion alongside the compliance percentage to improve traceability.
  2. [Figure 1] Figure 1 (adjudication framework): The diagram and caption could more explicitly label the three tiers and the exact conditions for escalating to manual review.
  3. [§5 (Threats to Validity)] §5 (Threats to Validity): The discussion of LLM prompt sensitivity and criterion interpretation variability could be expanded with a brief sensitivity analysis or example prompts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below with specific plans for revision where the concerns are valid, while defending the core methodology on substantive grounds.

read point-by-point responses
  1. Referee: [§3 (Checklist Construction) and Abstract] §3 (Checklist Construction) and Abstract: The headline compliance rates (28% overall, 4% permissions, Security 26% vs. Clarity 68%) and the LLM-vs-expert comparison both presuppose that the 30-criteria checklist is a faithful, complete proxy for documented best practices. The manuscript describes it as 'documentation-grounded' but provides no explicit derivation process, source-to-criterion mapping, coverage audit against GitHub docs/advisories, or external validation (e.g., expert consensus). This is load-bearing for the 'major gaps' claim and risks systematic omissions or debatable thresholds inflating the reported non-compliance.

    Authors: We agree that greater transparency in the checklist derivation would strengthen the paper. The 30 criteria were systematically extracted from GitHub's official documentation on workflow syntax, security best practices, and advisories (e.g., permissions, secrets, and job configuration sections), cross-referenced against community guidelines and prior literature on CI security. However, the original submission did not include an explicit mapping table or coverage audit. In the revised manuscript we will add a dedicated subsection in §3 with a complete source-to-criterion mapping, references to specific GitHub documentation URLs, and an explicit discussion of coverage and any threshold decisions. We will also note the absence of a formal expert consensus round as a limitation. These additions directly address the load-bearing concern without changing the reported compliance figures. revision: yes

  2. Referee: [§4 (Workflow Sampling and Dataset)] §4 (Workflow Sampling and Dataset): The selection criteria and representativeness of the 95 Java workflows are not fully specified (e.g., filtering rules, repository popularity thresholds, or exclusion of toy/example repos). This directly affects the generalizability of the compliance statistics and the claim that the findings reveal 'major compliance gaps' in real-world pipelines.

    Authors: We acknowledge that the sampling description in §4 could be more explicit to support reproducibility and generalizability claims. The 95 workflows were drawn from actively maintained open-source Java repositories on GitHub that contain at least one .github/workflows file, with selection prioritizing repositories having non-trivial activity (measured by recent commits and stars) while excluding forks, archived repositories, and purely tutorial/example projects. To improve clarity, the revised §4 will enumerate the precise filtering rules, any popularity thresholds applied, exclusion criteria, and the rationale for restricting to Java projects (to control for language-specific workflow patterns). This expansion will better substantiate the claim of revealing gaps in real-world pipelines. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical measurement on external workflows

full rationale

The paper performs an empirical audit: it defines a 30-criteria checklist, applies LLMs to 95 real-world Java workflows (2,850 assessments), measures agreement via Fleiss' kappa, and evaluates a multi-tier adjudication framework against expert judgment. No equations, fitted parameters, predictions derived from inputs, or self-citation chains appear in the abstract or described methodology. Compliance rates (28% overall, 4% permissions) and effort-reduction figures (81%) are direct counts and comparisons on external data, not reductions by construction. The checklist is presented as documentation-grounded without any claim that its validity is proven by the compliance results themselves. This is a standard measurement study whose central claims rest on external benchmarks rather than self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the proposed checklist faithfully represents best practices and that the hybrid adjudication process produces judgments sufficiently close to expert review for practical use.

axioms (1)
  • domain assumption The 30 criteria accurately reflect documented best practices for GitHub Actions workflows in security, maintainability, and performance.
    The checklist is presented as documentation-grounded, but its validity as a ground truth for compliance is presupposed rather than independently validated in the abstract.

pith-pipeline@v0.9.0 · 5540 in / 1366 out tokens · 42303 ms · 2026-05-08T19:14:00.995150+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 12 canonical work pages · 3 internal anchors

  1. [1]

    Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauff- mann, et al. 2024. Phi-4 technical report.arXiv preprint arXiv:2412.08905(2024)

  2. [2]

    Edward Abrokwah and Taher A. Ghaleb. 2026. Auditing GitHub Actions Work- flows: A Compliance Checklist and Evaluation Using LLMs (Replication Package). https://github.com/Taher-Ghaleb/GHACompliance-EASE2026

  3. [3]

    Alireza Amiri-Margavi, Iman Jebellat, Ehsan Jebellat, and Seyed Pouyan Mousavi Davoudi. 2025. Enhancing answer reliability through inter-model consensus of large language models. InIFIP International Conference on Artificial Intelligence Applications and Innovations. Springer, 299–316

  4. [4]

    Moritz Beller, Georgios Gousios, and Andy Zaidman. 2017. Travistorrent: Syn- thesizing Travis CI and GitHub for full-stack research on continuous integration. In2017 IEEE/ACM 14th International Conference on Mining Software Repositories. IEEE, 447–450

  5. [5]

    Robert E Blackwell, Jon Barry, and Anthony G Cohn. 2024. Towards reproducible llm evaluation: Quantifying uncertainty in llm benchmark scores.arXiv preprint arXiv:2410.03492(2024)

  6. [6]

    Łukasz Chomątek, Jakub Papuga, Przemyslaw Nowak, and Aneta Poniszewska- Marańda. 2025. Decoding CI/CD Practices in Open-Source Projects with LLM Insights. InProceedings of the 33rd ACM International Conference on the Founda- tions of Software Engineering. 1638–1644

  7. [7]

    Nitika Chopra and Taher A Ghaleb. 2025. From First Use to Final Commit: Studying the Evolution of Multi-CI Service Adoption. InInternational Conference on Software Maintenance and Evolution. IEEE, 773–778

  8. [8]

    Codecov. 2021. Codecov Bash Uploader Security Incident. https://about.codecov. io/security-update. Accessed: 2026-01-20

  9. [9]

    Cycode Security Research. 2024. How we found vulnerabilities in GitHub Ac- tions CI/CD pipelines. https://cycode.com/blog/github-actions-vulnerabilities. Accessed: 2026-01-20

  10. [10]

    Martin Fowler. [n. d.]. Continuous Integration. https://martinfowler.com/articles/ originalContinuousIntegration.html

  11. [11]

    Keheliya Gallaba and Shane McIntosh. 2018. Use and misuse of continuous integration features: An empirical study of projects that (mis) use Travis CI.IEEE Transactions on Software Engineering46, 1 (2018), 33–50

  12. [12]

    Taher A Ghaleb. 2026. When AI Agents Touch CI/CD Configurations: Frequency and Success. InProceedings of the 23rd International Conference on Mining Software Repositories. ACM, 1–5

  13. [13]

    Taher A Ghaleb, Osamah Abduljalil, and Safwat Hassan. 2026. CI/CD Config- uration Practices in Open Source Android Apps: An Empirical Study.ACM Transactions on Software Engineering and Methodology35, 2 (2026), 1–40

  14. [14]

    Taher Ahmed Ghaleb, Daniel Alencar da Costa, and Ying Zou. 2019. An empirical study of the long duration of continuous integration builds.Empirical Software Engineering24, 4 (2019), 2102–2139

  15. [15]

    Taher Ahmed Ghaleb, Daniel Alencar da Costa, Ying Zou, and Ahmed E Hassan

  16. [16]

    doi:10.1109/TSE.2019.2941880

    Studying the Impact of Noises in Build Breakage Data.IEEE Transactions on Software Engineering(2019), 1–14. doi:10.1109/TSE.2019.2941880

  17. [17]

    Taher A Ghaleb, Safwat Hassan, and Ying Zou. 2022. Studying the interplay between the durations and breakages of continuous integration builds.IEEE Transactions on Software Engineering49, 4 (2022), 2476–2497

  18. [18]

    Taher A Ghaleb and Dulina Rathnayake. 2025. Can LLMs Write CI? A Study on Automatic Generation of GitHub Actions Configurations. In2025 IEEE Interna- tional Conference on Software Maintenance and Evolution. IEEE, 767–772

  19. [19]

    GitHub. [n. d.]. GitHub Actions Documentation. https://docs.github.com/en/ actions. Accessed: 2025-11-17

  20. [20]

    Jinyao Guo, Chengpeng Wang, Xiangzhe Xu, Zian Su, and Xiangyu Zhang. 2025. Repoaudit: An autonomous llm-agent for repository-level code auditing.arXiv preprint arXiv:2501.18160(2025)

  21. [21]

    Aayush Gupta. 2026. ReliabilityBench: Evaluating LLM Agent Reliability Under Production-Like Stress Conditions.arXiv preprint arXiv:2601.06112(2026)

  22. [22]

    Michael Hilton, Nicholas Nelson, Timothy Tunnell, Darko Marinov, and Danny Dig. 2017. Trade-offs in continuous integration: assurance, security, and flexibility. InProceedings of the 11th Joint Meeting on Foundations of Software Engineering. 197–207

  23. [23]

    Md Nazmul Hossain and Taher A Ghaleb. 2025. CIgrate: Automating CI Service Migration with Large Language Models.arXiv preprint arXiv:2507.20402(2025)

  24. [24]

    IN-COM DATA SYSTEMS. 2025. What Is the Difference Between Static Code Analysis and Linting? https://www.in-com.com/blog/what-is-the-difference- between-static-code-analysis-and-linting. Accessed: 2026-01-20

  25. [25]

    Ali Khatami, Cćdric Willekens, and Andy Zaidman. 2024. Catching smells in the act: A GitHub Actions workflow investigation. In2024 IEEE International Conference on Source Code Analysis and Manipulation (SCAM). IEEE, 47–58

  26. [26]

    Legit Security Research Team. 2024. Vulnerable GitHub Actions Workflows Part 1: Privilege Escalation Inside Your CI/CD Pipeline. https://www.legitsecurity. com/blog/github-privilege-escalation-vulnerability. Accessed: 2026-01-20

  27. [27]

    Yiqi Liu, Nafise Sadat Moosavi, and Chenghua Lin. 2024. LLMs as narcissistic evaluators: When ego inflates evaluation scores. InFindings of the Association for Computational Linguistics: ACL 2024. 12688–12701

  28. [28]

    Francisco S Marcondes, Adelino Gala, Renata Magalhães, Fernando Perez de Britto, Dalila Durães, and Paulo Novais. 2025. Using Ollama. InNatural Language Analytics with Generative Large-Language Models: A Practical Approach with Ollama and Open-Source LLMs. Springer, 23–35

  29. [29]

    Deep Mehta, Kartik Rawool, Subodh Gujar, and Bowen Xu. 2023. Automated devops pipeline generation for code repositories using large language models. arXiv preprint arXiv:2312.13225(2023)

  30. [30]

    2024.Mistral 7B

    Mistral AI. 2024.Mistral 7B. https://mistral.ai/news/mistral-7b/ Accessed: 2025-10-18

  31. [31]

    Oligo Security. 2025. Static Code Analysis: Top 7 Methods, Pros/Cons and Best Practices. https://www.oligo.security/academy/static-code-analysis. Accessed on Jan 20, 2026

  32. [32]

    OpenAI. 2025. Introducing GPT-5. https://openai.com/gpt-5 Online; accessed 27-August-2025

  33. [33]

    OpenAI. 2026. Reasoning Models Guide. https://developers.openai.com/api/docs/ guides/reasoning. Accessed:2026

  34. [34]

    Charilaos Pipis, Shivam Garg, Vasilis Kontonis, Vaishnavi Shrivastava, Akshay Krishnamurthy, and Dimitris Papailiopoulos. 2025. Wait, Wait, Wait... Why Do Reasoning Models Loop?arXiv preprint arXiv:2512.12895(2025)

  35. [35]

    Enio G Santana Jr, Jander Pereira Santos Junior, Erlon P Almeida, Iftekhar Ahmed, Paulo Anselmo da Mota Silveira Neto, and Eduardo Santana de Almeida. 2025. Evaluating LLMs Effectiveness in Detecting and Correcting Test Smells: An Empirical Study.arXiv preprint arXiv:2506.07594(2025)

  36. [36]

    Klaas-Jan Stol, Paul Ralph, and Brian Fitzgerald. 2016. Grounded theory in software engineering research: a critical review and guidelines. InProceedings of the 38th International conference on software engineering. 120–131

  37. [37]

    Rickard Stureborg, Dimitris Alikaniotis, and Yoshi Suhara. 2024. Large language models are inconsistent and biased evaluators.arXiv preprint arXiv:2405.01724 (2024)

  38. [38]

    Davide Taibi, Andrea Janes, and Valentina Lenarduzzi. 2017. How developers perceive smells in source code: A replicated study.Information and Software Technology92 (2017), 223–235

  39. [39]

    Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. 2025. Gemma 3 technical report.arXiv preprint arXiv:2503.19786 (2025)

  40. [40]

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971(2023)

  41. [41]

    Pablo Valenzuela-Toledo, Alexandre Bergel, Timo Kehrer, and Oscar Nierstrasz

  42. [42]

    In2024 IEEE International Conference on Source Code Analysis and Manipulation (SCAM)

    The Hidden Costs of Automation: An Empirical Study on GitHub Actions Workflow Maintenance. In2024 IEEE International Conference on Source Code Analysis and Manipulation (SCAM). IEEE, 213–223

  43. [43]

    Carmine Vassallo, Sebastian Proksch, Anna Jancso, Harald C Gall, and Massi- miliano Di Penta. 2020. Configuration smells in continuous delivery pipelines: a linter and a six-month study on GitLab. InProceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 327–337

  44. [44]

    Stefan Wagner, Marvin Muñoz Barón, Davide Falessi, and Sebastian Baltes. 2025. Towards evaluation guidelines for empirical studies involving LLMs. In2025 IEEE/ACM International Workshop on Methodological Issues with Empirical Studies in Software Engineering (WSESE). IEEE, 24–27

  45. [45]

    Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Lingpeng Kong, Qi Liu, Tianyu Liu, et al. 2024. Large language models are not fair evaluators. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. 9440–9450

  46. [46]

    Weiyuan Xu, Juntao Luo, Tao Huang, Kaixin Sui, Jie Geng, Qijun Ma, Isami Akasaka, Xiaoxue Shi, Jing Tang, and Peng Cai. 2025. LogSage: An LLM-Based Framework for CI/CD Failure Detection and Remediation with Industrial Valida- tion. In40th IEEE/ACM International Conference on Automated Software Engineer- ing

  47. [47]

    Fiorella Zampetti, Carmine Vassallo, Sebastiano Panichella, Gerardo Canfora, Harald Gall, and Massimiliano Di Penta. 2020. An empirical characterization of bad practices in continuous integration.Empirical Software Engineering25, 2 (2020), 1095–1135

  48. [48]

    Chen Zhang, Bihuan Chen, Junhao Hu, Xin Peng, and Wenyun Zhao. 2022. BuildSonic: Detecting and repairing performance-related configuration smells for continuous integration builds. InProceedings of the 37th IEEE/ACM international conference on automated software engineering. 1–13

  49. [49]

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judg- ing LLM-as-a-Judge with MT-Bench and chatbot arena.Advances in neural information processing systems36 (2023), 46595–46623

  50. [50]

    Xiaoxin Zhou, Taher A Ghaleb, and Safwat Hassan. 2026. Role of CI Adoption in Mobile App Success: An Empirical Study of Open-Source Android Projects. In Proceedings of the 23rd International Conference on Mining Software Repositories. ACM, 1–12