arxiv: 2604.21090 · v1 · submitted 2026-04-22 · 💻 cs.SE · cs.AI

Recognition: unknown

Structural Quality Gaps in Practitioner AI Governance Prompts: An Empirical Study Using a Five-Principle Evaluation Framework

Christo Zietsman

Authors on Pith no claims yet

Pith reviewed 2026-05-09 23:24 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords AI governancegovernance promptsstructural completenessAGENTS.mdevaluation frameworkrequirements engineeringempirical studyprompt quality

0 comments

The pith

A five-principle framework shows that 37 percent of practitioner AI governance prompts lack essential structural elements including data classification and assessment rubrics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper creates a five-principle evaluation framework to determine whether AI governance prompts, which serve as executable specifications for agent behavior, contain all necessary structural components. It applies the framework to 34 publicly available AGENTS.md files and reports that 37 percent of file-model pairs fall below the completeness threshold, most often because they omit data classification rules and assessment criteria. A sympathetic reader would care because incomplete prompts can allow AI agents to operate without clear mandates or quality checks, raising risks in deployed systems. The findings indicate that these gaps follow consistent patterns that could be caught and repaired through automated static analysis.

Core claim

The central claim is that 37% of evaluated file-model pairs score below the structural completeness threshold, with data classification and assessment rubric criteria most frequently absent. These results suggest that practitioner-authored governance prompts exhibit consistent structural patterns that automated static analysis could detect and remediate, with implications for requirements engineering practice in AI-assisted development contexts and an undocumented artefact classification gap in the AGENTS.md convention.

What carries the argument

Five-principle evaluation framework grounded in computability theory, proof theory, and Bayesian epistemology that measures structural completeness of governance prompts.

Load-bearing premise

The five-principle evaluation framework correctly measures structural completeness of governance prompts.

What would settle it

Re-evaluate the same 34 AGENTS.md files with an alternative set of completeness criteria and check whether the 37 percent failure rate and the same two most-missing elements persist.

read the original abstract

AI governance programmes increasingly rely on natural language prompts to constrain and direct AI agent behaviour. These prompts function as executable specifications: they define the agent's mandate, scope, and quality criteria. Despite this role, no systematic framework exists for evaluating whether a governance prompt is structurally complete. We introduce a five-principle evaluation framework grounded in computability theory, proof theory, and Bayesian epistemology, and apply it to an empirical corpus of 34 publicly available AGENTS.md governance files sourced from GitHub. Our evaluation reveals that 37% of evaluated file-model pairs score below the structural completeness threshold, with data classification and assessment rubric criteria most frequently absent. These results suggest that practitioner-authored governance prompts exhibit consistent structural patterns that automated static analysis could detect and remediate. We discuss implications for requirements engineering practice in AI-assisted development contexts, identify a previously undocumented artefact classification gap in the AGENTS.md convention, and propose directions for tool support.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper introduces a five-principle framework for structural quality in AI governance prompts and applies it to 34 public AGENTS.md files to show common gaps, but the framework itself gets no validation.

read the letter

The main thing to know is that this paper introduces a five-principle framework for checking the structural quality of AI governance prompts and applies it to 34 public AGENTS.md files, finding that 37% score below the completeness threshold with data classification and rubrics missing most often. What is new is the framework, which draws on computability theory, proof theory, and Bayesian epistemology, and the empirical study of these governance files. No prior work is cited for this exact approach, so it fills a gap in how we evaluate executable specifications for AI agents. The paper does well by producing specific findings on absent criteria and by linking them to potential for automated static analysis. This keeps the contribution practical and points to implications for requirements engineering in AI development. The soft spots are in the execution details. The framework lacks reported validation, such as testing against known good prompts or measuring agreement between raters. The corpus comes from GitHub but the selection process is not described, which raises questions about bias. With only 34 files, the patterns are interesting but not yet robust. The theoretical grounding is mentioned but not unpacked in how it leads to the five principles or the scoring rules. That makes the instrument feel somewhat ad hoc despite the heavy references. This work is aimed at people in AI governance, prompt engineering, and software requirements. A reader focused on practical tools for AI agents would get value from the identified gaps and the call for tool support. It deserves a serious referee. The empirical claim is descriptive and the framework is novel enough to warrant feedback on its soundness. I recommend sending it to peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a five-principle evaluation framework for AI governance prompts, grounded in computability theory, proof theory, and Bayesian epistemology. It applies the framework to an empirical corpus of 34 publicly available AGENTS.md files from GitHub, reporting that 37% of file-model pairs fall below a structural completeness threshold, with data classification and assessment rubric criteria most frequently absent. The authors interpret these patterns as evidence for consistent structural gaps amenable to automated static analysis and discuss implications for requirements engineering in AI-assisted development.

Significance. If the descriptive findings hold under a reproducible application of the framework, the work supplies an initial empirical baseline on the structural quality of practitioner AI governance artifacts and flags a potential undocumented gap in the AGENTS.md convention. This could usefully inform tool-building efforts in prompt engineering and requirements engineering for AI agents. The contribution remains primarily observational rather than predictive or causal, limiting broader impact until the framework receives external validation.

major comments (2)

[§3 (Methodology)] §3 (Methodology): The manuscript provides no details on the selection criteria or sampling procedure for the 34 GitHub AGENTS.md files, the precise scoring rules and threshold definition for the five principles, the operationalization of each principle, or any inter-rater reliability statistics. These omissions directly affect the reproducibility and interpretability of the central 37% statistic and the reported per-criterion absence rates.
[§4 (Results)] §4 (Results): The identification of data classification and assessment rubric as the most frequently absent criteria rests on application of the new framework, yet the paper supplies neither example prompt excerpts with their scores nor a pilot validation showing that the framework aligns with the claimed theoretical foundations. This leaves open whether the observed gaps are properties of the corpus or artifacts of an untested instrument.

minor comments (2)

[Abstract] Abstract: The term 'file-model pairs' is introduced without definition; a single clarifying sentence would aid readers unfamiliar with the evaluation design.
[Introduction] Introduction: A brief explanation of the AGENTS.md convention and its role in AI agent governance would provide necessary context for the broader software-engineering audience.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which identify key areas where additional detail will improve the manuscript's reproducibility and interpretability. We address each major comment below and indicate the revisions planned for the next version.

read point-by-point responses

Referee: [§3 (Methodology)] §3 (Methodology): The manuscript provides no details on the selection criteria or sampling procedure for the 34 GitHub AGENTS.md files, the precise scoring rules and threshold definition for the five principles, the operationalization of each principle, or any inter-rater reliability statistics. These omissions directly affect the reproducibility and interpretability of the central 37% statistic and the reported per-criterion absence rates.

Authors: We agree that the current description of the methodology is insufficient for full reproducibility. In the revised manuscript we will expand §3 to specify the GitHub search strategy and inclusion criteria used to arrive at the corpus of 34 AGENTS.md files, the exact operational definitions and scoring rules for each of the five principles, the numerical threshold applied to determine structural completeness, and the evaluation procedure (including any consistency checks performed). These additions will allow independent replication of the 37% figure and the per-criterion absence rates. revision: yes
Referee: [§4 (Results)] §4 (Results): The identification of data classification and assessment rubric as the most frequently absent criteria rests on application of the new framework, yet the paper supplies neither example prompt excerpts with their scores nor a pilot validation showing that the framework aligns with the claimed theoretical foundations. This leaves open whether the observed gaps are properties of the corpus or artifacts of an untested instrument.

Authors: We accept that concrete illustrations are needed. The revised version will include selected excerpts from AGENTS.md files together with the scores assigned under each principle. The framework was constructed from principles drawn from computability theory, proof theory, and Bayesian epistemology, but no separate pilot validation with external raters was conducted. We will add an explicit discussion of the theoretical derivation and will note the lack of empirical validation as a limitation. The observed absences are consistent across the corpus; the added examples will allow readers to assess whether the gaps are corpus properties or instrument artifacts. revision: partial

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper introduces a new five-principle evaluation framework with explicitly defined scoring rules and applies it directly to a fixed corpus of 34 GitHub-sourced AGENTS.md files. The central empirical result (37% of file-model pairs below threshold) is produced by straightforward counting of per-criterion failures under those rules; no equations, fitted parameters, self-citations, or uniqueness theorems are invoked to derive the outcome. The framework's grounding in computability theory, proof theory, and Bayesian epistemology is stated as motivation rather than a load-bearing derivation step that reduces the measurements to prior inputs. The claim remains a descriptive observation of the chosen corpus under the authors' instrument and does not collapse into self-definition or statistical forcing.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework's five principles are presented as grounded in established theories but their specific mapping to prompt evaluation is introduced without prior derivation or validation in the abstract; no free parameters or invented entities are mentioned.

axioms (1)

domain assumption A five-principle framework grounded in computability theory, proof theory, and Bayesian epistemology can evaluate structural completeness of governance prompts
Stated directly in the abstract as the basis for the new framework.

pith-pipeline@v0.9.0 · 5459 in / 1145 out tokens · 23176 ms · 2026-05-09T23:24:03.529152+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 9 canonical work pages · 4 internal anchors

[1]

Rice, H.G. (1953). Classes of recursively enumerable sets and their decision problems.Transactions of the American Mathe- matical Society, 74(2), 358-366. DOI: 10.1090/S0002-9947-1953- 0053041-6

work page doi:10.1090/s0002-9947-1953- 1953
[2]

Zietsman, C. (2026). The Specification as Quality Gate: Three Hypotheses on AI-Assisted Code Review. arXiv:2603.25773. Available at: https://doi.org/10.48550/arXiv.2603.25773

work page doi:10.48550/arxiv.2603.25773 2026
[3]

Jin, W.-L. (2025). FASTRIC: Prompt Specification Language for Verifiable LLM Interactions. arXiv:2512.18940. Available at: https://arxiv.org/abs/2512.18940

work page arXiv 2025
[4]

Gartner. (2025). Gartner Predicts Over 40% of Agentic AI Projects Will Be Canceled by End of 2027. Gartner Newsroom, June 25, 2025. Analyst: Anushree Verma. Available at: https://www.gartner.com/en/newsroom/press-releases/2025- 06-25-gartner-predicts-over-40-percent-of-agentic-ai-projects- will-be-canceled-by-end-of-2027

2025
[5]

(2025).Cost of a Data Breach Report 2025

IBM Security. (2025).Cost of a Data Breach Report 2025. IBM Corporation

2025
[6]

Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073. Available at: https://arxiv.org/abs/ 2212.08073

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

(1998).IEEE Recommended Practice for Software Re- quirements Specifications(IEEE Std 830-1998)

IEEE. (1998).IEEE Recommended Practice for Software Re- quirements Specifications(IEEE Std 830-1998). IEEE

1998
[8]

and Trentanni, G

Gnesi, S. and Trentanni, G. (2019). QuARS: A NLP Tool for Requirements Analysis. NLP4RE Workshop. CEUR-WS Vol-

2019
[9]

Available at: https://ceur-ws.org/Vol-2376/NLP4RE19_ paper07.pdf
[10]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Wei,J.,Wang,X.,Schuurmans,D.,Bosma,M.,Ichter,B.,Xia,F., Chi, E., Le, Q. and Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35. arXiv:2201.11903

work page internal anchor Pith review arXiv 2022
[11]

Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., et al. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877-

2020
[12]

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C.L., et al. (2022). Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35, 27730-27744. arXiv:2203.02155

work page internal anchor Pith review arXiv 2022
[13]

Ignore Previous Prompt: Attack Techniques For Language Models

Perez, F. and Ribeiro, I. (2022). Ignore previous prompt: attack techniques for language models.NeurIPS 2022 ML Safety Workshop. arXiv:2211.09527

work page internal anchor Pith review arXiv 2022
[14]

Howard, W.A. (1980). The formulae-as-types notion of construc- tion. In Seldin, J.P. and Hindley, J.R. (eds.),To H.B. Curry: Essays on Combinatory Logic, Lambda Calculus and Formalism, pp. 479-490. Academic Press, New York

1980
[15]

and Novak, M

Mavin, A., Wilkinson, P., Harwood, A. and Novak, M. (2009). Easy approach to requirements syntax (EARS). InProceedings of the 17th IEEE International Requirements Engineering Con- ference (RE ’09), pp. 317-322. IEEE. DOI: 10.1109/RE.2009.9

work page doi:10.1109/re.2009.9 2009
[16]

doi: 10.1093/acprof:oso/9780199563029

Williamson, J. (2010).In Defence of Objective Bayesianism. Oxford University Press, Oxford. DOI: 10.1093/acprof:oso/9780199228003.001.0001

work page doi:10.1093/acprof:oso/9780199228003.001.0001 2010
[17]

AGENTS.md. (2025). The open standard for AI agent instruc- tions. Available at: https://agents.md/

2025
[18]

Linux Foundation. (2025). Linux Foundation Announces the For- mationoftheAgenticAIFoundation.9December2025.Available at: https://www.linuxfoundation.org/press/linux-foundation- announces-the-formation-of-the-agentic-ai-foundation

2025
[19]

Analyse the codebase and provide your find- ings

Zietsman, C. (2026). governance-prompts-v1 corpus. https://github.com/czietsman/nuphirho.dev/tree/dcb7036/ experiments/governance-prompts-v1 A. Principle 1: Success Definition Score 1 (present):The prompt contains an explicit completion criterion. The agent can determine without ambiguity when its task is done. Example:"Your task is complete when you have...

2026