pith. machine review for the scientific record. sign in

arxiv: 2604.21090 · v1 · submitted 2026-04-22 · 💻 cs.SE · cs.AI

Recognition: unknown

Structural Quality Gaps in Practitioner AI Governance Prompts: An Empirical Study Using a Five-Principle Evaluation Framework

Authors on Pith no claims yet

Pith reviewed 2026-05-09 23:24 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords AI governancegovernance promptsstructural completenessAGENTS.mdevaluation frameworkrequirements engineeringempirical studyprompt quality
0
0 comments X

The pith

A five-principle framework shows that 37 percent of practitioner AI governance prompts lack essential structural elements including data classification and assessment rubrics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper creates a five-principle evaluation framework to determine whether AI governance prompts, which serve as executable specifications for agent behavior, contain all necessary structural components. It applies the framework to 34 publicly available AGENTS.md files and reports that 37 percent of file-model pairs fall below the completeness threshold, most often because they omit data classification rules and assessment criteria. A sympathetic reader would care because incomplete prompts can allow AI agents to operate without clear mandates or quality checks, raising risks in deployed systems. The findings indicate that these gaps follow consistent patterns that could be caught and repaired through automated static analysis.

Core claim

The central claim is that 37% of evaluated file-model pairs score below the structural completeness threshold, with data classification and assessment rubric criteria most frequently absent. These results suggest that practitioner-authored governance prompts exhibit consistent structural patterns that automated static analysis could detect and remediate, with implications for requirements engineering practice in AI-assisted development contexts and an undocumented artefact classification gap in the AGENTS.md convention.

What carries the argument

Five-principle evaluation framework grounded in computability theory, proof theory, and Bayesian epistemology that measures structural completeness of governance prompts.

Load-bearing premise

The five-principle evaluation framework correctly measures structural completeness of governance prompts.

What would settle it

Re-evaluate the same 34 AGENTS.md files with an alternative set of completeness criteria and check whether the 37 percent failure rate and the same two most-missing elements persist.

read the original abstract

AI governance programmes increasingly rely on natural language prompts to constrain and direct AI agent behaviour. These prompts function as executable specifications: they define the agent's mandate, scope, and quality criteria. Despite this role, no systematic framework exists for evaluating whether a governance prompt is structurally complete. We introduce a five-principle evaluation framework grounded in computability theory, proof theory, and Bayesian epistemology, and apply it to an empirical corpus of 34 publicly available AGENTS.md governance files sourced from GitHub. Our evaluation reveals that 37% of evaluated file-model pairs score below the structural completeness threshold, with data classification and assessment rubric criteria most frequently absent. These results suggest that practitioner-authored governance prompts exhibit consistent structural patterns that automated static analysis could detect and remediate. We discuss implications for requirements engineering practice in AI-assisted development contexts, identify a previously undocumented artefact classification gap in the AGENTS.md convention, and propose directions for tool support.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a five-principle evaluation framework for AI governance prompts, grounded in computability theory, proof theory, and Bayesian epistemology. It applies the framework to an empirical corpus of 34 publicly available AGENTS.md files from GitHub, reporting that 37% of file-model pairs fall below a structural completeness threshold, with data classification and assessment rubric criteria most frequently absent. The authors interpret these patterns as evidence for consistent structural gaps amenable to automated static analysis and discuss implications for requirements engineering in AI-assisted development.

Significance. If the descriptive findings hold under a reproducible application of the framework, the work supplies an initial empirical baseline on the structural quality of practitioner AI governance artifacts and flags a potential undocumented gap in the AGENTS.md convention. This could usefully inform tool-building efforts in prompt engineering and requirements engineering for AI agents. The contribution remains primarily observational rather than predictive or causal, limiting broader impact until the framework receives external validation.

major comments (2)
  1. [§3 (Methodology)] §3 (Methodology): The manuscript provides no details on the selection criteria or sampling procedure for the 34 GitHub AGENTS.md files, the precise scoring rules and threshold definition for the five principles, the operationalization of each principle, or any inter-rater reliability statistics. These omissions directly affect the reproducibility and interpretability of the central 37% statistic and the reported per-criterion absence rates.
  2. [§4 (Results)] §4 (Results): The identification of data classification and assessment rubric as the most frequently absent criteria rests on application of the new framework, yet the paper supplies neither example prompt excerpts with their scores nor a pilot validation showing that the framework aligns with the claimed theoretical foundations. This leaves open whether the observed gaps are properties of the corpus or artifacts of an untested instrument.
minor comments (2)
  1. [Abstract] Abstract: The term 'file-model pairs' is introduced without definition; a single clarifying sentence would aid readers unfamiliar with the evaluation design.
  2. [Introduction] Introduction: A brief explanation of the AGENTS.md convention and its role in AI agent governance would provide necessary context for the broader software-engineering audience.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which identify key areas where additional detail will improve the manuscript's reproducibility and interpretability. We address each major comment below and indicate the revisions planned for the next version.

read point-by-point responses
  1. Referee: [§3 (Methodology)] §3 (Methodology): The manuscript provides no details on the selection criteria or sampling procedure for the 34 GitHub AGENTS.md files, the precise scoring rules and threshold definition for the five principles, the operationalization of each principle, or any inter-rater reliability statistics. These omissions directly affect the reproducibility and interpretability of the central 37% statistic and the reported per-criterion absence rates.

    Authors: We agree that the current description of the methodology is insufficient for full reproducibility. In the revised manuscript we will expand §3 to specify the GitHub search strategy and inclusion criteria used to arrive at the corpus of 34 AGENTS.md files, the exact operational definitions and scoring rules for each of the five principles, the numerical threshold applied to determine structural completeness, and the evaluation procedure (including any consistency checks performed). These additions will allow independent replication of the 37% figure and the per-criterion absence rates. revision: yes

  2. Referee: [§4 (Results)] §4 (Results): The identification of data classification and assessment rubric as the most frequently absent criteria rests on application of the new framework, yet the paper supplies neither example prompt excerpts with their scores nor a pilot validation showing that the framework aligns with the claimed theoretical foundations. This leaves open whether the observed gaps are properties of the corpus or artifacts of an untested instrument.

    Authors: We accept that concrete illustrations are needed. The revised version will include selected excerpts from AGENTS.md files together with the scores assigned under each principle. The framework was constructed from principles drawn from computability theory, proof theory, and Bayesian epistemology, but no separate pilot validation with external raters was conducted. We will add an explicit discussion of the theoretical derivation and will note the lack of empirical validation as a limitation. The observed absences are consistent across the corpus; the added examples will allow readers to assess whether the gaps are corpus properties or instrument artifacts. revision: partial

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper introduces a new five-principle evaluation framework with explicitly defined scoring rules and applies it directly to a fixed corpus of 34 GitHub-sourced AGENTS.md files. The central empirical result (37% of file-model pairs below threshold) is produced by straightforward counting of per-criterion failures under those rules; no equations, fitted parameters, self-citations, or uniqueness theorems are invoked to derive the outcome. The framework's grounding in computability theory, proof theory, and Bayesian epistemology is stated as motivation rather than a load-bearing derivation step that reduces the measurements to prior inputs. The claim remains a descriptive observation of the chosen corpus under the authors' instrument and does not collapse into self-definition or statistical forcing.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework's five principles are presented as grounded in established theories but their specific mapping to prompt evaluation is introduced without prior derivation or validation in the abstract; no free parameters or invented entities are mentioned.

axioms (1)
  • domain assumption A five-principle framework grounded in computability theory, proof theory, and Bayesian epistemology can evaluate structural completeness of governance prompts
    Stated directly in the abstract as the basis for the new framework.

pith-pipeline@v0.9.0 · 5459 in / 1145 out tokens · 23176 ms · 2026-05-09T23:24:03.529152+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 9 canonical work pages · 4 internal anchors

  1. [1]

    Rice, H.G. (1953). Classes of recursively enumerable sets and their decision problems.Transactions of the American Mathe- matical Society, 74(2), 358-366. DOI: 10.1090/S0002-9947-1953- 0053041-6

  2. [2]

    Zietsman, C. (2026). The Specification as Quality Gate: Three Hypotheses on AI-Assisted Code Review. arXiv:2603.25773. Available at: https://doi.org/10.48550/arXiv.2603.25773

  3. [3]

    Jin, W.-L. (2025). FASTRIC: Prompt Specification Language for Verifiable LLM Interactions. arXiv:2512.18940. Available at: https://arxiv.org/abs/2512.18940

  4. [4]

    Gartner. (2025). Gartner Predicts Over 40% of Agentic AI Projects Will Be Canceled by End of 2027. Gartner Newsroom, June 25, 2025. Analyst: Anushree Verma. Available at: https://www.gartner.com/en/newsroom/press-releases/2025- 06-25-gartner-predicts-over-40-percent-of-agentic-ai-projects- will-be-canceled-by-end-of-2027

  5. [5]

    (2025).Cost of a Data Breach Report 2025

    IBM Security. (2025).Cost of a Data Breach Report 2025. IBM Corporation

  6. [6]

    Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073. Available at: https://arxiv.org/abs/ 2212.08073

  7. [7]

    (1998).IEEE Recommended Practice for Software Re- quirements Specifications(IEEE Std 830-1998)

    IEEE. (1998).IEEE Recommended Practice for Software Re- quirements Specifications(IEEE Std 830-1998). IEEE

  8. [8]

    and Trentanni, G

    Gnesi, S. and Trentanni, G. (2019). QuARS: A NLP Tool for Requirements Analysis. NLP4RE Workshop. CEUR-WS Vol-

  9. [9]

    Available at: https://ceur-ws.org/Vol-2376/NLP4RE19_ paper07.pdf

  10. [10]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Wei,J.,Wang,X.,Schuurmans,D.,Bosma,M.,Ichter,B.,Xia,F., Chi, E., Le, Q. and Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35. arXiv:2201.11903

  11. [11]

    Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., et al. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877-

  12. [12]

    Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C.L., et al. (2022). Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35, 27730-27744. arXiv:2203.02155

  13. [13]

    Ignore Previous Prompt: Attack Techniques For Language Models

    Perez, F. and Ribeiro, I. (2022). Ignore previous prompt: attack techniques for language models.NeurIPS 2022 ML Safety Workshop. arXiv:2211.09527

  14. [14]

    Howard, W.A. (1980). The formulae-as-types notion of construc- tion. In Seldin, J.P. and Hindley, J.R. (eds.),To H.B. Curry: Essays on Combinatory Logic, Lambda Calculus and Formalism, pp. 479-490. Academic Press, New York

  15. [15]

    and Novak, M

    Mavin, A., Wilkinson, P., Harwood, A. and Novak, M. (2009). Easy approach to requirements syntax (EARS). InProceedings of the 17th IEEE International Requirements Engineering Con- ference (RE ’09), pp. 317-322. IEEE. DOI: 10.1109/RE.2009.9

  16. [16]

    doi: 10.1093/acprof:oso/9780199563029

    Williamson, J. (2010).In Defence of Objective Bayesianism. Oxford University Press, Oxford. DOI: 10.1093/acprof:oso/9780199228003.001.0001

  17. [17]

    AGENTS.md. (2025). The open standard for AI agent instruc- tions. Available at: https://agents.md/

  18. [18]

    Linux Foundation. (2025). Linux Foundation Announces the For- mationoftheAgenticAIFoundation.9December2025.Available at: https://www.linuxfoundation.org/press/linux-foundation- announces-the-formation-of-the-agentic-ai-foundation

  19. [19]

    Analyse the codebase and provide your find- ings

    Zietsman, C. (2026). governance-prompts-v1 corpus. https://github.com/czietsman/nuphirho.dev/tree/dcb7036/ experiments/governance-prompts-v1 A. Principle 1: Success Definition Score 1 (present):The prompt contains an explicit completion criterion. The agent can determine without ambiguity when its task is done. Example:"Your task is complete when you have...