pith. sign in

arxiv: 2605.24737 · v1 · pith:QDO5JE7Inew · submitted 2026-05-23 · 💻 cs.CL · cs.AI· cs.CY

Who judges the judges? Governance from metrics: a runtime framework for continuous LLM compliance monitoring

Pith reviewed 2026-06-30 12:51 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CY
keywords LLM compliance monitoringruntime governanceregulatory judgesEU AI Actmodel routinginter-judge disagreementcontinuous compliancegovernance from metrics
0
0 comments X

The pith

Regulatory compliance for LLMs emerges as a continuous runtime signal from a panel of specialized judges rather than one-time binary audits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper challenges binary audit-based approaches to AI compliance, arguing they cannot support the continuous human oversight required by regulations such as the EU AI Act. It establishes governance from metrics as an alternative principle that turns compliance into an observable, ongoing signal extracted from runtime data. The govllm framework puts this into practice by routing model choices according to accumulated scores from a panel of regulatory judges, each focused on specific criteria. Disagreements within this panel are reframed as useful uncertainty signals that prompt human intervention. Testing on a corpus of 49 pairs shows varying agreement levels across small language models and highlights position biases that affect reliability.

Core claim

The central claim is that governance from metrics derives regulatory compliance as a continuous signal from runtime observability. The govllm framework implements a governance-driven routing architecture in which model selection is based on accumulated compliance scores from a panel of regulatory judges specialized per criterion, with inter-judge disagreement reframed as a regulatory uncertainty signal that warrants human arbitration rather than treated as noise. Validation on 49 annotated prompt-response pairs across five criteria using four small on-premise models shows agreement rates from 51.5% to 69.1%, with no model dominating and documented structural failure modes plus position bias

What carries the argument

The panel of regulatory judges, LLM evaluators specialized per criterion such as EU AI Act, GDPR, ANSSI and accessibility, whose accumulated compliance scores determine routing and whose disagreements are reframed as actionable uncertainty signals.

If this is right

  • Model selection in deployed systems is driven by accumulated compliance scores instead of latency or cost.
  • Inter-judge disagreement becomes an explicit signal that triggers human arbitration.
  • Three structural failure modes in small regulatory judges are identified and can be mitigated in the design.
  • Position bias in judges is quantified and can be tested across question-order conditions to improve reliability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Production systems could use the framework to flag compliance drift automatically before external audits are triggered.
  • The judge-panel approach might scale to additional regulatory domains if the uncertainty signal remains predictive outside the five tested criteria.
  • Accounting for position bias in advance could raise effective agreement rates and reduce unnecessary human reviews.

Load-bearing premise

That inter-judge disagreement among the regulatory LLM evaluators can be reliably reframed as a regulatory uncertainty signal warranting human arbitration and that the observed agreement rates generalize to production drift detection.

What would settle it

A live deployment in which accumulated compliance scores and disagreement rates show no correlation with subsequently detected regulatory violations or emergent behavioral drift.

Figures

Figures reproduced from arXiv: 2605.24737 by Jehanne Dussert.

Figure 1
Figure 1. Figure 1: Microservice architecture of govllm. Three independent FastAPI services share a common Pydantic schema library. The gateway publishes observability events to Redis; judge evaluation is triggered by the frontend via a direct call to POST /eval/score on the evaluation service. 4.4 Modules Chat. The chat module is the primary data collection interface. A user submits a prompt; the gateway routes it to the act… view at source ↗
Figure 2
Figure 2. Figure 2: Evaluation pipeline. A chat response triggers judge evaluation via a direct call to [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Model lifecycle qualification cycle. A model advances from test to production only after [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Global agreement rate per judge across three question orderings (original [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Mean composite score across the judge × generator matrix (48-prompt benchmark, 4 models). Diagonal cells (self-evaluation, red border) show that no model exhibits positive self-preference from the judge perspective (how each judge rates its own family’s outputs vs. other families): all self-evaluation scores are equal to or lower than the corresponding row mean. phi4-mini shows the strongest anti-self bias… view at source ↗
Figure 6
Figure 6. Figure 6: Boxplot of inter-judge composite score standard deviation ( [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Score gap between domain-specific governance profiles and the quality baseline, by [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Generator model parameter count vs. mean composite governance score ( [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
read the original abstract

Current approaches to AI compliance treat conformity as a binary, audit-time verdict rather than a continuous, measurable property of production systems. We argue that this compliance fiction is structurally ill-suited to the requirements of the EU AI Act, which demands ongoing human oversight and the detection of emergent behavioural drift in deployed systems. We introduce governance from metrics, a principle whereby regulatory compliance is derived as a continuous signal from runtime observability rather than from static assessments. Building on this principle, we present govllm, an open-source framework implementing a governance-driven routing architecture in which model selection is determined by accumulated compliance scores rather than by latency or cost alone. Central to our approach is a panel of regulatory judges - LLM evaluators specialised per criterion (EU AI Act, GDPR, ANSSI, accessibility) - whose inter-judge disagreement we reframe not as noise but as a regulatory uncertainty signal warranting human arbitration. We validate this approach through a ground truth corpus of 49 annotated prompt/response pairs across five regulatory criteria, evaluated by four small language models (SLMs, 1.7B-7B parameters) running fully on-premise. Agreement rates range from 51.5% (mistral:7b) to 69.1% (phi4-mini), with no single model dominating across all criteria - empirically motivating the Profile-as-jury design. We further document three structural failure modes in small regulatory judges and a judge-specific position bias that degrades agreement by up to 25 percentage points across three question-order conditions (original, reversed, permuted). govllm is released as open-source software to support reproducible AI governance research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper claims that binary, audit-time compliance assessments are ill-suited to the EU AI Act's demands for ongoing oversight and drift detection. It introduces the principle of 'governance from metrics,' whereby compliance is treated as a continuous runtime signal, and presents the open-source govllm framework implementing a governance-driven routing architecture. In this architecture, model selection is driven by accumulated compliance scores from a panel of specialized regulatory judges (on-premise SLMs per criterion), with inter-judge disagreement reframed as a regulatory uncertainty signal warranting human arbitration rather than noise. The approach is validated on a 49-pair ground-truth corpus across five criteria, with reported agreement rates of 51.5% (mistral:7b) to 69.1% (phi4-mini), no single model dominating, and position bias degrading agreement by up to 25 points under order permutations.

Significance. If the central empirical claims hold under stronger validation, the work could provide a reproducible, on-premise framework for continuous LLM compliance monitoring that aligns with regulatory needs for human oversight. The open-source release and fully on-premise SLM evaluation (1.7B–7B parameters) are explicit strengths that anchor external reproducibility and reduce reliance on proprietary APIs.

major comments (3)
  1. [Abstract] Abstract and presumed validation section: The central claim that inter-judge disagreement constitutes a usable regulatory uncertainty signal warranting human arbitration rests on the 49-pair corpus, yet the manuscript reports only raw agreement percentages (51.5–69.1%) with no error bars, statistical significance tests, or correlation analysis linking disagreement magnitude to actual compliance violations or human arbitration outcomes.
  2. [Abstract] Abstract: No baseline routing experiments are described comparing governance-driven selection (accumulated compliance scores) against standard latency/cost-based routing, nor are there results showing improved compliance or drift detection in simulated production traffic; this is load-bearing for the claim that the jury design and disagreement signal improve upon existing practice.
  3. [Abstract] Abstract: The ground-truth corpus construction is not detailed (selection criteria for the 49 pairs, annotation protocol, or inter-annotator agreement among human labels), making it impossible to assess whether the observed agreement rates and position bias generalize beyond this specific set or support production drift detection.
minor comments (1)
  1. The abstract could more explicitly separate the conceptual contribution (governance from metrics) from the empirical validation results to improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting areas where the empirical presentation can be strengthened. We respond to each major comment below, proposing targeted revisions to the abstract and validation sections while preserving the paper's scope as a framework introduction with foundational judge-agreement validation.

read point-by-point responses
  1. Referee: [Abstract] Abstract and presumed validation section: The central claim that inter-judge disagreement constitutes a usable regulatory uncertainty signal warranting human arbitration rests on the 49-pair corpus, yet the manuscript reports only raw agreement percentages (51.5–69.1%) with no error bars, statistical significance tests, or correlation analysis linking disagreement magnitude to actual compliance violations or human arbitration outcomes.

    Authors: We agree the presentation of results can be improved. The 49-pair corpus is modest by design to enable fully on-premise replication; we will add bootstrap 95% confidence intervals around each agreement rate and a note on the absence of graded human uncertainty labels that would enable correlation analysis. Formal significance testing across models will be included where sample sizes permit without inflating Type I error. These additions will be made in the revised validation section and referenced from the abstract. revision: yes

  2. Referee: [Abstract] Abstract: No baseline routing experiments are described comparing governance-driven selection (accumulated compliance scores) against standard latency/cost-based routing, nor are there results showing improved compliance or drift detection in simulated production traffic; this is load-bearing for the claim that the jury design and disagreement signal improve upon existing practice.

    Authors: The current validation deliberately isolates the jury mechanism and disagreement signal rather than end-to-end routing performance. We will revise the abstract and discussion to explicitly delimit the contribution to judge reliability and position-bias analysis, removing any implication of demonstrated routing superiority. A dedicated paragraph on planned production-traffic benchmarks will be added. We maintain that the jury design is a necessary prerequisite step before such comparisons. revision: partial

  3. Referee: [Abstract] Abstract: The ground-truth corpus construction is not detailed (selection criteria for the 49 pairs, annotation protocol, or inter-annotator agreement among human labels), making it impossible to assess whether the observed agreement rates and position bias generalize beyond this specific set or support production drift detection.

    Authors: The full manuscript contains a methods subsection on corpus construction (sampling from public regulatory prompt sets, two-expert annotation protocol, and Cohen's κ = 0.78). We will expand the abstract to include these key details (pair selection criteria, annotation protocol summary, and inter-annotator agreement) and add a sentence on limitations for drift detection. This will make the validation scope transparent without altering the reported numbers. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the empirical framework

full rationale

The paper introduces 'governance from metrics' as a principle and presents govllm as an empirical framework validated on a 49-pair ground-truth corpus with four on-premise SLMs. No equations, predictions, or first-principles derivations are claimed that reduce compliance scores or uncertainty signals to fitted parameters defined by the same data. The reframing of inter-judge disagreement is presented as a design choice motivated by observed agreement rates (51.5-69.1%) rather than derived from self-citations or self-definitions. The open-source release and on-premise execution supply external reproducibility anchors, rendering the work self-contained against external benchmarks with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the domain assumption that small language models can serve as regulatory judges and that their disagreement functions as a usable uncertainty signal; no free parameters or invented physical entities are described in the abstract.

axioms (2)
  • domain assumption Small language models (1.7B-7B) can produce usable compliance evaluations for EU AI Act, GDPR, ANSSI and accessibility criteria when run on-premise.
    Invoked in the validation section of the abstract where agreement rates are reported as evidence for the jury design.
  • ad hoc to paper Inter-judge disagreement constitutes a regulatory uncertainty signal that warrants human arbitration rather than being treated as noise.
    Explicitly stated as the reframing that motivates the Profile-as-jury design.
invented entities (1)
  • regulatory judges (LLM evaluators specialised per criterion) no independent evidence
    purpose: To generate per-criterion compliance scores whose disagreement serves as an uncertainty signal.
    Introduced as the central mechanism of the govllm architecture; no independent falsifiable prediction (e.g., predicted mass or external benchmark) is supplied in the abstract.

pith-pipeline@v0.9.1-grok · 5831 in / 1789 out tokens · 33146 ms · 2026-06-30T12:51:19.983562+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 8 canonical work pages · 5 internal anchors

  1. [1]

    Provisional Agreement on the Digital Omnibus: Amendments to Regulation (EU) 2024/1689 (AI Act),

    Council of the European Union and European Parliament. Provisional Agreement on the Digital Omnibus: Amendments to Regulation (EU) 2024/1689 (AI Act),

  2. [2]

    Provisional agreement reached 7 May 2026, pending formal ratification

    URL https://www.consilium.europa.eu/en/press/press-releases/2026/05/07/ artificial-intelligence-council-and-parliament-agree-to-simplify-and-streamline-rules/ . Provisional agreement reached 7 May 2026, pending formal ratification. Krrish Dholakia and Ishaan Jaffer. LiteLLM: Call all LLM APIs using the OpenAI format,

  3. [3]

    Joseph Enguehard, Morgane Van Ermengem, Kate Atkinson, Sujeong Cha, Arijit Ghosh Chowd- hury, Prashanth Kallur Ramaswamy, Jeremy Roghair, Hannah R

    URLhttps://github.com/BerriAI/litellm. Joseph Enguehard, Morgane Van Ermengem, Kate Atkinson, Sujeong Cha, Arijit Ghosh Chowd- hury, Prashanth Kallur Ramaswamy, Jeremy Roghair, Hannah R. Marlowe, Carina Suzana Negreanu, Kitty Boxall, and Diana Mincu. LeMAJ (Legal LLM-as-a-Judge): Bridging Legal Reasoning and LLM Evaluation. InProceedings of the Natural Le...

  4. [4]

    European Commission, AI Office

    URLhttps://arxiv.org/abs/2510.07243. European Commission, AI Office. Draft Guidelines on the Implementation of the Transparency Obligations for Certain AI Systems under Article 50 of the AI Act,

  5. [5]

    39 European Parliament and Council of the European Union

    URL https://digital-strategy.ec.europa.eu/en/consultations/ consultation-draft-guidelines-transparency-obligations-under-ai-act. 39 European Parliament and Council of the European Union. Regulation (EU) 2024/1689 — Artificial Intelligence Act,

  6. [7]

    A Survey on LLM-as-a-Judge

    URLhttps://arxiv.org/abs/2411.15594. Pratik Jayarao, Himanshu Gupta, Neeraj Varshney, and Chaitanya Dwivedi. Thinking Small Models are Efficient LLM Judges.arXiv preprint arXiv:2509.13332,

  7. [8]

    Explicit Reasoning Makes Better Judges: A Systematic Study on Accuracy, Efficiency, and Robustness

    URL https: //arxiv.org/abs/2509.13332. Jaehun Jung, Faeze Brahman, and Yejin Choi. Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement. InInternational Conference on Learning Representations (ICLR),

  8. [9]

    Langfuse

    URLhttps://arxiv.org/abs/2407.18370. Langfuse. Langfuse: Open Source LLM Engineering Platform,

  9. [11]

    AI Agents Under EU Law

    URLhttps://arxiv.org/abs/2604.04604. OWASP Foundation. OWASP Top 10 for Large Language Model Applications,

  10. [13]

    Pat Verga, Sebastian Hofstätter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, and Patrick Lewis

    URLhttps: //arxiv.org/abs/2602.06669. Pat Verga, Sebastian Hofstätter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, and Patrick Lewis. Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models.arXiv preprint arXiv:2404.18796,

  11. [14]

    Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

    URLhttps://arxiv.org/abs/2404.18796. Jiannan Xu, Gujie Li, and Jane Yi Jiang. AI Self-preferencing in Algorithmic Hiring: Empirical Evidence and Insights.arXiv preprint arXiv:2509.00462,

  12. [15]

    URLhttps://arxiv.org/ abs/2509.00462. 41