Who judges the judges? Governance from metrics: a runtime framework for continuous LLM compliance monitoring

Jehanne Dussert

arxiv: 2605.24737 · v1 · pith:QDO5JE7Inew · submitted 2026-05-23 · 💻 cs.CL · cs.AI· cs.CY

Who judges the judges? Governance from metrics: a runtime framework for continuous LLM compliance monitoring

Jehanne Dussert This is my paper

Pith reviewed 2026-06-30 12:51 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CY

keywords LLM compliance monitoringruntime governanceregulatory judgesEU AI Actmodel routinginter-judge disagreementcontinuous compliancegovernance from metrics

0 comments

The pith

Regulatory compliance for LLMs emerges as a continuous runtime signal from a panel of specialized judges rather than one-time binary audits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper challenges binary audit-based approaches to AI compliance, arguing they cannot support the continuous human oversight required by regulations such as the EU AI Act. It establishes governance from metrics as an alternative principle that turns compliance into an observable, ongoing signal extracted from runtime data. The govllm framework puts this into practice by routing model choices according to accumulated scores from a panel of regulatory judges, each focused on specific criteria. Disagreements within this panel are reframed as useful uncertainty signals that prompt human intervention. Testing on a corpus of 49 pairs shows varying agreement levels across small language models and highlights position biases that affect reliability.

Core claim

The central claim is that governance from metrics derives regulatory compliance as a continuous signal from runtime observability. The govllm framework implements a governance-driven routing architecture in which model selection is based on accumulated compliance scores from a panel of regulatory judges specialized per criterion, with inter-judge disagreement reframed as a regulatory uncertainty signal that warrants human arbitration rather than treated as noise. Validation on 49 annotated prompt-response pairs across five criteria using four small on-premise models shows agreement rates from 51.5% to 69.1%, with no model dominating and documented structural failure modes plus position bias

What carries the argument

The panel of regulatory judges, LLM evaluators specialized per criterion such as EU AI Act, GDPR, ANSSI and accessibility, whose accumulated compliance scores determine routing and whose disagreements are reframed as actionable uncertainty signals.

If this is right

Model selection in deployed systems is driven by accumulated compliance scores instead of latency or cost.
Inter-judge disagreement becomes an explicit signal that triggers human arbitration.
Three structural failure modes in small regulatory judges are identified and can be mitigated in the design.
Position bias in judges is quantified and can be tested across question-order conditions to improve reliability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Production systems could use the framework to flag compliance drift automatically before external audits are triggered.
The judge-panel approach might scale to additional regulatory domains if the uncertainty signal remains predictive outside the five tested criteria.
Accounting for position bias in advance could raise effective agreement rates and reduce unnecessary human reviews.

Load-bearing premise

That inter-judge disagreement among the regulatory LLM evaluators can be reliably reframed as a regulatory uncertainty signal warranting human arbitration and that the observed agreement rates generalize to production drift detection.

What would settle it

A live deployment in which accumulated compliance scores and disagreement rates show no correlation with subsequently detected regulatory violations or emergent behavioral drift.

Figures

Figures reproduced from arXiv: 2605.24737 by Jehanne Dussert.

**Figure 1.** Figure 1: Microservice architecture of govllm. Three independent FastAPI services share a common Pydantic schema library. The gateway publishes observability events to Redis; judge evaluation is triggered by the frontend via a direct call to POST /eval/score on the evaluation service. 4.4 Modules Chat. The chat module is the primary data collection interface. A user submits a prompt; the gateway routes it to the act… view at source ↗

**Figure 2.** Figure 2: Evaluation pipeline. A chat response triggers judge evaluation via a direct call to [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Model lifecycle qualification cycle. A model advances from test to production only after [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Global agreement rate per judge across three question orderings (original [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Mean composite score across the judge × generator matrix (48-prompt benchmark, 4 models). Diagonal cells (self-evaluation, red border) show that no model exhibits positive self-preference from the judge perspective (how each judge rates its own family’s outputs vs. other families): all self-evaluation scores are equal to or lower than the corresponding row mean. phi4-mini shows the strongest anti-self bias… view at source ↗

**Figure 6.** Figure 6: Boxplot of inter-judge composite score standard deviation ( [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Score gap between domain-specific governance profiles and the quality baseline, by [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Generator model parameter count vs. mean composite governance score ( [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

read the original abstract

Current approaches to AI compliance treat conformity as a binary, audit-time verdict rather than a continuous, measurable property of production systems. We argue that this compliance fiction is structurally ill-suited to the requirements of the EU AI Act, which demands ongoing human oversight and the detection of emergent behavioural drift in deployed systems. We introduce governance from metrics, a principle whereby regulatory compliance is derived as a continuous signal from runtime observability rather than from static assessments. Building on this principle, we present govllm, an open-source framework implementing a governance-driven routing architecture in which model selection is determined by accumulated compliance scores rather than by latency or cost alone. Central to our approach is a panel of regulatory judges - LLM evaluators specialised per criterion (EU AI Act, GDPR, ANSSI, accessibility) - whose inter-judge disagreement we reframe not as noise but as a regulatory uncertainty signal warranting human arbitration. We validate this approach through a ground truth corpus of 49 annotated prompt/response pairs across five regulatory criteria, evaluated by four small language models (SLMs, 1.7B-7B parameters) running fully on-premise. Agreement rates range from 51.5% (mistral:7b) to 69.1% (phi4-mini), with no single model dominating across all criteria - empirically motivating the Profile-as-jury design. We further document three structural failure modes in small regulatory judges and a judge-specific position bias that degrades agreement by up to 25 percentage points across three question-order conditions (original, reversed, permuted). govllm is released as open-source software to support reproducible AI governance research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The govllm framework and open-source release are concrete steps, but the 49-pair corpus with 51-69% agreement and position bias does not validate disagreement as a usable uncertainty signal or show routing gains.

read the letter

The paper's main move is to treat compliance as a running score from a panel of small on-premise judges, each tied to one regulatory criterion, then route models by those accumulated scores. They release the code and report agreement rates plus position bias on a 49-pair set. That release and the bias documentation are the parts that actually land.

The multi-judge setup and the observation that no single SLM wins across criteria are reasonable engineering choices. Documenting how question order shifts agreement by up to 25 points is also straightforward and useful to see in print.

The central claim does not hold up yet. The agreement numbers sit between 51% and 69%, which already signals substantial noise. Nothing in the reported results shows that disagreement tracks actual compliance failures, that the scores improve routing over latency or cost baselines, or that the method would catch drift in live traffic. The corpus size and lack of construction details or error bars leave the reframing of disagreement as a regulatory signal as an untested modeling decision rather than a demonstrated mechanism.

This is for people building production monitoring layers for regulated AI, especially those needing on-premise options. A reader looking for a working prototype with honest bias notes could pull the code and extend it. The work is not ready to stand on its own as evidence for the EU AI Act's continuous oversight needs.

It deserves peer review. The implementation is real and the empirical observations are specific enough that referees can ask for the missing baselines and larger tests without starting from zero.

Referee Report

3 major / 1 minor

Summary. The paper claims that binary, audit-time compliance assessments are ill-suited to the EU AI Act's demands for ongoing oversight and drift detection. It introduces the principle of 'governance from metrics,' whereby compliance is treated as a continuous runtime signal, and presents the open-source govllm framework implementing a governance-driven routing architecture. In this architecture, model selection is driven by accumulated compliance scores from a panel of specialized regulatory judges (on-premise SLMs per criterion), with inter-judge disagreement reframed as a regulatory uncertainty signal warranting human arbitration rather than noise. The approach is validated on a 49-pair ground-truth corpus across five criteria, with reported agreement rates of 51.5% (mistral:7b) to 69.1% (phi4-mini), no single model dominating, and position bias degrading agreement by up to 25 points under order permutations.

Significance. If the central empirical claims hold under stronger validation, the work could provide a reproducible, on-premise framework for continuous LLM compliance monitoring that aligns with regulatory needs for human oversight. The open-source release and fully on-premise SLM evaluation (1.7B–7B parameters) are explicit strengths that anchor external reproducibility and reduce reliance on proprietary APIs.

major comments (3)

[Abstract] Abstract and presumed validation section: The central claim that inter-judge disagreement constitutes a usable regulatory uncertainty signal warranting human arbitration rests on the 49-pair corpus, yet the manuscript reports only raw agreement percentages (51.5–69.1%) with no error bars, statistical significance tests, or correlation analysis linking disagreement magnitude to actual compliance violations or human arbitration outcomes.
[Abstract] Abstract: No baseline routing experiments are described comparing governance-driven selection (accumulated compliance scores) against standard latency/cost-based routing, nor are there results showing improved compliance or drift detection in simulated production traffic; this is load-bearing for the claim that the jury design and disagreement signal improve upon existing practice.
[Abstract] Abstract: The ground-truth corpus construction is not detailed (selection criteria for the 49 pairs, annotation protocol, or inter-annotator agreement among human labels), making it impossible to assess whether the observed agreement rates and position bias generalize beyond this specific set or support production drift detection.

minor comments (1)

The abstract could more explicitly separate the conceptual contribution (governance from metrics) from the empirical validation results to improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting areas where the empirical presentation can be strengthened. We respond to each major comment below, proposing targeted revisions to the abstract and validation sections while preserving the paper's scope as a framework introduction with foundational judge-agreement validation.

read point-by-point responses

Referee: [Abstract] Abstract and presumed validation section: The central claim that inter-judge disagreement constitutes a usable regulatory uncertainty signal warranting human arbitration rests on the 49-pair corpus, yet the manuscript reports only raw agreement percentages (51.5–69.1%) with no error bars, statistical significance tests, or correlation analysis linking disagreement magnitude to actual compliance violations or human arbitration outcomes.

Authors: We agree the presentation of results can be improved. The 49-pair corpus is modest by design to enable fully on-premise replication; we will add bootstrap 95% confidence intervals around each agreement rate and a note on the absence of graded human uncertainty labels that would enable correlation analysis. Formal significance testing across models will be included where sample sizes permit without inflating Type I error. These additions will be made in the revised validation section and referenced from the abstract. revision: yes
Referee: [Abstract] Abstract: No baseline routing experiments are described comparing governance-driven selection (accumulated compliance scores) against standard latency/cost-based routing, nor are there results showing improved compliance or drift detection in simulated production traffic; this is load-bearing for the claim that the jury design and disagreement signal improve upon existing practice.

Authors: The current validation deliberately isolates the jury mechanism and disagreement signal rather than end-to-end routing performance. We will revise the abstract and discussion to explicitly delimit the contribution to judge reliability and position-bias analysis, removing any implication of demonstrated routing superiority. A dedicated paragraph on planned production-traffic benchmarks will be added. We maintain that the jury design is a necessary prerequisite step before such comparisons. revision: partial
Referee: [Abstract] Abstract: The ground-truth corpus construction is not detailed (selection criteria for the 49 pairs, annotation protocol, or inter-annotator agreement among human labels), making it impossible to assess whether the observed agreement rates and position bias generalize beyond this specific set or support production drift detection.

Authors: The full manuscript contains a methods subsection on corpus construction (sampling from public regulatory prompt sets, two-expert annotation protocol, and Cohen's κ = 0.78). We will expand the abstract to include these key details (pair selection criteria, annotation protocol summary, and inter-annotator agreement) and add a sentence on limitations for drift detection. This will make the validation scope transparent without altering the reported numbers. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the empirical framework

full rationale

The paper introduces 'governance from metrics' as a principle and presents govllm as an empirical framework validated on a 49-pair ground-truth corpus with four on-premise SLMs. No equations, predictions, or first-principles derivations are claimed that reduce compliance scores or uncertainty signals to fitted parameters defined by the same data. The reframing of inter-judge disagreement is presented as a design choice motivated by observed agreement rates (51.5-69.1%) rather than derived from self-citations or self-definitions. The open-source release and on-premise execution supply external reproducibility anchors, rendering the work self-contained against external benchmarks with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the domain assumption that small language models can serve as regulatory judges and that their disagreement functions as a usable uncertainty signal; no free parameters or invented physical entities are described in the abstract.

axioms (2)

domain assumption Small language models (1.7B-7B) can produce usable compliance evaluations for EU AI Act, GDPR, ANSSI and accessibility criteria when run on-premise.
Invoked in the validation section of the abstract where agreement rates are reported as evidence for the jury design.
ad hoc to paper Inter-judge disagreement constitutes a regulatory uncertainty signal that warrants human arbitration rather than being treated as noise.
Explicitly stated as the reframing that motivates the Profile-as-jury design.

invented entities (1)

regulatory judges (LLM evaluators specialised per criterion) no independent evidence
purpose: To generate per-criterion compliance scores whose disagreement serves as an uncertainty signal.
Introduced as the central mechanism of the govllm architecture; no independent falsifiable prediction (e.g., predicted mass or external benchmark) is supplied in the abstract.

pith-pipeline@v0.9.1-grok · 5831 in / 1789 out tokens · 33146 ms · 2026-06-30T12:51:19.983562+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 8 canonical work pages · 5 internal anchors

[1]

Provisional Agreement on the Digital Omnibus: Amendments to Regulation (EU) 2024/1689 (AI Act),

Council of the European Union and European Parliament. Provisional Agreement on the Digital Omnibus: Amendments to Regulation (EU) 2024/1689 (AI Act),

2024
[2]

Provisional agreement reached 7 May 2026, pending formal ratification

URL https://www.consilium.europa.eu/en/press/press-releases/2026/05/07/ artificial-intelligence-council-and-parliament-agree-to-simplify-and-streamline-rules/ . Provisional agreement reached 7 May 2026, pending formal ratification. Krrish Dholakia and Ishaan Jaffer. LiteLLM: Call all LLM APIs using the OpenAI format,

2026
[3]

Joseph Enguehard, Morgane Van Ermengem, Kate Atkinson, Sujeong Cha, Arijit Ghosh Chowd- hury, Prashanth Kallur Ramaswamy, Jeremy Roghair, Hannah R

URLhttps://github.com/BerriAI/litellm. Joseph Enguehard, Morgane Van Ermengem, Kate Atkinson, Sujeong Cha, Arijit Ghosh Chowd- hury, Prashanth Kallur Ramaswamy, Jeremy Roghair, Hannah R. Marlowe, Carina Suzana Negreanu, Kitty Boxall, and Diana Mincu. LeMAJ (Legal LLM-as-a-Judge): Bridging Legal Reasoning and LLM Evaluation. InProceedings of the Natural Le...

2025
[4]

European Commission, AI Office

URLhttps://arxiv.org/abs/2510.07243. European Commission, AI Office. Draft Guidelines on the Implementation of the Transparency Obligations for Certain AI Systems under Article 50 of the AI Act,

work page arXiv
[5]

39 European Parliament and Council of the European Union

URL https://digital-strategy.ec.europa.eu/en/consultations/ consultation-draft-guidelines-transparency-obligations-under-ai-act. 39 European Parliament and Council of the European Union. Regulation (EU) 2024/1689 — Artificial Intelligence Act,

2024
[7]

A Survey on LLM-as-a-Judge

URLhttps://arxiv.org/abs/2411.15594. Pratik Jayarao, Himanshu Gupta, Neeraj Varshney, and Chaitanya Dwivedi. Thinking Small Models are Efficient LLM Judges.arXiv preprint arXiv:2509.13332,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Explicit Reasoning Makes Better Judges: A Systematic Study on Accuracy, Efficiency, and Robustness

URL https: //arxiv.org/abs/2509.13332. Jaehun Jung, Faeze Brahman, and Yejin Choi. Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement. InInternational Conference on Learning Representations (ICLR),

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Langfuse

URLhttps://arxiv.org/abs/2407.18370. Langfuse. Langfuse: Open Source LLM Engineering Platform,

work page arXiv
[11]

AI Agents Under EU Law

URLhttps://arxiv.org/abs/2604.04604. OWASP Foundation. OWASP Top 10 for Large Language Model Applications,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Pat Verga, Sebastian Hofstätter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, and Patrick Lewis

URLhttps: //arxiv.org/abs/2602.06669. Pat Verga, Sebastian Hofstätter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, and Patrick Lewis. Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models.arXiv preprint arXiv:2404.18796,

work page arXiv
[14]

Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

URLhttps://arxiv.org/abs/2404.18796. Jiannan Xu, Gujie Li, and Jane Yi Jiang. AI Self-preferencing in Algorithmic Hiring: Empirical Evidence and Insights.arXiv preprint arXiv:2509.00462,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

URLhttps://arxiv.org/ abs/2509.00462. 41

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Provisional Agreement on the Digital Omnibus: Amendments to Regulation (EU) 2024/1689 (AI Act),

Council of the European Union and European Parliament. Provisional Agreement on the Digital Omnibus: Amendments to Regulation (EU) 2024/1689 (AI Act),

2024

[2] [2]

Provisional agreement reached 7 May 2026, pending formal ratification

URL https://www.consilium.europa.eu/en/press/press-releases/2026/05/07/ artificial-intelligence-council-and-parliament-agree-to-simplify-and-streamline-rules/ . Provisional agreement reached 7 May 2026, pending formal ratification. Krrish Dholakia and Ishaan Jaffer. LiteLLM: Call all LLM APIs using the OpenAI format,

2026

[3] [3]

Joseph Enguehard, Morgane Van Ermengem, Kate Atkinson, Sujeong Cha, Arijit Ghosh Chowd- hury, Prashanth Kallur Ramaswamy, Jeremy Roghair, Hannah R

URLhttps://github.com/BerriAI/litellm. Joseph Enguehard, Morgane Van Ermengem, Kate Atkinson, Sujeong Cha, Arijit Ghosh Chowd- hury, Prashanth Kallur Ramaswamy, Jeremy Roghair, Hannah R. Marlowe, Carina Suzana Negreanu, Kitty Boxall, and Diana Mincu. LeMAJ (Legal LLM-as-a-Judge): Bridging Legal Reasoning and LLM Evaluation. InProceedings of the Natural Le...

2025

[4] [4]

European Commission, AI Office

URLhttps://arxiv.org/abs/2510.07243. European Commission, AI Office. Draft Guidelines on the Implementation of the Transparency Obligations for Certain AI Systems under Article 50 of the AI Act,

work page arXiv

[5] [5]

39 European Parliament and Council of the European Union

URL https://digital-strategy.ec.europa.eu/en/consultations/ consultation-draft-guidelines-transparency-obligations-under-ai-act. 39 European Parliament and Council of the European Union. Regulation (EU) 2024/1689 — Artificial Intelligence Act,

2024

[6] [7]

A Survey on LLM-as-a-Judge

URLhttps://arxiv.org/abs/2411.15594. Pratik Jayarao, Himanshu Gupta, Neeraj Varshney, and Chaitanya Dwivedi. Thinking Small Models are Efficient LLM Judges.arXiv preprint arXiv:2509.13332,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [8]

Explicit Reasoning Makes Better Judges: A Systematic Study on Accuracy, Efficiency, and Robustness

URL https: //arxiv.org/abs/2509.13332. Jaehun Jung, Faeze Brahman, and Yejin Choi. Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement. InInternational Conference on Learning Representations (ICLR),

work page internal anchor Pith review Pith/arXiv arXiv

[8] [9]

Langfuse

URLhttps://arxiv.org/abs/2407.18370. Langfuse. Langfuse: Open Source LLM Engineering Platform,

work page arXiv

[9] [11]

AI Agents Under EU Law

URLhttps://arxiv.org/abs/2604.04604. OWASP Foundation. OWASP Top 10 for Large Language Model Applications,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [13]

Pat Verga, Sebastian Hofstätter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, and Patrick Lewis

URLhttps: //arxiv.org/abs/2602.06669. Pat Verga, Sebastian Hofstätter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, and Patrick Lewis. Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models.arXiv preprint arXiv:2404.18796,

work page arXiv

[11] [14]

Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

URLhttps://arxiv.org/abs/2404.18796. Jiannan Xu, Gujie Li, and Jane Yi Jiang. AI Self-preferencing in Algorithmic Hiring: Empirical Evidence and Insights.arXiv preprint arXiv:2509.00462,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [15]

URLhttps://arxiv.org/ abs/2509.00462. 41

work page internal anchor Pith review Pith/arXiv arXiv