arxiv: 2604.20972 · v1 · submitted 2026-04-22 · 💻 cs.AI · cs.CY

Recognition: unknown

Escaping the Agreement Trap: Defensibility Signals for Evaluating Rule-Governed AI

Michael O'Herlihy , Rosa Catal\`a

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:15 UTC · model grok-4.3

classification 💻 cs.AI cs.CY

keywords agreement trapdefensibility indexrule-governed AIcontent moderationpolicy-grounded evaluationLLM reasoning tracesambiguity indexgovernance signals

0 comments

The pith

Rule-governed AI decisions should be evaluated by logical consistency with explicit policies rather than agreement with past human labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard agreement metrics for AI content moderation fail in rule-governed settings because multiple decisions can satisfy the same policy, so matching old labels both penalizes valid choices and treats genuine policy ambiguity as error. The paper formalizes evaluation around policy-grounded correctness and introduces the Defensibility Index to score whether a decision logically follows from the rule hierarchy, plus the Ambiguity Index to separate rule vagueness from mistakes. It derives the Probabilistic Defensibility Signal directly from token logprobs to gauge reasoning stability without extra model passes. Large-scale tests on Reddit moderation data expose 33-46.6 percentage point gaps between the two evaluation styles and show that tighter rules shrink ambiguity while leaving defensibility scores stable. A Governance Gate built on these signals reaches high automation rates with measurable risk reduction.

Core claim

In rule-governed environments, evaluation should shift from agreement with historical labels to reasoning-grounded validity under explicit rules, where an audit model verifies logical derivability from the governing rule hierarchy by examining its own reasoning traces and token logprobs; the Defensibility Index and Ambiguity Index quantify this validity and ambiguity respectively, producing 33-46.6 percentage-point gaps on 193,000+ Reddit cases, with 79.8-80.6% of false negatives actually policy-consistent, and enabling a Governance Gate with 78.6% automation coverage.

What carries the argument

The Defensibility Index (DI), which scores whether a proposed decision is logically derivable from the governing rule hierarchy by analyzing the audit model's reasoning traces and token logprobs as a governance signal instead of a classification output.

If this is right

Agreement metrics produce 33-46.6 percentage-point gaps from policy-grounded metrics on real moderation data.
79.8-80.6% of cases the model flags as errors under agreement are actually consistent with the rules.
Making rules more specific reduces the Ambiguity Index by 10.8 percentage points while the Defensibility Index stays stable.
A Governance Gate using the signals achieves 78.6% automation coverage with 64.9% risk reduction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same signals could support evaluation in other domains that rely on explicit rule hierarchies, such as regulatory compliance or legal reasoning tasks.
Using internal logprobs for stability estimation may reduce the need for repeated human labeling when rules change.
Combining the Probabilistic Defensibility Signal with external uncertainty measures could further separate model noise from genuine policy gaps.

Load-bearing premise

An audit LLM can reliably check logical derivability from rules by inspecting its own reasoning traces and token probabilities without introducing new systematic biases.

What would settle it

A side-by-side review by human experts of decisions the index rates as defensible against a clear, written rule violation would falsify the claim if a large share of those decisions are judged invalid.

Figures

Figures reproduced from arXiv: 2604.20972 by Michael O'Herlihy, Rosa Catal\`a.

**Figure 2.** Figure 2: Single-pass signal correlation with reasoning instability ( [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: PDS calibration validation. (A) Calibrated [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: The Agreement–Defensibility Gap. F1 (red) and Defensibility Index diverge by 33–46 percentage points. 79.8–80.6% of model false negatives are L1 or L2—policy-grounded decisions penalized by agreement-based evaluation. Root cause analysis of the 6,760 disagreement cases in the Balanced Sample: Model Error (L3 classification) in 19.4% (1,311 decisions); Policy-Grounded Disagreement in 80.6% (5,449 decisions… view at source ↗

**Figure 5.** Figure 5: Fleet diagnostics across 270 communities. Earned Autonomy ( [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Defensibility level distribution per case, [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: PDS predicts reasoning instability under repeated evaluation across temperatures ( [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Monte Carlo validation—100 cases × 100 replicates (T = 0.3). (A–B) Single-pass S and mean S do not predict σ(S) (|ρ| < 0.11). (C) Single-pass S vs label stability: ρ= 0.352. (D) σ(S) monotonically increases by stability class. (E) σ(ρ) uncorrelated with σ(S). (F) Gate threshold vs stability composition: Rock Solid fraction and coverage trade off at S = 0.90 knee. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Flipper vs Stable analysis (100 cases × 100 replicates, T = 0.3). Flippers show higher σ(S) (A), lower label stability (B), and overlapping σ(ρ) (C). Key metric separation is significant for S and dominant fraction (D, ∗∗ , ∗∗∗). A.3 Scalar Collapse and Calibration S = exp α · λξ + β · (−H[w]) + γ · (−σ(ρ)) , (α, β, γ) = softmax(u1, u2, u3) (9) for unconstrained u ∈ R 3 . MLE objective: (α ∗ , β∗ , γ∗ ) = … view at source ↗

**Figure 10.** Figure 10: Corrected PDS (T = 0.1, 100k simulations)—S = exp[α·λξ+β·(−H[w])+γ·(−σ(ρ))]. Left: σˆPDS vs label stability (ρ=−0.641). Centre: σˆPDS vs L3 indefensible fraction—cases with high σˆPDS have elevated L3 probability, tracking the DI gate threshold (dashed red, 10%). Right: σˆPDS vs inverse check probability—cases with high σˆPDS show elevated mean inverse check probability, tracking the AI gate threshold (da… view at source ↗

read the original abstract

Content moderation systems are typically evaluated by measuring agreement with human labels. In rule-governed environments this assumption fails: multiple decisions may be logically consistent with the governing policy, and agreement metrics penalize valid decisions while mischaracterizing ambiguity as error -- a failure mode we term the Agreement Trap. We formalize evaluation as policy-grounded correctness and introduce the Defensibility Index (DI) and Ambiguity Index (AI). To estimate reasoning stability without additional audit passes, we introduce the Probabilistic Defensibility Signal (PDS), derived from audit-model token logprobs. We harness LLM reasoning traces as a governance signal rather than a classification output by deploying the audit model not to decide whether content violates policy, but to verify whether a proposed decision is logically derivable from the governing rule hierarchy. We validate the framework on 193,000+ Reddit moderation decisions across multiple communities and evaluation cohorts, finding a 33-46.6 percentage-point gap between agreement-based and policy-grounded metrics, with 79.8-80.6% of the model's false negatives corresponding to policy-grounded decisions rather than true errors. We further show that measured ambiguity is driven by rule specificity: auditing 37,286 identical decisions under three tiers of the same community rules reduces AI by 10.8 pp while DI remains stable. Repeated-sampling analysis attributes PDS variance primarily to governance ambiguity rather than decoding noise. A Governance Gate built on these signals achieves 78.6% automation coverage with 64.9% risk reduction. Together, these results show that evaluation in rule-governed environments should shift from agreement with historical labels to reasoning-grounded validity under explicit rules.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The shift to policy-grounded evaluation via defensibility and ambiguity indices is a useful reframing, but the self-audit method leaves the central claims vulnerable to circularity.

read the letter

The paper's real move is to stop treating agreement with historical labels as the gold standard for rule-based AI decisions. Instead it measures whether an output can be logically derived from the stated policy, using a Defensibility Index, an Ambiguity Index, and a logprob-derived Probabilistic Defensibility Signal. On 193k Reddit moderation cases it reports 33-46 point gaps between the two approaches, with most of the model's false negatives turning out to be rule-consistent under the new metrics. A follow-up experiment on 37k decisions shows that tightening rule specificity lowers measured ambiguity while defensibility stays flat, and the Governance Gate built on these signals reaches 78.6% automation coverage with decent risk reduction. Those numbers come from actual large cohorts and a controlled rule-variation test, which is more than most agreement-metric papers deliver. The logprob trick for estimating stability without extra sampling passes is also a practical engineering detail worth noting. The soft spot is the verification step itself. The audit model checks its own reasoning traces and token probabilities to decide whether a decision is derivable from the rule hierarchy. When 80% of false negatives are declared policy-grounded by that same model, the gap could partly be an artifact of the verifier's tendency to find justifications for its prior outputs. The abstract gives no external human adjudication or separate verifier model, so the independence of the correctness signal is not yet clear. If the full paper shows a clean separation or external checks, that concern shrinks; otherwise the headline results rest on internal consistency rather than independent grounding. This work is aimed at people building or auditing large-scale rule-governed systems, especially in moderation and governance settings. It is worth sending to peer review because the scale, the new metrics, and the rule-specificity experiment give referees something concrete to test, even if the circularity question needs explicit handling in revision.

Referee Report

3 major / 2 minor

Summary. The paper identifies the 'Agreement Trap' in evaluating rule-governed AI systems, where agreement with historical labels fails because multiple decisions can be consistent with policy. It introduces Defensibility Index (DI), Ambiguity Index (AI), and Probabilistic Defensibility Signal (PDS) derived from audit LLM token logprobs to assess if decisions are logically derivable from rule hierarchies. On 193,000+ Reddit moderation decisions, it reports 33-46.6 pp gaps between agreement and policy-grounded metrics, 79.8-80.6% of false negatives being policy-grounded, rule specificity effects on AI, and a Governance Gate with 78.6% automation coverage and 64.9% risk reduction, advocating a shift to reasoning-grounded evaluation.

Significance. If validated, this framework could transform evaluation practices in AI governance and content moderation by providing metrics that account for rule ambiguity and logical consistency rather than strict label agreement. The use of existing reasoning traces for PDS without extra passes is efficient, and the large-scale empirical results on real-world data strengthen the case for policy-grounded approaches. The Governance Gate demonstrates practical applicability with substantial risk reduction.

major comments (3)

Abstract: The reported 79.8-80.6% of false negatives being policy-grounded is determined by the audit LLM's assessment of logical derivability from its own reasoning traces and logprobs. This risks circularity, as there is no mention of independent human adjudication or a separate verifier to confirm the 'policy-grounded' classification, potentially inflating the gap if the model overstates derivability.
Abstract: No equations or algorithmic details are provided for computing the Defensibility Index (DI), Ambiguity Index (AI), or Probabilistic Defensibility Signal (PDS) from token logprobs. Without these, it is unclear whether PDS is truly derived independently or if it reduces to parameters fitted to the correctness signal, undermining claims of it being a stable governance signal.
Abstract: The rule-specificity experiment shows AI reduced by 10.8 pp with more specific rules while DI stable, but lacks baseline comparisons to standard agreement metrics or statistical error bars on the changes, making it difficult to assess if this supports the superiority of the new metrics over existing ones.

minor comments (2)

Abstract: Details on how the 193,000+ decisions and cohorts were constructed, including any filtering or post-processing, are missing and should be provided for reproducibility.
Abstract: The abstract mentions 'repeated-sampling analysis' attributing PDS variance to governance ambiguity, but without specifics on the sampling method or variance decomposition, this claim is hard to evaluate.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below with clarifications and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: Abstract: The reported 79.8-80.6% of false negatives being policy-grounded is determined by the audit LLM's assessment of logical derivability from its own reasoning traces and logprobs. This risks circularity, as there is no mention of independent human adjudication or a separate verifier to confirm the 'policy-grounded' classification, potentially inflating the gap if the model overstates derivability.

Authors: The policy-grounded classification is intentionally performed by the audit LLM as a verifier of logical derivability from the explicit rule hierarchy, separate from the original moderation model's decisions and historical labels. This is the central methodological shift proposed in the paper: using reasoning traces to assess consistency rather than label agreement. The 33-46.6 pp gap is measured between agreement metrics and this audit-derived metric, so the classification is not circular but definitional to the new evaluation approach. We acknowledge that independent human adjudication would provide further corroboration and will revise the manuscript to explicitly discuss this as a limitation while emphasizing the scale of the Reddit dataset (193k+ cases) as empirical grounding. This will be addressed via partial revision. revision: partial
Referee: Abstract: No equations or algorithmic details are provided for computing the Defensibility Index (DI), Ambiguity Index (AI), or Probabilistic Defensibility Signal (PDS) from token logprobs. Without these, it is unclear whether PDS is truly derived independently or if it reduces to parameters fitted to the correctness signal, undermining claims of it being a stable governance signal.

Authors: The full manuscript defines these in the Methods section: DI as the proportion of audit reasoning steps that affirm logical derivability from the rule hierarchy; AI as normalized entropy across possible derivable outcomes; and PDS as the geometric mean of logprobs on the tokens expressing the derivability conclusion, extracted from a single audit pass without reference to correctness labels. PDS is computed solely from the audit model's token probabilities and is not fitted or tuned to any external correctness signal. To address the concern directly, we will incorporate the key equations and a brief algorithmic description into the abstract and strengthen the independence claim in the text. This constitutes a revision. revision: yes
Referee: Abstract: The rule-specificity experiment shows AI reduced by 10.8 pp with more specific rules while DI stable, but lacks baseline comparisons to standard agreement metrics or statistical error bars on the changes, making it difficult to assess if this supports the superiority of the new metrics over existing ones.

Authors: The primary empirical contribution already demonstrates superiority via the 33-46.6 pp gaps between agreement-based and policy-grounded metrics on the full 193k+ dataset. The rule-specificity experiment (37,286 decisions) isolates the effect of rule granularity on ambiguity while holding decisions fixed. We will revise this section to add a direct baseline comparison showing that agreement metrics exhibit minimal change across the three rule tiers, and to include statistical error bars (e.g., 95% confidence intervals) on the 10.8 pp AI reduction. These additions will clarify the differential responsiveness of the proposed metrics. Revision will be made. revision: yes

Circularity Check

0 steps flagged

No circularity: metrics derived from rules and logprobs, validated against external labels

full rationale

The paper defines DI/AI as policy-grounded correctness via logical derivability from explicit rule hierarchies and PDS directly from audit-model token logprobs without any fitting to human labels or target correctness signals. Empirical gaps (33-46.6 pp) and percentages (79.8-80.6% of false negatives as policy-grounded) are computed by applying these signals to an external dataset of 193k human-labeled decisions; the audit verification step is an independent methodological choice rather than a self-referential reduction. No equations, self-citations, or ansatzes reduce the central claims to their inputs by construction. The framework remains self-contained against the provided human-label benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 4 invented entities

The framework rests on several new constructs introduced without independent external validation and on domain assumptions about LLM behavior and rule hierarchies. No explicit free parameters are stated in the abstract.

axioms (2)

domain assumption LLM token logprobs can serve as a proxy for reasoning stability without additional audit passes
Used to derive the Probabilistic Defensibility Signal directly from the audit model.
domain assumption Multiple decisions can be logically consistent with a given policy hierarchy
Foundation for distinguishing policy-grounded correctness from agreement error.

invented entities (4)

Defensibility Index (DI) no independent evidence
purpose: Quantify policy-grounded correctness of a decision
New metric introduced to replace agreement-based scoring.
Ambiguity Index (AI) no independent evidence
purpose: Quantify rule-level ambiguity driving decision variance
New metric shown to respond to rule specificity tiers.
Probabilistic Defensibility Signal (PDS) no independent evidence
purpose: Estimate reasoning stability from token logprobs
Derived signal used to avoid extra audit passes.
Governance Gate no independent evidence
purpose: Automated decision filter achieving coverage with risk reduction
System built on the new signals.

pith-pipeline@v0.9.0 · 5607 in / 1768 out tokens · 61851 ms · 2026-05-10T00:15:14.280615+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 8 canonical work pages · 1 internal anchor

[1]

Truth is a lie: Crowd truth and the seven myths of human annotation

Lora Aroyo and Chris Welty. Truth is a lie: Crowd truth and the seven myths of human annotation.AI Magazine, 36(1):15–24, 2015. doi:10.1609/aimag.v36i1.2564

work page doi:10.1609/aimag.v36i1.2564 2015
[2]

Eshwar Chandrasekharan, Mattia Samory, Shagun Jhaver, Hunter Charvat, Amy Bruckman, Cliff Lampe, Jacob Eisenstein, and Eric Gilbert. The internet’s hidden rules: An empirical study of Reddit norm violations at micro, meso, and macro scales.Proceedings of the ACM on Human-Computer Interaction, 2(CSCW):Article 32, 2018. doi:10.1145/3274301

work page doi:10.1145/3274301 2018
[3]

Large scale crowdsourcing and characterization of Twitter abusive behavior

Antigoni Maria Founta, Constantinos Djouvas, Despoina Chatzakou, Ilias Leontiadis, Jeremy Blackburn, Gianluca Stringhini, Athena Vakali, Michael Sirivianos, and Nicolas Kourtellis. Large scale crowdsourcing and characterization of Twitter abusive behavior. InProceedings of the International AAAI Conference on Web and Social Media, volume 12, pages 491–500, 2018

2018
[4]

Dropout as a Bayesian approximation: Representing model uncertainty in deep learning

Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. InProceedings of the 33rd International Conference on Machine Learning, pages 1050–1059, 2016

2016
[5]

Selective classification for deep neural networks

Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. In Advances in Neural Information Processing Systems, volume 30, 2017

2017
[6]

Yale University Press, 2018

Tarleton Gillespie.Custodians of the Internet: Platforms, Content Moderation, and the Hidden Decisions That Shape Social Media. Yale University Press, 2018

2018
[7]

Gorwa, R

Robert Gorwa, Reuben Binns, and Christian Katzenbach. Algorithmic content moderation: Technical and political challenges in the automation of platform governance.Big Data & Society, 7(1), 2020. doi:10.1177/2053951719897945

work page doi:10.1177/2053951719897945 2020
[8]

On calibration of modern neural networks

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. InProceedings of the 34th International Conference on Machine Learning, pages 1321–1330, 2017

2017
[9]

Oxford University Press, 1961

Herbert Lionel Adolphus Hart.The Concept of Law. Oxford University Press, 1961. 16

1961
[10]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221, 2022

work page internal anchor Pith review arXiv 2022
[11]

Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. InProceedings of ICLR 2023, 2023

2023
[12]

Simple and scalable pre- dictive uncertainty estimation using deep ensembles

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable pre- dictive uncertainty estimation using deep ensembles. InAdvances in Neural Information Pro- cessing Systems, volume 30, 2017

2017
[13]

Measuring Faithfulness in Chain-of-Thought Reasoning

Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, et al. Measuring faith- fulness in chain-of-thought reasoning.arXiv preprint arXiv:2307.13702, 2023

work page Pith review arXiv 2023
[14]

Examining reasoning llms-as-judges in non-verifiable llm post-training.arXiv preprint arXiv:2603.12246, 2026

Yixin Liu, Yue Yu, DiJia Su, Sid Wang, Xuewei Wang, Song Jiang, Bo Liu, Arman Cohan, Yuandong Tian, and Zhengxing Chen. Examining reasoning LLMs-as-judges in non-verifiable LLM post-training.arXiv preprint arXiv:2603.12246, 2026

work page arXiv 2026
[15]

Dealing with Disagreements: Looking Beyond the Majority Vote in Subjective Annotations

Aida Mostafazadeh Davani, Mark Díaz, and Vinodkumar Prabhakaran. Dealing with dis- agreements: Looking beyond the majority vote in subjective annotations.Transactions of the Association for Computational Linguistics, 10:92–110, 2022. doi:10.1162/tacl_a_00449

work page doi:10.1162/tacl_a_00449 2022
[16]

Obtaining well calibrated probabilities using Bayesian binning

Mahdi Pakdaman Naeini, Gregory F Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using Bayesian binning. InProceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, pages 2901–2907, 2015

2015
[17]

Inherent disagreements in human textual infer- ences.Transactions of the Association for Computational Linguistics, 7:677–694, 2019

Ellie Pavlick and Tom Kwiatkowski. Inherent disagreements in human textual infer- ences.Transactions of the Association for Computational Linguistics, 7:677–694, 2019. doi:10.1162/tacl_a_00293

work page doi:10.1162/tacl_a_00293 2019
[18]

The “problem” of human label variation: On ground truth in data, modeling and evaluation

Barbara Plank. The “problem” of human label variation: On ground truth in data, modeling and evaluation. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10671–10682, 2022

2022
[19]

The risk of racial bias in hate speech detection

Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi, and Noah A Smith. The risk of racial bias in hate speech detection. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1668–1678, 2019

2019
[20]

Language models don’t al- ways say what they think: Unfaithful explanations in chain-of-thought prompting

Miles Turpin, Julian Michael, Ethan Perez, and Samuel R Bowman. Language models don’t al- ways say what they think: Unfaithful explanations in chain-of-thought prompting. InAdvances in Neural Information Processing Systems, volume 36, 2023

2023
[21]

Learning from disagreement: A survey.Journal of Artificial Intelligence Research, 72:1385–1470, 2021

Alexandra Uma, Tommaso Fornaciari, Dirk Hovy, Silviu Paun, and Barbara Plank. Learning from disagreement: A survey.Journal of Artificial Intelligence Research, 72:1385–1470, 2021

2021
[22]

at least 1 replicate

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P Xing, Hao Zhang, Joseph E Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems, volume 36, 2023. 17 A Formal PDS Development A.1 Two-Model Probability...

2023