Recognition: unknown
Escaping the Agreement Trap: Defensibility Signals for Evaluating Rule-Governed AI
Pith reviewed 2026-05-10 00:15 UTC · model grok-4.3
The pith
Rule-governed AI decisions should be evaluated by logical consistency with explicit policies rather than agreement with past human labels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In rule-governed environments, evaluation should shift from agreement with historical labels to reasoning-grounded validity under explicit rules, where an audit model verifies logical derivability from the governing rule hierarchy by examining its own reasoning traces and token logprobs; the Defensibility Index and Ambiguity Index quantify this validity and ambiguity respectively, producing 33-46.6 percentage-point gaps on 193,000+ Reddit cases, with 79.8-80.6% of false negatives actually policy-consistent, and enabling a Governance Gate with 78.6% automation coverage.
What carries the argument
The Defensibility Index (DI), which scores whether a proposed decision is logically derivable from the governing rule hierarchy by analyzing the audit model's reasoning traces and token logprobs as a governance signal instead of a classification output.
If this is right
- Agreement metrics produce 33-46.6 percentage-point gaps from policy-grounded metrics on real moderation data.
- 79.8-80.6% of cases the model flags as errors under agreement are actually consistent with the rules.
- Making rules more specific reduces the Ambiguity Index by 10.8 percentage points while the Defensibility Index stays stable.
- A Governance Gate using the signals achieves 78.6% automation coverage with 64.9% risk reduction.
Where Pith is reading between the lines
- The same signals could support evaluation in other domains that rely on explicit rule hierarchies, such as regulatory compliance or legal reasoning tasks.
- Using internal logprobs for stability estimation may reduce the need for repeated human labeling when rules change.
- Combining the Probabilistic Defensibility Signal with external uncertainty measures could further separate model noise from genuine policy gaps.
Load-bearing premise
An audit LLM can reliably check logical derivability from rules by inspecting its own reasoning traces and token probabilities without introducing new systematic biases.
What would settle it
A side-by-side review by human experts of decisions the index rates as defensible against a clear, written rule violation would falsify the claim if a large share of those decisions are judged invalid.
Figures
read the original abstract
Content moderation systems are typically evaluated by measuring agreement with human labels. In rule-governed environments this assumption fails: multiple decisions may be logically consistent with the governing policy, and agreement metrics penalize valid decisions while mischaracterizing ambiguity as error -- a failure mode we term the Agreement Trap. We formalize evaluation as policy-grounded correctness and introduce the Defensibility Index (DI) and Ambiguity Index (AI). To estimate reasoning stability without additional audit passes, we introduce the Probabilistic Defensibility Signal (PDS), derived from audit-model token logprobs. We harness LLM reasoning traces as a governance signal rather than a classification output by deploying the audit model not to decide whether content violates policy, but to verify whether a proposed decision is logically derivable from the governing rule hierarchy. We validate the framework on 193,000+ Reddit moderation decisions across multiple communities and evaluation cohorts, finding a 33-46.6 percentage-point gap between agreement-based and policy-grounded metrics, with 79.8-80.6% of the model's false negatives corresponding to policy-grounded decisions rather than true errors. We further show that measured ambiguity is driven by rule specificity: auditing 37,286 identical decisions under three tiers of the same community rules reduces AI by 10.8 pp while DI remains stable. Repeated-sampling analysis attributes PDS variance primarily to governance ambiguity rather than decoding noise. A Governance Gate built on these signals achieves 78.6% automation coverage with 64.9% risk reduction. Together, these results show that evaluation in rule-governed environments should shift from agreement with historical labels to reasoning-grounded validity under explicit rules.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper identifies the 'Agreement Trap' in evaluating rule-governed AI systems, where agreement with historical labels fails because multiple decisions can be consistent with policy. It introduces Defensibility Index (DI), Ambiguity Index (AI), and Probabilistic Defensibility Signal (PDS) derived from audit LLM token logprobs to assess if decisions are logically derivable from rule hierarchies. On 193,000+ Reddit moderation decisions, it reports 33-46.6 pp gaps between agreement and policy-grounded metrics, 79.8-80.6% of false negatives being policy-grounded, rule specificity effects on AI, and a Governance Gate with 78.6% automation coverage and 64.9% risk reduction, advocating a shift to reasoning-grounded evaluation.
Significance. If validated, this framework could transform evaluation practices in AI governance and content moderation by providing metrics that account for rule ambiguity and logical consistency rather than strict label agreement. The use of existing reasoning traces for PDS without extra passes is efficient, and the large-scale empirical results on real-world data strengthen the case for policy-grounded approaches. The Governance Gate demonstrates practical applicability with substantial risk reduction.
major comments (3)
- Abstract: The reported 79.8-80.6% of false negatives being policy-grounded is determined by the audit LLM's assessment of logical derivability from its own reasoning traces and logprobs. This risks circularity, as there is no mention of independent human adjudication or a separate verifier to confirm the 'policy-grounded' classification, potentially inflating the gap if the model overstates derivability.
- Abstract: No equations or algorithmic details are provided for computing the Defensibility Index (DI), Ambiguity Index (AI), or Probabilistic Defensibility Signal (PDS) from token logprobs. Without these, it is unclear whether PDS is truly derived independently or if it reduces to parameters fitted to the correctness signal, undermining claims of it being a stable governance signal.
- Abstract: The rule-specificity experiment shows AI reduced by 10.8 pp with more specific rules while DI stable, but lacks baseline comparisons to standard agreement metrics or statistical error bars on the changes, making it difficult to assess if this supports the superiority of the new metrics over existing ones.
minor comments (2)
- Abstract: Details on how the 193,000+ decisions and cohorts were constructed, including any filtering or post-processing, are missing and should be provided for reproducibility.
- Abstract: The abstract mentions 'repeated-sampling analysis' attributing PDS variance to governance ambiguity, but without specifics on the sampling method or variance decomposition, this claim is hard to evaluate.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below with clarifications and commit to revisions that strengthen the manuscript without altering its core claims.
read point-by-point responses
-
Referee: Abstract: The reported 79.8-80.6% of false negatives being policy-grounded is determined by the audit LLM's assessment of logical derivability from its own reasoning traces and logprobs. This risks circularity, as there is no mention of independent human adjudication or a separate verifier to confirm the 'policy-grounded' classification, potentially inflating the gap if the model overstates derivability.
Authors: The policy-grounded classification is intentionally performed by the audit LLM as a verifier of logical derivability from the explicit rule hierarchy, separate from the original moderation model's decisions and historical labels. This is the central methodological shift proposed in the paper: using reasoning traces to assess consistency rather than label agreement. The 33-46.6 pp gap is measured between agreement metrics and this audit-derived metric, so the classification is not circular but definitional to the new evaluation approach. We acknowledge that independent human adjudication would provide further corroboration and will revise the manuscript to explicitly discuss this as a limitation while emphasizing the scale of the Reddit dataset (193k+ cases) as empirical grounding. This will be addressed via partial revision. revision: partial
-
Referee: Abstract: No equations or algorithmic details are provided for computing the Defensibility Index (DI), Ambiguity Index (AI), or Probabilistic Defensibility Signal (PDS) from token logprobs. Without these, it is unclear whether PDS is truly derived independently or if it reduces to parameters fitted to the correctness signal, undermining claims of it being a stable governance signal.
Authors: The full manuscript defines these in the Methods section: DI as the proportion of audit reasoning steps that affirm logical derivability from the rule hierarchy; AI as normalized entropy across possible derivable outcomes; and PDS as the geometric mean of logprobs on the tokens expressing the derivability conclusion, extracted from a single audit pass without reference to correctness labels. PDS is computed solely from the audit model's token probabilities and is not fitted or tuned to any external correctness signal. To address the concern directly, we will incorporate the key equations and a brief algorithmic description into the abstract and strengthen the independence claim in the text. This constitutes a revision. revision: yes
-
Referee: Abstract: The rule-specificity experiment shows AI reduced by 10.8 pp with more specific rules while DI stable, but lacks baseline comparisons to standard agreement metrics or statistical error bars on the changes, making it difficult to assess if this supports the superiority of the new metrics over existing ones.
Authors: The primary empirical contribution already demonstrates superiority via the 33-46.6 pp gaps between agreement-based and policy-grounded metrics on the full 193k+ dataset. The rule-specificity experiment (37,286 decisions) isolates the effect of rule granularity on ambiguity while holding decisions fixed. We will revise this section to add a direct baseline comparison showing that agreement metrics exhibit minimal change across the three rule tiers, and to include statistical error bars (e.g., 95% confidence intervals) on the 10.8 pp AI reduction. These additions will clarify the differential responsiveness of the proposed metrics. Revision will be made. revision: yes
Circularity Check
No circularity: metrics derived from rules and logprobs, validated against external labels
full rationale
The paper defines DI/AI as policy-grounded correctness via logical derivability from explicit rule hierarchies and PDS directly from audit-model token logprobs without any fitting to human labels or target correctness signals. Empirical gaps (33-46.6 pp) and percentages (79.8-80.6% of false negatives as policy-grounded) are computed by applying these signals to an external dataset of 193k human-labeled decisions; the audit verification step is an independent methodological choice rather than a self-referential reduction. No equations, self-citations, or ansatzes reduce the central claims to their inputs by construction. The framework remains self-contained against the provided human-label benchmark.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLM token logprobs can serve as a proxy for reasoning stability without additional audit passes
- domain assumption Multiple decisions can be logically consistent with a given policy hierarchy
invented entities (4)
-
Defensibility Index (DI)
no independent evidence
-
Ambiguity Index (AI)
no independent evidence
-
Probabilistic Defensibility Signal (PDS)
no independent evidence
-
Governance Gate
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Truth is a lie: Crowd truth and the seven myths of human annotation
Lora Aroyo and Chris Welty. Truth is a lie: Crowd truth and the seven myths of human annotation.AI Magazine, 36(1):15–24, 2015. doi:10.1609/aimag.v36i1.2564
-
[2]
Eshwar Chandrasekharan, Mattia Samory, Shagun Jhaver, Hunter Charvat, Amy Bruckman, Cliff Lampe, Jacob Eisenstein, and Eric Gilbert. The internet’s hidden rules: An empirical study of Reddit norm violations at micro, meso, and macro scales.Proceedings of the ACM on Human-Computer Interaction, 2(CSCW):Article 32, 2018. doi:10.1145/3274301
-
[3]
Large scale crowdsourcing and characterization of Twitter abusive behavior
Antigoni Maria Founta, Constantinos Djouvas, Despoina Chatzakou, Ilias Leontiadis, Jeremy Blackburn, Gianluca Stringhini, Athena Vakali, Michael Sirivianos, and Nicolas Kourtellis. Large scale crowdsourcing and characterization of Twitter abusive behavior. InProceedings of the International AAAI Conference on Web and Social Media, volume 12, pages 491–500, 2018
2018
-
[4]
Dropout as a Bayesian approximation: Representing model uncertainty in deep learning
Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. InProceedings of the 33rd International Conference on Machine Learning, pages 1050–1059, 2016
2016
-
[5]
Selective classification for deep neural networks
Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. In Advances in Neural Information Processing Systems, volume 30, 2017
2017
-
[6]
Yale University Press, 2018
Tarleton Gillespie.Custodians of the Internet: Platforms, Content Moderation, and the Hidden Decisions That Shape Social Media. Yale University Press, 2018
2018
-
[7]
Robert Gorwa, Reuben Binns, and Christian Katzenbach. Algorithmic content moderation: Technical and political challenges in the automation of platform governance.Big Data & Society, 7(1), 2020. doi:10.1177/2053951719897945
-
[8]
On calibration of modern neural networks
Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. InProceedings of the 34th International Conference on Machine Learning, pages 1321–1330, 2017
2017
-
[9]
Oxford University Press, 1961
Herbert Lionel Adolphus Hart.The Concept of Law. Oxford University Press, 1961. 16
1961
-
[10]
Language Models (Mostly) Know What They Know
Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221, 2022
work page internal anchor Pith review arXiv 2022
-
[11]
Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation
Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. InProceedings of ICLR 2023, 2023
2023
-
[12]
Simple and scalable pre- dictive uncertainty estimation using deep ensembles
Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable pre- dictive uncertainty estimation using deep ensembles. InAdvances in Neural Information Pro- cessing Systems, volume 30, 2017
2017
-
[13]
Measuring Faithfulness in Chain-of-Thought Reasoning
Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, et al. Measuring faith- fulness in chain-of-thought reasoning.arXiv preprint arXiv:2307.13702, 2023
work page Pith review arXiv 2023
-
[14]
Yixin Liu, Yue Yu, DiJia Su, Sid Wang, Xuewei Wang, Song Jiang, Bo Liu, Arman Cohan, Yuandong Tian, and Zhengxing Chen. Examining reasoning LLMs-as-judges in non-verifiable LLM post-training.arXiv preprint arXiv:2603.12246, 2026
-
[15]
Dealing with Disagreements: Looking Beyond the Majority Vote in Subjective Annotations
Aida Mostafazadeh Davani, Mark Díaz, and Vinodkumar Prabhakaran. Dealing with dis- agreements: Looking beyond the majority vote in subjective annotations.Transactions of the Association for Computational Linguistics, 10:92–110, 2022. doi:10.1162/tacl_a_00449
-
[16]
Obtaining well calibrated probabilities using Bayesian binning
Mahdi Pakdaman Naeini, Gregory F Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using Bayesian binning. InProceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, pages 2901–2907, 2015
2015
-
[17]
Ellie Pavlick and Tom Kwiatkowski. Inherent disagreements in human textual infer- ences.Transactions of the Association for Computational Linguistics, 7:677–694, 2019. doi:10.1162/tacl_a_00293
-
[18]
The “problem” of human label variation: On ground truth in data, modeling and evaluation
Barbara Plank. The “problem” of human label variation: On ground truth in data, modeling and evaluation. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10671–10682, 2022
2022
-
[19]
The risk of racial bias in hate speech detection
Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi, and Noah A Smith. The risk of racial bias in hate speech detection. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1668–1678, 2019
2019
-
[20]
Language models don’t al- ways say what they think: Unfaithful explanations in chain-of-thought prompting
Miles Turpin, Julian Michael, Ethan Perez, and Samuel R Bowman. Language models don’t al- ways say what they think: Unfaithful explanations in chain-of-thought prompting. InAdvances in Neural Information Processing Systems, volume 36, 2023
2023
-
[21]
Learning from disagreement: A survey.Journal of Artificial Intelligence Research, 72:1385–1470, 2021
Alexandra Uma, Tommaso Fornaciari, Dirk Hovy, Silviu Paun, and Barbara Plank. Learning from disagreement: A survey.Journal of Artificial Intelligence Research, 72:1385–1470, 2021
2021
-
[22]
at least 1 replicate
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P Xing, Hao Zhang, Joseph E Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems, volume 36, 2023. 17 A Formal PDS Development A.1 Two-Model Probability...
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.