Let's Measure Information Step-by-Step: AI-Based Evaluation Beyond Vibes
Pith reviewed 2026-05-18 23:41 UTC · model grok-4.3
The pith
Treating AI evaluation as mutual information estimation via prompting lets total variation distance resist adversarial attacks without ground truth.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By modeling the overseer as estimating mutual information through prompts, truthful agent reporting becomes optimal, and f-divergences like total variation distance retain polynomial guarantees under attack, so TVD-MI keeps AUC 0.70-0.77 while alternatives decay to chance.
What carries the argument
Mutual evaluation, in which the overseer estimates mutual information by prompting to make truthful reporting optimal for agents, paired with total variation distance to achieve robustness.
If this is right
- Pairwise evaluations can be broken into reliable item-level detection scores without any ground truth.
- Prompting for information relationships rather than quality judgments increases resistance to manipulation.
- Certain f-divergences maintain polynomial guarantees for mutual information estimation in adversarial settings.
- The method overcomes a central limitation of standard peer prediction by producing usable per-item scores.
Where Pith is reading between the lines
- The same prompting-plus-divergence structure might apply to other domains where verification is costly.
- It could be extended by testing on larger models to check whether the robustness scales.
- Links to mechanism design may produce new protocols that combine information estimation with incentives.
Load-bearing premise
That treating the overseer as a strategic player estimating mutual information by prompting makes truthful agent reporting an optimal strategy.
What would settle it
Run an experiment giving agents explicit incentives to misreport information and check whether TVD-MI AUC drops below 0.70 or reaches near 0.5.
read the original abstract
We evaluate artificial intelligence (AI) systems without ground truth by exploiting a link between strategic gaming and information loss. Building on established information theory, we analyze which mechanisms resist adversarial manipulation. This motivates mutual evaluation, where the overseer is treated as a strategic player estimating mutual information by prompting, making truthful agent reporting an optimal strategy. We show that certain f-divergences, such as total variation distance (TVD), maintain polynomial guarantees under attack, building on an established exponential barrier for estimating mutual information (MI) in worst-case certification settings. Under adversarial attacks, TVD-MI maintains effectiveness (area under the curve 0.70--0.77) while other approaches can decay toward chance, demonstrating that prompting the same system for information relationships rather than quality judgments can improve robustness. The mechanisms decompose pairwise evaluations into reliable item-level detection scores without ground truth, addressing a key limitation of standard peer prediction. Pre-registration: https://osf.io/c7pum
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that AI systems can be evaluated without ground truth by treating the overseer as a strategic player who estimates mutual information via prompting, which makes truthful agent reporting an optimal strategy. It analyzes f-divergences such as total variation distance (TVD) and argues that these maintain polynomial guarantees under adversarial attack, building on an exponential barrier for worst-case MI estimation. Experiments are reported to show TVD-MI retaining AUC 0.70--0.77 under attacks while other methods decay toward chance; the approach is said to decompose pairwise evaluations into reliable item-level scores, addressing limitations of standard peer prediction.
Significance. If the central robustness and optimality claims hold, the work would offer a concrete information-theoretic route to ground-truth-free AI evaluation that is explicitly designed to resist strategic manipulation. The pre-registration and the explicit connection to established exponential barriers for MI estimation are positive features that strengthen the contribution.
major comments (3)
- [Abstract] Abstract and introduction: the claim that 'prompting the same system for information relationships rather than quality judgments' makes truthful reporting an optimal strategy is load-bearing for both the polynomial guarantees and the no-ground-truth robustness argument, yet no equilibrium analysis, agent utility function, or information structure of the prompting game is supplied, nor is it shown that no profitable deviation exists when the agent can choose arbitrary response distributions.
- [Abstract] The reduction from general mutual information to TVD-MI in the interactive prompting setting is asserted to preserve the optimality property and the polynomial attack guarantees, but the manuscript provides no derivation or proof that this reduction holds in the strategic overseer-agent interaction described.
- [Abstract] The reported AUC range 0.70--0.77 under adversarial attacks is central to the empirical robustness claim, but the abstract (and, by extension, the experimental section) supplies no details on attack model, baselines, statistical controls, or whether post-hoc choices affect the result, making it impossible to assess whether the numbers support the polynomial-guarantee interpretation.
minor comments (2)
- [Abstract] The pre-registration link is welcome; the manuscript should state explicitly which analyses were pre-registered versus exploratory.
- Notation for the f-divergences and the prompting-based MI estimator should be introduced with a short table or diagram to improve readability for readers outside information theory.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our manuscript. The comments help clarify how to better present the game-theoretic foundations and experimental specifics of our mutual evaluation approach using f-divergences. We respond to each major comment below and indicate the revisions we will make.
read point-by-point responses
-
Referee: [Abstract] Abstract and introduction: the claim that 'prompting the same system for information relationships rather than quality judgments' makes truthful reporting an optimal strategy is load-bearing for both the polynomial guarantees and the no-ground-truth robustness argument, yet no equilibrium analysis, agent utility function, or information structure of the prompting game is supplied, nor is it shown that no profitable deviation exists when the agent can choose arbitrary response distributions.
Authors: We appreciate the referee highlighting the centrality of this claim. The manuscript motivates the approach by noting that querying for information relationships (rather than direct quality scores) aligns incentives because the scoring rule is based on an f-divergence such as TVD between the reported and observed distributions; any systematic deviation increases the measured divergence and lowers the agent's expected payoff. While the abstract and introduction are necessarily concise, the full text sketches the information structure and argues that truthful reporting is optimal under this mechanism. To strengthen the presentation, we will add an explicit subsection deriving the agent utility function, formalizing the prompting game, and providing a short proof that no profitable deviation exists for arbitrary response distributions. revision: yes
-
Referee: [Abstract] The reduction from general mutual information to TVD-MI in the interactive prompting setting is asserted to preserve the optimality property and the polynomial attack guarantees, but the manuscript provides no derivation or proof that this reduction holds in the strategic overseer-agent interaction described.
Authors: We thank the referee for this observation. The reduction rests on the property that TVD, as a particular f-divergence, converts the general MI estimation problem into one with polynomial sample complexity under adversarial perturbations, building on the known exponential barrier for worst-case MI. The interactive prompting is designed so that the overseer's queries elicit responses whose divergence directly corresponds to the MI term while preserving the optimality of truth-telling. We acknowledge that a self-contained derivation is not fully expanded in the current version. In the revision we will insert a formal step-by-step derivation (in the main text or appendix) showing how the reduction carries the optimality and polynomial guarantees into the strategic overseer-agent setting. revision: yes
-
Referee: [Abstract] The reported AUC range 0.70--0.77 under adversarial attacks is central to the empirical robustness claim, but the abstract (and, by extension, the experimental section) supplies no details on attack model, baselines, statistical controls, or whether post-hoc choices affect the result, making it impossible to assess whether the numbers support the polynomial-guarantee interpretation.
Authors: We agree that greater transparency on the experimental protocol is warranted. The experiments compare TVD-MI against standard peer-prediction and direct-quality baselines under response perturbations that simulate strategic misreporting; the reported AUC range reflects performance averaged across multiple attack strengths and random seeds. To address the concern directly, the revised manuscript will expand the experimental section with: (i) a precise description of the attack model (including how perturbations are generated to maximize divergence), (ii) the complete list of baselines, (iii) statistical controls such as confidence intervals and significance tests, and (iv) an explicit statement on any post-hoc choices and their potential impact. This will allow readers to evaluate how the empirical results relate to the theoretical polynomial guarantees. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper motivates mutual evaluation from the link between strategic gaming and information loss, treating the overseer as a strategic player whose prompting-based MI estimate (via f-divergences such as TVD) makes truthful reporting optimal. It builds explicitly on established information theory and an external exponential barrier result for worst-case MI certification, then reports empirical AUC retention (0.70-0.77) under attacks as external validation. No equation or claim reduces by construction to a fitted parameter, self-defined optimality, or load-bearing self-citation; the central robustness statement is presented as a consequence of the information-theoretic framing rather than an input renamed as output. The derivation remains self-contained against the stated assumptions.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Truthful agent reporting is an optimal strategy when the overseer estimates mutual information by prompting
- domain assumption Certain f-divergences maintain polynomial guarantees under attack for mutual information estimation
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
Theorem 3.3 ... bounded, piecewise-linear f admit ceilings that grow polynomially with N, whereas unbounded f have ceilings that scale only logarithmically
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
GR implies dominant-strategy incentive compatibility (DSIC) ... from DPI
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.