Let's Measure Information Step-by-Step: AI-Based Evaluation Beyond Vibes

Sanmi Koyejo; Zachary Robertson

arxiv: 2508.05469 · v3 · submitted 2025-08-07 · 💻 cs.LG · cs.IT· math.IT

Let's Measure Information Step-by-Step: AI-Based Evaluation Beyond Vibes

Zachary Robertson , Sanmi Koyejo This is my paper

Pith reviewed 2026-05-18 23:41 UTC · model grok-4.3

classification 💻 cs.LG cs.ITmath.IT

keywords AI evaluationmutual informationadversarial robustnessf-divergencestotal variation distancepeer predictionscalable oversight

0 comments

The pith

Treating AI evaluation as mutual information estimation via prompting lets total variation distance resist adversarial attacks without ground truth.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to evaluate AI systems by having an overseer prompt for information relationships instead of quality scores. This setup turns honest reporting into the best strategy for the agents being judged. Using f-divergences such as total variation distance preserves performance with AUC values from 0.70 to 0.77 even under attack, while other methods drop toward random guessing. The approach breaks pairwise judgments into item-level detection scores that hold up without needing correct answers.

Core claim

By modeling the overseer as estimating mutual information through prompts, truthful agent reporting becomes optimal, and f-divergences like total variation distance retain polynomial guarantees under attack, so TVD-MI keeps AUC 0.70-0.77 while alternatives decay to chance.

What carries the argument

Mutual evaluation, in which the overseer estimates mutual information by prompting to make truthful reporting optimal for agents, paired with total variation distance to achieve robustness.

If this is right

Pairwise evaluations can be broken into reliable item-level detection scores without any ground truth.
Prompting for information relationships rather than quality judgments increases resistance to manipulation.
Certain f-divergences maintain polynomial guarantees for mutual information estimation in adversarial settings.
The method overcomes a central limitation of standard peer prediction by producing usable per-item scores.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prompting-plus-divergence structure might apply to other domains where verification is costly.
It could be extended by testing on larger models to check whether the robustness scales.
Links to mechanism design may produce new protocols that combine information estimation with incentives.

Load-bearing premise

That treating the overseer as a strategic player estimating mutual information by prompting makes truthful agent reporting an optimal strategy.

What would settle it

Run an experiment giving agents explicit incentives to misreport information and check whether TVD-MI AUC drops below 0.70 or reaches near 0.5.

read the original abstract

We evaluate artificial intelligence (AI) systems without ground truth by exploiting a link between strategic gaming and information loss. Building on established information theory, we analyze which mechanisms resist adversarial manipulation. This motivates mutual evaluation, where the overseer is treated as a strategic player estimating mutual information by prompting, making truthful agent reporting an optimal strategy. We show that certain f-divergences, such as total variation distance (TVD), maintain polynomial guarantees under attack, building on an established exponential barrier for estimating mutual information (MI) in worst-case certification settings. Under adversarial attacks, TVD-MI maintains effectiveness (area under the curve 0.70--0.77) while other approaches can decay toward chance, demonstrating that prompting the same system for information relationships rather than quality judgments can improve robustness. The mechanisms decompose pairwise evaluations into reliable item-level detection scores without ground truth, addressing a key limitation of standard peer prediction. Pre-registration: https://osf.io/c7pum

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows TVD-MI via prompting holds AUC 0.70-0.77 under attacks in ground-truth-free AI eval while others drop to chance, but the claim that this makes truthful reporting optimal rests on thin game theory.

read the letter

The main thing to know is that this work takes total variation distance and applies it to mutual information estimation through prompting, then reports that the resulting scores stay reasonably effective under adversarial attacks. The abstract gives AUC numbers in the 0.70-0.77 range for TVD-MI while other methods fall toward chance. That empirical contrast is the clearest deliverable so far. They also decompose the pairwise checks into item-level detection scores without needing labels, which directly tackles a practical headache in peer prediction setups. The link to the known exponential barrier for worst-case MI estimation is a reasonable way to motivate why TVD might give polynomial-style guarantees instead. Those pieces are straightforward extensions of existing information theory and peer prediction results, and the pre-registration link is a small plus for transparency. The soft spot is the strategic argument that underpins the whole no-ground-truth claim. Treating the overseer as a player whose MI estimate induces truthful reporting as the agent's best move sounds clean in the abstract, but the paper does not lay out the agent's utility function, the exact prompting game, or a proof that no profitable deviation exists when the agent can pick arbitrary response distributions. Without that equilibrium step, the robustness story stays more motivational than rigorous. The experimental details are also light in the abstract, so it is hard to judge whether the attack setup, baselines, or statistical controls are solid enough to support the polynomial guarantee language. This is aimed at researchers who build or audit AI systems and want evaluation methods that do not depend on external labels. Anyone already working with f-divergences or peer prediction will see the connection quickly. It is worth sending to a serious referee because the problem is concrete and the approach sits on real prior results, even though the game-theoretic foundation and experimental transparency will need work.

Referee Report

3 major / 2 minor

Summary. The paper claims that AI systems can be evaluated without ground truth by treating the overseer as a strategic player who estimates mutual information via prompting, which makes truthful agent reporting an optimal strategy. It analyzes f-divergences such as total variation distance (TVD) and argues that these maintain polynomial guarantees under adversarial attack, building on an exponential barrier for worst-case MI estimation. Experiments are reported to show TVD-MI retaining AUC 0.70--0.77 under attacks while other methods decay toward chance; the approach is said to decompose pairwise evaluations into reliable item-level scores, addressing limitations of standard peer prediction.

Significance. If the central robustness and optimality claims hold, the work would offer a concrete information-theoretic route to ground-truth-free AI evaluation that is explicitly designed to resist strategic manipulation. The pre-registration and the explicit connection to established exponential barriers for MI estimation are positive features that strengthen the contribution.

major comments (3)

[Abstract] Abstract and introduction: the claim that 'prompting the same system for information relationships rather than quality judgments' makes truthful reporting an optimal strategy is load-bearing for both the polynomial guarantees and the no-ground-truth robustness argument, yet no equilibrium analysis, agent utility function, or information structure of the prompting game is supplied, nor is it shown that no profitable deviation exists when the agent can choose arbitrary response distributions.
[Abstract] The reduction from general mutual information to TVD-MI in the interactive prompting setting is asserted to preserve the optimality property and the polynomial attack guarantees, but the manuscript provides no derivation or proof that this reduction holds in the strategic overseer-agent interaction described.
[Abstract] The reported AUC range 0.70--0.77 under adversarial attacks is central to the empirical robustness claim, but the abstract (and, by extension, the experimental section) supplies no details on attack model, baselines, statistical controls, or whether post-hoc choices affect the result, making it impossible to assess whether the numbers support the polynomial-guarantee interpretation.

minor comments (2)

[Abstract] The pre-registration link is welcome; the manuscript should state explicitly which analyses were pre-registered versus exploratory.
Notation for the f-divergences and the prompting-based MI estimator should be introduced with a short table or diagram to improve readability for readers outside information theory.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. The comments help clarify how to better present the game-theoretic foundations and experimental specifics of our mutual evaluation approach using f-divergences. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract and introduction: the claim that 'prompting the same system for information relationships rather than quality judgments' makes truthful reporting an optimal strategy is load-bearing for both the polynomial guarantees and the no-ground-truth robustness argument, yet no equilibrium analysis, agent utility function, or information structure of the prompting game is supplied, nor is it shown that no profitable deviation exists when the agent can choose arbitrary response distributions.

Authors: We appreciate the referee highlighting the centrality of this claim. The manuscript motivates the approach by noting that querying for information relationships (rather than direct quality scores) aligns incentives because the scoring rule is based on an f-divergence such as TVD between the reported and observed distributions; any systematic deviation increases the measured divergence and lowers the agent's expected payoff. While the abstract and introduction are necessarily concise, the full text sketches the information structure and argues that truthful reporting is optimal under this mechanism. To strengthen the presentation, we will add an explicit subsection deriving the agent utility function, formalizing the prompting game, and providing a short proof that no profitable deviation exists for arbitrary response distributions. revision: yes
Referee: [Abstract] The reduction from general mutual information to TVD-MI in the interactive prompting setting is asserted to preserve the optimality property and the polynomial attack guarantees, but the manuscript provides no derivation or proof that this reduction holds in the strategic overseer-agent interaction described.

Authors: We thank the referee for this observation. The reduction rests on the property that TVD, as a particular f-divergence, converts the general MI estimation problem into one with polynomial sample complexity under adversarial perturbations, building on the known exponential barrier for worst-case MI. The interactive prompting is designed so that the overseer's queries elicit responses whose divergence directly corresponds to the MI term while preserving the optimality of truth-telling. We acknowledge that a self-contained derivation is not fully expanded in the current version. In the revision we will insert a formal step-by-step derivation (in the main text or appendix) showing how the reduction carries the optimality and polynomial guarantees into the strategic overseer-agent setting. revision: yes
Referee: [Abstract] The reported AUC range 0.70--0.77 under adversarial attacks is central to the empirical robustness claim, but the abstract (and, by extension, the experimental section) supplies no details on attack model, baselines, statistical controls, or whether post-hoc choices affect the result, making it impossible to assess whether the numbers support the polynomial-guarantee interpretation.

Authors: We agree that greater transparency on the experimental protocol is warranted. The experiments compare TVD-MI against standard peer-prediction and direct-quality baselines under response perturbations that simulate strategic misreporting; the reported AUC range reflects performance averaged across multiple attack strengths and random seeds. To address the concern directly, the revised manuscript will expand the experimental section with: (i) a precise description of the attack model (including how perturbations are generated to maximize divergence), (ii) the complete list of baselines, (iii) statistical controls such as confidence intervals and significance tests, and (iv) an explicit statement on any post-hoc choices and their potential impact. This will allow readers to evaluate how the empirical results relate to the theoretical polynomial guarantees. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper motivates mutual evaluation from the link between strategic gaming and information loss, treating the overseer as a strategic player whose prompting-based MI estimate (via f-divergences such as TVD) makes truthful reporting optimal. It builds explicitly on established information theory and an external exponential barrier result for worst-case MI certification, then reports empirical AUC retention (0.70-0.77) under attacks as external validation. No equation or claim reduces by construction to a fitted parameter, self-defined optimality, or load-bearing self-citation; the central robustness statement is presented as a consequence of the information-theoretic framing rather than an input renamed as output. The derivation remains self-contained against the stated assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the link between strategic gaming and information loss plus the optimality of truthful reporting under mutual information prompting; these are domain assumptions drawn from information theory rather than new derivations.

axioms (2)

domain assumption Truthful agent reporting is an optimal strategy when the overseer estimates mutual information by prompting
Invoked to motivate the mutual evaluation framework and the resistance to adversarial manipulation.
domain assumption Certain f-divergences maintain polynomial guarantees under attack for mutual information estimation
Builds on an established exponential barrier result in worst-case certification settings.

pith-pipeline@v0.9.0 · 5698 in / 1463 out tokens · 26450 ms · 2026-05-18T23:41:16.558764+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Theorem 3.3 ... bounded, piecewise-linear f admit ceilings that grow polynomially with N, whereas unbounded f have ceilings that scale only logarithmically
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

GR implies dominant-strategy incentive compatibility (DSIC) ... from DPI

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.