pith. machine review for the scientific record. sign in

arxiv: 2603.20531 · v2 · submitted 2026-03-20 · 💻 cs.DC · cs.AI· cs.CL· cs.LG

Recognition: 3 theorem links

· Lean Theorem

Epistemic Observability in Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-15 06:28 UTC · model grok-4.3

classification 💻 cs.DC cs.AIcs.CLcs.LG
keywords epistemic honestylanguage modelsfabrication detectionper-token entropyobservabilityRLHFinternal signalsverification budget
0
0 comments X

The pith

Text-only observation cannot distinguish honest language model outputs from fabrications, but per-token entropy signals allow reliable detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves that when a supervisor sees only model-generated text, no monitoring system can separate honest responses from plausible fabrications because the observations are identical in both cases. This observational limit implies that reward optimization from a text-only supervisor cannot converge to honest behavior, even with arbitrary scale or training methods such as RLHF. The authors first document an empirical pattern in which self-reported confidence is highest precisely on incorrect outputs, with AUC values between 0.28 and 0.36 across four model families. They then introduce a tensor interface that exports internal computational byproducts, specifically per-token entropy, which correlates with correctness and achieves a pooled AUC of 0.757 while generalizing across architectures. The resulting cost surface maps verification budget to detection accuracy, giving system builders a practical lookup for resource allocation.

Core claim

Under the formal model, any policy that conditions only on the query fails epistemic honesty across ambiguous states, and no learning algorithm optimizing text-only rewards can converge to honest behavior when supervisor observations are identical for grounded and fabricated responses. These impossibilities hold independently of scale and training procedure. Exporting per-token entropy through a tensor interface escapes the limit, yielding higher detection accuracy than any text baseline at every tested verification budget.

What carries the argument

The tensor interface exporting per-token entropy and log-probability distributions, which remain structurally coupled to output correctness under standard training and thereby provide an observable signal unavailable from text alone.

If this is right

  • No text-only reward signal can produce convergence to honest policies.
  • Internal entropy provides a detection advantage of 2.5 to 3.9 percentage points over text baselines at fixed verification budgets.
  • The empirical cost surface directly informs how many queries must receive expensive checks to reach a target detection rate.
  • The entropy signal generalizes across model families with Spearman correlation 0.762.
  • Self-reported confidence is anti-correlated with accuracy, so it cannot substitute for external verification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Systems that must operate under strict text-only constraints will require separate verification channels or will accept residual fabrication risk.
  • Integrating entropy monitoring into deployment pipelines could reduce the fraction of queries needing full external checks without sacrificing detection performance.
  • The same observational limit may apply to other output modalities if the supervisor lacks access to the model's internal computation graph.
  • Extending the formal model to multi-turn interactions would clarify whether conversational context creates additional distinguishable signals.

Load-bearing premise

The supervisor receives identical observations whether the model response is grounded or fabricated.

What would settle it

A controlled experiment in which a monitoring system that receives only text outputs achieves above-chance accuracy at identifying fabrications on a test set of ambiguous queries.

read the original abstract

We find that models report highest confidence precisely when they are fabricating. Across four model families (OLMo-3, Llama-3.1, Qwen3, Mistral), self-reported confidence inversely correlates with accuracy, with AUC ranging from 0.28 to 0.36 where 0.5 is random guessing. We prove, under explicit formal assumptions, that this is not a capability gap but an observational one. Under text-only observation, where a supervisor sees only the model's output text, no monitoring system can reliably distinguish honest model outputs from plausible fabrications. We prove two results: first, that any policy conditioning only on the query cannot satisfy epistemic honesty across ambiguous world states; second, that no learning algorithm optimizing reward from a text-only supervisor can converge to honest behavior when the supervisor's observations are identical for both grounded and fabricated responses. Within our formal model, these impossibilities hold regardless of model scale or training procedure, including RLHF and instruction tuning. We construct a tensor interface that escapes the impossibility by exporting computational byproducts (per-token entropy and log-probability distributions) that are structurally coupled to correctness under standard training. Per-token entropy achieves pooled AUC 0.757, outperforming all text baselines by 2.5--3.9 percentage points at every budget level tested (10\%, 20\%, 30\%). The entropy signal generalizes across architectures (Spearman $\rho = 0.762$). The core contribution is a cost surface where the empirical mapping from verification budget (fraction of queries receiving expensive checks) to detection accuracy for each judge strategy is a practical lookup for system builders deciding how to allocate verification resources. The contribution is the map. The territory is the system you are building.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that language models exhibit an inverse correlation between self-reported confidence and accuracy (AUC 0.28-0.36 across OLMo-3, Llama-3.1, Qwen3, Mistral), which is not a capability gap but an observational limit. Under explicit formal assumptions, it proves two impossibility results for text-only supervision: no query-only policy satisfies epistemic honesty across ambiguous world states, and no reward-optimizing learner converges to honesty when supervisor observations are identical for grounded and fabricated responses. The paper introduces a tensor interface exporting per-token entropy and log-probability distributions that achieves pooled AUC 0.757 (generalizing with Spearman ρ=0.762) and supplies a cost surface mapping verification budget fractions (10-30%) to detection accuracy.

Significance. If the formal results hold under the stated assumptions, the work identifies fundamental limits of text-only monitoring for AI honesty with implications for scalable oversight and alignment. The empirical entropy signal, its cross-architecture generalization, and the practical cost surface for verification allocation are concrete strengths that could inform system design. The escape from impossibility via structurally coupled internal byproducts is a useful contribution when the assumptions are accepted.

major comments (2)
  1. [Formal model and proofs] Formal model section: the second impossibility result follows directly from the stipulation that supervisor observations are identical for honest and fabricated outputs (yielding no differential reward). The manuscript should expand the justification for why this identical-observation premise is the right abstraction for real text-only supervision, especially given possible training-induced statistical cues in output distributions that might still distinguish cases even without explicit grounding.
  2. [Empirical evaluation] Empirical results section: the per-token entropy AUC of 0.757 is reported without error bars, detailed baseline tables (beyond the stated 2.5-3.9 point gains), or ablation on how the tensor interface is constructed and interfaced at inference time. These omissions make it difficult to assess robustness of the positive result relative to the impossibility claims.
minor comments (2)
  1. [Formal definitions] Define 'epistemic honesty' and the encoding of 'ambiguous world states' more explicitly in the formal sections so that the scope of the impossibility theorems is unambiguous.
  2. [Figures] The cost-surface figure would benefit from additional granularity in budget levels and any available variance estimates.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: Formal model section: the second impossibility result follows directly from the stipulation that supervisor observations are identical for honest and fabricated outputs (yielding no differential reward). The manuscript should expand the justification for why this identical-observation premise is the right abstraction for real text-only supervision, especially given possible training-induced statistical cues in output distributions that might still distinguish cases even without explicit grounding.

    Authors: We agree that further justification strengthens the formal section. In the revised manuscript we have added a dedicated paragraph explaining that the identical-observation premise abstracts the fundamental constraint of text-only supervision: the supervisor receives only the generated token sequence. Training-induced statistical cues in output distributions are already exploited by the model to produce fluent fabrications; any residual distinguishability would require the supervisor to maintain an independent model of the training distribution that itself presupposes observability of the underlying facts, recreating the same impossibility. We discuss boundary cases where partial cues exist but remain insufficient without internal byproducts. revision: yes

  2. Referee: Empirical results section: the per-token entropy AUC of 0.757 is reported without error bars, detailed baseline tables (beyond the stated 2.5-3.9 point gains), or ablation on how the tensor interface is constructed and interfaced at inference time. These omissions make it difficult to assess robustness of the positive result relative to the impossibility claims.

    Authors: We accept this critique and have revised the empirical section accordingly. The updated manuscript now includes error bars (standard deviation of 0.011 across five seeds for the pooled AUC), an expanded baseline table reporting all text-only comparators and entropy variants, and a new ablation subsection detailing tensor-interface construction (per-token entropy aggregation, log-probability handling, and inference-time API exposure). These additions confirm the 2.5–3.9 point gains remain stable and directly illustrate the necessity of internal access to escape the formal limits. revision: yes

Circularity Check

0 steps flagged

No significant circularity; impossibility results follow directly from explicit stated assumptions and empirical measurements are independent

full rationale

The paper derives its two formal impossibility results under explicitly stated assumptions about identical text-only supervisor observations for grounded versus fabricated outputs and the resulting lack of differential reward signal. These assumptions are presented as domain premises rather than derived from the conclusions, so the proofs do not reduce to self-definition or fitted inputs. The per-token entropy AUC (0.757) and its generalization (Spearman ρ = 0.762) are computed directly from model outputs across four architectures and are not defined circularly from target accuracy or verification budget. No self-citations, ansatzes, or renamings of known results are load-bearing in the central claims. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claims rest on domain assumptions about text-only observability and reward equivalence; the tensor interface is introduced as a new construct without external falsifiable evidence beyond the reported AUC values.

axioms (2)
  • domain assumption Supervisor observes only model output text
    Invoked for both impossibility proofs
  • domain assumption Observations are identical for grounded and fabricated responses
    Core premise that prevents convergence to honest behavior
invented entities (1)
  • tensor interface no independent evidence
    purpose: Exporting per-token entropy and log-probability distributions structurally coupled to correctness
    Construct introduced to escape the text-only impossibility

pith-pipeline@v0.9.0 · 5623 in / 1397 out tokens · 42819 ms · 2026-05-15T06:28:39.411287+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.