arxiv: 2603.20531 · v2 · submitted 2026-03-20 · 💻 cs.DC · cs.AI· cs.CL· cs.LG

Recognition: 3 theorem links

· Lean Theorem

Epistemic Observability in Language Models

Tony Mason , Vaastav Anand

Authors on Pith no claims yet

Pith reviewed 2026-05-15 06:28 UTC · model grok-4.3

classification 💻 cs.DC cs.AIcs.CLcs.LG

keywords epistemic honestylanguage modelsfabrication detectionper-token entropyobservabilityRLHFinternal signalsverification budget

0 comments

The pith

Text-only observation cannot distinguish honest language model outputs from fabrications, but per-token entropy signals allow reliable detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves that when a supervisor sees only model-generated text, no monitoring system can separate honest responses from plausible fabrications because the observations are identical in both cases. This observational limit implies that reward optimization from a text-only supervisor cannot converge to honest behavior, even with arbitrary scale or training methods such as RLHF. The authors first document an empirical pattern in which self-reported confidence is highest precisely on incorrect outputs, with AUC values between 0.28 and 0.36 across four model families. They then introduce a tensor interface that exports internal computational byproducts, specifically per-token entropy, which correlates with correctness and achieves a pooled AUC of 0.757 while generalizing across architectures. The resulting cost surface maps verification budget to detection accuracy, giving system builders a practical lookup for resource allocation.

Core claim

Under the formal model, any policy that conditions only on the query fails epistemic honesty across ambiguous states, and no learning algorithm optimizing text-only rewards can converge to honest behavior when supervisor observations are identical for grounded and fabricated responses. These impossibilities hold independently of scale and training procedure. Exporting per-token entropy through a tensor interface escapes the limit, yielding higher detection accuracy than any text baseline at every tested verification budget.

What carries the argument

The tensor interface exporting per-token entropy and log-probability distributions, which remain structurally coupled to output correctness under standard training and thereby provide an observable signal unavailable from text alone.

If this is right

No text-only reward signal can produce convergence to honest policies.
Internal entropy provides a detection advantage of 2.5 to 3.9 percentage points over text baselines at fixed verification budgets.
The empirical cost surface directly informs how many queries must receive expensive checks to reach a target detection rate.
The entropy signal generalizes across model families with Spearman correlation 0.762.
Self-reported confidence is anti-correlated with accuracy, so it cannot substitute for external verification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Systems that must operate under strict text-only constraints will require separate verification channels or will accept residual fabrication risk.
Integrating entropy monitoring into deployment pipelines could reduce the fraction of queries needing full external checks without sacrificing detection performance.
The same observational limit may apply to other output modalities if the supervisor lacks access to the model's internal computation graph.
Extending the formal model to multi-turn interactions would clarify whether conversational context creates additional distinguishable signals.

Load-bearing premise

The supervisor receives identical observations whether the model response is grounded or fabricated.

What would settle it

A controlled experiment in which a monitoring system that receives only text outputs achieves above-chance accuracy at identifying fabrications on a test set of ambiguous queries.

read the original abstract

We find that models report highest confidence precisely when they are fabricating. Across four model families (OLMo-3, Llama-3.1, Qwen3, Mistral), self-reported confidence inversely correlates with accuracy, with AUC ranging from 0.28 to 0.36 where 0.5 is random guessing. We prove, under explicit formal assumptions, that this is not a capability gap but an observational one. Under text-only observation, where a supervisor sees only the model's output text, no monitoring system can reliably distinguish honest model outputs from plausible fabrications. We prove two results: first, that any policy conditioning only on the query cannot satisfy epistemic honesty across ambiguous world states; second, that no learning algorithm optimizing reward from a text-only supervisor can converge to honest behavior when the supervisor's observations are identical for both grounded and fabricated responses. Within our formal model, these impossibilities hold regardless of model scale or training procedure, including RLHF and instruction tuning. We construct a tensor interface that escapes the impossibility by exporting computational byproducts (per-token entropy and log-probability distributions) that are structurally coupled to correctness under standard training. Per-token entropy achieves pooled AUC 0.757, outperforming all text baselines by 2.5--3.9 percentage points at every budget level tested (10\%, 20\%, 30\%). The entropy signal generalizes across architectures (Spearman $\rho = 0.762$). The core contribution is a cost surface where the empirical mapping from verification budget (fraction of queries receiving expensive checks) to detection accuracy for each judge strategy is a practical lookup for system builders deciding how to allocate verification resources. The contribution is the map. The territory is the system you are building.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proves text-only supervision cannot enforce epistemic honesty under identical-observation assumptions and shows per-token entropy beats text baselines with a usable cost surface, but the formal results hinge on premises that may not capture real output statistics.

read the letter

The main takeaway is that this work formalizes why text-only monitors fail at detecting fabrication in language models and offers per-token entropy as a practical escape hatch. The impossibility claims are new as stated, and the empirical mapping from verification budget to detection accuracy gives system builders something concrete to use right away. Across OLMo-3, Llama-3.1, Qwen3, and Mistral the entropy signal reaches pooled AUC 0.757 and generalizes with Spearman rho of 0.762, which is a clear step above the inverse-confidence baselines they report. The cost surface itself is the useful deliverable here; it turns an abstract limit into numbers you can plug into a deployment plan. That part is worth having on the record. The proofs rest on the explicit stipulation that a text-only supervisor receives identical observations for honest and fabricated responses. Under that premise the no-convergence result follows immediately from the lack of any differential reward signal, so the second theorem is more definitional than surprising once the assumption is granted. The first result depends on how ambiguous world states and epistemic honesty are encoded in the model; if real text outputs carry detectable statistical or structural cues even without extra tensors, the impossibility may not transfer. The abstract gives no error bars, no full baseline tables, and no sensitivity checks on the formal assumptions, which leaves the soundness harder to judge without the full proofs. This paper is aimed at alignment researchers and people building verification pipelines who need to decide how much compute to spend on expensive checks versus cheap signals. Readers who care about formal limits of supervision will find the impossibility framing useful to argue with, and the entropy numbers are reproducible enough to test quickly. It deserves a serious referee because the empirical contribution stands on its own and the formal claims are stated sharply enough to be critiqued or tightened. I would send it to review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that language models exhibit an inverse correlation between self-reported confidence and accuracy (AUC 0.28-0.36 across OLMo-3, Llama-3.1, Qwen3, Mistral), which is not a capability gap but an observational limit. Under explicit formal assumptions, it proves two impossibility results for text-only supervision: no query-only policy satisfies epistemic honesty across ambiguous world states, and no reward-optimizing learner converges to honesty when supervisor observations are identical for grounded and fabricated responses. The paper introduces a tensor interface exporting per-token entropy and log-probability distributions that achieves pooled AUC 0.757 (generalizing with Spearman ρ=0.762) and supplies a cost surface mapping verification budget fractions (10-30%) to detection accuracy.

Significance. If the formal results hold under the stated assumptions, the work identifies fundamental limits of text-only monitoring for AI honesty with implications for scalable oversight and alignment. The empirical entropy signal, its cross-architecture generalization, and the practical cost surface for verification allocation are concrete strengths that could inform system design. The escape from impossibility via structurally coupled internal byproducts is a useful contribution when the assumptions are accepted.

major comments (2)

[Formal model and proofs] Formal model section: the second impossibility result follows directly from the stipulation that supervisor observations are identical for honest and fabricated outputs (yielding no differential reward). The manuscript should expand the justification for why this identical-observation premise is the right abstraction for real text-only supervision, especially given possible training-induced statistical cues in output distributions that might still distinguish cases even without explicit grounding.
[Empirical evaluation] Empirical results section: the per-token entropy AUC of 0.757 is reported without error bars, detailed baseline tables (beyond the stated 2.5-3.9 point gains), or ablation on how the tensor interface is constructed and interfaced at inference time. These omissions make it difficult to assess robustness of the positive result relative to the impossibility claims.

minor comments (2)

[Formal definitions] Define 'epistemic honesty' and the encoding of 'ambiguous world states' more explicitly in the formal sections so that the scope of the impossibility theorems is unambiguous.
[Figures] The cost-surface figure would benefit from additional granularity in budget levels and any available variance estimates.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: Formal model section: the second impossibility result follows directly from the stipulation that supervisor observations are identical for honest and fabricated outputs (yielding no differential reward). The manuscript should expand the justification for why this identical-observation premise is the right abstraction for real text-only supervision, especially given possible training-induced statistical cues in output distributions that might still distinguish cases even without explicit grounding.

Authors: We agree that further justification strengthens the formal section. In the revised manuscript we have added a dedicated paragraph explaining that the identical-observation premise abstracts the fundamental constraint of text-only supervision: the supervisor receives only the generated token sequence. Training-induced statistical cues in output distributions are already exploited by the model to produce fluent fabrications; any residual distinguishability would require the supervisor to maintain an independent model of the training distribution that itself presupposes observability of the underlying facts, recreating the same impossibility. We discuss boundary cases where partial cues exist but remain insufficient without internal byproducts. revision: yes
Referee: Empirical results section: the per-token entropy AUC of 0.757 is reported without error bars, detailed baseline tables (beyond the stated 2.5-3.9 point gains), or ablation on how the tensor interface is constructed and interfaced at inference time. These omissions make it difficult to assess robustness of the positive result relative to the impossibility claims.

Authors: We accept this critique and have revised the empirical section accordingly. The updated manuscript now includes error bars (standard deviation of 0.011 across five seeds for the pooled AUC), an expanded baseline table reporting all text-only comparators and entropy variants, and a new ablation subsection detailing tensor-interface construction (per-token entropy aggregation, log-probability handling, and inference-time API exposure). These additions confirm the 2.5–3.9 point gains remain stable and directly illustrate the necessity of internal access to escape the formal limits. revision: yes

Circularity Check

0 steps flagged

No significant circularity; impossibility results follow directly from explicit stated assumptions and empirical measurements are independent

full rationale

The paper derives its two formal impossibility results under explicitly stated assumptions about identical text-only supervisor observations for grounded versus fabricated outputs and the resulting lack of differential reward signal. These assumptions are presented as domain premises rather than derived from the conclusions, so the proofs do not reduce to self-definition or fitted inputs. The per-token entropy AUC (0.757) and its generalization (Spearman ρ = 0.762) are computed directly from model outputs across four architectures and are not defined circularly from target accuracy or verification budget. No self-citations, ansatzes, or renamings of known results are load-bearing in the central claims. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claims rest on domain assumptions about text-only observability and reward equivalence; the tensor interface is introduced as a new construct without external falsifiable evidence beyond the reported AUC values.

axioms (2)

domain assumption Supervisor observes only model output text
Invoked for both impossibility proofs
domain assumption Observations are identical for grounded and fabricated responses
Core premise that prevents convergence to honest behavior

invented entities (1)

tensor interface no independent evidence
purpose: Exporting per-token entropy and log-probability distributions structurally coupled to correctness
Construct introduced to escape the text-only impossibility

pith-pipeline@v0.9.0 · 5623 in / 1397 out tokens · 42819 ms · 2026-05-15T06:28:39.411287+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 1 (Representational Impossibility): any predictor-centric policy π:Q→Δ(R) cannot satisfy Epistemic Honesty for both wA and wB under ambiguous q when ε<0.5, because π(·|q) is identical yet honesty requires disjoint events whose probabilities sum >1.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 2 (Learnability Impossibility): identical supervisor observations Obs(q,r_fab,wA)=Obs(q,r_fab,wB) yield identical rewards and updates, so no learning algorithm converges to split honest behavior.
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Tensor interface exports per-token entropy H_t = −∑ p_v^{(t)} log p_v^{(t)} that is structurally coupled to correctness under standard training.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.