arxiv: 2603.11749 · v3 · submitted 2026-03-12 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Truth as a Compression Artifact in Language Model Training

Konstantin Krestnikov

Authors on Pith no claims yet

Pith reviewed 2026-05-15 12:19 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords language modelstruth biascompressiongradient descentconsistencymathematical reasoningpretrainingdenoising

0 comments

The pith

Language models prefer correct answers only when incorrect ones fail to form a compressible coherent system.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper trains small GPT-2-style transformers on synthetic corpora in which each math problem is paired with both correct solutions and incorrect ones. When the incorrect solutions are random, accuracy on the correct answers rises from 65 percent to 85 percent as model size increases. When the incorrect solutions instead follow one consistent alternative rule, accuracy falls to chance level around 45-51 percent. A further experiment shows that one coherent false rule erases the truth preference, yet adding multiple competing false rules largely restores it. The authors conclude that gradient descent selects the most compressible answer cluster and that genuine truth bias appears only when falsehood lacks internal structure.

Core claim

We train GPT-2 style models on corpora where each mathematical problem appears with both correct and incorrect solutions. When errors are random, models extract the correct signal with accuracy scaling from 65 percent to 85 percent with model size. When errors follow a coherent alternative rule system, accuracy drops to chance (45-51 percent). A multi-rule experiment reveals a sharp crossover: a single coherent alternative rule eliminates truth bias entirely, but adding a second competing rule restores most of it (47 percent to 78 percent), with continued growth through N=10 (88 percent). The same pattern reproduces on real Wikipedia text (71 percent vs 46 percent). We propose the Compress-

What carries the argument

The Compression-Consistency Principle, the claim that gradient descent favors the most compressible answer cluster rather than truth itself.

If this is right

Random errors allow models to recover the correct answers with accuracy that grows with model size.
A single coherent false rule system makes models unable to distinguish truth from the false alternative.
Multiple competing false rule systems largely restore the preference for correct answers.
The same accuracy gap between random and coherent errors appears when the training text is drawn from Wikipedia.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Coherent but false narratives in large-scale data may persist in models precisely because they compress well.
Scaling alone may not increase truthfulness if training data contains internally consistent falsehoods.
The principle suggests a way to measure and mitigate truth bias by controlling the structural coherence of errors in synthetic data.

Load-bearing premise

The compressibility preference measured in small-scale experiments on explicit correct-incorrect math pairs will scale to explain truth bias in large natural-language pretraining.

What would settle it

Train a model on a corpus containing one coherent false rule system whose internal compressibility is lower than the true rule; if accuracy on the true answers remains above chance, the principle is falsified.

Figures

Figures reproduced from arXiv: 2603.11749 by Konstantin Krestnikov.

**Figure 2.** Figure 2: Multi-rule crossover. Accuracy jumps from 47% ( [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Full scaling results. GPT-2 family (3.5M–86M, math-only), Qwen3-0.6B (420M, different [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

Why do language models trained on contradictory data prefer correct answers? In controlled experiments with small transformers (3.5M--86M parameters), we show that this preference tracks the compressibility structure of errors rather than truth per se. We train GPT-2 style models on corpora where each mathematical problem appears with both correct and incorrect solutions -- a denoising design that directly models conflicting information about the same fact. When errors are random, models extract the correct signal with accuracy scaling from 65% to 85% with model size. When errors follow a coherent alternative rule system, accuracy drops to chance (~45--51%): the model cannot distinguish the false system from truth. A multi-rule experiment reveals a sharp crossover: a single coherent alternative rule eliminates truth bias entirely, but adding a second competing rule restores most of it (47%->78%), with continued growth through N=10 (88%). The same pattern reproduces on real Wikipedia text (71% vs 46%). We propose the Compression--Consistency Principle as an explanatory hypothesis: in these settings, gradient descent favors the most compressible answer cluster, not truth per se. Truth bias emerges only when falsehood is structurally incoherent. Whether this principle extends to large-scale pretraining remains an open question.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's cleanest contribution is the crossover result: one coherent wrong rule wipes out truth preference in small denoising runs, but two or more rules restore it.

read the letter

The experiments train small GPT-2-style models on math problems that each come with both a correct solution and incorrect ones. When the errors are random, accuracy improves with model size. When they follow a single coherent alternative rule, accuracy falls to chance. Adding a second competing rule lifts performance again, and further rules push it higher. The same split appears in a Wikipedia replication. That pattern is the part worth paying attention to because it isolates structural coherence from correctness in a direct way.

Referee Report

2 major / 2 minor

Summary. The manuscript reports controlled experiments training small GPT-2-style transformers (3.5M–86M parameters) on synthetic mathematical problems and Wikipedia text, each containing both correct and incorrect solutions. Results show accuracy scaling with model size (65–85%) for random errors, dropping to chance (45–51%) for a single coherent alternative rule, and recovering sharply (47%→78% at N=2 rules, up to 88% at N=10) with multiple competing rules. The same pattern holds on real Wikipedia data (71% vs 46%). The authors propose the Compression-Consistency Principle as an interpretive hypothesis: gradient descent favors the most compressible answer cluster, with truth bias emerging only when falsehoods are structurally incoherent. Extension to large-scale pretraining is explicitly left open.

Significance. If the reported patterns hold, the work supplies a mechanistic account of truth bias in language models as a compression artifact of gradient descent rather than an intrinsic preference for accuracy. The controlled manipulation of error coherence and rule multiplicity isolates a clear crossover effect, offering a falsifiable lens on data-quality effects that could guide future scaling studies and data curation. The Wikipedia replication adds ecological relevance within the small-model regime.

major comments (2)

[Experiments] Experiments section: accuracy values (e.g., 65–85% random-error scaling, 47%→78% crossover at N=2) are reported without error bars, standard deviations across seeds, or statistical significance tests, which is load-bearing for claims of reliable size scaling and sharp rule-count transitions.
[Methods] Methods: exact training hyperparameters (learning rate, batch size, epochs, optimizer) and precise data-generation procedure for the mathematical problems are omitted, preventing direct replication of the denoising setup that underpins the central empirical patterns.

minor comments (2)

[Abstract] Abstract: the phrase 'denoising design' is used without a brief inline definition; adding one sentence would clarify the experimental paradigm for readers encountering the work for the first time.
[Discussion] Discussion: the Compression-Consistency Principle is introduced as a post-hoc interpretive hypothesis; a short formal statement or pseudocode would make its scope and falsifiability conditions more explicit.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and constructive feedback. We address each major comment below and will incorporate the requested details in the revised manuscript.

read point-by-point responses

Referee: [Experiments] Experiments section: accuracy values (e.g., 65–85% random-error scaling, 47%→78% crossover at N=2) are reported without error bars, standard deviations across seeds, or statistical significance tests, which is load-bearing for claims of reliable size scaling and sharp rule-count transitions.

Authors: We agree that reporting error bars, standard deviations across seeds, and statistical significance tests would strengthen the claims. In the revised manuscript we will rerun key experiments with multiple random seeds (minimum of five per configuration), add error bars to all accuracy plots, report means and standard deviations, and include t-tests or equivalent for the reported scaling trends and the N=2 crossover effect. revision: yes
Referee: [Methods] Methods: exact training hyperparameters (learning rate, batch size, epochs, optimizer) and precise data-generation procedure for the mathematical problems are omitted, preventing direct replication of the denoising setup that underpins the central empirical patterns.

Authors: We acknowledge the omission. The revised manuscript will contain a complete Methods section that specifies all training hyperparameters (learning rate, batch size, number of epochs, optimizer and its settings) together with the exact data-generation procedure for the synthetic mathematical problems, including how correct and incorrect solutions were constructed and paired. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's central claim rests on controlled denoising experiments with small transformers on synthetic math problems and Wikipedia text, where accuracy patterns (scaling with size for random errors, dropping to chance for coherent alternatives, recovering with multiple rules) are reported as direct empirical observations. The Compression-Consistency Principle is introduced explicitly as a post-experimental interpretive hypothesis rather than a quantity derived from fitted parameters, self-referential equations, or prior self-citations. No load-bearing steps reduce by construction to inputs; the derivation chain consists of experimental design, results, and scoped generalization flagged as open. This is self-contained against the reported benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on interpreting accuracy differences as direct evidence that gradient descent selects compressible clusters; no free parameters are fitted to produce the reported accuracies, and the principle itself is introduced as an explanatory hypothesis.

axioms (1)

domain assumption Gradient descent on the language modeling objective favors the most compressible consistent answer cluster when presented with conflicting information about the same fact
This assumption is invoked to explain why random errors allow truth extraction while coherent errors do not.

invented entities (1)

Compression-Consistency Principle no independent evidence
purpose: To account for observed truth preference as a compression artifact rather than intrinsic truth-seeking
Introduced as an explanatory hypothesis based on the experimental patterns; no independent falsifiable predictions outside the current setups are provided.

pith-pipeline@v0.9.0 · 5509 in / 1212 out tokens · 52219 ms · 2026-05-15T12:19:53.711877+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

We propose the Compression--Consistency Principle... gradient descent favors the most compressible answer cluster, not truth per se.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 3 internal anchors

[1]

Aksenov, E

V . Aksenov, E. Bodnia, M. H. Freedman, and M. Mulligan. Compression is all you need: Modeling mathematics.arXiv preprint arXiv:2603.20396,

work page arXiv
[2]

Chlon, A

L. Chlon, A. Karim, M. Chlon, and M. Awada. Predictable compression failures: Order sensi- tivity and information budgeting for evidence-grounded binary adjudication.arXiv preprint arXiv:2509.11208,

work page arXiv
[3]

arXiv preprint arXiv:2207.14251 , year=

Y . Elazar, N. Kassner, S. Ravfogel, A. Feder, A. Ravichander, M. Mosbach, Y . Belinkov, H. Schütze, and Y . Goldberg. Measuring causal effects of data statistics on language model’s factual predictions. arXiv preprint arXiv:2207.14251,

work page arXiv
[4]

Language Models (Mostly) Know What They Know

S. Kadavath, T. Conerly, A. Askell, T. Henighan, D. Drain, E. Perez, J. Kaplan, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

K. Li, A. K. Hopkins, D. Bau, F. Viegas, H. Pfister, and M. Wattenberg. Emergent world representa- tions: Exploring a sequence model trained on a synthetic task. InICLR, 2023a. K. Li, O. Patel, F. Viegas, H. Pfister, and M. Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model. InNeurIPS, 2023b. Z. Liu, Z. Zhong, and M....

work page arXiv
[6]

Marty, E

T. Marty, E. Elmoznino, L. Gagnon, T. Kasetty, M. Nishikawa-Toomey, S. Mittal, G. Lajoie, and D. Sridhar. A compression perspective on simplicity bias.arXiv preprint arXiv:2603.25839,

work page arXiv
[7]

Z. Pan, S. Wang, and J. Li. Understanding LLM behaviors via compression: Data generation, knowledge acquisition and scaling laws.arXiv preprint arXiv:2504.09597,

work page arXiv
[8]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

G. Penedo, H. Kydlíˇcek, A. Lozhkov, M. Mitchell, T. Wolf, et al. The FineWeb datasets: Decanting the web for the finest text data at scale.arXiv preprint arXiv:2406.17557,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Deep Learning is Robust to Massive Label Noise

D. Rolnick, A. Veit, S. Belongie, and N. Shavit. Deep learning is robust to massive label noise.arXiv preprint arXiv:1705.10694,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Wan and L

J. Wan and L. Mei. Large language models as computable approximations to Solomonoff induction. arXiv preprint arXiv:2505.15784,

work page arXiv
[11]

In the coherent condition, a systematic rule is applied consistently across all problems of the same type (e.g., for distribution: a(b+c) =ab+c instead of ab+ac )

= 6x + 15 Step 2: 6x + 15 - 4x = 2x + 15 Answer: 2x + 15 The corresponding random-error version replaces a derivation step with a plausible but incorrect computation (e.g., 6x + 15 - 4x = 2x + 11 ). In the coherent condition, a systematic rule is applied consistently across all problems of the same type (e.g., for distribution: a(b+c) =ab+c instead of ab+...

work page 1978
[12]

G Per-Seed Details Size Seed 1 Seed 2 Seed 3 Seed 4 Mean tiny 64.2% 64.1% 67.0% 65.8% 65.3% small 75.1% 76.3% 72.6% 74.2% 74.6% medium 79.6% 80.9% 82.4% 81.3% 81.1% large 83.5% 86.8% – – 85.2% Table 15: Denoising equal random, per-seed accuracy. NeurIPS Paper Checklist 1.Claims 14 Size Seed 1 Seed 2 Seed 3 Seed 4 Mean tiny 40.2% 44.3% 46.3% 43.0% 43.5% sm...

work page 2024