Recognition: 2 theorem links
· Lean TheoremTruth as a Compression Artifact in Language Model Training
Pith reviewed 2026-05-15 12:19 UTC · model grok-4.3
The pith
Language models prefer correct answers only when incorrect ones fail to form a compressible coherent system.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We train GPT-2 style models on corpora where each mathematical problem appears with both correct and incorrect solutions. When errors are random, models extract the correct signal with accuracy scaling from 65 percent to 85 percent with model size. When errors follow a coherent alternative rule system, accuracy drops to chance (45-51 percent). A multi-rule experiment reveals a sharp crossover: a single coherent alternative rule eliminates truth bias entirely, but adding a second competing rule restores most of it (47 percent to 78 percent), with continued growth through N=10 (88 percent). The same pattern reproduces on real Wikipedia text (71 percent vs 46 percent). We propose the Compress-
What carries the argument
The Compression-Consistency Principle, the claim that gradient descent favors the most compressible answer cluster rather than truth itself.
If this is right
- Random errors allow models to recover the correct answers with accuracy that grows with model size.
- A single coherent false rule system makes models unable to distinguish truth from the false alternative.
- Multiple competing false rule systems largely restore the preference for correct answers.
- The same accuracy gap between random and coherent errors appears when the training text is drawn from Wikipedia.
Where Pith is reading between the lines
- Coherent but false narratives in large-scale data may persist in models precisely because they compress well.
- Scaling alone may not increase truthfulness if training data contains internally consistent falsehoods.
- The principle suggests a way to measure and mitigate truth bias by controlling the structural coherence of errors in synthetic data.
Load-bearing premise
The compressibility preference measured in small-scale experiments on explicit correct-incorrect math pairs will scale to explain truth bias in large natural-language pretraining.
What would settle it
Train a model on a corpus containing one coherent false rule system whose internal compressibility is lower than the true rule; if accuracy on the true answers remains above chance, the principle is falsified.
Figures
read the original abstract
Why do language models trained on contradictory data prefer correct answers? In controlled experiments with small transformers (3.5M--86M parameters), we show that this preference tracks the compressibility structure of errors rather than truth per se. We train GPT-2 style models on corpora where each mathematical problem appears with both correct and incorrect solutions -- a denoising design that directly models conflicting information about the same fact. When errors are random, models extract the correct signal with accuracy scaling from 65% to 85% with model size. When errors follow a coherent alternative rule system, accuracy drops to chance (~45--51%): the model cannot distinguish the false system from truth. A multi-rule experiment reveals a sharp crossover: a single coherent alternative rule eliminates truth bias entirely, but adding a second competing rule restores most of it (47%->78%), with continued growth through N=10 (88%). The same pattern reproduces on real Wikipedia text (71% vs 46%). We propose the Compression--Consistency Principle as an explanatory hypothesis: in these settings, gradient descent favors the most compressible answer cluster, not truth per se. Truth bias emerges only when falsehood is structurally incoherent. Whether this principle extends to large-scale pretraining remains an open question.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports controlled experiments training small GPT-2-style transformers (3.5M–86M parameters) on synthetic mathematical problems and Wikipedia text, each containing both correct and incorrect solutions. Results show accuracy scaling with model size (65–85%) for random errors, dropping to chance (45–51%) for a single coherent alternative rule, and recovering sharply (47%→78% at N=2 rules, up to 88% at N=10) with multiple competing rules. The same pattern holds on real Wikipedia data (71% vs 46%). The authors propose the Compression-Consistency Principle as an interpretive hypothesis: gradient descent favors the most compressible answer cluster, with truth bias emerging only when falsehoods are structurally incoherent. Extension to large-scale pretraining is explicitly left open.
Significance. If the reported patterns hold, the work supplies a mechanistic account of truth bias in language models as a compression artifact of gradient descent rather than an intrinsic preference for accuracy. The controlled manipulation of error coherence and rule multiplicity isolates a clear crossover effect, offering a falsifiable lens on data-quality effects that could guide future scaling studies and data curation. The Wikipedia replication adds ecological relevance within the small-model regime.
major comments (2)
- [Experiments] Experiments section: accuracy values (e.g., 65–85% random-error scaling, 47%→78% crossover at N=2) are reported without error bars, standard deviations across seeds, or statistical significance tests, which is load-bearing for claims of reliable size scaling and sharp rule-count transitions.
- [Methods] Methods: exact training hyperparameters (learning rate, batch size, epochs, optimizer) and precise data-generation procedure for the mathematical problems are omitted, preventing direct replication of the denoising setup that underpins the central empirical patterns.
minor comments (2)
- [Abstract] Abstract: the phrase 'denoising design' is used without a brief inline definition; adding one sentence would clarify the experimental paradigm for readers encountering the work for the first time.
- [Discussion] Discussion: the Compression-Consistency Principle is introduced as a post-hoc interpretive hypothesis; a short formal statement or pseudocode would make its scope and falsifiability conditions more explicit.
Simulated Author's Rebuttal
We thank the referee for the positive assessment and constructive feedback. We address each major comment below and will incorporate the requested details in the revised manuscript.
read point-by-point responses
-
Referee: [Experiments] Experiments section: accuracy values (e.g., 65–85% random-error scaling, 47%→78% crossover at N=2) are reported without error bars, standard deviations across seeds, or statistical significance tests, which is load-bearing for claims of reliable size scaling and sharp rule-count transitions.
Authors: We agree that reporting error bars, standard deviations across seeds, and statistical significance tests would strengthen the claims. In the revised manuscript we will rerun key experiments with multiple random seeds (minimum of five per configuration), add error bars to all accuracy plots, report means and standard deviations, and include t-tests or equivalent for the reported scaling trends and the N=2 crossover effect. revision: yes
-
Referee: [Methods] Methods: exact training hyperparameters (learning rate, batch size, epochs, optimizer) and precise data-generation procedure for the mathematical problems are omitted, preventing direct replication of the denoising setup that underpins the central empirical patterns.
Authors: We acknowledge the omission. The revised manuscript will contain a complete Methods section that specifies all training hyperparameters (learning rate, batch size, number of epochs, optimizer and its settings) together with the exact data-generation procedure for the synthetic mathematical problems, including how correct and incorrect solutions were constructed and paired. revision: yes
Circularity Check
No significant circularity
full rationale
The paper's central claim rests on controlled denoising experiments with small transformers on synthetic math problems and Wikipedia text, where accuracy patterns (scaling with size for random errors, dropping to chance for coherent alternatives, recovering with multiple rules) are reported as direct empirical observations. The Compression-Consistency Principle is introduced explicitly as a post-experimental interpretive hypothesis rather than a quantity derived from fitted parameters, self-referential equations, or prior self-citations. No load-bearing steps reduce by construction to inputs; the derivation chain consists of experimental design, results, and scoped generalization flagged as open. This is self-contained against the reported benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Gradient descent on the language modeling objective favors the most compressible consistent answer cluster when presented with conflicting information about the same fact
invented entities (1)
-
Compression-Consistency Principle
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
We propose the Compression--Consistency Principle... gradient descent favors the most compressible answer cluster, not truth per se.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
V . Aksenov, E. Bodnia, M. H. Freedman, and M. Mulligan. Compression is all you need: Modeling mathematics.arXiv preprint arXiv:2603.20396,
- [2]
-
[3]
arXiv preprint arXiv:2207.14251 , year=
Y . Elazar, N. Kassner, S. Ravfogel, A. Feder, A. Ravichander, M. Mosbach, Y . Belinkov, H. Schütze, and Y . Goldberg. Measuring causal effects of data statistics on language model’s factual predictions. arXiv preprint arXiv:2207.14251,
-
[4]
Language Models (Mostly) Know What They Know
S. Kadavath, T. Conerly, A. Askell, T. Henighan, D. Drain, E. Perez, J. Kaplan, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
K. Li, A. K. Hopkins, D. Bau, F. Viegas, H. Pfister, and M. Wattenberg. Emergent world representa- tions: Exploring a sequence model trained on a synthetic task. InICLR, 2023a. K. Li, O. Patel, F. Viegas, H. Pfister, and M. Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model. InNeurIPS, 2023b. Z. Liu, Z. Zhong, and M....
- [6]
- [7]
-
[8]
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
G. Penedo, H. Kydlíˇcek, A. Lozhkov, M. Mitchell, T. Wolf, et al. The FineWeb datasets: Decanting the web for the finest text data at scale.arXiv preprint arXiv:2406.17557,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Deep Learning is Robust to Massive Label Noise
D. Rolnick, A. Veit, S. Belongie, and N. Shavit. Deep learning is robust to massive label noise.arXiv preprint arXiv:1705.10694,
work page internal anchor Pith review Pith/arXiv arXiv
- [10]
-
[11]
= 6x + 15 Step 2: 6x + 15 - 4x = 2x + 15 Answer: 2x + 15 The corresponding random-error version replaces a derivation step with a plausible but incorrect computation (e.g., 6x + 15 - 4x = 2x + 11 ). In the coherent condition, a systematic rule is applied consistently across all problems of the same type (e.g., for distribution: a(b+c) =ab+c instead of ab+...
work page 1978
-
[12]
G Per-Seed Details Size Seed 1 Seed 2 Seed 3 Seed 4 Mean tiny 64.2% 64.1% 67.0% 65.8% 65.3% small 75.1% 76.3% 72.6% 74.2% 74.6% medium 79.6% 80.9% 82.4% 81.3% 81.1% large 83.5% 86.8% – – 85.2% Table 15: Denoising equal random, per-seed accuracy. NeurIPS Paper Checklist 1.Claims 14 Size Seed 1 Seed 2 Seed 3 Seed 4 Mean tiny 40.2% 44.3% 46.3% 43.0% 43.5% sm...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.