Hallucinations are inevitable but can be made statistically negligible

Atsushi Suzuki; Feng Tian; Yulan He; Zhongyuan Wang

arxiv: 2502.12187 · v3 · pith:S2HRXDEJnew · submitted 2025-02-15 · 💻 cs.CL · cs.FL· cs.LG· math.ST· stat.ML· stat.TH

Hallucinations are inevitable but can be made statistically negligible

Atsushi Suzuki , Yulan He , Feng Tian , Zhongyuan Wang This is my paper

Pith reviewed 2026-05-23 02:39 UTC · model grok-4.3

classification 💻 cs.CL cs.FLcs.LGmath.STstat.MLstat.TH

keywords hallucinationslanguage modelscomputability theoryprobability theorytraining datastatistical negligibilityinevitability

0 comments

The pith

Hallucinations in language models cannot be eliminated but their probability on typical inputs can be driven arbitrarily low with enough high-quality training data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that computability theory requires any language model to produce hallucinations on an infinite set of inputs no matter how much training occurs. At the same time it proves that under a probability distribution over inputs the fraction of cases where hallucinations occur can be made as small as desired simply by increasing the quality and quantity of training data. These two statements are compatible because the inevitable failures can be assigned vanishingly small probability. The authors conclude that the probabilistic guarantee aligns better with how models are used and measured in practice than the worst-case inevitability result.

Core claim

We prove that hallucinations can be made statistically negligible, provided that the quality and quantity of the training data are sufficient. This positive theoretical result from a probabilistic perspective coexists with the computability-theoretic result showing hallucinations on an infinite set of inputs. By evaluating the two results through the lens of information theory, the probability-theoretic positive result better reflects practical considerations than the computability-theoretic negative result.

What carries the argument

A probability distribution over inputs under which the measure of inputs that cause hallucinations can be driven arbitrarily close to zero by improvements in training data quality and quantity.

If this is right

Hallucination rates observed on representative test sets can be reduced without bound by scaling high-quality data.
The existence of an infinite set of failure cases does not prevent overall statistical reliability in deployed systems.
Efforts to mitigate hallucinations should prioritize data curation over changes in model architecture or inference algorithms.
Practical evaluation of language models should use probability-weighted measures rather than worst-case input sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Evaluation benchmarks could be redesigned around sampled distributions that reflect expected usage rather than adversarial inputs.
The same separation between inevitable failures and negligible probability may apply to other systematic errors in generative models.
Data-collection priorities could shift toward covering high-probability regions where reductions have the largest effect.

Load-bearing premise

There exists a probability distribution over inputs such that the set of inputs producing hallucinations can be assigned arbitrarily low probability through better training data, without the diagonal argument forcing that probability to stay high.

What would settle it

An experiment or proof showing that for any chosen probability distribution over inputs, there remains a positive lower bound on hallucination probability that no amount of added training data can reduce below.

read the original abstract

Hallucinations, a phenomenon where a language model (LM) generates nonfactual content, pose a significant challenge to the practical deployment of LMs. While many empirical methods have been proposed to mitigate hallucinations, recent studies established a computability-theoretic result showing that any LM will inevitably generate hallucinations on an infinite set of inputs, regardless of the quality and quantity of training datasets and the choice of the language model architecture and training and inference algorithms. Although the computability-theoretic result may seem pessimistic, its significance in practical viewpoints has remained unclear. This paper claims that those "innate" inevitability results from computability theory and diagonal argument, in principle, cannot explain practical issues of LLMs. We demonstrate this claim by presenting a positive theoretical result from a probabilistic perspective. Specifically, we prove that hallucinations can be made statistically negligible, provided that the quality and quantity of the training data are sufficient. Interestingly, our positive result coexists with the computability-theoretic result, implying that while hallucinations on an infinite set of inputs cannot be entirely eliminated, their probability can always be reduced by improving algorithms and training data. By evaluating the two seemingly contradictory results through the lens of information theory, we argue that our probability-theoretic positive result better reflects practical considerations than the computability-theoretic negative result.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript claims that while computability-theoretic results (via diagonal arguments) establish that hallucinations occur on an infinite set of inputs for any language model regardless of training data or architecture, a probabilistic analysis shows that the measure of the hallucination set can be driven arbitrarily close to zero under a suitable input distribution when training data quality and quantity are sufficient. The positive probabilistic result is argued to coexist with the negative computability result without contradiction, and an information-theoretic lens is used to claim that the probabilistic result better reflects practical considerations.

Significance. If the probabilistic derivation holds with explicit assumptions and a concrete measure, the result would provide a theoretically grounded account of why empirical mitigation can succeed in practice despite theoretical inevitability, by showing that countably infinite bad sets need not carry positive probability under realistic input distributions. This distinction between inevitability on infinite sets and statistical negligibility is a potentially useful clarification for the field.

major comments (2)

[Abstract] Abstract: the claim that 'we prove that hallucinations can be made statistically negligible' is asserted without any derivation steps, explicit probability space, measure, or handling of how the result coexists with the computability result while remaining non-vacuous; this leaves the central positive claim without verifiable support from the provided text.
[Abstract (paragraph on positive theoretical result)] The weakest assumption identified (existence of a suitable probability distribution over inputs under which the hallucination set has arbitrarily small measure) is load-bearing for the coexistence claim but is not formalized or shown to be compatible with the diagonal argument in any explicit construction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and the need for explicit formalization of the probabilistic result. We agree that the abstract is too concise and will revise it and the main text to include a high-level outline of the probability space, measure, and explicit construction showing compatibility with the diagonal argument. These changes will make the central claim verifiable without altering the core results.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'we prove that hallucinations can be made statistically negligible' is asserted without any derivation steps, explicit probability space, measure, or handling of how the result coexists with the computability result while remaining non-vacuous; this leaves the central positive claim without verifiable support from the provided text.

Authors: We acknowledge that the abstract does not include derivation steps or an explicit probability space. The full derivation appears in Section 3, where we define the input space as the set of all finite sequences over the vocabulary, equip it with the product measure induced by a token-level distribution that assigns higher probability to factual completions as training data quality and quantity increase, and prove that the measure of the hallucination set (the countable set identified by diagonalization) can be driven below any epsilon > 0. Coexistence is addressed by observing that the computability result only requires the set to be infinite (non-empty support), while the measure can still be made arbitrarily small. We will revise the abstract to summarize this argument in one additional sentence for verifiability. revision: yes
Referee: [Abstract (paragraph on positive theoretical result)] The weakest assumption identified (existence of a suitable probability distribution over inputs under which the hallucination set has arbitrarily small measure) is load-bearing for the coexistence claim but is not formalized or shown to be compatible with the diagonal argument in any explicit construction.

Authors: We agree an explicit construction strengthens the presentation. In the revision we will add a formal definition of the input distribution (a mixture of a geometric distribution over lengths and a token distribution whose entropy decreases with training data quality) and an explicit construction showing that any countable diagonal set can be assigned measure less than epsilon while remaining infinite. This is compatible because diagonalization produces a countable set, and countable sets can have measure zero under non-uniform distributions without contradicting their existence. The revised Section 3 will include this construction. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an independent probabilistic argument that the measure of the hallucination set (known to be countably infinite via diagonalization) can be driven to zero or arbitrarily small under suitable input distributions when training data improves. This is a direct application of measure theory on countable sets and does not reduce by construction to any fitted parameter, self-citation chain, or definitional equivalence with the computability result. The two results are explicitly shown to coexist without contradiction, and the positive claim rests on standard probabilistic reasoning rather than any load-bearing self-reference or renaming of empirical patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract; no free parameters, invented entities, or non-standard axioms are mentioned. The proof is described as relying on standard probabilistic and information-theoretic reasoning.

axioms (1)

standard math Standard axioms of probability theory
Invoked for defining and bounding hallucination probabilities under a data distribution.

pith-pipeline@v0.9.0 · 5771 in / 1340 out tokens · 49810 ms · 2026-05-23T02:39:38.525847+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Overfitting has a limitation: a model-independent generalization gap bound based on R\'enyi entropy
stat.ML 2025-05 unverdicted novelty 6.0

A model-independent upper bound on generalization gap is established that depends solely on the Rényi entropy of the data-generating distribution for histogram-determined algorithms such as ERM.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · cited by 1 Pith paper · 17 internal anchors

[1]

ELIZA—a computer program for the study of natural language communication between man and machine,

J. Weizenbaum, “ELIZA—a computer program for the study of natural language communication between man and machine,” Communications of the ACM, vol. 9, no. 1, pp. 36–45, 1966

work page 1966
[2]

Artificial paranoia,

K. M. Colby, S. Weber, and F. D. Hilf, “Artificial paranoia,” Artificial intelligence, vol. 2, no. 1, pp. 1– 25, 1971

work page 1971
[3]

R. S. Wallace, The anatomy of ALICE. Springer, 2009

work page 2009
[4]

A cache-based natural language model for speech recognition,

R. Kuhn and R. De Mori, “A cache-based natural language model for speech recognition,” IEEE transac- tions on pattern analysis and machine intelligence, vol. 12, no. 6, pp. 570–583, 1990

work page 1990
[5]

A linguistically motivated probabilistic model of information retrieval,

D. Hiemstra, “A linguistically motivated probabilistic model of information retrieval,” in Research and Advanced Technology for Digital Libraries: Second European Conference, ECDL’98 Heraklion, Crete, Greece September 21–23, 1998 Proceedings 2, 1998, pp. 569–584

work page 1998
[6]

An empirical study of smoothing techniques for language modeling,

S. F. Chen and J. Goodman, “An empirical study of smoothing techniques for language modeling,” Computer Speech & Language, vol. 13, no. 4, pp. 359–394, 1999

work page 1999
[7]

D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal representations by error propaga- tion, parallel distributed processing, explorations in the microstructure of cognition, ed. de rumelhart and j. mcclelland. vol. 1. 1986,” Biometrika, vol. 71, no. 599–607, p. 6, 1986

work page 1986
[8]

Finding structure in time,

J. L. Elman, “Finding structure in time,” Cognitive science, vol. 14, no. 2, pp. 179–211, 1990

work page 1990
[9]

Fast Text Compression with Neural Networks.,

M. V. Mahoney, “Fast Text Compression with Neural Networks.,” in FLAIRS, 2000, pp. 230–234

work page 2000
[10]

A neural probabilistic language model,

Y. Bengio, R. Ducharme, and P. Vincent, “A neural probabilistic language model,” Advances in neural information processing systems, vol. 13, 2000

work page 2000
[11]

LONG SHORT-TERM MEMORY,

S. Hochreiter, J. urgen Schmidhuber, and C. Elvezia, “LONG SHORT-TERM MEMORY,” Neural Com- putation, vol. 9, no. 8, pp. 1735–1780, 1997

work page 1997
[12]

Learning to forget: Continual prediction with LSTM,

F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget: Continual prediction with LSTM,” Neural computation, vol. 12, no. 10, pp. 2451–2471, 2000

work page 2000
[13]

Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation,

K. Cho et al., “Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Pro- cessing (EMNLP), 2014, pp. 1724–1734

work page 2014
[14]

Neural machine translation by jointly learning to align and translate,

D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in the 3rd International Conference on Learning Representations, 2015

work page 2015
[15]

Attention is all you need,

A. Vaswani, “Attention is all you need,” Advances in Neural Information Processing Systems, 2017

work page 2017
[16]

Bert: Pre-training of deep bidirectional transformers for language understanding,

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2019

work page 2019
[17]

Training language models to follow instructions with human feedback,

L. Ouyang et al., “Training language models to follow instructions with human feedback,” Advances in neural information processing systems, vol. 35, pp. 27730–27744, 2022

work page 2022
[18]

Language models are unsupervised multitask learners,

A. Radford et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019

work page 2019
[19]

Language models are few-shot learners,

T. Brown et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020

work page 1901
[20]

GPT-4 Technical Report

J. Achiam et al., “GPT-4 technical report,” arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team et al. , “Gemini: a family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805, 2023. 10

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team et al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,” arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron et al. , “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron et al. , “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

The Llama 3 Herd of Models

A. Dubey et al., “The llama 3 herd of models,” arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

The Claude 3 Model Family: Opus, Sonnet, Haiku

Anthropic, “The Claude 3 Model Family: Opus, Sonnet, Haiku.” [Online]. Available: https://www-cdn. anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf

work page
[27]

Qwen Technical Report

J. Bai et al., “Qwen technical report,” arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Qwen2 Technical Report

A. Yang et al., “Qwen2 Technical Report,” arXiv preprint arXiv:2407.10671, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Qwen2.5 Technical Report

A. Yang et al., “Qwen2.5 Technical Report,” arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

DeepSeek-V3 Technical Report

A. Liu et al., “Deepseek-v3 technical report,” arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

D. Guo et al., “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning,” arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Y. Wu et al., “Google's neural machine translation system: Bridging the gap between human and machine translation,” arXiv preprint arXiv:1609.08144, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[33]

The new Bing: Our approach to Responsible AI

Microsoft, “The new Bing: Our approach to Responsible AI.” [Online]. Available: https://blogs.microsoft. com/wp-content/uploads/prod/sites/5/2023/04/RAI-for-the-new-Bing-April-2023.pdf

work page 2023
[34]

GPT4Rec: A generative framework for personalized recommendation and user interests interpretation,

J. Li, W. Zhang, T. Wang, G. Xiong, A. Lu, and G. Medioni, “GPT4Rec: A generative framework for personalized recommendation and user interests interpretation,” arXiv preprint arXiv:2304.03879, 2023

work page arXiv 2023
[35]

Chat-rec: Towards interactive and explainable llms-augmented recommender system,

Y. Gao, T. Sheng, Y. Xiang, Y. Xiong, H. Wang, and J. Zhang, “Chat-rec: Towards interactive and explainable llms-augmented recommender system,” arXiv preprint arXiv:2303.14524, 2023

work page arXiv 2023
[36]

A Survey of Large Language Models

W. X. Zhao et al., “A survey of large language models,” arXiv preprint arXiv:2303.18223, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Large Language Models: A Survey

S. Minaee et al., “Large language models: A survey,” arXiv preprint arXiv:2402.06196, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

A complete survey on llm-based ai chatbots,

S. K. Dam, C. S. Hong, Y. Qiao, and C. Zhang, “A complete survey on llm-based ai chatbots,” arXiv preprint arXiv:2406.16937, 2024

work page arXiv 2024
[39]

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

L. Huang et al., “A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,” arXiv preprint arXiv:2311.05232, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

Survey of hallucination in natural language generation,

Z. Ji et al., “Survey of hallucination in natural language generation,” ACM Computing Surveys, vol. 55, no. 12, pp. 1–38, 2023

work page 2023
[41]

Retrieval augmentation reduces hallucination in conversation,

K. Shuster, S. Poff, M. Chen, D. Kiela, and J. Weston, “Retrieval augmentation reduces hallucination in conversation,” arXiv preprint arXiv:2104.07567, 2021

work page arXiv 2021
[42]

Verify-and-edit: A knowledge-enhanced chain-of-thought framework,

R. Zhao, X. Li, S. Joty, C. Qin, and L. Bing, “Verify-and-edit: A knowledge-enhanced chain-of-thought framework,” arXiv preprint arXiv:2305.03268, 2023

work page arXiv 2023
[43]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in neural information processing systems, vol. 35, pp. 24824–24837, 2022

work page 2022
[44]

Chain-of-Verification Reduces Hallucination in Large Language Models

S. Dhuliawala et al. , “Chain-of-verification reduces hallucination in large language models,” arXiv preprint arXiv:2309.11495, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Hallucination is Inevitable: An Innate Limitation of Large Language Models

Z. Xu, S. Jain, and M. Kankanhalli, “Hallucination is inevitable: An innate limitation of large language models,” arXiv preprint arXiv:2401.11817, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

Llms will always hallucinate, and we need to live with this,

S. Banerjee, A. Agarwal, and S. Singla, “Llms will always hallucinate, and we need to live with this,” arXiv preprint arXiv:2409.05746, 2024

work page arXiv 2024
[47]

Hallucination (artificial intelligence) — Wikipedia, The Free Encyclopedia

Wikipedia, “Hallucination (artificial intelligence) — Wikipedia, The Free Encyclopedia.” 2025

work page 2025
[48]

AI hallucinations can't be stopped—but these techniques can limit their damage,

N. Jones, “AI hallucinations can't be stopped—but these techniques can limit their damage,” Nature, vol. 637, no. 8047, pp. 778–780, 2025. 11

work page 2025
[49]

Calibrated language models must hallucinate,

A. T. Kalai and S. S. Vempala, “Calibrated language models must hallucinate,” in Proceedings of the 56th Annual ACM Symposium on Theory of Computing, 2024, pp. 160–171

work page 2024
[50]

Are Transformers universal approximators of sequence-to-sequence functions?,

C. Yun, S. Bhojanapalli, A. S. Rawat, S. Reddi, and S. Kumar, “Are Transformers universal approximators of sequence-to-sequence functions?,” in the 8th International Conference on Learning Representations , 2020

work page 2020
[51]

Big bird: Transformers for longer sequences,

M. Zaheer et al., “Big bird: Transformers for longer sequences,” Advances in neural information process- ing systems, vol. 33, pp. 17283–17297, 2020

work page 2020
[52]

Inductive biases and variable creation in self-attention mechanisms,

B. L. Edelman, S. Goel, S. Kakade, and C. Zhang, “Inductive biases and variable creation in self-attention mechanisms,” in International Conference on Machine Learning, 2022, pp. 5793–5831

work page 2022
[53]

On the Rate of Convergence of a Classifier Based on a Transformer Encoder,

I. Gurevych, M. Kohler, and G. G. Şahin, “On the Rate of Convergence of a Classifier Based on a Transformer Encoder,” IEEE Transactions on Information Theory, vol. 68, no. 12, pp. 8139–8155, 2022, doi: 10.1109/TIT.2022.3191747

work page doi:10.1109/tit.2022.3191747 2022
[54]

Approximation and estimation ability of transformers for sequence-to- sequence functions with infinite dimensional input,

S. Takakura and T. Suzuki, “Approximation and estimation ability of transformers for sequence-to- sequence functions with infinite dimensional input,” in International Conference on Machine Learning, 2023, pp. 33416–33447

work page 2023
[55]

Transformers are minimax optimal nonparametric in-context learners,

J. Kim, T. Nakamaki, and T. Suzuki, “Transformers are minimax optimal nonparametric in-context learners,” arXiv preprint arXiv:2408.12186, 2024

work page arXiv 2024
[56]

On learnability wih com- putable learners,

S. Agarwal, N. Ananthakrishnan, S. Ben-David, T. Lechner, and R. Urner, “On learnability wih com- putable learners,” in Algorithmic Learning Theory, 2020, pp. 48–60

work page 2020
[57]

Shalev-Shwartz and S

S. Shalev-Shwartz and S. Ben-David, Understanding machine learning: From theory to algorithms . Cambridge university press, 2014

work page 2014
[58]

Sipser, Introduction to the Theory of Computation

M. Sipser, Introduction to the Theory of Computation. Cengage Learning, 2012

work page 2012
[59]

Language identification in the limit,

E. M. Gold, “Language identification in the limit,” Information and control , vol. 10, no. 5, pp. 447– 474, 1967

work page 1967
[60]

D. J. MacKay, Information theory, inference and learning algorithms. Cambridge university press, 2003

work page 2003
[61]

Kunen, Set theory an introduction to independence proofs

K. Kunen, Set theory an introduction to independence proofs. Elsevier, 2014

work page 2014
[62]

counterintuitive results,

H. Herrlich, Axiom of choice, vol. 1876. Springer, 2006. 12 RoteMemorizer(((𝑠1,𝑦1),(𝑠2,𝑦2),…,(𝑠𝑚,𝑦𝑚)) ∈ (Σ∗,Σ∗)∗): 1 𝑚← len((𝑠1,𝑦1),(𝑠2,𝑦2),…,(𝑠𝑚,𝑦𝑚)) 2 Initialize an empty dictionary 𝑑 3 for 𝑖 ← 1,2,…,𝑚: 4 𝑑[𝑠𝑖] ← 𝑦𝑖 5 def TrivialRecaller (𝑠 ∈ Σ∗): 6 if 𝑑[𝑠] is defined: 7 return 𝑑[𝑠] ∈ Σ∗ 8 else: 9 return “” ∈ Σ∗ 10 return TrivialRecaller Algorithm 1: Ro...

work page 2006
[63]

We can prove Pr𝑇(HPUni(𝒳),𝑓0 (𝔄(𝑇)) ≥ 𝜆H) ≥ 𝜆T by applying Lemma 39 with 𝑐 = 1 and 𝑎 = 𝜆H to 𝔼𝑇 HPUni(𝒳),𝑓0 (𝔄(𝑇)) ≥ 1

work page
[64]

In Corollary 35, for example, if 𝜆H = 1 4, then 𝜆T = 1 3

□ Remark 36. In Corollary 35, for example, if 𝜆H = 1 4, then 𝜆T = 1 3 . Remark 37. (Regarding the statement of Theorem 33) Theorem 33 is a generalized version of the no free lunch theorem given as [57, Theorem 5.1.] as Theorem 33 provides a tighter bound when |𝒴| ≥ 3. However, the proof technique is essentially the same as that of Theorem 5.1. in [57]. We...

work page
[65]

|𝒳| = ⌊ |Σ|𝑛+1−1 (|Σ|−1)CDF(𝑛)⌋, and

work page
[66]

For any 𝑛∈ ℤ≥0, Pr(len(𝑋) ≤ 𝑛) = |𝒳≤𝑛| |𝒳| ≥ CDF(𝑛), where 𝒳≤𝑛 ≔ 𝒳0 ∪ 𝒳1 ∪ … ∪ 𝒳𝑛, which are the consequences of Lemma 32

Let 𝑋 be a random variable generated by Uni(𝒳). For any 𝑛∈ ℤ≥0, Pr(len(𝑋) ≤ 𝑛) = |𝒳≤𝑛| |𝒳| ≥ CDF(𝑛), where 𝒳≤𝑛 ≔ 𝒳0 ∪ 𝒳1 ∪ … ∪ 𝒳𝑛, which are the consequences of Lemma 32 . Hence, we can complete the proof by confirming the above two properties. Here, the first property is trivial. Noting that |𝒳| ≤ |Σ|𝑛+1−1 (|Σ|−1)CDF(𝑛), the second property can be confir...

work page
[67]

It is because all the maps in ℱ𝒚 return the same value 𝑦𝑗 for the input 𝜉𝑗 for each of 𝑗 ∈ [𝑛] \ {𝑟∗} but they return distinct values in 𝒴 for the input 𝜉𝑟∗

The size (cardinality) is given by |𝒬𝒚| = |ℱ𝒚| = |𝒴| = 𝑝. It is because all the maps in ℱ𝒚 return the same value 𝑦𝑗 for the input 𝜉𝑗 for each of 𝑗 ∈ [𝑛] \ {𝑟∗} but they return distinct values in 𝒴 for the input 𝜉𝑟∗

work page
[68]

For any 𝑞,𝑞′ ∈ 𝒬𝒚, the maps ℎ𝑞 and ℎ𝑞′ are identical, i.e., 𝔄( (𝒙𝑑∗,𝑓𝑞(𝒙𝑑∗)) ⊤ ) = 𝔄( (𝒙𝑑∗,𝑓𝑞′(𝒙𝑑∗)) ⊤ ) . It is because the equation 𝑓𝑞(𝒙𝑑∗) = 𝑓𝑞′(𝒙𝑑∗) holds since 𝜉𝑟∗ ∉ {𝑥1,𝑥2,…,𝑥𝑚} as 𝜉𝑟∗ = 𝜉𝑟∗ is taken from 𝒳\ {𝑥1,𝑥2,…,𝑥𝑚} by definition and 𝑓𝑞(𝜉) = 𝑓𝑞′(𝜉) is satisfied for all 𝜉 ∈ 𝒳\ {𝑟∗} by the definition of ℱ𝒚. We can uniquely define ℎ𝒚 by ℎ𝒚 ∈ {ℎ𝑞 |...

work page

[1] [1]

ELIZA—a computer program for the study of natural language communication between man and machine,

J. Weizenbaum, “ELIZA—a computer program for the study of natural language communication between man and machine,” Communications of the ACM, vol. 9, no. 1, pp. 36–45, 1966

work page 1966

[2] [2]

Artificial paranoia,

K. M. Colby, S. Weber, and F. D. Hilf, “Artificial paranoia,” Artificial intelligence, vol. 2, no. 1, pp. 1– 25, 1971

work page 1971

[3] [3]

R. S. Wallace, The anatomy of ALICE. Springer, 2009

work page 2009

[4] [4]

A cache-based natural language model for speech recognition,

R. Kuhn and R. De Mori, “A cache-based natural language model for speech recognition,” IEEE transac- tions on pattern analysis and machine intelligence, vol. 12, no. 6, pp. 570–583, 1990

work page 1990

[5] [5]

A linguistically motivated probabilistic model of information retrieval,

D. Hiemstra, “A linguistically motivated probabilistic model of information retrieval,” in Research and Advanced Technology for Digital Libraries: Second European Conference, ECDL’98 Heraklion, Crete, Greece September 21–23, 1998 Proceedings 2, 1998, pp. 569–584

work page 1998

[6] [6]

An empirical study of smoothing techniques for language modeling,

S. F. Chen and J. Goodman, “An empirical study of smoothing techniques for language modeling,” Computer Speech & Language, vol. 13, no. 4, pp. 359–394, 1999

work page 1999

[7] [7]

D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal representations by error propaga- tion, parallel distributed processing, explorations in the microstructure of cognition, ed. de rumelhart and j. mcclelland. vol. 1. 1986,” Biometrika, vol. 71, no. 599–607, p. 6, 1986

work page 1986

[8] [8]

Finding structure in time,

J. L. Elman, “Finding structure in time,” Cognitive science, vol. 14, no. 2, pp. 179–211, 1990

work page 1990

[9] [9]

Fast Text Compression with Neural Networks.,

M. V. Mahoney, “Fast Text Compression with Neural Networks.,” in FLAIRS, 2000, pp. 230–234

work page 2000

[10] [10]

A neural probabilistic language model,

Y. Bengio, R. Ducharme, and P. Vincent, “A neural probabilistic language model,” Advances in neural information processing systems, vol. 13, 2000

work page 2000

[11] [11]

LONG SHORT-TERM MEMORY,

S. Hochreiter, J. urgen Schmidhuber, and C. Elvezia, “LONG SHORT-TERM MEMORY,” Neural Com- putation, vol. 9, no. 8, pp. 1735–1780, 1997

work page 1997

[12] [12]

Learning to forget: Continual prediction with LSTM,

F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget: Continual prediction with LSTM,” Neural computation, vol. 12, no. 10, pp. 2451–2471, 2000

work page 2000

[13] [13]

Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation,

K. Cho et al., “Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Pro- cessing (EMNLP), 2014, pp. 1724–1734

work page 2014

[14] [14]

Neural machine translation by jointly learning to align and translate,

D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in the 3rd International Conference on Learning Representations, 2015

work page 2015

[15] [15]

Attention is all you need,

A. Vaswani, “Attention is all you need,” Advances in Neural Information Processing Systems, 2017

work page 2017

[16] [16]

Bert: Pre-training of deep bidirectional transformers for language understanding,

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2019

work page 2019

[17] [17]

Training language models to follow instructions with human feedback,

L. Ouyang et al., “Training language models to follow instructions with human feedback,” Advances in neural information processing systems, vol. 35, pp. 27730–27744, 2022

work page 2022

[18] [18]

Language models are unsupervised multitask learners,

A. Radford et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019

work page 2019

[19] [19]

Language models are few-shot learners,

T. Brown et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020

work page 1901

[20] [20]

GPT-4 Technical Report

J. Achiam et al., “GPT-4 technical report,” arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [21]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team et al. , “Gemini: a family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805, 2023. 10

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [22]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team et al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,” arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron et al. , “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron et al. , “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [25]

The Llama 3 Herd of Models

A. Dubey et al., “The llama 3 herd of models,” arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

The Claude 3 Model Family: Opus, Sonnet, Haiku

Anthropic, “The Claude 3 Model Family: Opus, Sonnet, Haiku.” [Online]. Available: https://www-cdn. anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf

work page

[27] [27]

Qwen Technical Report

J. Bai et al., “Qwen technical report,” arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[28] [28]

Qwen2 Technical Report

A. Yang et al., “Qwen2 Technical Report,” arXiv preprint arXiv:2407.10671, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

Qwen2.5 Technical Report

A. Yang et al., “Qwen2.5 Technical Report,” arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

DeepSeek-V3 Technical Report

A. Liu et al., “Deepseek-v3 technical report,” arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

D. Guo et al., “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning,” arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Y. Wu et al., “Google's neural machine translation system: Bridging the gap between human and machine translation,” arXiv preprint arXiv:1609.08144, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[33] [33]

The new Bing: Our approach to Responsible AI

Microsoft, “The new Bing: Our approach to Responsible AI.” [Online]. Available: https://blogs.microsoft. com/wp-content/uploads/prod/sites/5/2023/04/RAI-for-the-new-Bing-April-2023.pdf

work page 2023

[34] [34]

GPT4Rec: A generative framework for personalized recommendation and user interests interpretation,

J. Li, W. Zhang, T. Wang, G. Xiong, A. Lu, and G. Medioni, “GPT4Rec: A generative framework for personalized recommendation and user interests interpretation,” arXiv preprint arXiv:2304.03879, 2023

work page arXiv 2023

[35] [35]

Chat-rec: Towards interactive and explainable llms-augmented recommender system,

Y. Gao, T. Sheng, Y. Xiang, Y. Xiong, H. Wang, and J. Zhang, “Chat-rec: Towards interactive and explainable llms-augmented recommender system,” arXiv preprint arXiv:2303.14524, 2023

work page arXiv 2023

[36] [36]

A Survey of Large Language Models

W. X. Zhao et al., “A survey of large language models,” arXiv preprint arXiv:2303.18223, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [37]

Large Language Models: A Survey

S. Minaee et al., “Large language models: A survey,” arXiv preprint arXiv:2402.06196, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[38] [38]

A complete survey on llm-based ai chatbots,

S. K. Dam, C. S. Hong, Y. Qiao, and C. Zhang, “A complete survey on llm-based ai chatbots,” arXiv preprint arXiv:2406.16937, 2024

work page arXiv 2024

[39] [39]

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

L. Huang et al., “A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,” arXiv preprint arXiv:2311.05232, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[40] [40]

Survey of hallucination in natural language generation,

Z. Ji et al., “Survey of hallucination in natural language generation,” ACM Computing Surveys, vol. 55, no. 12, pp. 1–38, 2023

work page 2023

[41] [41]

Retrieval augmentation reduces hallucination in conversation,

K. Shuster, S. Poff, M. Chen, D. Kiela, and J. Weston, “Retrieval augmentation reduces hallucination in conversation,” arXiv preprint arXiv:2104.07567, 2021

work page arXiv 2021

[42] [42]

Verify-and-edit: A knowledge-enhanced chain-of-thought framework,

R. Zhao, X. Li, S. Joty, C. Qin, and L. Bing, “Verify-and-edit: A knowledge-enhanced chain-of-thought framework,” arXiv preprint arXiv:2305.03268, 2023

work page arXiv 2023

[43] [43]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in neural information processing systems, vol. 35, pp. 24824–24837, 2022

work page 2022

[44] [44]

Chain-of-Verification Reduces Hallucination in Large Language Models

S. Dhuliawala et al. , “Chain-of-verification reduces hallucination in large language models,” arXiv preprint arXiv:2309.11495, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[45] [45]

Hallucination is Inevitable: An Innate Limitation of Large Language Models

Z. Xu, S. Jain, and M. Kankanhalli, “Hallucination is inevitable: An innate limitation of large language models,” arXiv preprint arXiv:2401.11817, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[46] [46]

Llms will always hallucinate, and we need to live with this,

S. Banerjee, A. Agarwal, and S. Singla, “Llms will always hallucinate, and we need to live with this,” arXiv preprint arXiv:2409.05746, 2024

work page arXiv 2024

[47] [47]

Hallucination (artificial intelligence) — Wikipedia, The Free Encyclopedia

Wikipedia, “Hallucination (artificial intelligence) — Wikipedia, The Free Encyclopedia.” 2025

work page 2025

[48] [48]

AI hallucinations can't be stopped—but these techniques can limit their damage,

N. Jones, “AI hallucinations can't be stopped—but these techniques can limit their damage,” Nature, vol. 637, no. 8047, pp. 778–780, 2025. 11

work page 2025

[49] [49]

Calibrated language models must hallucinate,

A. T. Kalai and S. S. Vempala, “Calibrated language models must hallucinate,” in Proceedings of the 56th Annual ACM Symposium on Theory of Computing, 2024, pp. 160–171

work page 2024

[50] [50]

Are Transformers universal approximators of sequence-to-sequence functions?,

C. Yun, S. Bhojanapalli, A. S. Rawat, S. Reddi, and S. Kumar, “Are Transformers universal approximators of sequence-to-sequence functions?,” in the 8th International Conference on Learning Representations , 2020

work page 2020

[51] [51]

Big bird: Transformers for longer sequences,

M. Zaheer et al., “Big bird: Transformers for longer sequences,” Advances in neural information process- ing systems, vol. 33, pp. 17283–17297, 2020

work page 2020

[52] [52]

Inductive biases and variable creation in self-attention mechanisms,

B. L. Edelman, S. Goel, S. Kakade, and C. Zhang, “Inductive biases and variable creation in self-attention mechanisms,” in International Conference on Machine Learning, 2022, pp. 5793–5831

work page 2022

[53] [53]

On the Rate of Convergence of a Classifier Based on a Transformer Encoder,

I. Gurevych, M. Kohler, and G. G. Şahin, “On the Rate of Convergence of a Classifier Based on a Transformer Encoder,” IEEE Transactions on Information Theory, vol. 68, no. 12, pp. 8139–8155, 2022, doi: 10.1109/TIT.2022.3191747

work page doi:10.1109/tit.2022.3191747 2022

[54] [54]

Approximation and estimation ability of transformers for sequence-to- sequence functions with infinite dimensional input,

S. Takakura and T. Suzuki, “Approximation and estimation ability of transformers for sequence-to- sequence functions with infinite dimensional input,” in International Conference on Machine Learning, 2023, pp. 33416–33447

work page 2023

[55] [55]

Transformers are minimax optimal nonparametric in-context learners,

J. Kim, T. Nakamaki, and T. Suzuki, “Transformers are minimax optimal nonparametric in-context learners,” arXiv preprint arXiv:2408.12186, 2024

work page arXiv 2024

[56] [56]

On learnability wih com- putable learners,

S. Agarwal, N. Ananthakrishnan, S. Ben-David, T. Lechner, and R. Urner, “On learnability wih com- putable learners,” in Algorithmic Learning Theory, 2020, pp. 48–60

work page 2020

[57] [57]

Shalev-Shwartz and S

S. Shalev-Shwartz and S. Ben-David, Understanding machine learning: From theory to algorithms . Cambridge university press, 2014

work page 2014

[58] [58]

Sipser, Introduction to the Theory of Computation

M. Sipser, Introduction to the Theory of Computation. Cengage Learning, 2012

work page 2012

[59] [59]

Language identification in the limit,

E. M. Gold, “Language identification in the limit,” Information and control , vol. 10, no. 5, pp. 447– 474, 1967

work page 1967

[60] [60]

D. J. MacKay, Information theory, inference and learning algorithms. Cambridge university press, 2003

work page 2003

[61] [61]

Kunen, Set theory an introduction to independence proofs

K. Kunen, Set theory an introduction to independence proofs. Elsevier, 2014

work page 2014

[62] [62]

counterintuitive results,

H. Herrlich, Axiom of choice, vol. 1876. Springer, 2006. 12 RoteMemorizer(((𝑠1,𝑦1),(𝑠2,𝑦2),…,(𝑠𝑚,𝑦𝑚)) ∈ (Σ∗,Σ∗)∗): 1 𝑚← len((𝑠1,𝑦1),(𝑠2,𝑦2),…,(𝑠𝑚,𝑦𝑚)) 2 Initialize an empty dictionary 𝑑 3 for 𝑖 ← 1,2,…,𝑚: 4 𝑑[𝑠𝑖] ← 𝑦𝑖 5 def TrivialRecaller (𝑠 ∈ Σ∗): 6 if 𝑑[𝑠] is defined: 7 return 𝑑[𝑠] ∈ Σ∗ 8 else: 9 return “” ∈ Σ∗ 10 return TrivialRecaller Algorithm 1: Ro...

work page 2006

[63] [63]

We can prove Pr𝑇(HPUni(𝒳),𝑓0 (𝔄(𝑇)) ≥ 𝜆H) ≥ 𝜆T by applying Lemma 39 with 𝑐 = 1 and 𝑎 = 𝜆H to 𝔼𝑇 HPUni(𝒳),𝑓0 (𝔄(𝑇)) ≥ 1

work page

[64] [64]

In Corollary 35, for example, if 𝜆H = 1 4, then 𝜆T = 1 3

□ Remark 36. In Corollary 35, for example, if 𝜆H = 1 4, then 𝜆T = 1 3 . Remark 37. (Regarding the statement of Theorem 33) Theorem 33 is a generalized version of the no free lunch theorem given as [57, Theorem 5.1.] as Theorem 33 provides a tighter bound when |𝒴| ≥ 3. However, the proof technique is essentially the same as that of Theorem 5.1. in [57]. We...

work page

[65] [65]

|𝒳| = ⌊ |Σ|𝑛+1−1 (|Σ|−1)CDF(𝑛)⌋, and

work page

[66] [66]

For any 𝑛∈ ℤ≥0, Pr(len(𝑋) ≤ 𝑛) = |𝒳≤𝑛| |𝒳| ≥ CDF(𝑛), where 𝒳≤𝑛 ≔ 𝒳0 ∪ 𝒳1 ∪ … ∪ 𝒳𝑛, which are the consequences of Lemma 32

Let 𝑋 be a random variable generated by Uni(𝒳). For any 𝑛∈ ℤ≥0, Pr(len(𝑋) ≤ 𝑛) = |𝒳≤𝑛| |𝒳| ≥ CDF(𝑛), where 𝒳≤𝑛 ≔ 𝒳0 ∪ 𝒳1 ∪ … ∪ 𝒳𝑛, which are the consequences of Lemma 32 . Hence, we can complete the proof by confirming the above two properties. Here, the first property is trivial. Noting that |𝒳| ≤ |Σ|𝑛+1−1 (|Σ|−1)CDF(𝑛), the second property can be confir...

work page

[67] [67]

It is because all the maps in ℱ𝒚 return the same value 𝑦𝑗 for the input 𝜉𝑗 for each of 𝑗 ∈ [𝑛] \ {𝑟∗} but they return distinct values in 𝒴 for the input 𝜉𝑟∗

The size (cardinality) is given by |𝒬𝒚| = |ℱ𝒚| = |𝒴| = 𝑝. It is because all the maps in ℱ𝒚 return the same value 𝑦𝑗 for the input 𝜉𝑗 for each of 𝑗 ∈ [𝑛] \ {𝑟∗} but they return distinct values in 𝒴 for the input 𝜉𝑟∗

work page

[68] [68]

For any 𝑞,𝑞′ ∈ 𝒬𝒚, the maps ℎ𝑞 and ℎ𝑞′ are identical, i.e., 𝔄( (𝒙𝑑∗,𝑓𝑞(𝒙𝑑∗)) ⊤ ) = 𝔄( (𝒙𝑑∗,𝑓𝑞′(𝒙𝑑∗)) ⊤ ) . It is because the equation 𝑓𝑞(𝒙𝑑∗) = 𝑓𝑞′(𝒙𝑑∗) holds since 𝜉𝑟∗ ∉ {𝑥1,𝑥2,…,𝑥𝑚} as 𝜉𝑟∗ = 𝜉𝑟∗ is taken from 𝒳\ {𝑥1,𝑥2,…,𝑥𝑚} by definition and 𝑓𝑞(𝜉) = 𝑓𝑞′(𝜉) is satisfied for all 𝜉 ∈ 𝒳\ {𝑟∗} by the definition of ℱ𝒚. We can uniquely define ℎ𝒚 by ℎ𝒚 ∈ {ℎ𝑞 |...

work page