pith. sign in

arxiv: 2605.23278 · v1 · pith:ZDD6TL7Enew · submitted 2026-05-22 · 💻 cs.CL · stat.ML

When Is Next-Token Prediction Useful? Marginalization, Ergodicity, Mixture Identifiability, Local Sufficiency, RAG, Tools, and Programming

Pith reviewed 2026-05-25 05:00 UTC · model grok-4.3

classification 💻 cs.CL stat.ML
keywords next-token predictionmarginalizationergodicityconditional sufficiencyRAGtool usemixture identifiabilitylanguage models
0
0 comments X

The pith

Next-token prediction estimates the marginal text-only law and is useful only when observed prefixes are approximately sufficient statistics for latent circumstances.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper distinguishes the full conditional language process (conditioned on latent facts, intentions, and context), the marginal text-only process obtained by integrating those circumstances out, and the distribution learned from finite observed sequences. Interpreting training as estimating the marginal requires assumptions of stationarity, representativeness, and ergodicity that are standard in statistics but difficult to justify for heterogeneous language data. Usefulness of the resulting model for next-token prediction further requires that the residual conditional mutual information between the next token and the omitted circumstances, given the text prefix, be small. The argument extends to heterogeneous corpora and treats RAG and tool use as mechanisms that increase conditional sufficiency.

Core claim

A model trained on realized token trajectories receives sampled continuations and therefore estimates the marginal text-only process rather than the full conditional law; this marginal is useful for prediction only when the observed prefix is an approximately sufficient statistic for the latent circumstances relevant to continuation, which holds when residual conditional mutual information is small.

What carries the argument

The three-way distinction among the full conditional language process, the marginal text-only process, and the model-induced distribution, with local sufficiency of the observed prefix serving as the condition for usefulness.

If this is right

  • RAG improves next-token prediction by supplying additional context that reduces residual mutual information with omitted circumstances.
  • Tool use functions as a conditional sufficiency device that augments the observed text with external information.
  • In heterogeneous training corpora the identifiability of mixture components depends on the same sufficiency conditions.
  • Programming tasks require richer context because code continuations depend on non-textual goals and constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the sufficiency condition fails, scaling data volume alone will not close the gap between marginal and conditional performance.
  • Tasks with rapidly changing external circumstances may require explicit conditioning mechanisms beyond pure next-token training.
  • The same marginal-versus-conditional distinction applies to any sequential prediction setting where observations are generated under varying latent regimes.

Load-bearing premise

Real language corpora can be meaningfully analyzed as samples from a stationary ergodic process whose marginal can be estimated from finite observed trajectories.

What would settle it

A direct measurement showing that next-token prediction error remains high even after conditioning on prefixes that are information-theoretically sufficient for the relevant latent circumstances would falsify the usefulness criterion.

Figures

Figures reproduced from arXiv: 2605.23278 by Francesco Corielli.

Figure 1
Figure 1. Figure 1: Four distinct distributions. The model samples from the last object. Identifying it [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Programming as a favorable regime: specifications, previous code, tests, and errors [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗
read the original abstract

Language models trained on observed sequences are often described as learning the conditional distribution of the next token given previous tokens. This description is only conditionally correct. A model trained on realized token trajectories does not observe full conditional laws; it receives sampled continuations. Moreover, real language generation is conditioned not only on previous words but also on non-textual circumstances: facts, events, intentions, goals, beliefs, social context, and task-specific constraints. This paper distinguishes three objects that are often conflated: the full conditional language process conditioned on latent circumstances, the marginal text-only process obtained by integrating those circumstances out, and the model-induced distribution learned from finite observed corpora. The paper argues that interpreting model training as estimating the marginal text-only law requires strong assumptions of stationarity, representativeness, and ergodicity, assumptions that are standard in statistical estimation but problematic when applied to heterogeneous language corpora. Even if these assumptions hold, the marginal text-only law is useful only when the observed prefix is an approximately sufficient statistic for the latent circumstances relevant to continuation. In information-theoretic terms, usefulness requires that the residual conditional mutual information between the next token and the omitted circumstances, given the observed text, be small. The paper then extends this argument to heterogeneous training corpora. Finally, the paper interprets Retrieval Augmented Generation (RAG) and tool use as conditional sufficiency devices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper distinguishes the full conditional language process (conditioned on latent circumstances), the marginal text-only process (circumstances integrated out), and the model distribution learned from finite corpora. It argues that next-token prediction training estimates the marginal only under stationarity, representativeness, and ergodicity assumptions (problematic for heterogeneous language data) and is useful only when the observed prefix is approximately sufficient for the relevant latent circumstances, i.e., when residual conditional mutual information I(next token; circumstances | text) is small. The argument is extended to heterogeneous corpora, and RAG/tool use is interpreted as providing conditional sufficiency.

Significance. If the framework holds, it supplies a clean information-theoretic lens for understanding the scope and limits of next-token training, the mismatch between language data and standard statistical assumptions, and the mechanistic role of retrieval and tools. This could usefully inform both theoretical analyses of LM capabilities and practical system design.

major comments (1)
  1. [Abstract] Abstract: the usefulness claim rests on the residual conditional mutual information being small, yet the manuscript supplies neither a formal derivation of this condition from the chain rule nor any concrete bounds or corpus examples showing when the term is plausibly negligible; without such support the central practical implication remains untested.
minor comments (1)
  1. The extension to heterogeneous training corpora is announced but receives no detailed treatment or examples in the provided text; a short dedicated subsection would clarify how the stationarity/ergodicity issues compound across domains.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback on the manuscript. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the usefulness claim rests on the residual conditional mutual information being small, yet the manuscript supplies neither a formal derivation of this condition from the chain rule nor any concrete bounds or corpus examples showing when the term is plausibly negligible; without such support the central practical implication remains untested.

    Authors: We agree that the abstract and surrounding discussion would be strengthened by an explicit derivation and supporting illustrations. The key condition follows from the chain rule: H(next token | text) = H(next token | text, circumstances) + I(next token; circumstances | text). When the residual mutual information term is small, the marginal next-token law given text approximates the full conditional law. We will insert this derivation into the revised abstract and add a short subsection with illustrative cases (e.g., technical prose versus open-ended dialogue) showing domains where the term is plausibly negligible. These changes will be incorporated in the next version. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper distinguishes the full conditional process, marginal text-only law, and learned model via standard information-theoretic definitions (chain rule, conditional mutual information, sufficiency). It states assumptions of stationarity/ergodicity/representativeness explicitly as requirements for interpreting training as marginal estimation, without deriving any quantity from fitted parameters or self-citations. RAG/tool-use are positioned as mechanisms to reduce residual I(next token; circumstances | text), following directly from the definitions without reduction to inputs. No equations or claims reduce by construction to the paper's own outputs; the argument is self-contained against external statistical concepts.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper relies on domain assumptions from statistics and information theory applied to language data; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption Language generation is conditioned on latent non-textual circumstances (facts, events, intentions, goals, beliefs, social context).
    Invoked in the abstract as the basis for distinguishing the full conditional process from the marginal text-only process.
  • domain assumption Training corpora can be treated under assumptions of stationarity, representativeness, and ergodicity for marginal estimation.
    Explicitly discussed in the abstract as standard statistical assumptions that are required but problematic for heterogeneous language data.

pith-pipeline@v0.9.0 · 5785 in / 1393 out tokens · 28402 ms · 2026-05-25T05:00:38.263086+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 5 canonical work pages · 4 internal anchors

  1. [1]

    I., Babaei, H., LeJeune, D., Siahkoohi, A., & Baraniuk, R

    Alemohammad, S., Casco-Rodriguez, J., Luzi, L., Humayun, A. I., Babaei, H., LeJeune, D., Siahkoohi, A., & Baraniuk, R. G. (2024). Self-consuming generative models go MAD. International Conference on Learning Representations

  2. [2]

    Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model.Journal of Machine Learning Research, 3, 1137–1155

  3. [3]

    M., & Koller, A

    Bender, E. M., & Koller, A. (2020). Climbing towards NLU: On meaning, form, and understanding in the age of data.Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 5185–5198

  4. [4]

    Bishop, C. M. (2006).Pattern Recognition and Machine Learning. Springer

  5. [5]

    Bloom, B. S. (1984). The 2 sigma problem: The search for methods of group instruction as effective as one-to-one tutoring.Educational Researcher, 13(6), 4–16

  6. [6]

    B., Lespiau, J.-B., Damoc, B., Clark, A., et al

    Borgeaud, S., Mensch, A., Hoffmann, J., Cai, T., Rutherford, E., Millican, K., van den Driessche, G. B., Lespiau, J.-B., Damoc, B., Clark, A., et al. (2022). Improving language models by retrieving from trillions of tokens.Proceedings of the 39th International Conference on Machine Learning

  7. [7]

    B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

    Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901

  8. [8]

    M., & Thomas, J

    Cover, T. M., & Thomas, J. A. (2006).Elements of Information Theory(2nd ed.). Wiley

  9. [9]

    Dong, C., Yuan, Y., Chen, K., Cheng, S., & Wen, C. (2023). How to build an adaptive AI tutor for any course using knowledge graph-enhanced retrieval-augmented generation (KG-RAG).arXiv:2311.17696

  10. [10]

    Gao, L., Madaan, A., Zhou, S., Alon, U., Liu, P., Yang, Y., Callan, J., & Neubig, G. (2023). Program-aided language models. InProceedings of the 40th International Conference on Machine Learning

  11. [11]

    C., Chipman, P., Haynes, B

    Graesser, A. C., Chipman, P., Haynes, B. C., & Olney, A. (2005). AutoTutor: An intelligent tutoring system with mixed-initiative dialogue.IEEE Transactions on Education, 48(4), 612–618

  12. [12]

    Graves, A. (2012). Sequence transduction with recurrent neural networks.ICML Workshop on Representation Learning

  13. [13]

    He, T., Zhang, J., Zhou, Z., & Glass, J. (2021). Quantifying exposure bias for neural language generation.Transactions of the Association for Computational Linguistics, 9, 971–986

  14. [14]

    Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2020). The curious case of neural text degeneration.International Conference on Learning Representations. 20

  15. [15]

    Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., & Liu, T. (2023). A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.arXiv:2311.05232

  16. [16]

    J., Madotto, A., & Fung, P

    Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y. J., Madotto, A., & Fung, P. (2023). Survey of hallucination in natural language generation.ACM Computing Surveys, 55(12), Article 248

  17. [17]

    Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Hatfield-Dodds, Z., DasSarma, N., Tran-Johnson, E., et al. (2022). Language models (mostly) know what they know.arXiv:2207.05221

  18. [18]

    Karpas, E., Abend, O., Belinkov, Y., Lenz, B., Lieber, O., Ratner, N., Shoham, Y., Bata, H., Levine, Y., Leyton-Brown, K., Muhlgay, D., Rozen, N., Schwartz, E., Shachaf, G., Shalev-Shwartz, S., Shashua, A., & Tenenholtz, M. (2022). MRKL systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and disc...

  19. [19]

    Kasneci, E., Sessler, K., K¨ uchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G., G¨ unnemann, S., H”ullermeier, E., et al. (2023). ChatGPT for good? On opportunities and challenges of large language models for education.Learning and Individual Differences, 103, 102274

  20. [20]

    Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Kuttler, H., Lewis, M., Yih, W.-t., Rockt¨ aschel, T., et al. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks.Advances in Neural Information Processing Systems, 33, 9459–9474

  21. [21]

    O., Nesbit, J

    Ma, W., Adesope, O. O., Nesbit, J. C., & Liu, Q. (2014). Intelligent tutoring systems and learning outcomes: A meta-analysis.Journal of Educational Psychology, 106(4), 901–918

  22. [22]

    D., & Sch¨ utze, H

    Manning, C. D., & Sch¨ utze, H. (1999).Foundations of Statistical Natural Language Pro- cessing. MIT Press

  23. [23]

    Mikolov, T., Karafi´ at, M., Burget, L.,ˇCernock´ y, J., & Khudanpur, S. (2010). Recurrent neural network based language model.INTERSPEECH

  24. [24]

    Min, S., Lyu, X., Holtzman, A., Artetxe, M., Lewis, M., Hajishirzi, H., and Zettlemoyer, L. (2022). Rethinking the role of demonstrations: What makes in-context learning work? Proceedings of EMNLP

  25. [25]

    Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. (2022). Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35, 27730–27744

  26. [26]

    Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI technical report

  27. [27]

    Rosenfeld, R. (2000). Two decades of statistical language modeling: Where do we go from here?Proceedings of the IEEE, 88(8), 1270–1278. 21

  28. [28]

    Schick, T., Dwivedi-Yu, J., Dessi, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., & Scialom, T. (2023). Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems

  29. [29]

    Shannon, C. E. (1948). A mathematical theory of communication.Bell System Technical Journal, 27(3), 379–423

  30. [30]

    Shannon, C. E. (1951). Prediction and entropy of printed English.Bell System Technical Journal, 30(1), 50–64

  31. [31]

    Shumailov, I., Shumaylov, Z., Zhao, Y., Gal, Y., Papernot, N., & Anderson, R. (2023). The curse of recursion: Training on generated data makes models forget.arXiv:2305.17493

  32. [32]

    Shumailov, I., Shumaylov, Z., Zhao, Y., Papernot, N., Anderson, R., & Gal, Y. (2024). AI models collapse when trained on recursively generated data.Nature, 631, 755–759

  33. [33]

    VanLehn, K. (2011). The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems.Educational Psychologist, 46(4), 197–221

  34. [34]

    N., Kaiser, L., & Polosukhin, I

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need.Advances in Neural Information Processing Systems, 30

  35. [35]

    M., Raghunathan, A., Liang, P., and Ma, T

    Xie, S. M., Raghunathan, A., Liang, P., and Ma, T. (2022). An explanation of in-context learning as implicit Bayesian inference.International Conference on Learning Representa- tions

  36. [36]

    Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2023). ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations

  37. [37]

    V., & Zhou, D

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35, 24824–24837. 22