When Is Next-Token Prediction Useful? Marginalization, Ergodicity, Mixture Identifiability, Local Sufficiency, RAG, Tools, and Programming
Pith reviewed 2026-05-25 05:00 UTC · model grok-4.3
The pith
Next-token prediction estimates the marginal text-only law and is useful only when observed prefixes are approximately sufficient statistics for latent circumstances.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A model trained on realized token trajectories receives sampled continuations and therefore estimates the marginal text-only process rather than the full conditional law; this marginal is useful for prediction only when the observed prefix is an approximately sufficient statistic for the latent circumstances relevant to continuation, which holds when residual conditional mutual information is small.
What carries the argument
The three-way distinction among the full conditional language process, the marginal text-only process, and the model-induced distribution, with local sufficiency of the observed prefix serving as the condition for usefulness.
If this is right
- RAG improves next-token prediction by supplying additional context that reduces residual mutual information with omitted circumstances.
- Tool use functions as a conditional sufficiency device that augments the observed text with external information.
- In heterogeneous training corpora the identifiability of mixture components depends on the same sufficiency conditions.
- Programming tasks require richer context because code continuations depend on non-textual goals and constraints.
Where Pith is reading between the lines
- If the sufficiency condition fails, scaling data volume alone will not close the gap between marginal and conditional performance.
- Tasks with rapidly changing external circumstances may require explicit conditioning mechanisms beyond pure next-token training.
- The same marginal-versus-conditional distinction applies to any sequential prediction setting where observations are generated under varying latent regimes.
Load-bearing premise
Real language corpora can be meaningfully analyzed as samples from a stationary ergodic process whose marginal can be estimated from finite observed trajectories.
What would settle it
A direct measurement showing that next-token prediction error remains high even after conditioning on prefixes that are information-theoretically sufficient for the relevant latent circumstances would falsify the usefulness criterion.
Figures
read the original abstract
Language models trained on observed sequences are often described as learning the conditional distribution of the next token given previous tokens. This description is only conditionally correct. A model trained on realized token trajectories does not observe full conditional laws; it receives sampled continuations. Moreover, real language generation is conditioned not only on previous words but also on non-textual circumstances: facts, events, intentions, goals, beliefs, social context, and task-specific constraints. This paper distinguishes three objects that are often conflated: the full conditional language process conditioned on latent circumstances, the marginal text-only process obtained by integrating those circumstances out, and the model-induced distribution learned from finite observed corpora. The paper argues that interpreting model training as estimating the marginal text-only law requires strong assumptions of stationarity, representativeness, and ergodicity, assumptions that are standard in statistical estimation but problematic when applied to heterogeneous language corpora. Even if these assumptions hold, the marginal text-only law is useful only when the observed prefix is an approximately sufficient statistic for the latent circumstances relevant to continuation. In information-theoretic terms, usefulness requires that the residual conditional mutual information between the next token and the omitted circumstances, given the observed text, be small. The paper then extends this argument to heterogeneous training corpora. Finally, the paper interprets Retrieval Augmented Generation (RAG) and tool use as conditional sufficiency devices.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper distinguishes the full conditional language process (conditioned on latent circumstances), the marginal text-only process (circumstances integrated out), and the model distribution learned from finite corpora. It argues that next-token prediction training estimates the marginal only under stationarity, representativeness, and ergodicity assumptions (problematic for heterogeneous language data) and is useful only when the observed prefix is approximately sufficient for the relevant latent circumstances, i.e., when residual conditional mutual information I(next token; circumstances | text) is small. The argument is extended to heterogeneous corpora, and RAG/tool use is interpreted as providing conditional sufficiency.
Significance. If the framework holds, it supplies a clean information-theoretic lens for understanding the scope and limits of next-token training, the mismatch between language data and standard statistical assumptions, and the mechanistic role of retrieval and tools. This could usefully inform both theoretical analyses of LM capabilities and practical system design.
major comments (1)
- [Abstract] Abstract: the usefulness claim rests on the residual conditional mutual information being small, yet the manuscript supplies neither a formal derivation of this condition from the chain rule nor any concrete bounds or corpus examples showing when the term is plausibly negligible; without such support the central practical implication remains untested.
minor comments (1)
- The extension to heterogeneous training corpora is announced but receives no detailed treatment or examples in the provided text; a short dedicated subsection would clarify how the stationarity/ergodicity issues compound across domains.
Simulated Author's Rebuttal
We thank the referee for their careful reading and constructive feedback on the manuscript. We address the single major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the usefulness claim rests on the residual conditional mutual information being small, yet the manuscript supplies neither a formal derivation of this condition from the chain rule nor any concrete bounds or corpus examples showing when the term is plausibly negligible; without such support the central practical implication remains untested.
Authors: We agree that the abstract and surrounding discussion would be strengthened by an explicit derivation and supporting illustrations. The key condition follows from the chain rule: H(next token | text) = H(next token | text, circumstances) + I(next token; circumstances | text). When the residual mutual information term is small, the marginal next-token law given text approximates the full conditional law. We will insert this derivation into the revised abstract and add a short subsection with illustrative cases (e.g., technical prose versus open-ended dialogue) showing domains where the term is plausibly negligible. These changes will be incorporated in the next version. revision: yes
Circularity Check
No significant circularity
full rationale
The paper distinguishes the full conditional process, marginal text-only law, and learned model via standard information-theoretic definitions (chain rule, conditional mutual information, sufficiency). It states assumptions of stationarity/ergodicity/representativeness explicitly as requirements for interpreting training as marginal estimation, without deriving any quantity from fitted parameters or self-citations. RAG/tool-use are positioned as mechanisms to reduce residual I(next token; circumstances | text), following directly from the definitions without reduction to inputs. No equations or claims reduce by construction to the paper's own outputs; the argument is self-contained against external statistical concepts.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Language generation is conditioned on latent non-textual circumstances (facts, events, intentions, goals, beliefs, social context).
- domain assumption Training corpora can be treated under assumptions of stationarity, representativeness, and ergodicity for marginal estimation.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
interpreting model training as estimating the marginal text-only law requires strong assumptions of stationarity, representativeness, and ergodicity... residual conditional mutual information I(Xt+1;Zt | X≤t)≈0
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
mixture conditional pmix(xt+1 | x≤t) = Σ p(k|x≤t) pk(xt+1 | x≤t)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
I., Babaei, H., LeJeune, D., Siahkoohi, A., & Baraniuk, R
Alemohammad, S., Casco-Rodriguez, J., Luzi, L., Humayun, A. I., Babaei, H., LeJeune, D., Siahkoohi, A., & Baraniuk, R. G. (2024). Self-consuming generative models go MAD. International Conference on Learning Representations
work page 2024
-
[2]
Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model.Journal of Machine Learning Research, 3, 1137–1155
work page 2003
-
[3]
Bender, E. M., & Koller, A. (2020). Climbing towards NLU: On meaning, form, and understanding in the age of data.Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 5185–5198
work page 2020
-
[4]
Bishop, C. M. (2006).Pattern Recognition and Machine Learning. Springer
work page 2006
-
[5]
Bloom, B. S. (1984). The 2 sigma problem: The search for methods of group instruction as effective as one-to-one tutoring.Educational Researcher, 13(6), 4–16
work page 1984
-
[6]
B., Lespiau, J.-B., Damoc, B., Clark, A., et al
Borgeaud, S., Mensch, A., Hoffmann, J., Cai, T., Rutherford, E., Millican, K., van den Driessche, G. B., Lespiau, J.-B., Damoc, B., Clark, A., et al. (2022). Improving language models by retrieving from trillions of tokens.Proceedings of the 39th International Conference on Machine Learning
work page 2022
-
[7]
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901
work page 2020
-
[8]
Cover, T. M., & Thomas, J. A. (2006).Elements of Information Theory(2nd ed.). Wiley
work page 2006
- [9]
-
[10]
Gao, L., Madaan, A., Zhou, S., Alon, U., Liu, P., Yang, Y., Callan, J., & Neubig, G. (2023). Program-aided language models. InProceedings of the 40th International Conference on Machine Learning
work page 2023
-
[11]
Graesser, A. C., Chipman, P., Haynes, B. C., & Olney, A. (2005). AutoTutor: An intelligent tutoring system with mixed-initiative dialogue.IEEE Transactions on Education, 48(4), 612–618
work page 2005
-
[12]
Graves, A. (2012). Sequence transduction with recurrent neural networks.ICML Workshop on Representation Learning
work page 2012
-
[13]
He, T., Zhang, J., Zhou, Z., & Glass, J. (2021). Quantifying exposure bias for neural language generation.Transactions of the Association for Computational Linguistics, 9, 971–986
work page 2021
-
[14]
Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2020). The curious case of neural text degeneration.International Conference on Learning Representations. 20
work page 2020
-
[15]
Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., & Liu, T. (2023). A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.arXiv:2311.05232
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[16]
Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y. J., Madotto, A., & Fung, P. (2023). Survey of hallucination in natural language generation.ACM Computing Surveys, 55(12), Article 248
work page 2023
-
[17]
Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Hatfield-Dodds, Z., DasSarma, N., Tran-Johnson, E., et al. (2022). Language models (mostly) know what they know.arXiv:2207.05221
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[18]
Karpas, E., Abend, O., Belinkov, Y., Lenz, B., Lieber, O., Ratner, N., Shoham, Y., Bata, H., Levine, Y., Leyton-Brown, K., Muhlgay, D., Rozen, N., Schwartz, E., Shachaf, G., Shalev-Shwartz, S., Shashua, A., & Tenenholtz, M. (2022). MRKL systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and disc...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[19]
Kasneci, E., Sessler, K., K¨ uchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G., G¨ unnemann, S., H”ullermeier, E., et al. (2023). ChatGPT for good? On opportunities and challenges of large language models for education.Learning and Individual Differences, 103, 102274
work page 2023
-
[20]
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Kuttler, H., Lewis, M., Yih, W.-t., Rockt¨ aschel, T., et al. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks.Advances in Neural Information Processing Systems, 33, 9459–9474
work page 2020
-
[21]
Ma, W., Adesope, O. O., Nesbit, J. C., & Liu, Q. (2014). Intelligent tutoring systems and learning outcomes: A meta-analysis.Journal of Educational Psychology, 106(4), 901–918
work page 2014
-
[22]
Manning, C. D., & Sch¨ utze, H. (1999).Foundations of Statistical Natural Language Pro- cessing. MIT Press
work page 1999
-
[23]
Mikolov, T., Karafi´ at, M., Burget, L.,ˇCernock´ y, J., & Khudanpur, S. (2010). Recurrent neural network based language model.INTERSPEECH
work page 2010
-
[24]
Min, S., Lyu, X., Holtzman, A., Artetxe, M., Lewis, M., Hajishirzi, H., and Zettlemoyer, L. (2022). Rethinking the role of demonstrations: What makes in-context learning work? Proceedings of EMNLP
work page 2022
-
[25]
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. (2022). Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35, 27730–27744
work page 2022
-
[26]
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI technical report
work page 2019
-
[27]
Rosenfeld, R. (2000). Two decades of statistical language modeling: Where do we go from here?Proceedings of the IEEE, 88(8), 1270–1278. 21
work page 2000
-
[28]
Schick, T., Dwivedi-Yu, J., Dessi, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., & Scialom, T. (2023). Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems
work page 2023
-
[29]
Shannon, C. E. (1948). A mathematical theory of communication.Bell System Technical Journal, 27(3), 379–423
work page 1948
-
[30]
Shannon, C. E. (1951). Prediction and entropy of printed English.Bell System Technical Journal, 30(1), 50–64
work page 1951
-
[31]
Shumailov, I., Shumaylov, Z., Zhao, Y., Gal, Y., Papernot, N., & Anderson, R. (2023). The curse of recursion: Training on generated data makes models forget.arXiv:2305.17493
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[32]
Shumailov, I., Shumaylov, Z., Zhao, Y., Papernot, N., Anderson, R., & Gal, Y. (2024). AI models collapse when trained on recursively generated data.Nature, 631, 755–759
work page 2024
-
[33]
VanLehn, K. (2011). The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems.Educational Psychologist, 46(4), 197–221
work page 2011
-
[34]
N., Kaiser, L., & Polosukhin, I
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need.Advances in Neural Information Processing Systems, 30
work page 2017
-
[35]
M., Raghunathan, A., Liang, P., and Ma, T
Xie, S. M., Raghunathan, A., Liang, P., and Ma, T. (2022). An explanation of in-context learning as implicit Bayesian inference.International Conference on Learning Representa- tions
work page 2022
-
[36]
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2023). ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations
work page 2023
-
[37]
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35, 24824–24837. 22
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.