When Is Next-Token Prediction Useful? Marginalization, Ergodicity, Mixture Identifiability, Local Sufficiency, RAG, Tools, and Programming

Francesco Corielli

arxiv: 2605.23278 · v1 · pith:ZDD6TL7Enew · submitted 2026-05-22 · 💻 cs.CL · stat.ML

When Is Next-Token Prediction Useful? Marginalization, Ergodicity, Mixture Identifiability, Local Sufficiency, RAG, Tools, and Programming

Francesco Corielli This is my paper

Pith reviewed 2026-05-25 05:00 UTC · model grok-4.3

classification 💻 cs.CL stat.ML

keywords next-token predictionmarginalizationergodicityconditional sufficiencyRAGtool usemixture identifiabilitylanguage models

0 comments

The pith

Next-token prediction estimates the marginal text-only law and is useful only when observed prefixes are approximately sufficient statistics for latent circumstances.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper distinguishes the full conditional language process (conditioned on latent facts, intentions, and context), the marginal text-only process obtained by integrating those circumstances out, and the distribution learned from finite observed sequences. Interpreting training as estimating the marginal requires assumptions of stationarity, representativeness, and ergodicity that are standard in statistics but difficult to justify for heterogeneous language data. Usefulness of the resulting model for next-token prediction further requires that the residual conditional mutual information between the next token and the omitted circumstances, given the text prefix, be small. The argument extends to heterogeneous corpora and treats RAG and tool use as mechanisms that increase conditional sufficiency.

Core claim

A model trained on realized token trajectories receives sampled continuations and therefore estimates the marginal text-only process rather than the full conditional law; this marginal is useful for prediction only when the observed prefix is an approximately sufficient statistic for the latent circumstances relevant to continuation, which holds when residual conditional mutual information is small.

What carries the argument

The three-way distinction among the full conditional language process, the marginal text-only process, and the model-induced distribution, with local sufficiency of the observed prefix serving as the condition for usefulness.

If this is right

RAG improves next-token prediction by supplying additional context that reduces residual mutual information with omitted circumstances.
Tool use functions as a conditional sufficiency device that augments the observed text with external information.
In heterogeneous training corpora the identifiability of mixture components depends on the same sufficiency conditions.
Programming tasks require richer context because code continuations depend on non-textual goals and constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the sufficiency condition fails, scaling data volume alone will not close the gap between marginal and conditional performance.
Tasks with rapidly changing external circumstances may require explicit conditioning mechanisms beyond pure next-token training.
The same marginal-versus-conditional distinction applies to any sequential prediction setting where observations are generated under varying latent regimes.

Load-bearing premise

Real language corpora can be meaningfully analyzed as samples from a stationary ergodic process whose marginal can be estimated from finite observed trajectories.

What would settle it

A direct measurement showing that next-token prediction error remains high even after conditioning on prefixes that are information-theoretically sufficient for the relevant latent circumstances would falsify the usefulness criterion.

Figures

Figures reproduced from arXiv: 2605.23278 by Francesco Corielli.

**Figure 2.** Figure 2: Programming as a favorable regime: specifications, previous code, tests, and errors [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

read the original abstract

Language models trained on observed sequences are often described as learning the conditional distribution of the next token given previous tokens. This description is only conditionally correct. A model trained on realized token trajectories does not observe full conditional laws; it receives sampled continuations. Moreover, real language generation is conditioned not only on previous words but also on non-textual circumstances: facts, events, intentions, goals, beliefs, social context, and task-specific constraints. This paper distinguishes three objects that are often conflated: the full conditional language process conditioned on latent circumstances, the marginal text-only process obtained by integrating those circumstances out, and the model-induced distribution learned from finite observed corpora. The paper argues that interpreting model training as estimating the marginal text-only law requires strong assumptions of stationarity, representativeness, and ergodicity, assumptions that are standard in statistical estimation but problematic when applied to heterogeneous language corpora. Even if these assumptions hold, the marginal text-only law is useful only when the observed prefix is an approximately sufficient statistic for the latent circumstances relevant to continuation. In information-theoretic terms, usefulness requires that the residual conditional mutual information between the next token and the omitted circumstances, given the observed text, be small. The paper then extends this argument to heterogeneous training corpora. Finally, the paper interprets Retrieval Augmented Generation (RAG) and tool use as conditional sufficiency devices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper cleanly separates the full conditional process, the marginal text law, and the learned model, then ties usefulness to low residual mutual information and frames RAG/tools as sufficiency fixes; the argument is standard stats applied to LMs but lacks any derivation or test.

read the letter

The core point is that training on token sequences estimates a marginal over text rather than the true conditional law given latent circumstances, and this marginal is only useful when the observed prefix carries most of the relevant information about those circumstances. The paper spells out the stationarity, ergodicity, and representativeness assumptions needed for that marginal to be well-defined from finite data, notes that language corpora violate them, and says the residual conditional mutual information must be small for next-token prediction to work in practice. It then treats RAG and tool use as ways to shrink that residual term by supplying the missing context. That framing is clear and follows directly from the chain rule and sufficiency definitions. What is actually new is the explicit mapping of augmentation methods onto conditional sufficiency; the rest recycles standard information-theoretic distinctions. The paper does this without circularity or invented quantities. The main limitation is that everything stays at the level of definitions and verbal argument. There are no derivations showing when the residual term is small, no bounds, and no empirical checks on real corpora or models. The ergodicity critique is familiar and the paper does not add much depth to it. Heterogeneous corpora are mentioned but not analyzed in detail. This is useful for readers who already think about LM training in information-theoretic terms and want a compact way to organize why pure next-token models need external help. It is not a result that changes practice or theory on its own. A serious editor should send it to review so the distinctions can be stress-tested by people who work on conditional generation and retrieval methods.

Referee Report

1 major / 1 minor

Summary. The paper distinguishes the full conditional language process (conditioned on latent circumstances), the marginal text-only process (circumstances integrated out), and the model distribution learned from finite corpora. It argues that next-token prediction training estimates the marginal only under stationarity, representativeness, and ergodicity assumptions (problematic for heterogeneous language data) and is useful only when the observed prefix is approximately sufficient for the relevant latent circumstances, i.e., when residual conditional mutual information I(next token; circumstances | text) is small. The argument is extended to heterogeneous corpora, and RAG/tool use is interpreted as providing conditional sufficiency.

Significance. If the framework holds, it supplies a clean information-theoretic lens for understanding the scope and limits of next-token training, the mismatch between language data and standard statistical assumptions, and the mechanistic role of retrieval and tools. This could usefully inform both theoretical analyses of LM capabilities and practical system design.

major comments (1)

[Abstract] Abstract: the usefulness claim rests on the residual conditional mutual information being small, yet the manuscript supplies neither a formal derivation of this condition from the chain rule nor any concrete bounds or corpus examples showing when the term is plausibly negligible; without such support the central practical implication remains untested.

minor comments (1)

The extension to heterogeneous training corpora is announced but receives no detailed treatment or examples in the provided text; a short dedicated subsection would clarify how the stationarity/ergodicity issues compound across domains.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback on the manuscript. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the usefulness claim rests on the residual conditional mutual information being small, yet the manuscript supplies neither a formal derivation of this condition from the chain rule nor any concrete bounds or corpus examples showing when the term is plausibly negligible; without such support the central practical implication remains untested.

Authors: We agree that the abstract and surrounding discussion would be strengthened by an explicit derivation and supporting illustrations. The key condition follows from the chain rule: H(next token | text) = H(next token | text, circumstances) + I(next token; circumstances | text). When the residual mutual information term is small, the marginal next-token law given text approximates the full conditional law. We will insert this derivation into the revised abstract and add a short subsection with illustrative cases (e.g., technical prose versus open-ended dialogue) showing domains where the term is plausibly negligible. These changes will be incorporated in the next version. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper distinguishes the full conditional process, marginal text-only law, and learned model via standard information-theoretic definitions (chain rule, conditional mutual information, sufficiency). It states assumptions of stationarity/ergodicity/representativeness explicitly as requirements for interpreting training as marginal estimation, without deriving any quantity from fitted parameters or self-citations. RAG/tool-use are positioned as mechanisms to reduce residual I(next token; circumstances | text), following directly from the definitions without reduction to inputs. No equations or claims reduce by construction to the paper's own outputs; the argument is self-contained against external statistical concepts.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper relies on domain assumptions from statistics and information theory applied to language data; no free parameters or invented entities are introduced.

axioms (2)

domain assumption Language generation is conditioned on latent non-textual circumstances (facts, events, intentions, goals, beliefs, social context).
Invoked in the abstract as the basis for distinguishing the full conditional process from the marginal text-only process.
domain assumption Training corpora can be treated under assumptions of stationarity, representativeness, and ergodicity for marginal estimation.
Explicitly discussed in the abstract as standard statistical assumptions that are required but problematic for heterogeneous language data.

pith-pipeline@v0.9.0 · 5785 in / 1393 out tokens · 28402 ms · 2026-05-25T05:00:38.263086+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

interpreting model training as estimating the marginal text-only law requires strong assumptions of stationarity, representativeness, and ergodicity... residual conditional mutual information I(Xt+1;Zt | X≤t)≈0
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

mixture conditional pmix(xt+1 | x≤t) = Σ p(k|x≤t) pk(xt+1 | x≤t)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 4 internal anchors

[1]

I., Babaei, H., LeJeune, D., Siahkoohi, A., & Baraniuk, R

Alemohammad, S., Casco-Rodriguez, J., Luzi, L., Humayun, A. I., Babaei, H., LeJeune, D., Siahkoohi, A., & Baraniuk, R. G. (2024). Self-consuming generative models go MAD. International Conference on Learning Representations

work page 2024
[2]

Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model.Journal of Machine Learning Research, 3, 1137–1155

work page 2003
[3]

M., & Koller, A

Bender, E. M., & Koller, A. (2020). Climbing towards NLU: On meaning, form, and understanding in the age of data.Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 5185–5198

work page 2020
[4]

Bishop, C. M. (2006).Pattern Recognition and Machine Learning. Springer

work page 2006
[5]

Bloom, B. S. (1984). The 2 sigma problem: The search for methods of group instruction as effective as one-to-one tutoring.Educational Researcher, 13(6), 4–16

work page 1984
[6]

B., Lespiau, J.-B., Damoc, B., Clark, A., et al

Borgeaud, S., Mensch, A., Hoffmann, J., Cai, T., Rutherford, E., Millican, K., van den Driessche, G. B., Lespiau, J.-B., Damoc, B., Clark, A., et al. (2022). Improving language models by retrieving from trillions of tokens.Proceedings of the 39th International Conference on Machine Learning

work page 2022
[7]

B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901

work page 2020
[8]

M., & Thomas, J

Cover, T. M., & Thomas, J. A. (2006).Elements of Information Theory(2nd ed.). Wiley

work page 2006
[9]

Dong, C., Yuan, Y., Chen, K., Cheng, S., & Wen, C. (2023). How to build an adaptive AI tutor for any course using knowledge graph-enhanced retrieval-augmented generation (KG-RAG).arXiv:2311.17696

work page arXiv 2023
[10]

Gao, L., Madaan, A., Zhou, S., Alon, U., Liu, P., Yang, Y., Callan, J., & Neubig, G. (2023). Program-aided language models. InProceedings of the 40th International Conference on Machine Learning

work page 2023
[11]

C., Chipman, P., Haynes, B

Graesser, A. C., Chipman, P., Haynes, B. C., & Olney, A. (2005). AutoTutor: An intelligent tutoring system with mixed-initiative dialogue.IEEE Transactions on Education, 48(4), 612–618

work page 2005
[12]

Graves, A. (2012). Sequence transduction with recurrent neural networks.ICML Workshop on Representation Learning

work page 2012
[13]

He, T., Zhang, J., Zhou, Z., & Glass, J. (2021). Quantifying exposure bias for neural language generation.Transactions of the Association for Computational Linguistics, 9, 971–986

work page 2021
[14]

Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2020). The curious case of neural text degeneration.International Conference on Learning Representations. 20

work page 2020
[15]

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., & Liu, T. (2023). A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.arXiv:2311.05232

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

J., Madotto, A., & Fung, P

Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y. J., Madotto, A., & Fung, P. (2023). Survey of hallucination in natural language generation.ACM Computing Surveys, 55(12), Article 248

work page 2023
[17]

Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Hatfield-Dodds, Z., DasSarma, N., Tran-Johnson, E., et al. (2022). Language models (mostly) know what they know.arXiv:2207.05221

work page internal anchor Pith review Pith/arXiv arXiv 2022
[18]

Karpas, E., Abend, O., Belinkov, Y., Lenz, B., Lieber, O., Ratner, N., Shoham, Y., Bata, H., Levine, Y., Leyton-Brown, K., Muhlgay, D., Rozen, N., Schwartz, E., Shachaf, G., Shalev-Shwartz, S., Shashua, A., & Tenenholtz, M. (2022). MRKL systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and disc...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[19]

Kasneci, E., Sessler, K., K¨ uchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G., G¨ unnemann, S., H”ullermeier, E., et al. (2023). ChatGPT for good? On opportunities and challenges of large language models for education.Learning and Individual Differences, 103, 102274

work page 2023
[20]

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Kuttler, H., Lewis, M., Yih, W.-t., Rockt¨ aschel, T., et al. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks.Advances in Neural Information Processing Systems, 33, 9459–9474

work page 2020
[21]

O., Nesbit, J

Ma, W., Adesope, O. O., Nesbit, J. C., & Liu, Q. (2014). Intelligent tutoring systems and learning outcomes: A meta-analysis.Journal of Educational Psychology, 106(4), 901–918

work page 2014
[22]

D., & Sch¨ utze, H

Manning, C. D., & Sch¨ utze, H. (1999).Foundations of Statistical Natural Language Pro- cessing. MIT Press

work page 1999
[23]

Mikolov, T., Karafi´ at, M., Burget, L.,ˇCernock´ y, J., & Khudanpur, S. (2010). Recurrent neural network based language model.INTERSPEECH

work page 2010
[24]

Min, S., Lyu, X., Holtzman, A., Artetxe, M., Lewis, M., Hajishirzi, H., and Zettlemoyer, L. (2022). Rethinking the role of demonstrations: What makes in-context learning work? Proceedings of EMNLP

work page 2022
[25]

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. (2022). Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35, 27730–27744

work page 2022
[26]

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI technical report

work page 2019
[27]

Rosenfeld, R. (2000). Two decades of statistical language modeling: Where do we go from here?Proceedings of the IEEE, 88(8), 1270–1278. 21

work page 2000
[28]

Schick, T., Dwivedi-Yu, J., Dessi, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., & Scialom, T. (2023). Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems

work page 2023
[29]

Shannon, C. E. (1948). A mathematical theory of communication.Bell System Technical Journal, 27(3), 379–423

work page 1948
[30]

Shannon, C. E. (1951). Prediction and entropy of printed English.Bell System Technical Journal, 30(1), 50–64

work page 1951
[31]

Shumailov, I., Shumaylov, Z., Zhao, Y., Gal, Y., Papernot, N., & Anderson, R. (2023). The curse of recursion: Training on generated data makes models forget.arXiv:2305.17493

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Shumailov, I., Shumaylov, Z., Zhao, Y., Papernot, N., Anderson, R., & Gal, Y. (2024). AI models collapse when trained on recursively generated data.Nature, 631, 755–759

work page 2024
[33]

VanLehn, K. (2011). The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems.Educational Psychologist, 46(4), 197–221

work page 2011
[34]

N., Kaiser, L., & Polosukhin, I

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need.Advances in Neural Information Processing Systems, 30

work page 2017
[35]

M., Raghunathan, A., Liang, P., and Ma, T

Xie, S. M., Raghunathan, A., Liang, P., and Ma, T. (2022). An explanation of in-context learning as implicit Bayesian inference.International Conference on Learning Representa- tions

work page 2022
[36]

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2023). ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations

work page 2023
[37]

V., & Zhou, D

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35, 24824–24837. 22

work page 2022

[1] [1]

I., Babaei, H., LeJeune, D., Siahkoohi, A., & Baraniuk, R

Alemohammad, S., Casco-Rodriguez, J., Luzi, L., Humayun, A. I., Babaei, H., LeJeune, D., Siahkoohi, A., & Baraniuk, R. G. (2024). Self-consuming generative models go MAD. International Conference on Learning Representations

work page 2024

[2] [2]

Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model.Journal of Machine Learning Research, 3, 1137–1155

work page 2003

[3] [3]

M., & Koller, A

Bender, E. M., & Koller, A. (2020). Climbing towards NLU: On meaning, form, and understanding in the age of data.Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 5185–5198

work page 2020

[4] [4]

Bishop, C. M. (2006).Pattern Recognition and Machine Learning. Springer

work page 2006

[5] [5]

Bloom, B. S. (1984). The 2 sigma problem: The search for methods of group instruction as effective as one-to-one tutoring.Educational Researcher, 13(6), 4–16

work page 1984

[6] [6]

B., Lespiau, J.-B., Damoc, B., Clark, A., et al

Borgeaud, S., Mensch, A., Hoffmann, J., Cai, T., Rutherford, E., Millican, K., van den Driessche, G. B., Lespiau, J.-B., Damoc, B., Clark, A., et al. (2022). Improving language models by retrieving from trillions of tokens.Proceedings of the 39th International Conference on Machine Learning

work page 2022

[7] [7]

B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901

work page 2020

[8] [8]

M., & Thomas, J

Cover, T. M., & Thomas, J. A. (2006).Elements of Information Theory(2nd ed.). Wiley

work page 2006

[9] [9]

Dong, C., Yuan, Y., Chen, K., Cheng, S., & Wen, C. (2023). How to build an adaptive AI tutor for any course using knowledge graph-enhanced retrieval-augmented generation (KG-RAG).arXiv:2311.17696

work page arXiv 2023

[10] [10]

Gao, L., Madaan, A., Zhou, S., Alon, U., Liu, P., Yang, Y., Callan, J., & Neubig, G. (2023). Program-aided language models. InProceedings of the 40th International Conference on Machine Learning

work page 2023

[11] [11]

C., Chipman, P., Haynes, B

Graesser, A. C., Chipman, P., Haynes, B. C., & Olney, A. (2005). AutoTutor: An intelligent tutoring system with mixed-initiative dialogue.IEEE Transactions on Education, 48(4), 612–618

work page 2005

[12] [12]

Graves, A. (2012). Sequence transduction with recurrent neural networks.ICML Workshop on Representation Learning

work page 2012

[13] [13]

He, T., Zhang, J., Zhou, Z., & Glass, J. (2021). Quantifying exposure bias for neural language generation.Transactions of the Association for Computational Linguistics, 9, 971–986

work page 2021

[14] [14]

Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2020). The curious case of neural text degeneration.International Conference on Learning Representations. 20

work page 2020

[15] [15]

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., & Liu, T. (2023). A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.arXiv:2311.05232

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

J., Madotto, A., & Fung, P

Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y. J., Madotto, A., & Fung, P. (2023). Survey of hallucination in natural language generation.ACM Computing Surveys, 55(12), Article 248

work page 2023

[17] [17]

Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Hatfield-Dodds, Z., DasSarma, N., Tran-Johnson, E., et al. (2022). Language models (mostly) know what they know.arXiv:2207.05221

work page internal anchor Pith review Pith/arXiv arXiv 2022

[18] [18]

Karpas, E., Abend, O., Belinkov, Y., Lenz, B., Lieber, O., Ratner, N., Shoham, Y., Bata, H., Levine, Y., Leyton-Brown, K., Muhlgay, D., Rozen, N., Schwartz, E., Shachaf, G., Shalev-Shwartz, S., Shashua, A., & Tenenholtz, M. (2022). MRKL systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and disc...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[19] [19]

Kasneci, E., Sessler, K., K¨ uchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G., G¨ unnemann, S., H”ullermeier, E., et al. (2023). ChatGPT for good? On opportunities and challenges of large language models for education.Learning and Individual Differences, 103, 102274

work page 2023

[20] [20]

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Kuttler, H., Lewis, M., Yih, W.-t., Rockt¨ aschel, T., et al. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks.Advances in Neural Information Processing Systems, 33, 9459–9474

work page 2020

[21] [21]

O., Nesbit, J

Ma, W., Adesope, O. O., Nesbit, J. C., & Liu, Q. (2014). Intelligent tutoring systems and learning outcomes: A meta-analysis.Journal of Educational Psychology, 106(4), 901–918

work page 2014

[22] [22]

D., & Sch¨ utze, H

Manning, C. D., & Sch¨ utze, H. (1999).Foundations of Statistical Natural Language Pro- cessing. MIT Press

work page 1999

[23] [23]

Mikolov, T., Karafi´ at, M., Burget, L.,ˇCernock´ y, J., & Khudanpur, S. (2010). Recurrent neural network based language model.INTERSPEECH

work page 2010

[24] [24]

Min, S., Lyu, X., Holtzman, A., Artetxe, M., Lewis, M., Hajishirzi, H., and Zettlemoyer, L. (2022). Rethinking the role of demonstrations: What makes in-context learning work? Proceedings of EMNLP

work page 2022

[25] [25]

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. (2022). Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35, 27730–27744

work page 2022

[26] [26]

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI technical report

work page 2019

[27] [27]

Rosenfeld, R. (2000). Two decades of statistical language modeling: Where do we go from here?Proceedings of the IEEE, 88(8), 1270–1278. 21

work page 2000

[28] [28]

Schick, T., Dwivedi-Yu, J., Dessi, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., & Scialom, T. (2023). Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems

work page 2023

[29] [29]

Shannon, C. E. (1948). A mathematical theory of communication.Bell System Technical Journal, 27(3), 379–423

work page 1948

[30] [30]

Shannon, C. E. (1951). Prediction and entropy of printed English.Bell System Technical Journal, 30(1), 50–64

work page 1951

[31] [31]

Shumailov, I., Shumaylov, Z., Zhao, Y., Gal, Y., Papernot, N., & Anderson, R. (2023). The curse of recursion: Training on generated data makes models forget.arXiv:2305.17493

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [32]

Shumailov, I., Shumaylov, Z., Zhao, Y., Papernot, N., Anderson, R., & Gal, Y. (2024). AI models collapse when trained on recursively generated data.Nature, 631, 755–759

work page 2024

[33] [33]

VanLehn, K. (2011). The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems.Educational Psychologist, 46(4), 197–221

work page 2011

[34] [34]

N., Kaiser, L., & Polosukhin, I

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need.Advances in Neural Information Processing Systems, 30

work page 2017

[35] [35]

M., Raghunathan, A., Liang, P., and Ma, T

Xie, S. M., Raghunathan, A., Liang, P., and Ma, T. (2022). An explanation of in-context learning as implicit Bayesian inference.International Conference on Learning Representa- tions

work page 2022

[36] [36]

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2023). ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations

work page 2023

[37] [37]

V., & Zhou, D

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35, 24824–24837. 22

work page 2022