pith. sign in

arxiv: 2605.27564 · v1 · pith:WFRH6CU4new · submitted 2026-05-26 · 💻 cs.CL · cs.AI· cs.LG

The Future of Facts: Tracing the Factual Generation-Verification Gap

Pith reviewed 2026-06-29 18:28 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords generation-verification gapfactual knowledgelanguage modelscontinual learningmodel updatingverification robustnessmulti-verse state
0
0 comments X

The pith

Language models learn to verify facts reliably before they learn to generate them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper follows factual knowledge through three training stages—acquisition, continual learning, and updating—across multiple open-source model families. It shows verification capabilities appear earlier than generation and survive continued training better. When new facts are introduced, models can end up treating both the old and new versions as correct at once. These patterns hold in natural experiments on larger frontier models. The work isolates the factual version of the generation-verification gap from other kinds of gaps in model behavior.

Core claim

Across four model families at two scales each, verification is learned before generation during acquisition, remains more robust than generation during continual learning, and factual updates can produce a multi-verse state in which models verify both old and new answers as correct simultaneously. The same dynamics appear at scale on frontier models, along with residual verification biases on well-covered facts.

What carries the argument

The factual generation-verification gap, isolated by tracking generation and verification separately across acquisition, continual learning, and updating phases.

If this is right

  • Verification precedes generation in the acquisition of factual knowledge.
  • Verification capabilities degrade less than generation during further training on other tasks.
  • Factual updates can leave models simultaneously accepting contradictory answers as correct.
  • The same ordering and robustness patterns appear in larger frontier models on well-covered facts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training pipelines might usefully separate verification-focused stages from generation-focused stages.
  • Update methods could be tested specifically for whether they reduce simultaneous acceptance of old and new facts.
  • Self-improvement loops that rely on verification may inherit biases from this ordering of capabilities.

Load-bearing premise

Generation and verification can be measured independently through the three phases without the chosen tasks or facts creating the observed gap by design.

What would settle it

An experiment that trains models on a new set of facts using different tasks and shows generation consistently preceding verification or that updates eliminate rather than preserve conflicting verifications.

Figures

Figures reproduced from arXiv: 2605.27564 by Anja Surina, Caglar Gulcehre, Tim R. Davidson.

Figure 4.1
Figure 4.1. Figure 4.1: Verification develops before generation during knowledge acquisition. Fine-tuning loss (top) and generation/verification accuracy (bottom) for four open model families at two scales each. Verification capabilities consistently emerge before generation, opening a GV-gap (shaded region) that closes as both capabilities saturate. The location of the gap is not visible in the loss curve. queries (asking the … view at source ↗
Figure 4.2
Figure 4.2. Figure 4.2: Continual learning re-opens or widens factual GV-gaps. Generation and verification accuracy after switching to unrelated factual data at four intervention points (0, 3, 6, 12 epochs after acquisition). Verification is consistently more robust to continual learning than generation, with larger models maintaining higher floors. 4.3 What are the effects of updates? [PITH_FULL_IMAGE:figures/full_fig_p006_4_2.png] view at source ↗
Figure 4.3
Figure 4.3. Figure 4.3: shows the effect of updating a synthetic fact: at the epoch indicated by the vertical line, every paraphrase in the original training data is rewritten to replace the original answer with a new one. After the switch, models successfully shift their generative output to the new answer — generation of the obsolete fact effectively ceases. Verification, however, does not follow suit: models continue to conf… view at source ↗
Figure 4.4
Figure 4.4. Figure 4.4: Natural variation in real-world data exposes GV-gaps in frontier models. Generation and verification capabilities of GPT 5.4 and Gemini 3 Flash across three datasets spanning a coverage gradient (S&P 500, NBA, lottery) from 2002 to 2024. Each dataset traverses three regimes: (1) too little data for either capability to emerge, (2) an opening GV-gap, and (3) convergence, with higher-coverage datasets tran… view at source ↗
Figure 4.5
Figure 4.5. Figure 4.5: Residual multi-verse state in frontier models on naturalistic updating. Accuracy at rejecting incorrect Billboard Hot 100 song-rank pairings for GPT 5.4 (top) and Gemini 3 (bottom), across four time periods. Random-noise queries replace the correct song with a random song from the top 10; ranked-noise queries replace it with the song that held the same rank at week T ± k. larger counterparts miss. For ve… view at source ↗
Figure 4.6
Figure 4.6. Figure 4.6: Distillation widens the GV-gap; reasoning effort affects model families differently. Results for three model sizes: Large, Medium, and Small; (a) Distillation increases the raw GV-gap and decreases verification utility across both families. (b) Distillation widens verification bias for Gemini 3 models. Increased reasoning effort widens bias for GPT 5.4 but reduces it for Gemini 3. 8 [PITH_FULL_IMAGE:fig… view at source ↗
read the original abstract

Language models are becoming the default interface to factual knowledge, yet they often verify outputs more reliably than they generate them. This generation-verification gap (GV-gap) underlies many recent advances in self-improvement and reasoning, but its dynamics on factual knowledge specifically remain poorly understood. We focus on the training mechanisms underlying factual GV-gaps, distinguishing them from their computational and aesthetic counterparts. We trace generation and verification capabilities through three training phases (acquisition, continual learning, and updating) across four open-source model families at two scales each. Three findings recur across models: (i) verification is consistently learned before generation; (ii) verification is more robust to continual learning than generation; and (iii) factual updates can leave models in a "multi-verse" state, simultaneously verifying both old and new answers as correct. Natural experiments on frontier models reproduce these dynamics at scale and reveal residual verification biases on well-covered facts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript claims that language models exhibit a generation-verification gap (GV-gap) specific to factual knowledge. Tracing capabilities across acquisition, continual learning, and updating phases in four open-source model families (two scales each), it reports three recurring findings: (i) verification is learned before generation, (ii) verification is more robust to continual learning than generation, and (iii) factual updates can leave models in a 'multi-verse' state simultaneously verifying both old and new answers as correct. Natural experiments on frontier models are said to reproduce the dynamics at scale.

Significance. If the three findings prove robust once operational details are supplied, the work would provide a useful empirical map of how factual generation and verification diverge during training. This could inform self-improvement pipelines and update strategies that aim to keep verification and generation aligned, adding to the literature on factual reliability in LLMs.

major comments (1)
  1. [Abstract] Abstract: The three central findings are stated without any description of how generation versus verification was operationalized, which facts were selected, what prompt formats or controls were used, or any error analysis. This is load-bearing for all claims, because the skeptic concern that the GV-gap may be created by construction through task design or fact selection cannot be evaluated from the given information.
minor comments (1)
  1. [Abstract] Abstract: The distinction drawn between factual GV-gaps and their 'computational and aesthetic counterparts' is introduced without definition or citation, which may reduce clarity for readers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for greater transparency in the abstract regarding operational details. We address this point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The three central findings are stated without any description of how generation versus verification was operationalized, which facts were selected, what prompt formats or controls were used, or any error analysis. This is load-bearing for all claims, because the skeptic concern that the GV-gap may be created by construction through task design or fact selection cannot be evaluated from the given information.

    Authors: We agree that the abstract's brevity leaves the operationalization of the GV-gap underspecified, which is necessary to evaluate potential task-design artifacts. The full manuscript details these elements in Section 3 (Methods): generation is measured via exact-match accuracy on open-ended factual completion prompts; verification uses multiple-choice accuracy on the same facts with both correct and distractor options; facts are drawn from a curated set of 500 Wikipedia-derived triples balanced across domains with controls for popularity and recency; prompt formats include zero-shot and few-shot variants with randomization of option order; and error analysis reports per-model confusion matrices plus inter-annotator agreement on a human subset. To make these claims evaluable directly from the abstract and preempt concerns about construction artifacts, we will add a concise clause describing the core measurement approach and fact-selection criteria. revision: yes

Circularity Check

0 steps flagged

No circularity: purely observational empirical study

full rationale

The paper reports three recurring empirical observations from training phases (acquisition, continual learning, updating) across model families. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described methodology. The findings are direct measurements of generation vs. verification performance; they do not reduce to inputs by construction. The study is self-contained against external benchmarks via reported metrics on open-source models and natural experiments on frontier models. No load-bearing self-citations or ansatz smuggling are present.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical observational study; no free parameters, no invented entities, and only standard domain assumptions about training phases and capability measurement.

axioms (1)
  • domain assumption Division of training into acquisition, continual learning, and updating phases is a valid and exhaustive partitioning for tracing capability emergence.
    Invoked to structure the tracing of generation and verification.

pith-pipeline@v0.9.1-grok · 5695 in / 1106 out tokens · 32413 ms · 2026-06-29T18:28:50.030777+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

154 extracted references · 28 canonical work pages · 13 internal anchors

  1. [1]

    Jon Saad-Falcon, E. Kelly Buchanan, Mayee F Chen, Tzu-Heng Huang, Brendan McLaugh- lin, Tanvir Bhathal, Shang Zhu, Ben Athiwaratkun, Frederic Sala, Scott Linderman, Azalia Mirhoseini, and Christopher Re. Weaver: Shrinking the generation-verification gap by scaling compute for verification. InThe Thirty-ninth Annual Conference on Neural Information Process...

  2. [2]

    Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models

    Yuda Song, Hanlin Zhang, Udaya Ghai, Carson Eisenach, Sham M Kakade, and Dean Foster. Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models. In ICLR, 2025. URLhttps://openreview.net/pdf?id=mtJSMcF3ek

  3. [3]

    Large language models can self-improve

    Jiaxin Huang, Shixiang Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve. InEMNLP, pages 1051–1068, 2023

  4. [4]

    Self-improvement in language models: The sharpening mechanism.ICLR, 2025

    Audrey Huang, Adam Block, Dylan J Foster, Dhruv Rohatgi, Cyril Zhang, Max Simchowitz, Jordan T Ash, and Akshay Krishnamurthy. Self-improvement in language models: The sharpening mechanism.ICLR, 2025

  5. [5]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training Verifiers to Solve Math Word Problems, November 2021. URL http://arxiv.org/abs/2110.14168. arXiv:2110.14168

  6. [6]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InICLR,

  7. [7]

    URLhttps://openreview.net/forum?id=v8L0pN6EOi

  8. [8]

    Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

    Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling.arXiv preprint arXiv:2407.21787, 2024

  9. [9]

    Scaling llm test-time compute optimally can be more effective than scaling model parameters.ICLR, 2025

    Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.ICLR, 2025

  10. [10]

    Generative verifiers: Reward modeling as next-token prediction.ICLR, 2025

    Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal. Generative verifiers: Reward modeling as next-token prediction.ICLR, 2025

  11. [11]

    Andrew Bagnell

    Gokul Swamy, Sanjiban Choudhury, Wen Sun, Zhiwei Steven Wu, and J. Andrew Bagnell. All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning, March 2025. URLhttp://arxiv.org/abs/2503.01067. arXiv:2503.01067

  12. [12]

    Theoretical modeling of llm self- improvement training dynamics through solver-verifier gap.arXiv preprint arXiv:2507.00075, 2025

    Yifan Sun, Yushan Liang, Zhen Zhang, and Jiaye Teng. Theoretical modeling of llm self- improvement training dynamics through solver-verifier gap.arXiv preprint arXiv:2507.00075, 2025

  13. [13]

    How people use chatgpt

    Aaron Chatterji, Thomas Cunningham, David J Deming, Zoe Hitzig, Christopher Ong, Carl Yan Shan, and Kevin Wadman. How people use chatgpt. Technical report, National Bureau of Economic Research, 2025

  14. [14]

    Introducing bing generative search

    Microsoft. Introducing bing generative search. https://blogs.bing.com/search/July- 2024/generativesearch, July 2024

  15. [15]

    Generative ai in search: Let google do the searching for you

    Elizabeth Reid. Generative ai in search: Let google do the searching for you. https://blog.google/products-and-platforms/products/search/generative-ai- google-search-may-2024/, May 2024. 10

  16. [16]

    A representative study on human detection of artificially generated media across countries

    Joel Frank, Franziska Herbert, Jonas Ricker, Lea Schönherr, Thorsten Eisenhofer, Asja Fischer, Markus Dürmuth, and Thorsten Holz. A representative study on human detection of artificially generated media across countries. In2024 IEEE Symposium on Security and Privacy (SP), pages 55–73. IEEE, 2024

  17. [17]

    As good as a coin toss: Human detection of ai-generated content.Commun

    Di Cooke, Abigail Edwards, Sophia Barkoff, and Kathryn Kelly. As good as a coin toss: Human detection of ai-generated content.Commun. ACM, 68(10):100–109, September 2025. ISSN 0001-0782. doi: 10.1145/3729417. URLhttps://doi.org/10.1145/3729417

  18. [18]

    Survey of hallucination in natural language generation

    Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM computing surveys, 55(12):1–38, 2023

  19. [19]

    Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, 2024

    Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, 2024

  20. [20]

    Hal- luhard: A hard multi-turn hallucination benchmark.arXiv preprint arXiv:2602.01031, 2026

    Dongyang Fan, Sebastien Delsad, Nicolas Flammarion, and Maksym Andriushchenko. Hal- luhard: A hard multi-turn hallucination benchmark.arXiv preprint arXiv:2602.01031, 2026

  21. [21]

    Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

    Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 610–623, New York, NY , USA, 2021. Association for Computing Machinery. ISBN 9781450383097. doi: 10....

  22. [22]

    How ai can distort human beliefs.Science, 380(6651): 1222–1223, 2023

    Celeste Kidd and Abeba Birhane. How ai can distort human beliefs.Science, 380(6651): 1222–1223, 2023

  23. [23]

    Generative language models and automated influence operations: Emerging threats and potential mitigations.arXiv preprint arXiv:2301.04246, 1, 2023

    Josh A Goldstein, Girish Sastry, Micah Musser, Renee DiResta, Matthew Gentzel, and Katerina Sedova. Generative language models and automated influence operations: Emerging threats and potential mitigations.arXiv preprint arXiv:2301.04246, 1, 2023

  24. [24]

    Proce- dural knowledge in pretraining drives reasoning in large language models.ICLR, 2025

    Laura Ruis, Maximilian Mozes, Juhan Bae, Siddhartha Rao Kamalakara, Dwarak Talupuru, Acyr Locatelli, Robert Kirk, Tim Rocktäschel, Edward Grefenstette, and Max Bartolo. Proce- dural knowledge in pretraining drives reasoning in large language models.ICLR, 2025

  25. [25]

    Morris, Chawin Sitawarin, Chuan Guo, Narine Kokhlikyan, G

    John X. Morris, Chawin Sitawarin, Chuan Guo, Narine Kokhlikyan, G. Edward Suh, Alexan- der M. Rush, Kamalika Chaudhuri, and Saeed Mahloujifar. How much do language models memorize?, June 2025. URLhttp://arxiv.org/abs/2505.24832. arXiv:2505.24832

  26. [26]

    Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus

    Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models.Transactions on Machine Learning Research, 2022. ISSN 2835-8856. URL https://openr...

  27. [27]

    Physics of language models: Part 3.1, knowledge storage and extraction.ICML, 2024

    Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.1, knowledge storage and extraction.ICML, 2024

  28. [28]

    Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective STars

    Kanishk Gandhi, Ayush K Chakravarthy, Anikait Singh, Nathan Lile, and Noah Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective STars. InSecond Conference on Language Modeling, 2025. URL https://openreview.net/ forum?id=QGJ9ttXLTy

  29. [29]

    Octothinker: Mid-training incentivizes reinforcement learning scaling.arXiv preprint arXiv:2506.20512, 2025

    Zengzhi Wang, Fan Zhou, Xuefeng Li, and Pengfei Liu. Octothinker: Mid-training incentivizes reinforcement learning scaling.arXiv preprint arXiv:2506.20512, 2025

  30. [30]

    What It Can Create, It May Not Understand

    Peter West, Ximing Lu, Nouha Dziri, Faeze Brahman, Linjie Li, Jena D. Hwang, Liwei Jiang, Jillian Fisher, Abhilasha Ravichander, Khyathi Chandu, Benjamin Newman, Pang Wei Koh, Allyson Ettinger, and Yejin Choi. The Generative AI Paradox: “What It Can Create, It May Not Understand”. InThe Twelfth International Conference on Learning Representations, 2024. U...

  31. [31]

    Self-recognition in language models.EMNLP, 2024

    Tim R Davidson, Viacheslav Surkov, Veniamin Veselovsky, Giuseppe Russo, Robert West, and Caglar Gulcehre. Self-recognition in language models.EMNLP, 2024

  32. [32]

    How do large language models acquire factual knowledge during pretraining? Advances in neural information processing systems, 37:60626–60668, 2024

    Hoyeon Chang, Jinho Park, Seonghyeon Ye, Sohee Yang, Youngkyung Seo, Du-Seong Chang, and Minjoon Seo. How do large language models acquire factual knowledge during pretraining? Advances in neural information processing systems, 37:60626–60668, 2024

  33. [33]

    How do language models learn facts? dynamics, curricula and hallucinations

    Nicolas Zucchet, Jörg Bornschein, Stephanie Chan, Andrew Lampinen, Razvan Pascanu, and Soham De. How do language models learn facts? dynamics, curricula and hallucinations. arXiv preprint arXiv:2503.21676, 2025

  34. [34]

    Birth of a transformer: A memory viewpoint.Advances in Neural Information Processing Systems, 36: 1560–1588, 2023

    Alberto Bietti, Vivien Cabannes, Diane Bouchacourt, Herve Jegou, and Leon Bottou. Birth of a transformer: A memory viewpoint.Advances in Neural Information Processing Systems, 36: 1560–1588, 2023

  35. [35]

    Transformer feed-forward layers are key-value memories

    Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. InEMNLP, pages 5484–5495, 2021

  36. [36]

    Dissecting recall of factual associations in auto-regressive language models.EMNLP, 2023

    Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. Dissecting recall of factual associations in auto-regressive language models.EMNLP, 2023

  37. [37]

    Locating and editing factual associations in gpt.Advances in neural information processing systems, 35:17359–17372, 2022

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt.Advances in neural information processing systems, 35:17359–17372, 2022

  38. [38]

    Lee, and Alberto Bietti

    Eshaan Nichani, Jason D. Lee, and Alberto Bietti. Understanding Factual Recall in Trans- formers via Associative Memories. InICLR, October 2025. URL https://openreview.net/ forum?id=hwSmPOAmhk

  39. [39]

    Editing factual knowledge in language models

    Nicola De Cao, Wilker Aziz, and Ivan Titov. Editing factual knowledge in language models. EMNLP, 2021

  40. [40]

    Memory-based model editing at scale

    Eric Mitchell, Charles Lin, Antoine Bosselut, Christopher D Manning, and Chelsea Finn. Memory-based model editing at scale. InInternational Conference on Machine Learning, pages 15817–15831. PMLR, 2022

  41. [41]

    Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

  42. [42]

    An empirical study of catastrophic forgetting in large language models during continual fine-tuning.IEEE Transactions on Audio, Speech and Language Processing, 2025

    Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning.IEEE Transactions on Audio, Speech and Language Processing, 2025

  43. [43]

    Continual learning of large language models: A comprehensive survey.ACM Computing Surveys, 58(5):1–42, 2025

    Haizhou Shi, Zihao Xu, Hengyi Wang, Weiyi Qin, Wenyuan Wang, Yibin Wang, Zifeng Wang, Sayna Ebrahimi, and Hao Wang. Continual learning of large language models: A comprehensive survey.ACM Computing Surveys, 58(5):1–42, 2025

  44. [44]

    The nature of recollection and familiarity: A review of 30 years of research.Journal of memory and language, 46(3):441–517, 2002

    Andrew P Yonelinas. The nature of recollection and familiarity: A review of 30 years of research.Journal of memory and language, 46(3):441–517, 2002

  45. [45]

    Recognition and retrieval processes in free recall

    John R Anderson and Gordon H Bower. Recognition and retrieval processes in free recall. Psychological review, 79(2):97, 1972

  46. [46]

    Search processes in recognition memory 1

    Richard C Atkinson, Douglas J Herrmann, and Keith T Wescourt. Search processes in recognition memory 1. InTheories in cognitive psychology, pages 101–146. Routledge, 1974

  47. [47]

    On the relationship between autobiographical memory and perceptual learning.Journal of Experimental Psychology: General, 110(3):306, 1981

    Larry L Jacoby and Mark Dallas. On the relationship between autobiographical memory and perceptual learning.Journal of Experimental Psychology: General, 110(3):306, 1981

  48. [48]

    Functional neuroanatomy of recall and recognition: A pet study of episodic memory.Journal of cognitive neuroscience, 9(2):254–265, 1997

    Roberto Cabeza, Shitij Kapur, Fergus IM Craik, Anthony R McIntosh, Sylvain Houle, and Endel Tulving. Functional neuroanatomy of recall and recognition: A pet study of episodic memory.Journal of cognitive neuroscience, 9(2):254–265, 1997. 12

  49. [49]

    Cambridge University Press, 2014

    Shai Shalev-Shwartz and Shai Ben-David.Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014

  50. [50]

    Bin Liu and Geoffrey I. Webb. Generative and discriminative learning. In Claude Sammut and Geoffrey I. Webb, editors,Encyclopedia of Machine Learning, pages 454–455. Springer US, Boston, MA, 2010. ISBN 978-0-387-30164-8. doi: 10.1007/978-0-387-30164-8 _332. URL https://doi.org/10.1007/978-0-387-30164-8_332

  51. [51]

    Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

    Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets.arXiv preprint arXiv:2201.02177, 2022

  52. [52]

    Stephen A. Cook. The complexity of theorem-proving procedures. InProceedings of the Third Annual ACM Symposium on Theory of Computing, STOC ’71, page 151–158. Association for Computing Machinery, New York, NY , USA, 1971. ISBN 9781450374644. doi: 10.1145/ 800157.805047. URLhttps://doi.org/10.1145/800157.805047

  53. [53]

    Reducibility among combinatorial problems

    Richard M Karp. Reducibility among combinatorial problems. In50 Years of Integer Pro- gramming 1958-2008: from the Early Years to the State-of-the-Art, pages 219–241. Springer, 1972

  54. [54]

    Universal sequential search problems.Problems of information transmission, 9(3):265–266, 1973

    Leonid A Levin. Universal sequential search problems.Problems of information transmission, 9(3):265–266, 1973

  55. [55]

    Barnes & Noble, New York, 1790

    Immanuel Kant.Critique of Judgment. Barnes & Noble, New York, 1790

  56. [56]

    Towards understanding sycophancy in language models.ICLR, 2024

    Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R Johnston, et al. Towards understanding sycophancy in language models.ICLR, 2024

  57. [57]

    a is b" fail to learn

    Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans. The reversal curse: Llms trained on" a is b" fail to learn" b is a". ICLR, 2024

  58. [58]

    Physics of language models: Part 3.2, knowledge manipu- lation.ICLR, 2025

    Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.2, knowledge manipu- lation.ICLR, 2025

  59. [59]

    On the generalization of language models from in-context learning and finetuning: a controlled study

    Andrew K Lampinen, Arslan Chaudhry, Stephanie CY Chan, Cody Wild, Diane Wan, Alex Ku, Jörg Bornschein, Razvan Pascanu, Murray Shanahan, and James L McClelland. On the generalization of language models from in-context learning and finetuning: a controlled study. NeurIPS, FoRLM Workshop, 2025

  60. [60]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  61. [61]

    Claude sonnet 4.5

    Anthropic. Claude sonnet 4.5. https://www.anthropic.com, 2025. Large language model; API identifier: claude-sonnet-4-5

  62. [62]

    Gemma 3 Technical Report

    Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025

  63. [63]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  64. [64]

    Phi-4 Technical Report

    Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024. 13

  65. [65]

    Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

    Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, et al. Phi-4- mini technical report: Compact yet powerful multimodal language models via mixture-of-loras. arXiv preprint arXiv:2503.01743, 2025

  66. [66]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  67. [67]

    Llama 3.2: Revolutionizing edge AI and vision with open, customizable mod- els, September 2024

    Meta AI. Llama 3.2: Revolutionizing edge AI and vision with open, customizable mod- els, September 2024. URL https://ai.meta.com/blog/llama-3-2-connect-2024-vision- edge-mobile-devices. Blog post

  68. [68]

    Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, and Colin Raffel

    Niklas Muennighoff, Alexander M. Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, and Colin Raffel. Scaling data-constrained language models.Journal of Machine Learning Research, 26(53):1–66, 2025. URL http: //jmlr.org/papers/v26/24-1000.html

  69. [69]

    When scaling meets LLM finetun- ing: The effect of data, model and finetuning method

    Biao Zhang, Zhongtao Liu, Colin Cherry, and Orhan Firat. When scaling meets LLM finetun- ing: The effect of data, model and finetuning method. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=5HCnKDeTws

  70. [70]

    T-rex: A large scale alignment of natural language with knowledge base triples

    Hady Elsahar, Pavlos V ougiouklis, Arslen Remaci, Christophe Gravier, Jonathon Hare, Fred- erique Laforest, and Elena Simperl. T-rex: A large scale alignment of natural language with knowledge base triples. InProceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 2018

  71. [71]

    Judging llm-as-a-judge with mt- bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt- bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

  72. [72]

    A new era of intelligence with Gemini 3, November 2025

    Google. A new era of intelligence with Gemini 3, November 2025. URL https://blog. google/products-and-platforms/products/gemini/gemini-3/. Blog post

  73. [73]

    Reasoning- driven synthetic data generation and evaluation.TMLR, 2026

    Tim R Davidson, Benoit Seguin, Enrico Bacis, Cesar Ilharco, and Hamza Harkous. Reasoning- driven synthetic data generation and evaluation.TMLR, 2026

  74. [74]

    The digitization of the world from edge to core.Framingham: International Data Corporation, 16:1–28, 2018

    David Reinsel-John Gantz-John Rydning, John Reinsel, and John Gantz. The digitization of the world from edge to core.Framingham: International Data Corporation, 16:1–28, 2018

  75. [75]

    Introducing GPT-5.4, 2026

    OpenAI. Introducing GPT-5.4, 2026. URL https://openai.com/index/introducing-gpt- 5-4/. Blog post

  76. [76]

    Are we in the ai-generated text world already? quantifying and monitoring aigt on social media

    Zhen Sun, Zongmin Zhang, Xinyue Shen, Ziyi Zhang, Yule Liu, Michael Backes, Yang Zhang, and Xinlei He. Are we in the ai-generated text world already? quantifying and monitoring aigt on social media. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 22975–23005, 2025

  77. [77]

    The Impact of AI-Generated Text on the Internet

    Jonas Dolezal, Sawood Alam, Mark Graham, and Maty Bohacek. The impact of ai-generated text on the internet, 2026. URLhttps://arxiv.org/abs/2604.26965

  78. [78]

    Large language models reduce public knowledge sharing on online q&a platforms.PNAS nexus, 3(9):pgae400, 2024

    R Maria del Rio-Chanona, Nadzeya Laurentsyeva, and Johannes Wachs. Large language models reduce public knowledge sharing on online q&a platforms.PNAS nexus, 3(9):pgae400, 2024

  79. [79]

    Ai models collapse when trained on recursively generated data.Nature, 631(8022): 755–759, 2024

    Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, and Yarin Gal. Ai models collapse when trained on recursively generated data.Nature, 631(8022): 755–759, 2024

  80. [80]

    Physics of language models: Part 3.3, knowledge capacity scaling laws

    Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.3, knowledge capacity scaling laws. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=FxNNiUgtfa. 14

Showing first 80 references.