pith. machine review for the scientific record. sign in

arxiv: 2604.08519 · v1 · submitted 2026-04-09 · 💻 cs.CL · stat.ML

Recognition: unknown

Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:36 UTC · model grok-4.3

classification 💻 cs.CL stat.ML
keywords data pruningfact memorizationlanguage modelstraining data selectionmodel capacityentity factsWikipedia corpusinformation theory
0
0 comments X

The pith

Loss-based training data pruning lets smaller language models memorize more facts than training on the full dataset.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that language models memorize facts suboptimally when the total information in training facts exceeds their capacity, especially if fact frequencies follow a skewed power-law distribution. It introduces data selection methods that rely only on training loss to cap the number of distinct facts and flatten their frequency distribution. These methods bring fact accuracy up to the model's capacity limit on controlled high-entropy datasets. When applied to pretraining from scratch on an annotated Wikipedia corpus, the approach allows a 110-million-parameter model to store 1.3 times more entity facts than standard training and to match the factual recall of a model ten times larger trained on the unpruned data.

Core claim

Fact accuracy remains below capacity whenever the information contained in training facts exceeds model capacity, and this gap widens under skewed frequency distributions. Simple loss-based selection that limits the total number of facts while equalizing their occurrence frequencies raises accuracy to the capacity limit. On a real Wikipedia corpus this selection enables a GPT2-Small model to memorize 1.3X more entity facts than baseline training on the full dataset, matching the fact performance of a 1.3B-parameter model trained without pruning.

What carries the argument

Loss-based data selection that limits the number of facts in the training set and flattens their frequency distribution

If this is right

  • Fact accuracy can be driven to the model's theoretical capacity limit when training data entropy is high.
  • A 110M-parameter model can store 1.3 times more entity facts than it achieves under standard full-dataset training.
  • The pruned training matches the factual recall of a model ten times larger trained on the complete corpus.
  • The same selection mitigates the penalty caused by power-law skew in natural fact distributions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could reduce the compute needed to reach a given level of factual reliability by training smaller models on curated subsets.
  • Much of the data in typical pretraining corpora may be redundant for the purpose of fact storage.
  • Similar loss-driven pruning might be tested during continued pretraining or on non-text modalities that contain discrete facts.

Load-bearing premise

Loss-based selection can identify and remove only excess or redundant facts without discarding information the model needs for its overall capability or generalization.

What would settle it

Train two models from scratch on the same Wikipedia corpus, one on the loss-pruned subset and one on the full set, then compare their accuracy on a large held-out set of entity facts; if the pruned version does not exceed the full version or fails to reach the capacity limit seen on semi-synthetic data, the central claim is false.

read the original abstract

Large language models (LLMs) can struggle to memorize factual knowledge in their parameters, often leading to hallucinations and poor performance on knowledge-intensive tasks. In this paper, we formalize fact memorization from an information-theoretic perspective and study how training data distributions affect fact accuracy. We show that fact accuracy is suboptimal (below the capacity limit) whenever the amount of information contained in the training data facts exceeds model capacity. This is further exacerbated when the fact frequency distribution is skewed (e.g. a power law). We propose data selection schemes based on the training loss alone that aim to limit the number of facts in the training data and flatten their frequency distribution. On semi-synthetic datasets containing high-entropy facts, our selection method effectively boosts fact accuracy to the capacity limit. When pretraining language models from scratch on an annotated Wikipedia corpus, our selection method enables a GPT2-Small model (110m parameters) to memorize 1.3X more entity facts compared to standard training, matching the performance of a 10X larger model (1.3B parameters) pretrained on the full dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that fact memorization in LLMs is suboptimal when training data contains more information than model capacity or has skewed (power-law) fact frequencies. It proposes loss-based data selection to prune excess facts and flatten frequencies, achieving capacity-limit accuracy on semi-synthetic high-entropy fact datasets and, on annotated Wikipedia pretraining, enabling a 110M-parameter GPT-2 Small model to memorize 1.3× more entity facts than standard training—matching a 1.3B model trained on the full corpus.

Significance. If the central empirical result holds after addressing coverage and control issues, the work offers a practical, low-cost way to improve parameter efficiency for factual knowledge without scaling model size, with potential implications for pretraining data curation in knowledge-intensive domains. The information-theoretic framing of capacity limits and frequency skew is a useful lens, though the contribution is primarily empirical rather than theoretical.

major comments (3)
  1. [§4.2] §4.2 (Wikipedia pretraining experiments): the 1.3× memorization gain and parity with the 1.3B model are reported without any post-pruning measurement of unique entity-fact coverage or retention rate. Because selection uses only per-example training loss, it is possible for singleton mentions of rare facts to be pruned, making the effective fact inventory smaller than the baseline; without this coverage statistic the comparison is not load-bearing for the claim that pruning improves capacity utilization rather than simply reducing the target set.
  2. [§3] §3 (data selection method): the loss-based pruning procedure lacks reported ablations on the exact loss threshold, number of epochs used to compute the loss, or sensitivity to random seeds. The abstract and results mention “exact selection thresholds” only in passing; without these controls it is unclear whether the reported gains are robust or depend on particular hyper-parameter choices that could confound the frequency-flattening interpretation.
  3. [§4.1] §4.1 (semi-synthetic experiments): while capacity-limit accuracy is claimed, the paper does not report statistical significance tests or variance across multiple random seeds for the fact-accuracy metric, nor does it ablate whether the improvement persists when the same number of tokens is retained but facts are not explicitly deduplicated. This leaves open the possibility that gains arise from reduced redundancy rather than the proposed information-theoretic mechanism.
minor comments (2)
  1. [§2] Notation for “fact accuracy” and “capacity limit” is introduced in the abstract and §2 but never given an explicit equation; a short formal definition would improve reproducibility.
  2. [Figures 3-5] Figure captions and axis labels in the experimental plots should explicitly state the number of unique facts and total tokens in each condition to allow direct comparison of effective data volume.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback. We address each of the major comments below and have updated the manuscript accordingly to strengthen the presentation of our results.

read point-by-point responses
  1. Referee: [§4.2] §4.2 (Wikipedia pretraining experiments): the 1.3× memorization gain and parity with the 1.3B model are reported without any post-pruning measurement of unique entity-fact coverage or retention rate. Because selection uses only per-example training loss, it is possible for singleton mentions of rare facts to be pruned, making the effective fact inventory smaller than the baseline; without this coverage statistic the comparison is not load-bearing for the claim that pruning improves capacity utilization rather than simply reducing the target set.

    Authors: We agree that post-pruning coverage statistics are necessary to rule out the possibility that gains stem from a reduced fact inventory. In the revised manuscript, we have added a new analysis in §4.2 that measures unique entity-fact retention rates and coverage after pruning. This shows that our loss-based selection primarily removes redundant mentions of frequent facts while preserving the large majority of unique facts, supporting the interpretation that the observed improvements reflect better capacity utilization. revision: yes

  2. Referee: [§3] §3 (data selection method): the loss-based pruning procedure lacks reported ablations on the exact loss threshold, number of epochs used to compute the loss, or sensitivity to random seeds. The abstract and results mention “exact selection thresholds” only in passing; without these controls it is unclear whether the reported gains are robust or depend on particular hyper-parameter choices that could confound the frequency-flattening interpretation.

    Authors: We thank the referee for noting the need for robustness checks. The revised manuscript includes new ablations (Appendix C) that vary the loss threshold, the number of epochs used to compute per-example loss, and results across multiple random seeds. These experiments confirm that the memorization gains and frequency-flattening effect remain consistent across reasonable choices of these hyperparameters, and we have clarified the exact thresholds employed in the main experiments. revision: yes

  3. Referee: [§4.1] §4.1 (semi-synthetic experiments): while capacity-limit accuracy is claimed, the paper does not report statistical significance tests or variance across multiple random seeds for the fact-accuracy metric, nor does it ablate whether the improvement persists when the same number of tokens is retained but facts are not explicitly deduplicated. This leaves open the possibility that gains arise from reduced redundancy rather than the proposed information-theoretic mechanism.

    Authors: We acknowledge that additional statistical controls and ablations would strengthen the semi-synthetic results. The revised version adds error bars showing variance across multiple random seeds for the fact-accuracy metric, along with statistical significance tests. We also include a control ablation that retains the same number of tokens without our explicit deduplication step; the results indicate that our pruning method continues to outperform this baseline, consistent with the information-theoretic mechanism rather than redundancy reduction alone. revision: yes

Circularity Check

0 steps flagged

No circularity: results are empirical measurements of post-pruning fact accuracy

full rationale

The paper's core claims rest on experimental comparisons: loss-based selection on semi-synthetic high-entropy facts and on an annotated Wikipedia corpus, measuring how many entity facts a GPT-2 Small model memorizes versus a baseline and versus a larger model. The information-theoretic framing (fact accuracy suboptimal when data entropy exceeds capacity, worsened by power-law skew) supplies motivation and interpretation but does not contain equations that define the observed 1.3X gain or the capacity-limit achievement as a direct algebraic consequence of the selection rule itself. No fitted parameter is renamed as a prediction, no self-citation chain is invoked to justify uniqueness, and no ansatz is smuggled in. The skeptic concern about possible loss of unique facts is a validity question about the heuristic, not a reduction of the reported numbers to the paper's own inputs by construction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on an information-theoretic model of fact memorization that treats model capacity as a hard limit and assumes skewed frequency distributions degrade accuracy; no new entities are postulated and no parameters are fitted to produce the reported gains.

axioms (1)
  • domain assumption Fact accuracy becomes suboptimal whenever the total information in training facts exceeds model capacity, and this effect is worsened by power-law frequency skew.
    Invoked in the information-theoretic formalization to explain why standard training underperforms.

pith-pipeline@v0.9.0 · 5497 in / 1354 out tokens · 77754 ms · 2026-05-10T17:36:20.989525+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

106 extracted references · 50 canonical work pages · 10 internal anchors

  1. [1]

    arXiv preprint arXiv:2309.14316 , year=

    Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.1, knowledge storage and extraction. arXiv preprint arXiv:2309.14316, 2023

  2. [2]

    Physics of language models: Part 3.3, knowledge capacity scaling laws.arXiv preprint arXiv:2404.05405, 2024

    Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.3, knowledge capacity scaling laws. arXiv preprint arXiv:2404.05405, 2024

  3. [3]

    Information complexity of stochastic convex optimization: Applications to generalization and memorization

    Idan Attias, Gintare Karolina Dziugaite, Mahdi Haghifam, Roi Livni, and Daniel M Roy. Information complexity of stochastic convex optimization: Applications to generalization and memorization. arXiv preprint arXiv:2402.09327, 2024

  4. [4]

    Beyond the imitation game: Quantifying and extrapolating the capabilities of language models

    BIG bench authors. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023. ISSN 2835-8856

  5. [5]

    Semantic parsing on freebase from question-answer pairs

    Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on freebase from question-answer pairs. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1533--1544, 2013

  6. [6]

    Emergent and predictable memorization in large language models

    Stella Biderman, Usvsn Prashanth, Lintang Sutawika, Hailey Schoelkopf, Quentin Anthony, Shivanshu Purohit, and Edward Raff. Emergent and predictable memorization in large language models. Advances in Neural Information Processing Systems, 36: 0 28072--28090, 2023 a

  7. [7]

    Pythia: A suite for analyzing large language models across training and scaling

    Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397--2430. PMLR, 2023 b

  8. [8]

    Piqa: Reasoning about physical commonsense in natural language

    Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, 2020

  9. [9]

    Occam's razor

    Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K Warmuth. Occam's razor. Information processing letters, 24 0 (6): 0 377--380, 1987

  10. [10]

    Gavin Brown, Mark Bun, Vitaly Feldman, Adam Smith, and Kunal Talwar. When is memorization of irrelevant training data necessary for high-accuracy learning? In Proceedings of the 53rd annual ACM SIGACT symposium on theory of computing, pages 123--132, 2021

  11. [11]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

  12. [12]

    Dark experience for general continual learning: a strong, simple baseline

    Pietro Buzzega, Matteo Boschini, Angelo Porrello, Davide Abati, and Simone Calderara. Dark experience for general continual learning: a strong, simple baseline. Advances in neural information processing systems, 33: 0 15920--15930, 2020

  13. [13]

    The secret sharer: Evaluating and testing unintended memorization in neural networks

    Nicholas Carlini, Chang Liu, \'U lfar Erlingsson, Jernej Kos, and Dawn Song. The secret sharer: Evaluating and testing unintended memorization in neural networks. In 28th USENIX security symposium (USENIX security 19), pages 267--284, 2019

  14. [14]

    Quantifying memorization across neural language models

    Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. Quantifying memorization across neural language models. In The Eleventh International Conference on Learning Representations, 2022

  15. [15]

    Extracting training data from diffusion models

    Nicolas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagielski, Vikash Sehwag, Florian Tramer, Borja Balle, Daphne Ippolito, and Eric Wallace. Extracting training data from diffusion models. In 32nd USENIX security symposium (USENIX Security 23), pages 5253--5270, 2023

  16. [16]

    Reading Wikipedia to Answer Open-Domain Questions

    Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading wikipedia to answer open-domain questions. arXiv preprint arXiv:1704.00051, 2017

  17. [17]

    Cheng, W

    Xin Cheng, Wangding Zeng, Damai Dai, Qinyu Chen, Bingxuan Wang, Zhenda Xie, Kezhao Huang, Xingkai Yu, Zhewen Hao, Yukun Li, et al. Conditional memory via scalable lookup: A new axis of sparsity for large language models. arXiv preprint arXiv:2601.07372, 2026

  18. [18]

    Palm: Scaling language modeling with pathways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24 0 (240): 0 1--113, 2023

  19. [19]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1, 2018

  20. [20]

    Autoregressive entity retrieval.arXiv preprint arXiv:2010.00904, 2020

    Nicola De Cao, Gautier Izacard, Sebastian Riedel, and Fabio Petroni. Autoregressive entity retrieval. arXiv preprint arXiv:2010.00904, 2020

  21. [21]

    The llama 3 herd of models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv e-prints, pages arXiv--2407, 2024

  22. [22]

    Dsdm: Model-aware dataset selection with datamodels

    Logan Engstrom, Axel Feldmann, and Aleksander Madry. Dsdm: Model-aware dataset selection with datamodels. arXiv preprint arXiv:2401.12926, 2024

  23. [23]

    Doge: Domain reweighting with generalization estimation.arXiv preprint arXiv:2310.15393, 2023

    Simin Fan, Matteo Pagliardini, and Martin Jaggi. Doge: Domain reweighting with generalization estimation. arXiv preprint arXiv:2310.15393, 2023

  24. [24]

    Does learning require memorization? a short tale about a long tail

    Vitaly Feldman. Does learning require memorization? a short tale about a long tail. In Proceedings of the 52nd annual ACM SIGACT symposium on theory of computing, pages 954--959, 2020

  25. [25]

    What neural networks memorize and why: Discovering the long tail via influence estimation

    Vitaly Feldman and Chiyuan Zhang. What neural networks memorize and why: Discovering the long tail via influence estimation. Advances in Neural Information Processing Systems, 33: 0 2881--2891, 2020

  26. [26]

    Trade-offs in data memorization via strong data processing inequalities

    Vitaly Feldman, Guy Kornowski, and Xin Lyu. Trade-offs in data memorization via strong data processing inequalities. arXiv preprint arXiv:2506.01855, 2025

  27. [27]

    arXiv preprint arXiv:2004.07202 , year=

    Thibault F \'e vry, Livio Baldini Soares, Nicholas FitzGerald, Eunsol Choi, and Tom Kwiatkowski. Entities as experts: Sparse memory access with entity supervision. arXiv preprint arXiv:2004.07202, 2020

  28. [28]

    Coercing LLMs to do and reveal (almost) anything

    Jonas Geiping, Alex Stein, Manli Shu, Khalid Saifullah, Yuxin Wen, and Tom Goldstein. Coercing llms to do and reveal (almost) anything. arXiv preprint arXiv:2402.14020, 2024

  29. [29]

    Task-adaptive pretrained language models via clustered-importance sampling

    David Grangier, Simin Fan, Skyler Seto, and Pierre Ablin. Task-adaptive pretrained language models via clustered-importance sampling. arXiv preprint arXiv:2410.03735, 2024

  30. [30]

    Olmo: Accelerating the science of language models

    Dirk Groeneveld, Iz Beltagy, Evan Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, et al. Olmo: Accelerating the science of language models. In Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 15789--15809, 2024

  31. [31]

    Data Mixing Can Induce Phase Transitions in Knowledge Acquisition

    Xinran Gu, Kaifeng Lyu, Jiazheng Li, and Jingzhao Zhang. Data mixing can induce phase transitions in knowledge acquisition. arXiv preprint arXiv:2505.18091, 2025

  32. [32]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020

  33. [33]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022

  34. [34]

    Few-shot learning with retrieval augmented language models,

    Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299, 1 0 (2): 0 4, 2022

  35. [35]

    Mixture of parrots: Experts improve memorization more than reasoning

    Samy Jelassi, Clara Mohri, David Brandfonbrener, Alex Gu, Nikhil Vyas, Nikhil Anand, David Alvarez-Melis, Yuanzhi Li, Sham M Kakade, and Eran Malach. Mixture of parrots: Experts improve memorization more than reasoning. arXiv preprint arXiv:2410.19034, 2024

  36. [36]

    Mixtral of Experts

    Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024

  37. [37]

    How can we know what language models know? Transactions of the Association for Computational Linguistics, 8: 0 423--438, 2020

    Zhengbao Jiang, Frank F Xu, Jun Araki, and Graham Neubig. How can we know what language models know? Transactions of the Association for Computational Linguistics, 8: 0 423--438, 2020

  38. [38]

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning. arXiv preprint arXiv:2503.09516, 2025

  39. [39]

    TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017

  40. [40]

    Prismatic synthesis: Gradient-based data diversification boosts generalization in llm reasoning

    Jaehun Jung, Seungju Han, Ximing Lu, Skyler Hallinan, David Acuna, Shrimai Prabhumoye, Mostafa Patwary, Mohammad Shoeybi, Bryan Catanzaro, and Yejin Choi. Prismatic synthesis: Gradient-based data diversification boosts generalization in llm reasoning. arXiv preprint arXiv:2505.20161, 2025

  41. [41]

    Language Models (Mostly) Know What They Know

    Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022

  42. [42]

    Deduplicating training data mitigates privacy risks in language models

    Nikhil Kandpal, Eric Wallace, and Colin Raffel. Deduplicating training data mitigates privacy risks in language models. In International Conference on Machine Learning, pages 10697--10707. PMLR, 2022

  43. [43]

    Large language models struggle to learn long-tail knowledge

    Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel. Large language models struggle to learn long-tail knowledge. In International conference on machine learning, pages 15696--15707. PMLR, 2023

  44. [44]

    Dense passage retrieval for open-domain question answering

    Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick SH Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In EMNLP (1), pages 6769--6781, 2020

  45. [45]

    Generalization through memo- rization: Nearest neighbor language models

    Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Generalization through memorization: Nearest neighbor language models. arXiv preprint arXiv:1911.00172, 2019

  46. [46]

    Three approaches to the quantitative definition ofinformation’

    Andrei N Kolmogorov. Three approaches to the quantitative definition ofinformation’. Problems of information transmission, 1 0 (1): 0 1--7, 1965

  47. [47]

    Natural questions: a benchmark for question answering research

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7: 0 453--466, 2019

  48. [48]

    Deduplicating training data makes language models better

    Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. Deduplicating training data makes language models better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8424--8445, 2022

  49. [49]

    Tic-lm: A web-scale benchmark for time-continual llm pretraining

    Jeffrey Li, Mohammadreza Armandpour, Seyed Iman Mirzadeh, Sachin Mehta, Vaishaal Shankar, Raviteja Vemulapalli, Samy Bengio, Oncel Tuzel, Mehrdad Farajtabar, Hadi Pouransari, et al. Tic-lm: A web-scale benchmark for time-continual llm pretraining. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa...

  50. [50]

    Halueval: A large-scale hallucination evaluation benchmark for large language models.arXiv preprint arXiv:2305.11747, 2023

    Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. Halueval: A large-scale hallucination evaluation benchmark for large language models. arXiv preprint arXiv:2305.11747, 2023

  51. [51]

    From quantity to quality: Boosting llm performance with self-guided data selection for instruction tuning

    Ming Li, Yong Zhang, Zhitao Li, Jiuhai Chen, Lichang Chen, Ning Cheng, Jianzong Wang, Tianyi Zhou, and Jing Xiao. From quantity to quality: Boosting llm performance with self-guided data selection for instruction tuning. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tec...

  52. [52]

    Not all tokens are what you need for pretraining

    Zhenghao Lin, Zhibin Gou, Yeyun Gong, Xiao Liu, Ruochen Xu, Chen Lin, Yujiu Yang, Jian Jiao, Nan Duan, Weizhu Chen, et al. Not all tokens are what you need for pretraining. Advances in Neural Information Processing Systems, 37: 0 29029--29063, 2024

  53. [53]

    On a measure of the information provided by an experiment

    Dennis V Lindley. On a measure of the information provided by an experiment. The Annals of Mathematical Statistics, 27 0 (4): 0 986--1005, 1956

  54. [54]

    What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning

    Wei Liu, Weihao Zeng, Keqing He, Yong Jiang, and Junxian He. What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning. arXiv preprint arXiv:2312.15685, 2023

  55. [55]

    Analyzing leakage of personally identifiable information in language models

    Nils Lukas, Ahmed Salem, Robert Sim, Shruti Tople, Lukas Wutschitz, and Santiago Zanella-B \'e guelin. Analyzing leakage of personally identifiable information in language models. In 2023 IEEE Symposium on Security and Privacy (SP), pages 346--363. IEEE, 2023

  56. [56]

    Rephrasing the web: A recipe for compute and data-efficient language modeling

    Pratyush Maini, Skyler Seto, Richard Bai, David Grangier, Yizhe Zhang, and Navdeep Jaitly. Rephrasing the web: A recipe for compute and data-efficient language modeling. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14044--14072, 2024

  57. [57]

    When not to trust language models: Investigating effectiveness of parametric and non-parametric memories

    Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9802--9822, 2023

  58. [58]

    Locating and editing factual associations in gpt

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. Advances in neural information processing systems, 35: 0 17359--17372, 2022

  59. [59]

    Can a suit of armor conduct electricity? a new dataset for open book question answering

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP, 2018

  60. [60]

    Factscore: Fine-grained atomic evaluation of factual precision in long form text generation

    Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076--12100, 2023 a

  61. [61]

    Nonparametric masked language modeling

    Sewon Min, Weijia Shi, Mike Lewis, Xilun Chen, Wen-tau Yih, Hannaneh Hajishirzi, and Luke Zettlemoyer. Nonparametric masked language modeling. In Findings of the Association for Computational Linguistics: ACL 2023, pages 2097--2118, 2023 b

  62. [62]

    o ren Mindermann, Jan M Brauner, Muhammed T Razzak, Mrinank Sharma, Andreas Kirsch, Winnie Xu, Benedikt H \

    S \"o ren Mindermann, Jan M Brauner, Muhammed T Razzak, Mrinank Sharma, Andreas Kirsch, Winnie Xu, Benedikt H \"o ltgen, Aidan N Gomez, Adrien Morisot, Sebastian Farquhar, et al. Prioritized training on points that are learnable, worth learning, and not yet learnt. In International Conference on Machine Learning, pages 15630--15649. PMLR, 2022

  63. [63]

    arXiv preprint arXiv:2505.24832 , year =

    John X Morris, Chawin Sitawarin, Chuan Guo, Narine Kokhlikyan, G Edward Suh, Alexander M Rush, Kamalika Chaudhuri, and Saeed Mahloujifar. How much do language models memorize? arXiv preprint arXiv:2505.24832, 2025

  64. [64]

    arxiv-papers

    NICK007X. arxiv-papers. huggingface.co/datasets/nick007x/arxiv-papers, 2025. URL huggingface.co/datasets/nick007x/arxiv-papers. Accessed: Dec 9 2025

  65. [65]

    Understanding factuality in abstractive summarization with frank: A benchmark for factuality metrics

    Artidoro Pagnoni, Vidhisha Balachandran, and Yulia Tsvetkov. Understanding factuality in abstractive summarization with frank: A benchmark for factuality metrics. arXiv preprint arXiv:2104.13346, 2021

  66. [66]

    Understanding llm behaviors via compression: Data generation, knowledge acquisition and scaling laws

    Zhixuan Pan, Shaowen Wang, and Jian Li. Understanding llm behaviors via compression: Data generation, knowledge acquisition and scaling laws. arXiv preprint arXiv:2504.09597, 2025

  67. [67]

    Talm: Tool augmented language models.arXiv preprint arXiv:2205.12255, 2022

    Aaron Parisi, Yao Zhao, and Noah Fiedel. Talm: Tool augmented language models. arXiv preprint arXiv:2205.12255, 2022

  68. [68]

    Fabio Petroni, Tim Rockt \"a schel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander Miller. Language models as knowledge bases? In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 2463--2473, 2019

  69. [69]

    Pretraining with hierarchical memories: separating long-tail and common knowledge

    Hadi Pouransari, David Grangier, C Thomas, Michael Kirchhof, and Oncel Tuzel. Pretraining with hierarchical memories: separating long-tail and common knowledge. arXiv preprint arXiv:2510.02375, 2025

  70. [70]

    How does generative retrieval scale to millions of passages? In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1305--1321, 2023

    Ronak Pradeep, Kai Hui, Jai Gupta, Adam Lelkes, Honglei Zhuang, Jimmy Lin, Donald Metzler, and Vinh Tran. How does generative retrieval scale to millions of passages? In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1305--1321, 2023

  71. [71]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019

  72. [72]

    Shaping capabilities with token-level data filtering

    Neil Rathi and Alec Radford. Shaping capabilities with token-level data filtering. arXiv preprint arXiv:2601.21571, 2026

  73. [73]

    How Much Knowledge Can You Pack Into the Parameters of a Language Model?

    Adam Roberts, Colin Raffel, and Noam Shazeer. How much knowledge can you pack into the parameters of a language model? arXiv preprint arXiv:2002.08910, 2020

  74. [74]

    How to train data-efficient llms.arXiv preprint arXiv:2402.09668, 2024

    Noveen Sachdeva, Benjamin Coleman, Wang-Cheng Kang, Jianmo Ni, Lichan Hong, Ed H Chi, James Caverlee, Julian McAuley, and Derek Zhiyuan Cheng. How to train data-efficient llms. arXiv preprint arXiv:2402.09668, 2024

  75. [75]

    Upweighting easy samples in fine-tuning mitigates forgetting

    Sunny Sanyal, Hayden Prairie, Rudrajit Das, Ali Kavis, and Sujay Sanghavi. Upweighting easy samples in fine-tuning mitigates forgetting. arXiv preprint arXiv:2502.02797, 2025

  76. [76]

    Toolformer: Language models can teach themselves to use tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dess \` , Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36: 0 68539--68551, 2023

  77. [77]

    Rethinking llm memorization through the lens of adversarial compression

    Avi Schwarzschild, Zhili Feng, Pratyush Maini, Zachary Lipton, and J Zico Kolter. Rethinking llm memorization through the lens of adversarial compression. Advances in Neural Information Processing Systems, 37: 0 56244--56267, 2024

  78. [78]

    Beyond the reported cutoff: Where large language models fall short on financial knowledge

    Agam Shah, Liqin Ye, Sebastian Jaskowski, Wei Xu, and Sudheer Chava. Beyond the reported cutoff: Where large language models fall short on financial knowledge. arXiv preprint arXiv:2504.00042, 2025

  79. [79]

    Gptzero finds 100 new hallucinations in neurips 2025 accepted papers, January 2026

    Nazar Shmatko, Alex Adam, and Paul Esau. Gptzero finds 100 new hallucinations in neurips 2025 accepted papers, January 2026. URL https://gptzero.me/news/neurips/. Accessed: 2026-01-26

  80. [80]

    Nemotron-cc: Transforming common crawl into a refined long-horizon pretraining dataset

    Dan Su, Kezhi Kong, Ying Lin, Joseph Jennings, Brandon Norick, Markus Kliegl, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro. Nemotron-cc: Transforming common crawl into a refined long-horizon pretraining dataset. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2459--2475, 2025

Showing first 80 references.