arxiv: 2604.08519 · v1 · submitted 2026-04-09 · 💻 cs.CL · stat.ML

Recognition: unknown

Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts

Jiayuan Ye , Vitaly Feldman , Kunal Talwar

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:36 UTC · model grok-4.3

classification 💻 cs.CL stat.ML

keywords data pruningfact memorizationlanguage modelstraining data selectionmodel capacityentity factsWikipedia corpusinformation theory

0 comments

The pith

Loss-based training data pruning lets smaller language models memorize more facts than training on the full dataset.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that language models memorize facts suboptimally when the total information in training facts exceeds their capacity, especially if fact frequencies follow a skewed power-law distribution. It introduces data selection methods that rely only on training loss to cap the number of distinct facts and flatten their frequency distribution. These methods bring fact accuracy up to the model's capacity limit on controlled high-entropy datasets. When applied to pretraining from scratch on an annotated Wikipedia corpus, the approach allows a 110-million-parameter model to store 1.3 times more entity facts than standard training and to match the factual recall of a model ten times larger trained on the unpruned data.

Core claim

Fact accuracy remains below capacity whenever the information contained in training facts exceeds model capacity, and this gap widens under skewed frequency distributions. Simple loss-based selection that limits the total number of facts while equalizing their occurrence frequencies raises accuracy to the capacity limit. On a real Wikipedia corpus this selection enables a GPT2-Small model to memorize 1.3X more entity facts than baseline training on the full dataset, matching the fact performance of a 1.3B-parameter model trained without pruning.

What carries the argument

Loss-based data selection that limits the number of facts in the training set and flattens their frequency distribution

If this is right

Fact accuracy can be driven to the model's theoretical capacity limit when training data entropy is high.
A 110M-parameter model can store 1.3 times more entity facts than it achieves under standard full-dataset training.
The pruned training matches the factual recall of a model ten times larger trained on the complete corpus.
The same selection mitigates the penalty caused by power-law skew in natural fact distributions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could reduce the compute needed to reach a given level of factual reliability by training smaller models on curated subsets.
Much of the data in typical pretraining corpora may be redundant for the purpose of fact storage.
Similar loss-driven pruning might be tested during continued pretraining or on non-text modalities that contain discrete facts.

Load-bearing premise

Loss-based selection can identify and remove only excess or redundant facts without discarding information the model needs for its overall capability or generalization.

What would settle it

Train two models from scratch on the same Wikipedia corpus, one on the loss-pruned subset and one on the full set, then compare their accuracy on a large held-out set of entity facts; if the pruned version does not exceed the full version or fails to reach the capacity limit seen on semi-synthetic data, the central claim is false.

read the original abstract

Large language models (LLMs) can struggle to memorize factual knowledge in their parameters, often leading to hallucinations and poor performance on knowledge-intensive tasks. In this paper, we formalize fact memorization from an information-theoretic perspective and study how training data distributions affect fact accuracy. We show that fact accuracy is suboptimal (below the capacity limit) whenever the amount of information contained in the training data facts exceeds model capacity. This is further exacerbated when the fact frequency distribution is skewed (e.g. a power law). We propose data selection schemes based on the training loss alone that aim to limit the number of facts in the training data and flatten their frequency distribution. On semi-synthetic datasets containing high-entropy facts, our selection method effectively boosts fact accuracy to the capacity limit. When pretraining language models from scratch on an annotated Wikipedia corpus, our selection method enables a GPT2-Small model (110m parameters) to memorize 1.3X more entity facts compared to standard training, matching the performance of a 10X larger model (1.3B parameters) pretrained on the full dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that fact memorization in LLMs is suboptimal when training data contains more information than model capacity or has skewed (power-law) fact frequencies. It proposes loss-based data selection to prune excess facts and flatten frequencies, achieving capacity-limit accuracy on semi-synthetic high-entropy fact datasets and, on annotated Wikipedia pretraining, enabling a 110M-parameter GPT-2 Small model to memorize 1.3× more entity facts than standard training—matching a 1.3B model trained on the full corpus.

Significance. If the central empirical result holds after addressing coverage and control issues, the work offers a practical, low-cost way to improve parameter efficiency for factual knowledge without scaling model size, with potential implications for pretraining data curation in knowledge-intensive domains. The information-theoretic framing of capacity limits and frequency skew is a useful lens, though the contribution is primarily empirical rather than theoretical.

major comments (3)

[§4.2] §4.2 (Wikipedia pretraining experiments): the 1.3× memorization gain and parity with the 1.3B model are reported without any post-pruning measurement of unique entity-fact coverage or retention rate. Because selection uses only per-example training loss, it is possible for singleton mentions of rare facts to be pruned, making the effective fact inventory smaller than the baseline; without this coverage statistic the comparison is not load-bearing for the claim that pruning improves capacity utilization rather than simply reducing the target set.
[§3] §3 (data selection method): the loss-based pruning procedure lacks reported ablations on the exact loss threshold, number of epochs used to compute the loss, or sensitivity to random seeds. The abstract and results mention “exact selection thresholds” only in passing; without these controls it is unclear whether the reported gains are robust or depend on particular hyper-parameter choices that could confound the frequency-flattening interpretation.
[§4.1] §4.1 (semi-synthetic experiments): while capacity-limit accuracy is claimed, the paper does not report statistical significance tests or variance across multiple random seeds for the fact-accuracy metric, nor does it ablate whether the improvement persists when the same number of tokens is retained but facts are not explicitly deduplicated. This leaves open the possibility that gains arise from reduced redundancy rather than the proposed information-theoretic mechanism.

minor comments (2)

[§2] Notation for “fact accuracy” and “capacity limit” is introduced in the abstract and §2 but never given an explicit equation; a short formal definition would improve reproducibility.
[Figures 3-5] Figure captions and axis labels in the experimental plots should explicitly state the number of unique facts and total tokens in each condition to allow direct comparison of effective data volume.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback. We address each of the major comments below and have updated the manuscript accordingly to strengthen the presentation of our results.

read point-by-point responses

Referee: [§4.2] §4.2 (Wikipedia pretraining experiments): the 1.3× memorization gain and parity with the 1.3B model are reported without any post-pruning measurement of unique entity-fact coverage or retention rate. Because selection uses only per-example training loss, it is possible for singleton mentions of rare facts to be pruned, making the effective fact inventory smaller than the baseline; without this coverage statistic the comparison is not load-bearing for the claim that pruning improves capacity utilization rather than simply reducing the target set.

Authors: We agree that post-pruning coverage statistics are necessary to rule out the possibility that gains stem from a reduced fact inventory. In the revised manuscript, we have added a new analysis in §4.2 that measures unique entity-fact retention rates and coverage after pruning. This shows that our loss-based selection primarily removes redundant mentions of frequent facts while preserving the large majority of unique facts, supporting the interpretation that the observed improvements reflect better capacity utilization. revision: yes
Referee: [§3] §3 (data selection method): the loss-based pruning procedure lacks reported ablations on the exact loss threshold, number of epochs used to compute the loss, or sensitivity to random seeds. The abstract and results mention “exact selection thresholds” only in passing; without these controls it is unclear whether the reported gains are robust or depend on particular hyper-parameter choices that could confound the frequency-flattening interpretation.

Authors: We thank the referee for noting the need for robustness checks. The revised manuscript includes new ablations (Appendix C) that vary the loss threshold, the number of epochs used to compute per-example loss, and results across multiple random seeds. These experiments confirm that the memorization gains and frequency-flattening effect remain consistent across reasonable choices of these hyperparameters, and we have clarified the exact thresholds employed in the main experiments. revision: yes
Referee: [§4.1] §4.1 (semi-synthetic experiments): while capacity-limit accuracy is claimed, the paper does not report statistical significance tests or variance across multiple random seeds for the fact-accuracy metric, nor does it ablate whether the improvement persists when the same number of tokens is retained but facts are not explicitly deduplicated. This leaves open the possibility that gains arise from reduced redundancy rather than the proposed information-theoretic mechanism.

Authors: We acknowledge that additional statistical controls and ablations would strengthen the semi-synthetic results. The revised version adds error bars showing variance across multiple random seeds for the fact-accuracy metric, along with statistical significance tests. We also include a control ablation that retains the same number of tokens without our explicit deduplication step; the results indicate that our pruning method continues to outperform this baseline, consistent with the information-theoretic mechanism rather than redundancy reduction alone. revision: yes

Circularity Check

0 steps flagged

No circularity: results are empirical measurements of post-pruning fact accuracy

full rationale

The paper's core claims rest on experimental comparisons: loss-based selection on semi-synthetic high-entropy facts and on an annotated Wikipedia corpus, measuring how many entity facts a GPT-2 Small model memorizes versus a baseline and versus a larger model. The information-theoretic framing (fact accuracy suboptimal when data entropy exceeds capacity, worsened by power-law skew) supplies motivation and interpretation but does not contain equations that define the observed 1.3X gain or the capacity-limit achievement as a direct algebraic consequence of the selection rule itself. No fitted parameter is renamed as a prediction, no self-citation chain is invoked to justify uniqueness, and no ansatz is smuggled in. The skeptic concern about possible loss of unique facts is a validity question about the heuristic, not a reduction of the reported numbers to the paper's own inputs by construction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on an information-theoretic model of fact memorization that treats model capacity as a hard limit and assumes skewed frequency distributions degrade accuracy; no new entities are postulated and no parameters are fitted to produce the reported gains.

axioms (1)

domain assumption Fact accuracy becomes suboptimal whenever the total information in training facts exceeds model capacity, and this effect is worsened by power-law frequency skew.
Invoked in the information-theoretic formalization to explain why standard training underperforms.

pith-pipeline@v0.9.0 · 5497 in / 1354 out tokens · 77754 ms · 2026-05-10T17:36:20.989525+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

106 extracted references · 50 canonical work pages · 10 internal anchors

[1]

arXiv preprint arXiv:2309.14316 , year=

Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.1, knowledge storage and extraction. arXiv preprint arXiv:2309.14316, 2023

work page arXiv 2023
[2]

Physics of language models: Part 3.3, knowledge capacity scaling laws.arXiv preprint arXiv:2404.05405, 2024

Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.3, knowledge capacity scaling laws. arXiv preprint arXiv:2404.05405, 2024

work page arXiv 2024
[3]

Information complexity of stochastic convex optimization: Applications to generalization and memorization

Idan Attias, Gintare Karolina Dziugaite, Mahdi Haghifam, Roi Livni, and Daniel M Roy. Information complexity of stochastic convex optimization: Applications to generalization and memorization. arXiv preprint arXiv:2402.09327, 2024

work page arXiv 2024
[4]

Beyond the imitation game: Quantifying and extrapolating the capabilities of language models

BIG bench authors. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023. ISSN 2835-8856

2023
[5]

Semantic parsing on freebase from question-answer pairs

Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on freebase from question-answer pairs. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1533--1544, 2013

2013
[6]

Emergent and predictable memorization in large language models

Stella Biderman, Usvsn Prashanth, Lintang Sutawika, Hailey Schoelkopf, Quentin Anthony, Shivanshu Purohit, and Edward Raff. Emergent and predictable memorization in large language models. Advances in Neural Information Processing Systems, 36: 0 28072--28090, 2023 a

2023
[7]

Pythia: A suite for analyzing large language models across training and scaling

Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397--2430. PMLR, 2023 b

2023
[8]

Piqa: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, 2020

2020
[9]

Occam's razor

Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K Warmuth. Occam's razor. Information processing letters, 24 0 (6): 0 377--380, 1987

1987
[10]

Gavin Brown, Mark Bun, Vitaly Feldman, Adam Smith, and Kunal Talwar. When is memorization of irrelevant training data necessary for high-accuracy learning? In Proceedings of the 53rd annual ACM SIGACT symposium on theory of computing, pages 123--132, 2021

2021
[11]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

1901
[12]

Dark experience for general continual learning: a strong, simple baseline

Pietro Buzzega, Matteo Boschini, Angelo Porrello, Davide Abati, and Simone Calderara. Dark experience for general continual learning: a strong, simple baseline. Advances in neural information processing systems, 33: 0 15920--15930, 2020

2020
[13]

The secret sharer: Evaluating and testing unintended memorization in neural networks

Nicholas Carlini, Chang Liu, \'U lfar Erlingsson, Jernej Kos, and Dawn Song. The secret sharer: Evaluating and testing unintended memorization in neural networks. In 28th USENIX security symposium (USENIX security 19), pages 267--284, 2019

2019
[14]

Quantifying memorization across neural language models

Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. Quantifying memorization across neural language models. In The Eleventh International Conference on Learning Representations, 2022

2022
[15]

Extracting training data from diffusion models

Nicolas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagielski, Vikash Sehwag, Florian Tramer, Borja Balle, Daphne Ippolito, and Eric Wallace. Extracting training data from diffusion models. In 32nd USENIX security symposium (USENIX Security 23), pages 5253--5270, 2023

2023
[16]

Reading Wikipedia to Answer Open-Domain Questions

Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading wikipedia to answer open-domain questions. arXiv preprint arXiv:1704.00051, 2017

work page Pith review arXiv 2017
[17]

Cheng, W

Xin Cheng, Wangding Zeng, Damai Dai, Qinyu Chen, Bingxuan Wang, Zhenda Xie, Kezhao Huang, Xingkai Yu, Zhewen Hao, Yukun Li, et al. Conditional memory via scalable lookup: A new axis of sparsity for large language models. arXiv preprint arXiv:2601.07372, 2026

work page arXiv 2026
[18]

Palm: Scaling language modeling with pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24 0 (240): 0 1--113, 2023

2023
[19]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[20]

Autoregressive entity retrieval.arXiv preprint arXiv:2010.00904, 2020

Nicola De Cao, Gautier Izacard, Sebastian Riedel, and Fabio Petroni. Autoregressive entity retrieval. arXiv preprint arXiv:2010.00904, 2020

work page arXiv 2010
[21]

The llama 3 herd of models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv e-prints, pages arXiv--2407, 2024

2024
[22]

Dsdm: Model-aware dataset selection with datamodels

Logan Engstrom, Axel Feldmann, and Aleksander Madry. Dsdm: Model-aware dataset selection with datamodels. arXiv preprint arXiv:2401.12926, 2024

work page arXiv 2024
[23]

Doge: Domain reweighting with generalization estimation.arXiv preprint arXiv:2310.15393, 2023

Simin Fan, Matteo Pagliardini, and Martin Jaggi. Doge: Domain reweighting with generalization estimation. arXiv preprint arXiv:2310.15393, 2023

work page arXiv 2023
[24]

Does learning require memorization? a short tale about a long tail

Vitaly Feldman. Does learning require memorization? a short tale about a long tail. In Proceedings of the 52nd annual ACM SIGACT symposium on theory of computing, pages 954--959, 2020

2020
[25]

What neural networks memorize and why: Discovering the long tail via influence estimation

Vitaly Feldman and Chiyuan Zhang. What neural networks memorize and why: Discovering the long tail via influence estimation. Advances in Neural Information Processing Systems, 33: 0 2881--2891, 2020

2020
[26]

Trade-offs in data memorization via strong data processing inequalities

Vitaly Feldman, Guy Kornowski, and Xin Lyu. Trade-offs in data memorization via strong data processing inequalities. arXiv preprint arXiv:2506.01855, 2025

work page arXiv 2025
[27]

arXiv preprint arXiv:2004.07202 , year=

Thibault F \'e vry, Livio Baldini Soares, Nicholas FitzGerald, Eunsol Choi, and Tom Kwiatkowski. Entities as experts: Sparse memory access with entity supervision. arXiv preprint arXiv:2004.07202, 2020

work page arXiv 2004
[28]

Coercing LLMs to do and reveal (almost) anything

Jonas Geiping, Alex Stein, Manli Shu, Khalid Saifullah, Yuxin Wen, and Tom Goldstein. Coercing llms to do and reveal (almost) anything. arXiv preprint arXiv:2402.14020, 2024

work page arXiv 2024
[29]

Task-adaptive pretrained language models via clustered-importance sampling

David Grangier, Simin Fan, Skyler Seto, and Pierre Ablin. Task-adaptive pretrained language models via clustered-importance sampling. arXiv preprint arXiv:2410.03735, 2024

work page arXiv 2024
[30]

Olmo: Accelerating the science of language models

Dirk Groeneveld, Iz Beltagy, Evan Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, et al. Olmo: Accelerating the science of language models. In Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 15789--15809, 2024

2024
[31]

Data Mixing Can Induce Phase Transitions in Knowledge Acquisition

Xinran Gu, Kaifeng Lyu, Jiazheng Li, and Jingzhao Zhang. Data mixing can induce phase transitions in knowledge acquisition. arXiv preprint arXiv:2505.18091, 2025

work page internal anchor Pith review arXiv 2025
[32]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[33]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[34]

Few-shot learning with retrieval augmented language models,

Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299, 1 0 (2): 0 4, 2022

work page arXiv 2022
[35]

Mixture of parrots: Experts improve memorization more than reasoning

Samy Jelassi, Clara Mohri, David Brandfonbrener, Alex Gu, Nikhil Vyas, Nikhil Anand, David Alvarez-Melis, Yuanzhi Li, Sham M Kakade, and Eran Malach. Mixture of parrots: Experts improve memorization more than reasoning. arXiv preprint arXiv:2410.19034, 2024

work page arXiv 2024
[36]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

How can we know what language models know? Transactions of the Association for Computational Linguistics, 8: 0 423--438, 2020

Zhengbao Jiang, Frank F Xu, Jun Araki, and Graham Neubig. How can we know what language models know? Transactions of the Association for Computational Linguistics, 8: 0 423--438, 2020

2020
[38]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning. arXiv preprint arXiv:2503.09516, 2025

work page Pith review arXiv 2025
[39]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017

work page internal anchor Pith review arXiv 2017
[40]

Prismatic synthesis: Gradient-based data diversification boosts generalization in llm reasoning

Jaehun Jung, Seungju Han, Ximing Lu, Skyler Hallinan, David Acuna, Shrimai Prabhumoye, Mostafa Patwary, Mohammad Shoeybi, Bryan Catanzaro, and Yejin Choi. Prismatic synthesis: Gradient-based data diversification boosts generalization in llm reasoning. arXiv preprint arXiv:2505.20161, 2025

work page arXiv 2025
[41]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[42]

Deduplicating training data mitigates privacy risks in language models

Nikhil Kandpal, Eric Wallace, and Colin Raffel. Deduplicating training data mitigates privacy risks in language models. In International Conference on Machine Learning, pages 10697--10707. PMLR, 2022

2022
[43]

Large language models struggle to learn long-tail knowledge

Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel. Large language models struggle to learn long-tail knowledge. In International conference on machine learning, pages 15696--15707. PMLR, 2023

2023
[44]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick SH Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In EMNLP (1), pages 6769--6781, 2020

2020
[45]

Generalization through memo- rization: Nearest neighbor language models

Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Generalization through memorization: Nearest neighbor language models. arXiv preprint arXiv:1911.00172, 2019

work page arXiv 1911
[46]

Three approaches to the quantitative definition ofinformation’

Andrei N Kolmogorov. Three approaches to the quantitative definition ofinformation’. Problems of information transmission, 1 0 (1): 0 1--7, 1965

1965
[47]

Natural questions: a benchmark for question answering research

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7: 0 453--466, 2019

2019
[48]

Deduplicating training data makes language models better

Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. Deduplicating training data makes language models better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8424--8445, 2022

2022
[49]

Tic-lm: A web-scale benchmark for time-continual llm pretraining

Jeffrey Li, Mohammadreza Armandpour, Seyed Iman Mirzadeh, Sachin Mehta, Vaishaal Shankar, Raviteja Vemulapalli, Samy Bengio, Oncel Tuzel, Mehrdad Farajtabar, Hadi Pouransari, et al. Tic-lm: A web-scale benchmark for time-continual llm pretraining. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa...

2025
[50]

Halueval: A large-scale hallucination evaluation benchmark for large language models.arXiv preprint arXiv:2305.11747, 2023

Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. Halueval: A large-scale hallucination evaluation benchmark for large language models. arXiv preprint arXiv:2305.11747, 2023

work page arXiv 2023
[51]

From quantity to quality: Boosting llm performance with self-guided data selection for instruction tuning

Ming Li, Yong Zhang, Zhitao Li, Jiuhai Chen, Lichang Chen, Ning Cheng, Jianzong Wang, Tianyi Zhou, and Jing Xiao. From quantity to quality: Boosting llm performance with self-guided data selection for instruction tuning. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tec...

2024
[52]

Not all tokens are what you need for pretraining

Zhenghao Lin, Zhibin Gou, Yeyun Gong, Xiao Liu, Ruochen Xu, Chen Lin, Yujiu Yang, Jian Jiao, Nan Duan, Weizhu Chen, et al. Not all tokens are what you need for pretraining. Advances in Neural Information Processing Systems, 37: 0 29029--29063, 2024

2024
[53]

On a measure of the information provided by an experiment

Dennis V Lindley. On a measure of the information provided by an experiment. The Annals of Mathematical Statistics, 27 0 (4): 0 986--1005, 1956

1956
[54]

What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning

Wei Liu, Weihao Zeng, Keqing He, Yong Jiang, and Junxian He. What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning. arXiv preprint arXiv:2312.15685, 2023

work page arXiv 2023
[55]

Analyzing leakage of personally identifiable information in language models

Nils Lukas, Ahmed Salem, Robert Sim, Shruti Tople, Lukas Wutschitz, and Santiago Zanella-B \'e guelin. Analyzing leakage of personally identifiable information in language models. In 2023 IEEE Symposium on Security and Privacy (SP), pages 346--363. IEEE, 2023

2023
[56]

Rephrasing the web: A recipe for compute and data-efficient language modeling

Pratyush Maini, Skyler Seto, Richard Bai, David Grangier, Yizhe Zhang, and Navdeep Jaitly. Rephrasing the web: A recipe for compute and data-efficient language modeling. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14044--14072, 2024

2024
[57]

When not to trust language models: Investigating effectiveness of parametric and non-parametric memories

Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9802--9822, 2023

2023
[58]

Locating and editing factual associations in gpt

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. Advances in neural information processing systems, 35: 0 17359--17372, 2022

2022
[59]

Can a suit of armor conduct electricity? a new dataset for open book question answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP, 2018

2018
[60]

Factscore: Fine-grained atomic evaluation of factual precision in long form text generation

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076--12100, 2023 a

2023
[61]

Nonparametric masked language modeling

Sewon Min, Weijia Shi, Mike Lewis, Xilun Chen, Wen-tau Yih, Hannaneh Hajishirzi, and Luke Zettlemoyer. Nonparametric masked language modeling. In Findings of the Association for Computational Linguistics: ACL 2023, pages 2097--2118, 2023 b

2023
[62]

o ren Mindermann, Jan M Brauner, Muhammed T Razzak, Mrinank Sharma, Andreas Kirsch, Winnie Xu, Benedikt H \

S \"o ren Mindermann, Jan M Brauner, Muhammed T Razzak, Mrinank Sharma, Andreas Kirsch, Winnie Xu, Benedikt H \"o ltgen, Aidan N Gomez, Adrien Morisot, Sebastian Farquhar, et al. Prioritized training on points that are learnable, worth learning, and not yet learnt. In International Conference on Machine Learning, pages 15630--15649. PMLR, 2022

2022
[63]

arXiv preprint arXiv:2505.24832 , year =

John X Morris, Chawin Sitawarin, Chuan Guo, Narine Kokhlikyan, G Edward Suh, Alexander M Rush, Kamalika Chaudhuri, and Saeed Mahloujifar. How much do language models memorize? arXiv preprint arXiv:2505.24832, 2025

work page arXiv 2025
[64]

arxiv-papers

NICK007X. arxiv-papers. huggingface.co/datasets/nick007x/arxiv-papers, 2025. URL huggingface.co/datasets/nick007x/arxiv-papers. Accessed: Dec 9 2025

2025
[65]

Understanding factuality in abstractive summarization with frank: A benchmark for factuality metrics

Artidoro Pagnoni, Vidhisha Balachandran, and Yulia Tsvetkov. Understanding factuality in abstractive summarization with frank: A benchmark for factuality metrics. arXiv preprint arXiv:2104.13346, 2021

work page arXiv 2021
[66]

Understanding llm behaviors via compression: Data generation, knowledge acquisition and scaling laws

Zhixuan Pan, Shaowen Wang, and Jian Li. Understanding llm behaviors via compression: Data generation, knowledge acquisition and scaling laws. arXiv preprint arXiv:2504.09597, 2025

work page arXiv 2025
[67]

Talm: Tool augmented language models.arXiv preprint arXiv:2205.12255, 2022

Aaron Parisi, Yao Zhao, and Noah Fiedel. Talm: Tool augmented language models. arXiv preprint arXiv:2205.12255, 2022

work page arXiv 2022
[68]

Fabio Petroni, Tim Rockt \"a schel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander Miller. Language models as knowledge bases? In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 2463--2473, 2019

2019
[69]

Pretraining with hierarchical memories: separating long-tail and common knowledge

Hadi Pouransari, David Grangier, C Thomas, Michael Kirchhof, and Oncel Tuzel. Pretraining with hierarchical memories: separating long-tail and common knowledge. arXiv preprint arXiv:2510.02375, 2025

work page arXiv 2025
[70]

How does generative retrieval scale to millions of passages? In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1305--1321, 2023

Ronak Pradeep, Kai Hui, Jai Gupta, Adam Lelkes, Honglei Zhuang, Jimmy Lin, Donald Metzler, and Vinh Tran. How does generative retrieval scale to millions of passages? In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1305--1321, 2023

2023
[71]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019

2019
[72]

Shaping capabilities with token-level data filtering

Neil Rathi and Alec Radford. Shaping capabilities with token-level data filtering. arXiv preprint arXiv:2601.21571, 2026

work page arXiv 2026
[73]

How Much Knowledge Can You Pack Into the Parameters of a Language Model?

Adam Roberts, Colin Raffel, and Noam Shazeer. How much knowledge can you pack into the parameters of a language model? arXiv preprint arXiv:2002.08910, 2020

work page internal anchor Pith review arXiv 2002
[74]

How to train data-efficient llms.arXiv preprint arXiv:2402.09668, 2024

Noveen Sachdeva, Benjamin Coleman, Wang-Cheng Kang, Jianmo Ni, Lichan Hong, Ed H Chi, James Caverlee, Julian McAuley, and Derek Zhiyuan Cheng. How to train data-efficient llms. arXiv preprint arXiv:2402.09668, 2024

work page arXiv 2024
[75]

Upweighting easy samples in fine-tuning mitigates forgetting

Sunny Sanyal, Hayden Prairie, Rudrajit Das, Ali Kavis, and Sujay Sanghavi. Upweighting easy samples in fine-tuning mitigates forgetting. arXiv preprint arXiv:2502.02797, 2025

work page arXiv 2025
[76]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dess \` , Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36: 0 68539--68551, 2023

2023
[77]

Rethinking llm memorization through the lens of adversarial compression

Avi Schwarzschild, Zhili Feng, Pratyush Maini, Zachary Lipton, and J Zico Kolter. Rethinking llm memorization through the lens of adversarial compression. Advances in Neural Information Processing Systems, 37: 0 56244--56267, 2024

2024
[78]

Beyond the reported cutoff: Where large language models fall short on financial knowledge

Agam Shah, Liqin Ye, Sebastian Jaskowski, Wei Xu, and Sudheer Chava. Beyond the reported cutoff: Where large language models fall short on financial knowledge. arXiv preprint arXiv:2504.00042, 2025

work page arXiv 2025
[79]

Gptzero finds 100 new hallucinations in neurips 2025 accepted papers, January 2026

Nazar Shmatko, Alex Adam, and Paul Esau. Gptzero finds 100 new hallucinations in neurips 2025 accepted papers, January 2026. URL https://gptzero.me/news/neurips/. Accessed: 2026-01-26

2025
[80]

Nemotron-cc: Transforming common crawl into a refined long-horizon pretraining dataset

Dan Su, Kezhi Kong, Ying Lin, Joseph Jennings, Brandon Norick, Markus Kliegl, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro. Nemotron-cc: Transforming common crawl into a refined long-horizon pretraining dataset. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2459--2475, 2025

2025

Showing first 80 references.