Recognition: unknown
Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts
Pith reviewed 2026-05-10 17:36 UTC · model grok-4.3
The pith
Loss-based training data pruning lets smaller language models memorize more facts than training on the full dataset.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Fact accuracy remains below capacity whenever the information contained in training facts exceeds model capacity, and this gap widens under skewed frequency distributions. Simple loss-based selection that limits the total number of facts while equalizing their occurrence frequencies raises accuracy to the capacity limit. On a real Wikipedia corpus this selection enables a GPT2-Small model to memorize 1.3X more entity facts than baseline training on the full dataset, matching the fact performance of a 1.3B-parameter model trained without pruning.
What carries the argument
Loss-based data selection that limits the number of facts in the training set and flattens their frequency distribution
If this is right
- Fact accuracy can be driven to the model's theoretical capacity limit when training data entropy is high.
- A 110M-parameter model can store 1.3 times more entity facts than it achieves under standard full-dataset training.
- The pruned training matches the factual recall of a model ten times larger trained on the complete corpus.
- The same selection mitigates the penalty caused by power-law skew in natural fact distributions.
Where Pith is reading between the lines
- The method could reduce the compute needed to reach a given level of factual reliability by training smaller models on curated subsets.
- Much of the data in typical pretraining corpora may be redundant for the purpose of fact storage.
- Similar loss-driven pruning might be tested during continued pretraining or on non-text modalities that contain discrete facts.
Load-bearing premise
Loss-based selection can identify and remove only excess or redundant facts without discarding information the model needs for its overall capability or generalization.
What would settle it
Train two models from scratch on the same Wikipedia corpus, one on the loss-pruned subset and one on the full set, then compare their accuracy on a large held-out set of entity facts; if the pruned version does not exceed the full version or fails to reach the capacity limit seen on semi-synthetic data, the central claim is false.
read the original abstract
Large language models (LLMs) can struggle to memorize factual knowledge in their parameters, often leading to hallucinations and poor performance on knowledge-intensive tasks. In this paper, we formalize fact memorization from an information-theoretic perspective and study how training data distributions affect fact accuracy. We show that fact accuracy is suboptimal (below the capacity limit) whenever the amount of information contained in the training data facts exceeds model capacity. This is further exacerbated when the fact frequency distribution is skewed (e.g. a power law). We propose data selection schemes based on the training loss alone that aim to limit the number of facts in the training data and flatten their frequency distribution. On semi-synthetic datasets containing high-entropy facts, our selection method effectively boosts fact accuracy to the capacity limit. When pretraining language models from scratch on an annotated Wikipedia corpus, our selection method enables a GPT2-Small model (110m parameters) to memorize 1.3X more entity facts compared to standard training, matching the performance of a 10X larger model (1.3B parameters) pretrained on the full dataset.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that fact memorization in LLMs is suboptimal when training data contains more information than model capacity or has skewed (power-law) fact frequencies. It proposes loss-based data selection to prune excess facts and flatten frequencies, achieving capacity-limit accuracy on semi-synthetic high-entropy fact datasets and, on annotated Wikipedia pretraining, enabling a 110M-parameter GPT-2 Small model to memorize 1.3× more entity facts than standard training—matching a 1.3B model trained on the full corpus.
Significance. If the central empirical result holds after addressing coverage and control issues, the work offers a practical, low-cost way to improve parameter efficiency for factual knowledge without scaling model size, with potential implications for pretraining data curation in knowledge-intensive domains. The information-theoretic framing of capacity limits and frequency skew is a useful lens, though the contribution is primarily empirical rather than theoretical.
major comments (3)
- [§4.2] §4.2 (Wikipedia pretraining experiments): the 1.3× memorization gain and parity with the 1.3B model are reported without any post-pruning measurement of unique entity-fact coverage or retention rate. Because selection uses only per-example training loss, it is possible for singleton mentions of rare facts to be pruned, making the effective fact inventory smaller than the baseline; without this coverage statistic the comparison is not load-bearing for the claim that pruning improves capacity utilization rather than simply reducing the target set.
- [§3] §3 (data selection method): the loss-based pruning procedure lacks reported ablations on the exact loss threshold, number of epochs used to compute the loss, or sensitivity to random seeds. The abstract and results mention “exact selection thresholds” only in passing; without these controls it is unclear whether the reported gains are robust or depend on particular hyper-parameter choices that could confound the frequency-flattening interpretation.
- [§4.1] §4.1 (semi-synthetic experiments): while capacity-limit accuracy is claimed, the paper does not report statistical significance tests or variance across multiple random seeds for the fact-accuracy metric, nor does it ablate whether the improvement persists when the same number of tokens is retained but facts are not explicitly deduplicated. This leaves open the possibility that gains arise from reduced redundancy rather than the proposed information-theoretic mechanism.
minor comments (2)
- [§2] Notation for “fact accuracy” and “capacity limit” is introduced in the abstract and §2 but never given an explicit equation; a short formal definition would improve reproducibility.
- [Figures 3-5] Figure captions and axis labels in the experimental plots should explicitly state the number of unique facts and total tokens in each condition to allow direct comparison of effective data volume.
Simulated Author's Rebuttal
Thank you for the constructive feedback. We address each of the major comments below and have updated the manuscript accordingly to strengthen the presentation of our results.
read point-by-point responses
-
Referee: [§4.2] §4.2 (Wikipedia pretraining experiments): the 1.3× memorization gain and parity with the 1.3B model are reported without any post-pruning measurement of unique entity-fact coverage or retention rate. Because selection uses only per-example training loss, it is possible for singleton mentions of rare facts to be pruned, making the effective fact inventory smaller than the baseline; without this coverage statistic the comparison is not load-bearing for the claim that pruning improves capacity utilization rather than simply reducing the target set.
Authors: We agree that post-pruning coverage statistics are necessary to rule out the possibility that gains stem from a reduced fact inventory. In the revised manuscript, we have added a new analysis in §4.2 that measures unique entity-fact retention rates and coverage after pruning. This shows that our loss-based selection primarily removes redundant mentions of frequent facts while preserving the large majority of unique facts, supporting the interpretation that the observed improvements reflect better capacity utilization. revision: yes
-
Referee: [§3] §3 (data selection method): the loss-based pruning procedure lacks reported ablations on the exact loss threshold, number of epochs used to compute the loss, or sensitivity to random seeds. The abstract and results mention “exact selection thresholds” only in passing; without these controls it is unclear whether the reported gains are robust or depend on particular hyper-parameter choices that could confound the frequency-flattening interpretation.
Authors: We thank the referee for noting the need for robustness checks. The revised manuscript includes new ablations (Appendix C) that vary the loss threshold, the number of epochs used to compute per-example loss, and results across multiple random seeds. These experiments confirm that the memorization gains and frequency-flattening effect remain consistent across reasonable choices of these hyperparameters, and we have clarified the exact thresholds employed in the main experiments. revision: yes
-
Referee: [§4.1] §4.1 (semi-synthetic experiments): while capacity-limit accuracy is claimed, the paper does not report statistical significance tests or variance across multiple random seeds for the fact-accuracy metric, nor does it ablate whether the improvement persists when the same number of tokens is retained but facts are not explicitly deduplicated. This leaves open the possibility that gains arise from reduced redundancy rather than the proposed information-theoretic mechanism.
Authors: We acknowledge that additional statistical controls and ablations would strengthen the semi-synthetic results. The revised version adds error bars showing variance across multiple random seeds for the fact-accuracy metric, along with statistical significance tests. We also include a control ablation that retains the same number of tokens without our explicit deduplication step; the results indicate that our pruning method continues to outperform this baseline, consistent with the information-theoretic mechanism rather than redundancy reduction alone. revision: yes
Circularity Check
No circularity: results are empirical measurements of post-pruning fact accuracy
full rationale
The paper's core claims rest on experimental comparisons: loss-based selection on semi-synthetic high-entropy facts and on an annotated Wikipedia corpus, measuring how many entity facts a GPT-2 Small model memorizes versus a baseline and versus a larger model. The information-theoretic framing (fact accuracy suboptimal when data entropy exceeds capacity, worsened by power-law skew) supplies motivation and interpretation but does not contain equations that define the observed 1.3X gain or the capacity-limit achievement as a direct algebraic consequence of the selection rule itself. No fitted parameter is renamed as a prediction, no self-citation chain is invoked to justify uniqueness, and no ansatz is smuggled in. The skeptic concern about possible loss of unique facts is a validity question about the heuristic, not a reduction of the reported numbers to the paper's own inputs by construction. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Fact accuracy becomes suboptimal whenever the total information in training facts exceeds model capacity, and this effect is worsened by power-law frequency skew.
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2309.14316 , year=
Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.1, knowledge storage and extraction. arXiv preprint arXiv:2309.14316, 2023
-
[2]
Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.3, knowledge capacity scaling laws. arXiv preprint arXiv:2404.05405, 2024
-
[3]
Idan Attias, Gintare Karolina Dziugaite, Mahdi Haghifam, Roi Livni, and Daniel M Roy. Information complexity of stochastic convex optimization: Applications to generalization and memorization. arXiv preprint arXiv:2402.09327, 2024
-
[4]
Beyond the imitation game: Quantifying and extrapolating the capabilities of language models
BIG bench authors. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023. ISSN 2835-8856
2023
-
[5]
Semantic parsing on freebase from question-answer pairs
Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on freebase from question-answer pairs. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1533--1544, 2013
2013
-
[6]
Emergent and predictable memorization in large language models
Stella Biderman, Usvsn Prashanth, Lintang Sutawika, Hailey Schoelkopf, Quentin Anthony, Shivanshu Purohit, and Edward Raff. Emergent and predictable memorization in large language models. Advances in Neural Information Processing Systems, 36: 0 28072--28090, 2023 a
2023
-
[7]
Pythia: A suite for analyzing large language models across training and scaling
Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397--2430. PMLR, 2023 b
2023
-
[8]
Piqa: Reasoning about physical commonsense in natural language
Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, 2020
2020
-
[9]
Occam's razor
Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K Warmuth. Occam's razor. Information processing letters, 24 0 (6): 0 377--380, 1987
1987
-
[10]
Gavin Brown, Mark Bun, Vitaly Feldman, Adam Smith, and Kunal Talwar. When is memorization of irrelevant training data necessary for high-accuracy learning? In Proceedings of the 53rd annual ACM SIGACT symposium on theory of computing, pages 123--132, 2021
2021
-
[11]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020
1901
-
[12]
Dark experience for general continual learning: a strong, simple baseline
Pietro Buzzega, Matteo Boschini, Angelo Porrello, Davide Abati, and Simone Calderara. Dark experience for general continual learning: a strong, simple baseline. Advances in neural information processing systems, 33: 0 15920--15930, 2020
2020
-
[13]
The secret sharer: Evaluating and testing unintended memorization in neural networks
Nicholas Carlini, Chang Liu, \'U lfar Erlingsson, Jernej Kos, and Dawn Song. The secret sharer: Evaluating and testing unintended memorization in neural networks. In 28th USENIX security symposium (USENIX security 19), pages 267--284, 2019
2019
-
[14]
Quantifying memorization across neural language models
Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. Quantifying memorization across neural language models. In The Eleventh International Conference on Learning Representations, 2022
2022
-
[15]
Extracting training data from diffusion models
Nicolas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagielski, Vikash Sehwag, Florian Tramer, Borja Balle, Daphne Ippolito, and Eric Wallace. Extracting training data from diffusion models. In 32nd USENIX security symposium (USENIX Security 23), pages 5253--5270, 2023
2023
-
[16]
Reading Wikipedia to Answer Open-Domain Questions
Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading wikipedia to answer open-domain questions. arXiv preprint arXiv:1704.00051, 2017
work page Pith review arXiv 2017
- [17]
-
[18]
Palm: Scaling language modeling with pathways
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24 0 (240): 0 1--113, 2023
2023
-
[19]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[20]
Autoregressive entity retrieval.arXiv preprint arXiv:2010.00904, 2020
Nicola De Cao, Gautier Izacard, Sebastian Riedel, and Fabio Petroni. Autoregressive entity retrieval. arXiv preprint arXiv:2010.00904, 2020
-
[21]
The llama 3 herd of models
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv e-prints, pages arXiv--2407, 2024
2024
-
[22]
Dsdm: Model-aware dataset selection with datamodels
Logan Engstrom, Axel Feldmann, and Aleksander Madry. Dsdm: Model-aware dataset selection with datamodels. arXiv preprint arXiv:2401.12926, 2024
-
[23]
Doge: Domain reweighting with generalization estimation.arXiv preprint arXiv:2310.15393, 2023
Simin Fan, Matteo Pagliardini, and Martin Jaggi. Doge: Domain reweighting with generalization estimation. arXiv preprint arXiv:2310.15393, 2023
-
[24]
Does learning require memorization? a short tale about a long tail
Vitaly Feldman. Does learning require memorization? a short tale about a long tail. In Proceedings of the 52nd annual ACM SIGACT symposium on theory of computing, pages 954--959, 2020
2020
-
[25]
What neural networks memorize and why: Discovering the long tail via influence estimation
Vitaly Feldman and Chiyuan Zhang. What neural networks memorize and why: Discovering the long tail via influence estimation. Advances in Neural Information Processing Systems, 33: 0 2881--2891, 2020
2020
-
[26]
Trade-offs in data memorization via strong data processing inequalities
Vitaly Feldman, Guy Kornowski, and Xin Lyu. Trade-offs in data memorization via strong data processing inequalities. arXiv preprint arXiv:2506.01855, 2025
-
[27]
arXiv preprint arXiv:2004.07202 , year=
Thibault F \'e vry, Livio Baldini Soares, Nicholas FitzGerald, Eunsol Choi, and Tom Kwiatkowski. Entities as experts: Sparse memory access with entity supervision. arXiv preprint arXiv:2004.07202, 2020
-
[28]
Coercing LLMs to do and reveal (almost) anything
Jonas Geiping, Alex Stein, Manli Shu, Khalid Saifullah, Yuxin Wen, and Tom Goldstein. Coercing llms to do and reveal (almost) anything. arXiv preprint arXiv:2402.14020, 2024
-
[29]
Task-adaptive pretrained language models via clustered-importance sampling
David Grangier, Simin Fan, Skyler Seto, and Pierre Ablin. Task-adaptive pretrained language models via clustered-importance sampling. arXiv preprint arXiv:2410.03735, 2024
-
[30]
Olmo: Accelerating the science of language models
Dirk Groeneveld, Iz Beltagy, Evan Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, et al. Olmo: Accelerating the science of language models. In Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 15789--15809, 2024
2024
-
[31]
Data Mixing Can Induce Phase Transitions in Knowledge Acquisition
Xinran Gu, Kaifeng Lyu, Jiazheng Li, and Jingzhao Zhang. Data mixing can induce phase transitions in knowledge acquisition. arXiv preprint arXiv:2505.18091, 2025
work page internal anchor Pith review arXiv 2025
-
[32]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[33]
Training Compute-Optimal Large Language Models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[34]
Few-shot learning with retrieval augmented language models,
Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299, 1 0 (2): 0 4, 2022
-
[35]
Mixture of parrots: Experts improve memorization more than reasoning
Samy Jelassi, Clara Mohri, David Brandfonbrener, Alex Gu, Nikhil Vyas, Nikhil Anand, David Alvarez-Melis, Yuanzhi Li, Sham M Kakade, and Eran Malach. Mixture of parrots: Experts improve memorization more than reasoning. arXiv preprint arXiv:2410.19034, 2024
-
[36]
Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
How can we know what language models know? Transactions of the Association for Computational Linguistics, 8: 0 423--438, 2020
Zhengbao Jiang, Frank F Xu, Jun Araki, and Graham Neubig. How can we know what language models know? Transactions of the Association for Computational Linguistics, 8: 0 423--438, 2020
2020
-
[38]
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning. arXiv preprint arXiv:2503.09516, 2025
work page Pith review arXiv 2025
-
[39]
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017
work page internal anchor Pith review arXiv 2017
-
[40]
Prismatic synthesis: Gradient-based data diversification boosts generalization in llm reasoning
Jaehun Jung, Seungju Han, Ximing Lu, Skyler Hallinan, David Acuna, Shrimai Prabhumoye, Mostafa Patwary, Mohammad Shoeybi, Bryan Catanzaro, and Yejin Choi. Prismatic synthesis: Gradient-based data diversification boosts generalization in llm reasoning. arXiv preprint arXiv:2505.20161, 2025
-
[41]
Language Models (Mostly) Know What They Know
Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[42]
Deduplicating training data mitigates privacy risks in language models
Nikhil Kandpal, Eric Wallace, and Colin Raffel. Deduplicating training data mitigates privacy risks in language models. In International Conference on Machine Learning, pages 10697--10707. PMLR, 2022
2022
-
[43]
Large language models struggle to learn long-tail knowledge
Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel. Large language models struggle to learn long-tail knowledge. In International conference on machine learning, pages 15696--15707. PMLR, 2023
2023
-
[44]
Dense passage retrieval for open-domain question answering
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick SH Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In EMNLP (1), pages 6769--6781, 2020
2020
-
[45]
Generalization through memo- rization: Nearest neighbor language models
Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Generalization through memorization: Nearest neighbor language models. arXiv preprint arXiv:1911.00172, 2019
-
[46]
Three approaches to the quantitative definition ofinformation’
Andrei N Kolmogorov. Three approaches to the quantitative definition ofinformation’. Problems of information transmission, 1 0 (1): 0 1--7, 1965
1965
-
[47]
Natural questions: a benchmark for question answering research
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7: 0 453--466, 2019
2019
-
[48]
Deduplicating training data makes language models better
Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. Deduplicating training data makes language models better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8424--8445, 2022
2022
-
[49]
Tic-lm: A web-scale benchmark for time-continual llm pretraining
Jeffrey Li, Mohammadreza Armandpour, Seyed Iman Mirzadeh, Sachin Mehta, Vaishaal Shankar, Raviteja Vemulapalli, Samy Bengio, Oncel Tuzel, Mehrdad Farajtabar, Hadi Pouransari, et al. Tic-lm: A web-scale benchmark for time-continual llm pretraining. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa...
2025
-
[50]
Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. Halueval: A large-scale hallucination evaluation benchmark for large language models. arXiv preprint arXiv:2305.11747, 2023
-
[51]
From quantity to quality: Boosting llm performance with self-guided data selection for instruction tuning
Ming Li, Yong Zhang, Zhitao Li, Jiuhai Chen, Lichang Chen, Ning Cheng, Jianzong Wang, Tianyi Zhou, and Jing Xiao. From quantity to quality: Boosting llm performance with self-guided data selection for instruction tuning. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tec...
2024
-
[52]
Not all tokens are what you need for pretraining
Zhenghao Lin, Zhibin Gou, Yeyun Gong, Xiao Liu, Ruochen Xu, Chen Lin, Yujiu Yang, Jian Jiao, Nan Duan, Weizhu Chen, et al. Not all tokens are what you need for pretraining. Advances in Neural Information Processing Systems, 37: 0 29029--29063, 2024
2024
-
[53]
On a measure of the information provided by an experiment
Dennis V Lindley. On a measure of the information provided by an experiment. The Annals of Mathematical Statistics, 27 0 (4): 0 986--1005, 1956
1956
-
[54]
Wei Liu, Weihao Zeng, Keqing He, Yong Jiang, and Junxian He. What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning. arXiv preprint arXiv:2312.15685, 2023
-
[55]
Analyzing leakage of personally identifiable information in language models
Nils Lukas, Ahmed Salem, Robert Sim, Shruti Tople, Lukas Wutschitz, and Santiago Zanella-B \'e guelin. Analyzing leakage of personally identifiable information in language models. In 2023 IEEE Symposium on Security and Privacy (SP), pages 346--363. IEEE, 2023
2023
-
[56]
Rephrasing the web: A recipe for compute and data-efficient language modeling
Pratyush Maini, Skyler Seto, Richard Bai, David Grangier, Yizhe Zhang, and Navdeep Jaitly. Rephrasing the web: A recipe for compute and data-efficient language modeling. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14044--14072, 2024
2024
-
[57]
When not to trust language models: Investigating effectiveness of parametric and non-parametric memories
Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9802--9822, 2023
2023
-
[58]
Locating and editing factual associations in gpt
Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. Advances in neural information processing systems, 35: 0 17359--17372, 2022
2022
-
[59]
Can a suit of armor conduct electricity? a new dataset for open book question answering
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP, 2018
2018
-
[60]
Factscore: Fine-grained atomic evaluation of factual precision in long form text generation
Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076--12100, 2023 a
2023
-
[61]
Nonparametric masked language modeling
Sewon Min, Weijia Shi, Mike Lewis, Xilun Chen, Wen-tau Yih, Hannaneh Hajishirzi, and Luke Zettlemoyer. Nonparametric masked language modeling. In Findings of the Association for Computational Linguistics: ACL 2023, pages 2097--2118, 2023 b
2023
-
[62]
o ren Mindermann, Jan M Brauner, Muhammed T Razzak, Mrinank Sharma, Andreas Kirsch, Winnie Xu, Benedikt H \
S \"o ren Mindermann, Jan M Brauner, Muhammed T Razzak, Mrinank Sharma, Andreas Kirsch, Winnie Xu, Benedikt H \"o ltgen, Aidan N Gomez, Adrien Morisot, Sebastian Farquhar, et al. Prioritized training on points that are learnable, worth learning, and not yet learnt. In International Conference on Machine Learning, pages 15630--15649. PMLR, 2022
2022
-
[63]
arXiv preprint arXiv:2505.24832 , year =
John X Morris, Chawin Sitawarin, Chuan Guo, Narine Kokhlikyan, G Edward Suh, Alexander M Rush, Kamalika Chaudhuri, and Saeed Mahloujifar. How much do language models memorize? arXiv preprint arXiv:2505.24832, 2025
-
[64]
arxiv-papers
NICK007X. arxiv-papers. huggingface.co/datasets/nick007x/arxiv-papers, 2025. URL huggingface.co/datasets/nick007x/arxiv-papers. Accessed: Dec 9 2025
2025
-
[65]
Understanding factuality in abstractive summarization with frank: A benchmark for factuality metrics
Artidoro Pagnoni, Vidhisha Balachandran, and Yulia Tsvetkov. Understanding factuality in abstractive summarization with frank: A benchmark for factuality metrics. arXiv preprint arXiv:2104.13346, 2021
-
[66]
Understanding llm behaviors via compression: Data generation, knowledge acquisition and scaling laws
Zhixuan Pan, Shaowen Wang, and Jian Li. Understanding llm behaviors via compression: Data generation, knowledge acquisition and scaling laws. arXiv preprint arXiv:2504.09597, 2025
-
[67]
Talm: Tool augmented language models.arXiv preprint arXiv:2205.12255, 2022
Aaron Parisi, Yao Zhao, and Noah Fiedel. Talm: Tool augmented language models. arXiv preprint arXiv:2205.12255, 2022
-
[68]
Fabio Petroni, Tim Rockt \"a schel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander Miller. Language models as knowledge bases? In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 2463--2473, 2019
2019
-
[69]
Pretraining with hierarchical memories: separating long-tail and common knowledge
Hadi Pouransari, David Grangier, C Thomas, Michael Kirchhof, and Oncel Tuzel. Pretraining with hierarchical memories: separating long-tail and common knowledge. arXiv preprint arXiv:2510.02375, 2025
-
[70]
How does generative retrieval scale to millions of passages? In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1305--1321, 2023
Ronak Pradeep, Kai Hui, Jai Gupta, Adam Lelkes, Honglei Zhuang, Jimmy Lin, Donald Metzler, and Vinh Tran. How does generative retrieval scale to millions of passages? In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1305--1321, 2023
2023
-
[71]
Language models are unsupervised multitask learners
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019
2019
-
[72]
Shaping capabilities with token-level data filtering
Neil Rathi and Alec Radford. Shaping capabilities with token-level data filtering. arXiv preprint arXiv:2601.21571, 2026
-
[73]
How Much Knowledge Can You Pack Into the Parameters of a Language Model?
Adam Roberts, Colin Raffel, and Noam Shazeer. How much knowledge can you pack into the parameters of a language model? arXiv preprint arXiv:2002.08910, 2020
work page internal anchor Pith review arXiv 2002
-
[74]
How to train data-efficient llms.arXiv preprint arXiv:2402.09668, 2024
Noveen Sachdeva, Benjamin Coleman, Wang-Cheng Kang, Jianmo Ni, Lichan Hong, Ed H Chi, James Caverlee, Julian McAuley, and Derek Zhiyuan Cheng. How to train data-efficient llms. arXiv preprint arXiv:2402.09668, 2024
-
[75]
Upweighting easy samples in fine-tuning mitigates forgetting
Sunny Sanyal, Hayden Prairie, Rudrajit Das, Ali Kavis, and Sujay Sanghavi. Upweighting easy samples in fine-tuning mitigates forgetting. arXiv preprint arXiv:2502.02797, 2025
-
[76]
Toolformer: Language models can teach themselves to use tools
Timo Schick, Jane Dwivedi-Yu, Roberto Dess \` , Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36: 0 68539--68551, 2023
2023
-
[77]
Rethinking llm memorization through the lens of adversarial compression
Avi Schwarzschild, Zhili Feng, Pratyush Maini, Zachary Lipton, and J Zico Kolter. Rethinking llm memorization through the lens of adversarial compression. Advances in Neural Information Processing Systems, 37: 0 56244--56267, 2024
2024
-
[78]
Beyond the reported cutoff: Where large language models fall short on financial knowledge
Agam Shah, Liqin Ye, Sebastian Jaskowski, Wei Xu, and Sudheer Chava. Beyond the reported cutoff: Where large language models fall short on financial knowledge. arXiv preprint arXiv:2504.00042, 2025
-
[79]
Gptzero finds 100 new hallucinations in neurips 2025 accepted papers, January 2026
Nazar Shmatko, Alex Adam, and Paul Esau. Gptzero finds 100 new hallucinations in neurips 2025 accepted papers, January 2026. URL https://gptzero.me/news/neurips/. Accessed: 2026-01-26
2025
-
[80]
Nemotron-cc: Transforming common crawl into a refined long-horizon pretraining dataset
Dan Su, Kezhi Kong, Ying Lin, Joseph Jennings, Brandon Norick, Markus Kliegl, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro. Nemotron-cc: Transforming common crawl into a refined long-horizon pretraining dataset. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2459--2475, 2025
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.