Hierarchical Memorization in Large Language Models: Evidence from Citation Generation
Pith reviewed 2026-05-17 23:08 UTC · model grok-4.3
The pith
Large language models memorize citation details in layers, recalling titles and authors before venues or years as training repetitions increase.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Memorization of bibliographic records inside LLMs is not an all-or-nothing event but a graduated, hierarchically layered process. Factual accuracy scales log-linearly with citation count, with an inflection near 90 citations and near-verbatim reproduction after roughly 1,200 citations. Within any single record, titles and first authors are recovered first, venues and numeric fields demand substantially more redundancy, and publication years remain essentially unlearned. Records that share similar titles or authors can still interfere with one another even when each is individually well represented.
What carries the argument
Citation count used as a proxy for training-data redundancy, which orders the acquisition thresholds for different metadata fields inside one bibliographic entry.
If this is right
- Generated citations become measurably more accurate as the underlying paper's citation count rises in a log-linear pattern.
- Near-verbatim reproduction of a full citation record occurs only after the record exceeds roughly 1,200 citations.
- Overlapping titles or authors can still produce conflations even for records that individually exceed the saturation threshold.
- Publication years stay poorly reproduced regardless of how high the citation count becomes.
Where Pith is reading between the lines
- The same layered pattern may govern factual recall in domains other than citations whenever training data contains repeated but unevenly detailed statements.
- Targeted increases in redundancy for numeric fields or dates during continued pretraining could reduce specific classes of hallucination without raising overall data volume.
- Corpus curators could prioritize balanced coverage of weaker fields such as years rather than simply adding more high-count examples.
Load-bearing premise
Citation count serves as a reliable proxy for the redundancy of a bibliographic record in the pretraining corpus.
What would settle it
Direct measurement of actual training-data frequency for specific papers showing that accuracy no longer tracks citation count once true redundancy is controlled.
Figures
read the original abstract
Large language models (LLMs) generate fluent text across a wide range of tasks, but the fabrication of non-existent academic citations remains a critical and well-documented failure mode. Building on prior work that frames hallucination and verbatim memorization as outcomes of the same probabilistic process, this study uses citation count as a proxy for training data redundancy and asks how this redundancy is internally structured within a single bibliographic record. Using GPT-4.1, we generated and manually verified 100 citations across twenty computer-science domains, measuring factual fidelity via cosine similarity against authentic metadata. We find that (i) factual accuracy varies substantially across domains and scales log-linearly with citation count, (ii) the model crosses two empirically identifiable thresholds; an inflection around 90 citations and a saturation point near 1,200 citations beyond which records are reproduced nearly verbatim, (iii) memorization is hierarchical, with titles and first authors recalled earliest while venues and numeric fields require far greater redundancy and publication years remain essentially unlearned, and (iv) even highly cited records can be conflated when their titles and authors overlap, an effect interpretable as spurious-attractor interference. Memorization in LLMs is therefore not a binary on/off state but a graduated, hierarchically layered phenomenon shaped by the uneven distribution of knowledge in the pretraining corpus.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that memorization in LLMs is not binary but a graduated, hierarchically layered process. Using GPT-4.1 to generate 100 citations across 20 computer-science domains, manually verifying them, and measuring factual fidelity via cosine similarity to authentic metadata, the authors treat citation count as a proxy for pretraining-data redundancy. They report log-linear scaling of accuracy with citation count, two thresholds (inflection near 90 citations, saturation near 1,200), a clear ordering in which titles and first authors are recalled earliest while venues, numeric fields, and especially publication years require greater redundancy, and spurious-attractor interference when titles and authors overlap.
Significance. If the central empirical patterns hold, the work supplies concrete evidence that LLM knowledge acquisition tracks uneven redundancy gradients inside the training corpus, moving the field beyond binary hallucination/memorization dichotomies. The manual verification of 100 citations and the quantitative cosine-similarity metric against ground-truth metadata constitute a clear methodological strength and support falsifiable claims about field-specific thresholds.
major comments (3)
- [Methods] Methods section: the justification for treating raw citation count as a reliable proxy for token-level redundancy of specific bibliographic fields (exact title strings versus year tokens, for example) is not provided; paywalled sources, citation contexts that omit full metadata, and domain-specific scraping rates could decouple citation count from the relevant duplication statistics, directly undermining the interpretation of the observed hierarchy and the two thresholds.
- [Results] Results section on scaling and thresholds: the identification of the inflection at ~90 citations and saturation at ~1,200 citations lacks reported statistical procedure (change-point analysis, piecewise regression, or bootstrap confidence intervals), so it is unclear whether these are robustly estimated or post-hoc visual choices; this choice is load-bearing for the log-linear scaling claim and the hierarchical interpretation.
- [Results] Results on field ordering: the claim that titles and first authors are recalled at lower redundancy while venues and numeric fields require higher counts rests on the untested assumption that each field appears with equal frequency in the contexts that actually reach the model; without auxiliary measurements of field-specific occurrence rates in the training distribution, the hierarchy could arise from prompt formatting or generation heuristics instead.
minor comments (2)
- [Abstract] Abstract and Results: clarify whether the reported log-linear relationship is obtained by regressing accuracy on log(citation count) or by fitting a linear model to log-transformed variables; the distinction affects how the thresholds are interpreted.
- [Figures] Figure captions and legends: ensure that any plots of accuracy versus citation count explicitly mark the 90- and 1,200-citation thresholds and report the number of points in each bin.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments. We address each major comment point by point below, indicating revisions where the manuscript will be updated to improve rigor and transparency.
read point-by-point responses
-
Referee: [Methods] Methods section: the justification for treating raw citation count as a reliable proxy for token-level redundancy of specific bibliographic fields (exact title strings versus year tokens, for example) is not provided; paywalled sources, citation contexts that omit full metadata, and domain-specific scraping rates could decouple citation count from the relevant duplication statistics, directly undermining the interpretation of the observed hierarchy and the two thresholds.
Authors: We agree that citation count is an indirect proxy and does not directly capture token-level redundancy for individual fields, with potential decoupling from paywalled sources or incomplete contexts. In the revised manuscript we will add an explicit limitations paragraph in the Methods section discussing these factors and their implications for the hierarchy and thresholds, while retaining citation count as a practical proxy given the inaccessibility of full pretraining corpora. revision: yes
-
Referee: [Results] Results section on scaling and thresholds: the identification of the inflection at ~90 citations and saturation at ~1,200 citations lacks reported statistical procedure (change-point analysis, piecewise regression, or bootstrap confidence intervals), so it is unclear whether these are robustly estimated or post-hoc visual choices; this choice is load-bearing for the log-linear scaling claim and the hierarchical interpretation.
Authors: The thresholds were identified via visual inspection of accuracy curves for slope changes and plateauing. We acknowledge the absence of formal statistical validation. The revised Results section will incorporate change-point detection analysis along with bootstrap confidence intervals to substantiate the reported inflection and saturation points. revision: yes
-
Referee: [Results] Results on field ordering: the claim that titles and first authors are recalled at lower redundancy while venues and numeric fields require higher counts rests on the untested assumption that each field appears with equal frequency in the contexts that actually reach the model; without auxiliary measurements of field-specific occurrence rates in the training distribution, the hierarchy could arise from prompt formatting or generation heuristics instead.
Authors: Direct auxiliary measurements of field-specific occurrence rates are not possible without access to the closed training corpus. Standardized prompts were used across all generations to reduce formatting confounds, and the hierarchy is consistent across twenty domains. We will add discussion of alternative explanations including generation heuristics in the revision while maintaining that the redundancy interpretation best accounts for the observed patterns. revision: partial
Circularity Check
No significant circularity; purely observational empirical study
full rationale
The paper conducts an observational study by prompting GPT-4.1 to generate citations, manually verifying them, and measuring factual fidelity via cosine similarity to authentic metadata while correlating results against external citation counts. No equations, derivations, or fitted parameters are present that reduce any claimed result to a quantity defined by the inputs themselves. The hierarchical memorization pattern is reported as an empirical observation across fields at different citation-count thresholds rather than a self-referential construction. The proxy assumption linking citation count to training-data redundancy is an interpretive framing, not a definitional or self-citation loop that forces the central claims.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Citation count is a valid proxy for training-data redundancy of a bibliographic record
- domain assumption Cosine similarity on metadata plus manual verification accurately captures factual fidelity
Forward citations
Cited by 1 Pith paper
-
LLM-Metrics: Measuring Research Impact Through Large Language Model Memory
LLM-Metrics probes memory in 17 LLMs across 549 2023-2024 CS papers and finds a modest Spearman correlation (rho=0.1495) with citation counts, stronger for 2024 papers.
Reference graph
Works this paper leans on
-
[1]
Sentiment analysis in the age of generative ai.Customer Needs and Solutions, 11(1):3, 2024
Jan Ole Krugmann and Jochen Hartmann. Sentiment analysis in the age of generative ai.Customer Needs and Solutions, 11(1):3, 2024
work page 2024
-
[2]
Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, 2024
work page 2024
-
[3]
Hallucination is Inevitable: An Innate Limitation of Large Language Models
Ziwei Xu, Sanjay Jain, and Mohan Kankanhalli. Hallucination is inevitable: An innate limita- tion of large language models.arXiv preprint arXiv:2401.11817, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Sys- tems, 43(2):1–55, 2025
work page 2025
-
[5]
A direct mail customer purchase model.Journal of Direct Marketing, 2(3):16–24, 1988
Connie L Bauer. A direct mail customer purchase model.Journal of Direct Marketing, 2(3):16–24, 1988
work page 1988
-
[6]
Optimal selec- tion for direct mail.Marketing Science, 14(4):378– 394, 1995
Jan Roelf Bult and Tom Wansbeek. Optimal selec- tion for direct mail.Marketing Science, 14(4):378– 394, 1995
work page 1995
-
[7]
Sunil Gupta and Donald R Lehmann. Customer life- time value and firm valuation.Journal of Relation- ship Marketing, 5(2-3):87–110, 2006
work page 2006
-
[8]
John Wiley & Sons Incorporated, 1978
Jacob Jacoby and Robert W Chestnut.Brand loy- alty: Measurement and management. John Wiley & Sons Incorporated, 1978
work page 1978
-
[9]
Ravindra Chitturi, Rajagopal Raghunathan, and Vi- jay Mahajan. Form versus function: How the in- tensities of specific emotions evoked in functional versus hedonic trade-offs mediate product prefer- ences.Journal of marketing research, 44(4):702– 714, 2007
work page 2007
-
[10]
Efthymios Constantinides and Stefan J Fountain. Web 2.0: Conceptual foundations and marketing is- sues.Journal of direct, data and digital marketing practice, 9(3):231–244, 2008
work page 2008
-
[11]
Peter S Fader, Bruce GS Hardie, and Ka Lok Lee. Rfm and clv: Using iso-value curves for cus- tomer base analysis.Journal of marketing research, 42(4):415–430, 2005
work page 2005
-
[12]
Quantifying memorization across neural lan- guage models
Nicholas Carlini, Daphne Ippolito, Matthew Jagiel- ski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. Quantifying memorization across neural lan- guage models. InThe Eleventh International Con- ference on Learning Representations, 2022
work page 2022
-
[13]
Research paper recommender system evaluation: a quantitative literature survey
Joeran Beel, Stefan Langer, Marcel Genzmehr, Bela Gipp, Corinna Breitinger, and Andreas N ¨urnberger. Research paper recommender system evaluation: a quantitative literature survey. InProceedings of the international workshop on reproducibility and repli- cation in recommender systems evaluation, pages 15–22, 2013
work page 2013
-
[14]
Scientific paper recommendation: A survey.Ieee Access, 7:9324– 9339, 2019
Xiaomei Bai, Mengyang Wang, Ivan Lee, Zhuo Yang, Xiangjie Kong, and Feng Xia. Scientific paper recommendation: A survey.Ieee Access, 7:9324– 9339, 2019
work page 2019
-
[15]
Zafar Ali, Pavlos Kefalas, Khan Muhammad, Ba- hadar Ali, and Muhammad Imran. Deep learning 7 Hallucinate or Memorize?A PREPRINT in citation recommendation models survey.Expert Systems with Applications, 162:113790, 2020
work page 2020
-
[16]
Michael F ¨arber and Adam Jatowt. Citation recom- mendation: approaches and datasets.International Journal on Digital Libraries, 21(4):375–405, 2020
work page 2020
-
[17]
Zitong Zhang, Braja Gopal Patra, Ashraf Yaseen, Jie Zhu, Rachit Sabharwal, Kirk Roberts, Tru Cao, and Hulin Wu. Scholarly recommendation systems: a literature survey.Knowledge and Information Sys- tems, 65(11):4433–4478, 2023
work page 2023
-
[18]
Kurt D Bollacker, Steve Lawrence, and C Lee Giles. Citeseer: An autonomous web agent for automatic retrieval and identification of interesting publica- tions. InProceedings of the second international conference on Autonomous agents, pages 116–123, 1998
work page 1998
-
[19]
Karen Sparck Jones. A statistical interpretation of term specificity and its application in retrieval.Jour- nal of documentation, 28(1):11–21, 1972
work page 1972
-
[20]
Research paper rec- ommendation with topic analysis
Chenguang Pan and Wenxin Li. Research paper rec- ommendation with topic analysis. In2010 Interna- tional conference on computer design and applica- tions, volume 4, pages V4–264. IEEE, 2010
work page 2010
-
[21]
Jiwoon Ha, Sang-Wook Kim, Christos Faloutsos, and Sunju Park. An analysis on information diffu- sion through blogcast in a blogosphere.Information sciences, 290:45–62, 2015
work page 2015
-
[22]
Latent dirichlet allocation.Journal of machine Learning research, 3(Jan):993–1022, 2003
David M Blei, Andrew Y Ng, and Michael I Jor- dan. Latent dirichlet allocation.Journal of machine Learning research, 3(Jan):993–1022, 2003
work page 2003
-
[23]
Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harsh- man. Indexing by latent semantic analysis.Jour- nal of the American society for information science, 41(6):391–407, 1990
work page 1990
-
[24]
Chanwoo Jeong, Sion Jang, Eunjeong Park, and Sungchul Choi. A context-aware citation recom- mendation model with bert and graph convolutional networks.Scientometrics, 124(3):1907–1922, 2020
work page 1907
-
[25]
Specter: Document- level representation learning using citation-informed transformers
Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel S Weld. Specter: Document- level representation learning using citation-informed transformers. InProceedings of the 58th Annual Meeting of the Association for Computational Lin- guistics, pages 2270–2282, 2020
work page 2020
-
[26]
Kun Liu, Yan Zhang, Rui Pan, Tianchen Gao, and Hansheng Wang. Academic literature recommen- dation in large-scale citation networks enhanced by large language models.Scientometrics, 130:5143– 5169, 2025
work page 2025
-
[27]
Hoang Anh Dang, Vu Tran, and Le-Minh Nguyen. Survey and analysis of hallucinations in large lan- guage models: attribution to prompting strategies or model behavior.Frontiers in Artificial Intelligence, 8:1622292, 2025
work page 2025
-
[28]
Joseph Spracklen, Raveen Wijewickrama, AHM Nazmus Sakib, Anindya Maiti, and Bi- mal Viswanath. We have a package for you! a comprehensive analysis of package hallucinations by code generating{LLMs}. In34th USENIX Security Symposium (USENIX Security 25), pages 3687–3706, 2025
work page 2025
-
[29]
Why Language Models Hallucinate
Adam Tauman Kalai, Ofir Nachum, Santosh S Vem- pala, and Edwin Zhang. Why language models hal- lucinate.arXiv preprint arXiv:2509.04664, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep re- inforcement learning from human preferences.Ad- vances in neural information processing systems, 30, 2017
work page 2017
-
[31]
Fine-Tuning Language Models from Human Preferences
Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning lan- guage models from human preferences.arXiv preprint arXiv:1909.08593, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[32]
Extracting training data from large language models
Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-V oss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language models. In30th USENIX security sym- posium (USENIX Security 21), pages 2633–2650, 2021
work page 2021
-
[33]
Deduplicating training data mitigates privacy risks in language models
Nikhil Kandpal, Eric Wallace, and Colin Raffel. Deduplicating training data mitigates privacy risks in language models. InInternational Conference on Machine Learning, pages 10697–10707. PMLR, 2022
work page 2022
-
[34]
Deduplicating train- ing data makes language models better
Katherine Lee, Daphne Ippolito, Andrew Nys- trom, Chiyuan Zhang, Douglas Eck, Chris Callison- Burch, and Nicholas Carlini. Deduplicating train- ing data makes language models better. InProceed- ings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers), pages 8424–8445, 2022
work page 2022
-
[35]
Large language models as general pattern machines
Suvir Mirchandani, Fei Xia, Pete Florence, Brian Ichter, Danny Driess, Montserrat Gonzalez Arenas, Kanishka Rao, Dorsa Sadigh, and Andy Zeng. Large language models as general pattern machines. In Conference on Robot Learning, pages 2498–2518. PMLR, 2023
work page 2023
-
[36]
Distributional semantics: Meaning through culture and interaction.Topics in cognitive science, 2024
Pablo Contreras Kallens and Morten H Christiansen. Distributional semantics: Meaning through culture and interaction.Topics in cognitive science, 2024
work page 2024
-
[37]
The secret sharer: Eval- uating and testing unintended memorization in neu- ral networks
Nicholas Carlini, Chang Liu, ´Ulfar Erlingsson, Jernej Kos, and Dawn Song. The secret sharer: Eval- uating and testing unintended memorization in neu- ral networks. In28th USENIX security symposium (USENIX security 19), pages 267–284, 2019
work page 2019
-
[38]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you 8 Hallucinate or Memorize?A PREPRINT need.Advances in neural information processing systems, 30, 2017
work page 2017
-
[39]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denois- ing diffusion probabilistic models.Advances in neu- ral information processing systems, 33:6840–6851, 2020
work page 2020
-
[40]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K ¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt¨aschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neu- ral information processing systems, 33:9459–9474, 2020
work page 2020
-
[41]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
Sentence-bert: Sentence embeddings using siamese bert-networks
Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Lan- guage Processing (EMNLP-IJCNLP), page 3982. Association for Computational Linguistics, 2019
work page 2019
-
[43]
Swin transformer: Hierarchical vision transformer using shifted windows
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021
work page 2021
-
[44]
Mega: Moving average equipped gated attention
Xuezhe Ma, Chunting Zhou, Xiang Kong, Junx- ian He, Liangke Gui, Graham Neubig, Jonathan May, and Luke Zettlemoyer. Mega: Moving av- erage equipped gated attention.arXiv preprint arXiv:2209.10655, 2022
-
[45]
Yury Gorishniy, Ivan Rubachev, Valentin Khrulkov, and Artem Babenko. Revisiting deep learning mod- els for tabular data.Advances in neural information processing systems, 34:18932–18943, 2021
work page 2021
-
[46]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words.International Con- ference on Learning Representations (ICLR 2021), 2021
work page 2021
-
[47]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learn- ers.Advances in neural information processing sys- tems, 33:1877–1901, 2020
work page 1901
-
[48]
Generative adversarial nets.Advances in neural information processing systems, 27, 2014
Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets.Advances in neural information processing systems, 27, 2014
work page 2014
-
[49]
Generative ad- versarial networks.Communications of the ACM, 63(11):139–144, 2020
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative ad- versarial networks.Communications of the ACM, 63(11):139–144, 2020
work page 2020
-
[50]
Kunfeng Wang, Chao Gou, Yanjie Duan, Yilun Lin, Xinhu Zheng, and Fei-Yue Wang. Generative adver- sarial networks: introduction and outlook.IEEE/- CAA Journal of Automatica Sinica, 4(4):588–598, 2017
work page 2017
-
[51]
Generative adversarial networks: An overview.IEEE signal processing magazine, 35(1):53–65, 2018
Antonia Creswell, Tom White, Vincent Dumoulin, Kai Arulkumaran, Biswa Sengupta, and Anil A Bharath. Generative adversarial networks: An overview.IEEE signal processing magazine, 35(1):53–65, 2018
work page 2018
-
[52]
Self- supervised learning: Generative or contrastive
Xiao Liu, Fanjin Zhang, Zhenyu Hou, Li Mian, Zhaoyu Wang, Jing Zhang, and Jie Tang. Self- supervised learning: Generative or contrastive. IEEE transactions on knowledge and data engineer- ing, 35(1):857–876, 2021. 9
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.