LLM-Metrics: Measuring Research Impact Through Large Language Model Memory
Pith reviewed 2026-05-22 05:57 UTC · model grok-4.3
The pith
LLMs remember high-impact papers better, turning their internal memory into a citation-free impact metric.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
High-impact papers receive greater exposure in the academic community, this exposure enters LLM training data in textual form, and models consequently form stronger parametric memory of these papers, shown by positive correlations between probe accuracy and citation counts across 15 of 17 models.
What carries the argument
Four types of multiple-choice probes covering title recognition, author recognition, method recognition, and venue recognition, evaluated on 549 papers across 17 LLMs.
If this is right
- Impact assessment becomes possible in real time for papers published after model training cutoffs.
- Metrics avoid temporal lags and disciplinary biases that affect citation counts.
- Author-recognition probes show the strongest signal, consistent with exposure as the driver.
- Smaller models can yield stronger predictive signals than larger ones due to capacity limits acting as filters.
Where Pith is reading between the lines
- The approach could extend to non-computer-science fields if similar probes are adapted.
- Training data composition for different models might be reverse-engineered from memory patterns.
- Combining LLM memory scores with other early signals could improve short-term impact forecasting.
Load-bearing premise
Probe accuracy differences primarily reflect differential exposure of paper content in LLM training corpora rather than other factors like paper length or style.
What would settle it
Observing no correlation between probe accuracy and citations after matching papers for length, topic, and publication venue would falsify the exposure-memory link.
Figures
read the original abstract
Citation counts remain the dominant metric for assessing research impact, yet they suffer from well-documented limitations: temporal lag, disciplinary bias, and Matthew effects. Here we propose LLM-Metrics, a research-impact assessment metric derived from the parametric memory of large language models (LLMs). The central hypothesis is that high-impact papers receive greater exposure in the academic community, that this exposure enters LLM training data in textual form, and that models consequently form stronger parametric memory of these papers. We designed four types of multiple-choice probes, covering title recognition, author recognition, method recognition, and venue recognition, and evaluated 549 computer science papers published in 2023-2024 across 17 LLMs spanning 0.5B to 72B parameters from six vendors. Of the 17 models, 15 produced positive predictions, 9 of which were significant at p less than 0.05, with an overall Spearman correlation of rho = 0.1495 and p = 0.0004 against citation counts. Three additional findings support the proposed mechanism. First, the predictive signal was stronger for 2024 papers, rho = 0.1880, whose citation counts were near zero at model-training time, reducing the plausibility of a simple reverse-causality explanation. Second, author-recognition probes showed the strongest discriminative power, consistent with an exposure-driven memory mechanism. Third, model scale and predictive power were non-monotonic: a 3B-parameter model, Llama-3.2-3B-Instruct, with rho = 0.1829, outperformed most larger models, supporting a selective-memory hypothesis in which the limited capacity of smaller models can serve as an effective information filter. LLM-Metrics offers a real-time, cross-disciplinary, citation-independent paradigm for research assessment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes LLM-Metrics as a citation-independent research-impact measure derived from LLMs' parametric memory. The hypothesis is that high-impact papers receive greater community exposure that enters LLM training data, leading to stronger model memory measurable via multiple-choice probes on title, author, method, and venue. The study evaluates 549 computer science papers from 2023-2024 across 17 LLMs (0.5B to 72B parameters), reporting an overall Spearman rho of 0.1495 (p=0.0004) with citation counts, a stronger rho=0.1880 for 2024 papers, best performance on author probes, and non-monotonic scaling with model size.
Significance. If the mechanism is confirmed, LLM-Metrics could enable real-time, cross-disciplinary impact assessment that avoids citation lags and biases. Credit is due for the 2024-paper control that reduces reverse-causality concerns, the non-monotonic scale finding supporting selective memory, and the use of multiple probe types. The modest effect size limits immediate replacement value but suggests a useful complement; the approach is novel in directly probing LLM internals for scientometric purposes.
major comments (3)
- [Results] Results section: the central claim that probe accuracy reflects differential training-data exposure is load-bearing, yet no regression controls or matching for paper-level confounders (length, abstract complexity, topic popularity, or stylistic clarity) are reported; the observed rho=0.1495 (and rho=0.1880 for 2024 papers) could arise from these factors rather than memory of specific paper text.
- [Experimental setup] Experimental setup: validation relies on correlation with citation counts—the very metric the approach seeks to supplement—creating circularity risk; while the 2024 control and author-probe strength provide partial independent grounding, the manuscript lacks baseline comparisons (e.g., against random probes, length-matched controls, or non-LLM heuristics) or error analysis to isolate the proposed mechanism.
- [Discussion] Discussion: the non-monotonic scale result (e.g., Llama-3.2-3B outperforming larger models) is interesting but does not distinguish exposure-driven memory from topic-salience effects, as the skeptic concern notes; a concrete test such as topic-controlled subsets or human-rated salience covariates is needed to support the selective-memory interpretation.
minor comments (3)
- [Abstract] Abstract: the statement that '15 of 17 models produced positive predictions, 9 of which were significant' would benefit from a supplementary table listing per-model rho values and p-values for transparency.
- [Methods] Methods: probe question templates and exact multiple-choice options should be reproduced in an appendix to enable replication and assessment of question difficulty.
- Figure clarity: any plots of rho versus model size or citation bins should include error bars or confidence intervals to convey variability across the 549 papers.
Simulated Author's Rebuttal
We thank the referee for their constructive and insightful comments, which have helped us identify areas to strengthen the manuscript. We address each major comment below and indicate the revisions we will make in the next version.
read point-by-point responses
-
Referee: [Results] Results section: the central claim that probe accuracy reflects differential training-data exposure is load-bearing, yet no regression controls or matching for paper-level confounders (length, abstract complexity, topic popularity, or stylistic clarity) are reported; the observed rho=0.1495 (and rho=0.1880 for 2024 papers) could arise from these factors rather than memory of specific paper text.
Authors: We agree that unmeasured paper-level factors could contribute to the observed correlations and that explicit controls would strengthen the central claim. In the revised manuscript we will add ordinary least-squares regressions of probe accuracy on citation count while controlling for paper length (token count), abstract complexity (Flesch-Kincaid score and sentence length), and topic popularity (one-hot encoding of primary arXiv category plus venue). We will also report partial Spearman correlations after residualizing out these covariates. These additions directly test whether the signal survives after accounting for the listed confounders. revision: yes
-
Referee: [Experimental setup] Experimental setup: validation relies on correlation with citation counts—the very metric the approach seeks to supplement—creating circularity risk; while the 2024 control and author-probe strength provide partial independent grounding, the manuscript lacks baseline comparisons (e.g., against random probes, length-matched controls, or non-LLM heuristics) or error analysis to isolate the proposed mechanism.
Authors: We acknowledge the value of additional baselines. We will insert a new subsection that reports (i) probe accuracy against random-choice baselines (uniform and frequency-matched), (ii) results on length-matched control texts drawn from non-academic sources, and (iii) a brief error analysis of the 50 papers with largest residuals. The 2024-paper subset and the superior performance of author probes already provide some separation from citation counts; the new baselines will further isolate the contribution of parametric memory. Full non-LLM heuristics (e.g., simple n-gram overlap with Google Scholar snippets) lie outside the current scope but will be noted as future work. revision: partial
-
Referee: [Discussion] Discussion: the non-monotonic scale result (e.g., Llama-3.2-3B outperforming larger models) is interesting but does not distinguish exposure-driven memory from topic-salience effects, as the skeptic concern notes; a concrete test such as topic-controlled subsets or human-rated salience covariates is needed to support the selective-memory interpretation.
Authors: We concur that topic salience remains a plausible alternative explanation. In revision we will add a stratified analysis that recomputes the scale–performance relationship within each of the five most frequent arXiv categories (cs.AI, cs.LG, cs.CL, cs.CV, cs.SE), thereby holding broad topic constant. We will also expand the Discussion to explicitly acknowledge that human-rated salience covariates would be a stronger test and to frame the non-monotonic finding as suggestive rather than conclusive evidence for selective memory. These steps address the concern within the limits of the existing dataset. revision: partial
Circularity Check
No significant circularity in derivation chain
full rationale
The paper hypothesizes that greater exposure leads to stronger LLM parametric memory, designs title/author/method/venue probes, computes accuracies on 549 papers, and reports empirical Spearman correlations (rho=0.1495 overall, rho=0.1880 for 2024 papers) with citation counts as supporting evidence. This correlation is an external benchmark test, not a definitional reduction, fitted parameter renamed as prediction, or self-citation chain. The 2024 control (near-zero citations at training time) and non-monotonic scale finding supply independent observations that do not reduce to the input data by construction. No load-bearing step quotes or exhibits equivalence to prior author results or ansatzes. The paper remains self-contained against the citation benchmark it uses for validation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM parametric memory strength for a paper is driven primarily by the volume of that paper's content appearing in the model's training data due to academic exposure.
Reference graph
Works this paper leans on
-
[1]
Journal of Informetrics , volume =
Predicting citation counts based on deep neural network learning techniques , author =. Journal of Informetrics , volume =. 2019 , doi =
work page 2019
-
[2]
Second-order citations in altmetrics: A case study analyzing the audiences of
Alperin, Juan Pablo and Fleerackers, Alice and Riedlinger, Michelle and Haustein, Stefanie , journal =. Second-order citations in altmetrics: A case study analyzing the audiences of. 2024 , doi =
work page 2024
-
[3]
Frontiers in Research Metrics and Analytics , volume =
Evaluative altmetrics: Is there evidence for its application to research evaluation? , author =. Frontiers in Research Metrics and Analytics , volume =. 2023 , doi =
work page 2023
-
[4]
Proceedings of the 40th International Conference on Machine Learning , year =
Pythia: A suite for analyzing large language models across training and scaling , author =. Proceedings of the 40th International Conference on Machine Learning , year =
-
[5]
arXiv preprint arXiv:2404.06209 , year =
Elephants never forget: Memorization and learning of tabular data in large language models , author =. arXiv preprint arXiv:2404.06209 , year =. 2404.06209 , archiveprefix =
-
[6]
arXiv preprint arXiv:2408.16345 , year =
The unreasonable ineffectiveness of nucleus sampling on mitigating text memorization , author =. arXiv preprint arXiv:2408.16345 , year =. doi:10.18653/v1/2024.inlg-main.30 , url =. 2408.16345 , archiveprefix =
-
[7]
Advances in Neural Information Processing Systems , volume =
Language models are few-shot learners , author =. Advances in Neural Information Processing Systems , volume =
-
[8]
Extracting training data from large language models , author =. 30th
-
[9]
Proceedings of the International Conference on Learning Representations , year =
Quantifying memorization across neural language models , author =. Proceedings of the International Conference on Learning Representations , year =
-
[10]
Advances in Neural Information Processing Systems , volume =
How do large language models acquire factual knowledge during pretraining? , author =. Advances in Neural Information Processing Systems , volume =. 2024 , doi =
work page 2024
-
[11]
Journal of the Association for Information Science and Technology , volume =
Do ``altmetrics'' correlate with citations? Extensive comparison of altmetric indicators with citations from a multidisciplinary perspective , author =. Journal of the Association for Information Science and Technology , volume =. 2015 , doi =
work page 2015
-
[12]
Feng, Kai and Li, Ziyue and Wang, Jindong and Liu, Zixuan and Sun, Maosong , journal =. 2024 , url =. 2406.09098 , archiveprefix =
-
[13]
Over-optimization of academic publishing metrics: Observing
Fire, Michael and Guestrin, Carlos , journal =. Over-optimization of academic publishing metrics: Observing. 2019 , doi =
work page 2019
-
[14]
Citation indexes for science: A new dimension in documentation through association of ideas , author =. Science , volume =. 1955 , doi =
work page 1955
-
[15]
The history and meaning of the journal impact factor , author =. JAMA , volume =. 2006 , doi =
work page 2006
-
[16]
Gholampour, Sajad and Lim, Weng Marc and Lund, Brady D. and Noruzi, Alireza and Elahi, Alireza and Saboury, Ali Akbar and Nawaz, Raheel and Gholampour, Behzad , journal =. Does social media contribute to research impact? An. 2024 , doi =
work page 2024
-
[17]
Ghosh, Shrestha and Giordano, Luca and Hu, Yujia and Nguyen, Tuan-Phong and Razniewski, Simon , journal =. Mining the mind: What 100. 2025 , url =. 2510.07024 , archiveprefix =
-
[18]
Ethics and bias in research metrics: A comprehensive review of challenges, manifestations, and pathways to reform , author =. Scientometrica , volume =. 2025 , doi =
work page 2025
-
[19]
Frontiers in Research Metrics and Analytics , volume =
Altmetrics in the evaluation of scholarly impact: A systematic and critical literature review , author =. Frontiers in Research Metrics and Analytics , volume =. 2025 , doi =
work page 2025
-
[20]
Gu, Yuxian and Dong, Li and Wei, Furu and Huang, Minlie , booktitle =
-
[21]
arXiv preprint arXiv:2402.08640 , year =
Forecasting high-impact research topics via machine learning on evolving knowledge graphs , author =. arXiv preprint arXiv:2402.08640 , year =. doi:10.1088/2632-2153/add6ef , url =. 2402.08640 , archiveprefix =
-
[22]
SciVQR: A Multidisciplinary Multimodal Benchmark for Advanced Scientific Reasoning Evaluation
Guo, Longteng and Lin, Xuanxu and Hao, Dongze and Yue, Tongtian and Huo, Pengkang and Ma, Jiatong and Liu, Yuchen and Liu, Jing , journal =. 2026 , url =. 2605.10187 , archiveprefix =
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[23]
Hirako, Jun and Sasano, Ryo and Takeda, Koichi , journal =. 2024 , url =. 2410.04404 , archiveprefix =
-
[24]
Proceedings of the National Academy of Sciences , volume =
An index to quantify an individual's scientific research output , author =. Proceedings of the National Academy of Sciences , volume =. 2005 , doi =
work page 2005
-
[25]
Hull, Gavin and Bihlo, Alex , journal =. 2025 , url =. 2505.08941 , archiveprefix =
-
[26]
Proceedings of the 40th International Conference on Machine Learning , year =
Large language models struggle to learn long-tail knowledge , author =. Proceedings of the 40th International Conference on Machine Learning , year =
-
[27]
Proceedings of the 17th International Natural Language Generation Conference , pages =
A comprehensive analysis of memorization in large language models , author =. Proceedings of the 17th International Natural Language Generation Conference , pages =. 2024 , doi =
work page 2024
-
[28]
and Zhang, Hao and Stoica, Ion , booktitle =
Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph E. and Zhang, Hao and Stoica, Ion , booktitle =. Efficient memory management for large language model serving with. 2023 , doi =
work page 2023
-
[29]
npj Artificial Intelligence , volume =
Self-reflection enhances large language models towards substantial academic response , author =. npj Artificial Intelligence , volume =. 2025 , doi =
work page 2025
-
[30]
Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity
Li, Bojie , journal =. Incompressible knowledge probes: Estimating black-box. 2026 , url =. 2604.24827 , archiveprefix =
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[31]
npj Digital Medicine , volume =
Leveraging large language models for academic conference organization , author =. npj Digital Medicine , volume =. 2025 , doi =
work page 2025
-
[32]
Modeling scholarly collaboration and temporal dynamics in citation networks for impact prediction , author =. Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval , year =
-
[33]
Locating and editing factual associations in
Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , booktitle =. Locating and editing factual associations in
-
[34]
Proceedings of the International Conference on Learning Representations , year =
Mass-editing memory in a transformer , author =. Proceedings of the International Conference on Learning Representations , year =
- [35]
-
[36]
Nature Biomedical Engineering , volume =
Five years of. Nature Biomedical Engineering , volume =. 2022 , doi =
work page 2022
-
[37]
Hierarchical Memorization in Large Language Models: Evidence from Citation Generation
Hierarchical memorization in large language models: Evidence from citation generation , author =. arXiv preprint arXiv:2511.08877 , year =. 2511.08877 , archiveprefix =
work page internal anchor Pith review Pith/arXiv arXiv
-
[38]
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing , year =
Language models as knowledge bases? , author =. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing , year =. doi:10.18653/v1/d19-1250 , url =
- [39]
-
[40]
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , year =
How much knowledge can you pack into the parameters of a language model? , author =. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , year =. doi:10.18653/v1/2020.emnlp-main.437 , url =
-
[41]
`Precocious' early-career scientists with high citation counts proliferate , author =. Nature , volume =. 2025 , doi =
work page 2025
-
[42]
Thelwall, Mike and Haustein, Stefanie and Larivi. Do altmetrics work?. PLOS ONE , volume =. 2013 , doi =
work page 2013
-
[43]
Advances in Neural Information Processing Systems , volume =
Attention is all you need , author =. Advances in Neural Information Processing Systems , volume =
-
[44]
Predicting citation impact of research papers using
Vital Junior, Adilson and Silva, Filipi Nascimento and Oliveira Junior, Osvaldo Novais de and Amancio, Diego Raphael , journal =. Predicting citation impact of research papers using. 2025 , doi =
work page 2025
-
[45]
From words to worth: Newborn article impact prediction with
Zhao, Penghai and Xing, Qinghua and Dou, Kairan and Tian, Jinyu and Tai, Ying and Yang, Jian and Cheng, Ming-Ming and Li, Xiang , journal =. From words to worth: Newborn article impact prediction with. 2024 , url =. 2408.03934 , archiveprefix =
-
[46]
The landscape of memorization in
Xiong, Alexander and Zhao, Xuandong and Pappu, Aneesh and Song, Dawn , journal =. The landscape of memorization in. 2025 , url =. 2507.05578 , archiveprefix =
-
[47]
arXiv preprint arXiv:2601.13627 , year =
Are large language models able to predict highly cited papers? Evidence from statistical publications , author =. arXiv preprint arXiv:2601.13627 , year =. 2601.13627 , archiveprefix =
-
[48]
Citation count prediction based on the content of a paper , author =. Scientometrics , volume =
-
[49]
Findings of the Association for Computational Linguistics: EMNLP 2024 , year =
What matters in memorizing and recalling facts? Multifaceted benchmarks for knowledge probing in language models , author =. Findings of the Association for Computational Linguistics: EMNLP 2024 , year =. doi:10.18653/v1/2024.findings-emnlp.771 , url =
-
[50]
Instant prediction of scientific paper cited potential based on semantic and metadata features: Taking artificial intelligence field as an example , author =. PLOS ONE , volume =. 2024 , doi =
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.