pith. sign in

arxiv: 2605.22176 · v1 · pith:4D4ZA76Bnew · submitted 2026-05-21 · 💻 cs.AI

LLM-Metrics: Measuring Research Impact Through Large Language Model Memory

Pith reviewed 2026-05-22 05:57 UTC · model grok-4.3

classification 💻 cs.AI
keywords LLM parametric memoryresearch impact metricscitation correlationmultiple-choice probesacademic assessmentexposure in training data
0
0 comments X

The pith

LLMs remember high-impact papers better, turning their internal memory into a citation-free impact metric.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models can gauge research impact by how well they recall specific papers. It argues that widely read papers appear more often in training data, leading to stronger parametric memories that show up in probe tests. By creating multiple-choice questions about titles, authors, methods, and venues for recent computer science papers, the authors find positive correlations with citation counts in most models. This offers a way to assess impact in real time without waiting for citations to accumulate or facing disciplinary lags.

Core claim

High-impact papers receive greater exposure in the academic community, this exposure enters LLM training data in textual form, and models consequently form stronger parametric memory of these papers, shown by positive correlations between probe accuracy and citation counts across 15 of 17 models.

What carries the argument

Four types of multiple-choice probes covering title recognition, author recognition, method recognition, and venue recognition, evaluated on 549 papers across 17 LLMs.

If this is right

  • Impact assessment becomes possible in real time for papers published after model training cutoffs.
  • Metrics avoid temporal lags and disciplinary biases that affect citation counts.
  • Author-recognition probes show the strongest signal, consistent with exposure as the driver.
  • Smaller models can yield stronger predictive signals than larger ones due to capacity limits acting as filters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could extend to non-computer-science fields if similar probes are adapted.
  • Training data composition for different models might be reverse-engineered from memory patterns.
  • Combining LLM memory scores with other early signals could improve short-term impact forecasting.

Load-bearing premise

Probe accuracy differences primarily reflect differential exposure of paper content in LLM training corpora rather than other factors like paper length or style.

What would settle it

Observing no correlation between probe accuracy and citations after matching papers for length, topic, and publication venue would falsify the exposure-memory link.

Figures

Figures reproduced from arXiv: 2605.22176 by Danhao Zhu, Si Shen, Wenhua Zhao.

Figure 1
Figure 1. Figure 1: Front summary of LLM-Metrics. This overview condenses the paper’s three central empirical claims using only results reported in this study. Left, the 2024 cohort shows a stronger overall correlation with later citations than the 2023 cohort (ρ = 0.1880 vs. ρ = 0.0559), supporting a citation-independent exposure signal. Middle, predictive power is non-monotonic with model size: the 4–10B group has the highe… view at source ↗
Figure 2
Figure 2. Figure 2: Diagnostic evidence for selective memory. Left, model-level year-split analysis shows that 8 of 10 models achieve higher Spearman ρ on 2024 papers than on 2023 papers, even though 2024 papers had minimal citation accumulation at training time. Right, top model ranking and scale contrast show that Llama-3.2-3B reaches the highest correlation (ρ = 0.1829), matching the 70B model-level signal and exceeding mo… view at source ↗
Figure 3
Figure 3. Figure 3: Model ranking by predictive power. Horizontal bar chart showing Spearman ρ between LLM￾Metrics and citation counts for all 17 models, colored by vendor. Significance levels: ∗∗∗p < 0.001, ∗∗p < 0.01, ∗p < 0.05. The dashed vertical line marks the overall ρ = 0.1495. Model sizes (in billions of parameters) are shown in parentheses. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Mechanism evidence. a, Memory scores across citation bins for Llama-3.2-3B-Instruct. The line plot shows the mean memory score for each citation bin, with the shaded region indicating the 95% confidence band. Memory scores increase with citation counts (ρ = +0.1829, p < 0.001). b, Comparison of Spearman ρ for 2023 versus 2024 papers across 10 models. The predictive signal is consistently stronger for 2024 … view at source ↗
Figure 5
Figure 5. Figure 5: Vendor comparison and non-monotonic scaling. a, Vendor-level comparison of mean and best Spearman ρ. Meta’s LLaMA-3 family consistently outperforms other vendors. b, Scatter plot of model size versus Spearman ρ, colored by vendor. Filled circles indicate significant correlations (p < 0.05); open squares indicate non-significant. The relationship is non-monotonic: a 3B-parameter model (Llama-3.2-3B￾Instruct… view at source ↗
read the original abstract

Citation counts remain the dominant metric for assessing research impact, yet they suffer from well-documented limitations: temporal lag, disciplinary bias, and Matthew effects. Here we propose LLM-Metrics, a research-impact assessment metric derived from the parametric memory of large language models (LLMs). The central hypothesis is that high-impact papers receive greater exposure in the academic community, that this exposure enters LLM training data in textual form, and that models consequently form stronger parametric memory of these papers. We designed four types of multiple-choice probes, covering title recognition, author recognition, method recognition, and venue recognition, and evaluated 549 computer science papers published in 2023-2024 across 17 LLMs spanning 0.5B to 72B parameters from six vendors. Of the 17 models, 15 produced positive predictions, 9 of which were significant at p less than 0.05, with an overall Spearman correlation of rho = 0.1495 and p = 0.0004 against citation counts. Three additional findings support the proposed mechanism. First, the predictive signal was stronger for 2024 papers, rho = 0.1880, whose citation counts were near zero at model-training time, reducing the plausibility of a simple reverse-causality explanation. Second, author-recognition probes showed the strongest discriminative power, consistent with an exposure-driven memory mechanism. Third, model scale and predictive power were non-monotonic: a 3B-parameter model, Llama-3.2-3B-Instruct, with rho = 0.1829, outperformed most larger models, supporting a selective-memory hypothesis in which the limited capacity of smaller models can serve as an effective information filter. LLM-Metrics offers a real-time, cross-disciplinary, citation-independent paradigm for research assessment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper proposes LLM-Metrics as a citation-independent research-impact measure derived from LLMs' parametric memory. The hypothesis is that high-impact papers receive greater community exposure that enters LLM training data, leading to stronger model memory measurable via multiple-choice probes on title, author, method, and venue. The study evaluates 549 computer science papers from 2023-2024 across 17 LLMs (0.5B to 72B parameters), reporting an overall Spearman rho of 0.1495 (p=0.0004) with citation counts, a stronger rho=0.1880 for 2024 papers, best performance on author probes, and non-monotonic scaling with model size.

Significance. If the mechanism is confirmed, LLM-Metrics could enable real-time, cross-disciplinary impact assessment that avoids citation lags and biases. Credit is due for the 2024-paper control that reduces reverse-causality concerns, the non-monotonic scale finding supporting selective memory, and the use of multiple probe types. The modest effect size limits immediate replacement value but suggests a useful complement; the approach is novel in directly probing LLM internals for scientometric purposes.

major comments (3)
  1. [Results] Results section: the central claim that probe accuracy reflects differential training-data exposure is load-bearing, yet no regression controls or matching for paper-level confounders (length, abstract complexity, topic popularity, or stylistic clarity) are reported; the observed rho=0.1495 (and rho=0.1880 for 2024 papers) could arise from these factors rather than memory of specific paper text.
  2. [Experimental setup] Experimental setup: validation relies on correlation with citation counts—the very metric the approach seeks to supplement—creating circularity risk; while the 2024 control and author-probe strength provide partial independent grounding, the manuscript lacks baseline comparisons (e.g., against random probes, length-matched controls, or non-LLM heuristics) or error analysis to isolate the proposed mechanism.
  3. [Discussion] Discussion: the non-monotonic scale result (e.g., Llama-3.2-3B outperforming larger models) is interesting but does not distinguish exposure-driven memory from topic-salience effects, as the skeptic concern notes; a concrete test such as topic-controlled subsets or human-rated salience covariates is needed to support the selective-memory interpretation.
minor comments (3)
  1. [Abstract] Abstract: the statement that '15 of 17 models produced positive predictions, 9 of which were significant' would benefit from a supplementary table listing per-model rho values and p-values for transparency.
  2. [Methods] Methods: probe question templates and exact multiple-choice options should be reproduced in an appendix to enable replication and assessment of question difficulty.
  3. Figure clarity: any plots of rho versus model size or citation bins should include error bars or confidence intervals to convey variability across the 549 papers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and insightful comments, which have helped us identify areas to strengthen the manuscript. We address each major comment below and indicate the revisions we will make in the next version.

read point-by-point responses
  1. Referee: [Results] Results section: the central claim that probe accuracy reflects differential training-data exposure is load-bearing, yet no regression controls or matching for paper-level confounders (length, abstract complexity, topic popularity, or stylistic clarity) are reported; the observed rho=0.1495 (and rho=0.1880 for 2024 papers) could arise from these factors rather than memory of specific paper text.

    Authors: We agree that unmeasured paper-level factors could contribute to the observed correlations and that explicit controls would strengthen the central claim. In the revised manuscript we will add ordinary least-squares regressions of probe accuracy on citation count while controlling for paper length (token count), abstract complexity (Flesch-Kincaid score and sentence length), and topic popularity (one-hot encoding of primary arXiv category plus venue). We will also report partial Spearman correlations after residualizing out these covariates. These additions directly test whether the signal survives after accounting for the listed confounders. revision: yes

  2. Referee: [Experimental setup] Experimental setup: validation relies on correlation with citation counts—the very metric the approach seeks to supplement—creating circularity risk; while the 2024 control and author-probe strength provide partial independent grounding, the manuscript lacks baseline comparisons (e.g., against random probes, length-matched controls, or non-LLM heuristics) or error analysis to isolate the proposed mechanism.

    Authors: We acknowledge the value of additional baselines. We will insert a new subsection that reports (i) probe accuracy against random-choice baselines (uniform and frequency-matched), (ii) results on length-matched control texts drawn from non-academic sources, and (iii) a brief error analysis of the 50 papers with largest residuals. The 2024-paper subset and the superior performance of author probes already provide some separation from citation counts; the new baselines will further isolate the contribution of parametric memory. Full non-LLM heuristics (e.g., simple n-gram overlap with Google Scholar snippets) lie outside the current scope but will be noted as future work. revision: partial

  3. Referee: [Discussion] Discussion: the non-monotonic scale result (e.g., Llama-3.2-3B outperforming larger models) is interesting but does not distinguish exposure-driven memory from topic-salience effects, as the skeptic concern notes; a concrete test such as topic-controlled subsets or human-rated salience covariates is needed to support the selective-memory interpretation.

    Authors: We concur that topic salience remains a plausible alternative explanation. In revision we will add a stratified analysis that recomputes the scale–performance relationship within each of the five most frequent arXiv categories (cs.AI, cs.LG, cs.CL, cs.CV, cs.SE), thereby holding broad topic constant. We will also expand the Discussion to explicitly acknowledge that human-rated salience covariates would be a stronger test and to frame the non-monotonic finding as suggestive rather than conclusive evidence for selective memory. These steps address the concern within the limits of the existing dataset. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper hypothesizes that greater exposure leads to stronger LLM parametric memory, designs title/author/method/venue probes, computes accuracies on 549 papers, and reports empirical Spearman correlations (rho=0.1495 overall, rho=0.1880 for 2024 papers) with citation counts as supporting evidence. This correlation is an external benchmark test, not a definitional reduction, fitted parameter renamed as prediction, or self-citation chain. The 2024 control (near-zero citations at training time) and non-monotonic scale finding supply independent observations that do not reduce to the input data by construction. No load-bearing step quotes or exhibits equivalence to prior author results or ansatzes. The paper remains self-contained against the citation benchmark it uses for validation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on one domain assumption about exposure-driven memory formation; no free parameters are explicitly fitted to produce the metric, and no new physical or theoretical entities are introduced.

axioms (1)
  • domain assumption LLM parametric memory strength for a paper is driven primarily by the volume of that paper's content appearing in the model's training data due to academic exposure.
    Invoked to link probe performance to research impact rather than to other textual or training artifacts.

pith-pipeline@v0.9.0 · 5861 in / 1397 out tokens · 56879 ms · 2026-05-22T05:57:17.361579+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 3 internal anchors

  1. [1]

    Journal of Informetrics , volume =

    Predicting citation counts based on deep neural network learning techniques , author =. Journal of Informetrics , volume =. 2019 , doi =

  2. [2]

    Second-order citations in altmetrics: A case study analyzing the audiences of

    Alperin, Juan Pablo and Fleerackers, Alice and Riedlinger, Michelle and Haustein, Stefanie , journal =. Second-order citations in altmetrics: A case study analyzing the audiences of. 2024 , doi =

  3. [3]

    Frontiers in Research Metrics and Analytics , volume =

    Evaluative altmetrics: Is there evidence for its application to research evaluation? , author =. Frontiers in Research Metrics and Analytics , volume =. 2023 , doi =

  4. [4]

    Proceedings of the 40th International Conference on Machine Learning , year =

    Pythia: A suite for analyzing large language models across training and scaling , author =. Proceedings of the 40th International Conference on Machine Learning , year =

  5. [5]

    arXiv preprint arXiv:2404.06209 , year =

    Elephants never forget: Memorization and learning of tabular data in large language models , author =. arXiv preprint arXiv:2404.06209 , year =. 2404.06209 , archiveprefix =

  6. [6]

    arXiv preprint arXiv:2408.16345 , year =

    The unreasonable ineffectiveness of nucleus sampling on mitigating text memorization , author =. arXiv preprint arXiv:2408.16345 , year =. doi:10.18653/v1/2024.inlg-main.30 , url =. 2408.16345 , archiveprefix =

  7. [7]

    Advances in Neural Information Processing Systems , volume =

    Language models are few-shot learners , author =. Advances in Neural Information Processing Systems , volume =

  8. [8]

    Extracting training data from large language models , author =. 30th

  9. [9]

    Proceedings of the International Conference on Learning Representations , year =

    Quantifying memorization across neural language models , author =. Proceedings of the International Conference on Learning Representations , year =

  10. [10]

    Advances in Neural Information Processing Systems , volume =

    How do large language models acquire factual knowledge during pretraining? , author =. Advances in Neural Information Processing Systems , volume =. 2024 , doi =

  11. [11]

    Journal of the Association for Information Science and Technology , volume =

    Do ``altmetrics'' correlate with citations? Extensive comparison of altmetric indicators with citations from a multidisciplinary perspective , author =. Journal of the Association for Information Science and Technology , volume =. 2015 , doi =

  12. [12]

    2024 , url =

    Feng, Kai and Li, Ziyue and Wang, Jindong and Liu, Zixuan and Sun, Maosong , journal =. 2024 , url =. 2406.09098 , archiveprefix =

  13. [13]

    Over-optimization of academic publishing metrics: Observing

    Fire, Michael and Guestrin, Carlos , journal =. Over-optimization of academic publishing metrics: Observing. 2019 , doi =

  14. [14]

    Science , volume =

    Citation indexes for science: A new dimension in documentation through association of ideas , author =. Science , volume =. 1955 , doi =

  15. [15]

    JAMA , volume =

    The history and meaning of the journal impact factor , author =. JAMA , volume =. 2006 , doi =

  16. [16]

    and Noruzi, Alireza and Elahi, Alireza and Saboury, Ali Akbar and Nawaz, Raheel and Gholampour, Behzad , journal =

    Gholampour, Sajad and Lim, Weng Marc and Lund, Brady D. and Noruzi, Alireza and Elahi, Alireza and Saboury, Ali Akbar and Nawaz, Raheel and Gholampour, Behzad , journal =. Does social media contribute to research impact? An. 2024 , doi =

  17. [17]

    Mining the mind: What 100

    Ghosh, Shrestha and Giordano, Luca and Hu, Yujia and Nguyen, Tuan-Phong and Razniewski, Simon , journal =. Mining the mind: What 100. 2025 , url =. 2510.07024 , archiveprefix =

  18. [18]

    Scientometrica , volume =

    Ethics and bias in research metrics: A comprehensive review of challenges, manifestations, and pathways to reform , author =. Scientometrica , volume =. 2025 , doi =

  19. [19]

    Frontiers in Research Metrics and Analytics , volume =

    Altmetrics in the evaluation of scholarly impact: A systematic and critical literature review , author =. Frontiers in Research Metrics and Analytics , volume =. 2025 , doi =

  20. [20]

    Gu, Yuxian and Dong, Li and Wei, Furu and Huang, Minlie , booktitle =

  21. [21]

    arXiv preprint arXiv:2402.08640 , year =

    Forecasting high-impact research topics via machine learning on evolving knowledge graphs , author =. arXiv preprint arXiv:2402.08640 , year =. doi:10.1088/2632-2153/add6ef , url =. 2402.08640 , archiveprefix =

  22. [22]

    SciVQR: A Multidisciplinary Multimodal Benchmark for Advanced Scientific Reasoning Evaluation

    Guo, Longteng and Lin, Xuanxu and Hao, Dongze and Yue, Tongtian and Huo, Pengkang and Ma, Jiatong and Liu, Yuchen and Liu, Jing , journal =. 2026 , url =. 2605.10187 , archiveprefix =

  23. [23]

    2024 , url =

    Hirako, Jun and Sasano, Ryo and Takeda, Koichi , journal =. 2024 , url =. 2410.04404 , archiveprefix =

  24. [24]

    Proceedings of the National Academy of Sciences , volume =

    An index to quantify an individual's scientific research output , author =. Proceedings of the National Academy of Sciences , volume =. 2005 , doi =

  25. [25]

    2025 , url =

    Hull, Gavin and Bihlo, Alex , journal =. 2025 , url =. 2505.08941 , archiveprefix =

  26. [26]

    Proceedings of the 40th International Conference on Machine Learning , year =

    Large language models struggle to learn long-tail knowledge , author =. Proceedings of the 40th International Conference on Machine Learning , year =

  27. [27]

    Proceedings of the 17th International Natural Language Generation Conference , pages =

    A comprehensive analysis of memorization in large language models , author =. Proceedings of the 17th International Natural Language Generation Conference , pages =. 2024 , doi =

  28. [28]

    and Zhang, Hao and Stoica, Ion , booktitle =

    Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph E. and Zhang, Hao and Stoica, Ion , booktitle =. Efficient memory management for large language model serving with. 2023 , doi =

  29. [29]

    npj Artificial Intelligence , volume =

    Self-reflection enhances large language models towards substantial academic response , author =. npj Artificial Intelligence , volume =. 2025 , doi =

  30. [30]

    Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity

    Li, Bojie , journal =. Incompressible knowledge probes: Estimating black-box. 2026 , url =. 2604.24827 , archiveprefix =

  31. [31]

    npj Digital Medicine , volume =

    Leveraging large language models for academic conference organization , author =. npj Digital Medicine , volume =. 2025 , doi =

  32. [32]

    Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval , year =

    Modeling scholarly collaboration and temporal dynamics in citation networks for impact prediction , author =. Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval , year =

  33. [33]

    Locating and editing factual associations in

    Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , booktitle =. Locating and editing factual associations in

  34. [34]

    Proceedings of the International Conference on Learning Representations , year =

    Mass-editing memory in a transformer , author =. Proceedings of the International Conference on Learning Representations , year =

  35. [35]

    , journal =

    Merton, Robert K. , journal =. The. 1968 , doi =

  36. [36]

    Nature Biomedical Engineering , volume =

    Five years of. Nature Biomedical Engineering , volume =. 2022 , doi =

  37. [37]

    Hierarchical Memorization in Large Language Models: Evidence from Citation Generation

    Hierarchical memorization in large language models: Evidence from citation generation , author =. arXiv preprint arXiv:2511.08877 , year =. 2511.08877 , archiveprefix =

  38. [38]

    Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing , year =

    Language models as knowledge bases? , author =. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing , year =. doi:10.18653/v1/d19-1250 , url =

  39. [39]

    2010 , howpublished =

    Altmetrics: A manifesto , author =. 2010 , howpublished =

  40. [40]

    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , year =

    How much knowledge can you pack into the parameters of a language model? , author =. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , year =. doi:10.18653/v1/2020.emnlp-main.437 , url =

  41. [41]

    Nature , volume =

    `Precocious' early-career scientists with high citation counts proliferate , author =. Nature , volume =. 2025 , doi =

  42. [42]

    Do altmetrics work?

    Thelwall, Mike and Haustein, Stefanie and Larivi. Do altmetrics work?. PLOS ONE , volume =. 2013 , doi =

  43. [43]

    Advances in Neural Information Processing Systems , volume =

    Attention is all you need , author =. Advances in Neural Information Processing Systems , volume =

  44. [44]

    Predicting citation impact of research papers using

    Vital Junior, Adilson and Silva, Filipi Nascimento and Oliveira Junior, Osvaldo Novais de and Amancio, Diego Raphael , journal =. Predicting citation impact of research papers using. 2025 , doi =

  45. [45]

    From words to worth: Newborn article impact prediction with

    Zhao, Penghai and Xing, Qinghua and Dou, Kairan and Tian, Jinyu and Tai, Ying and Yang, Jian and Cheng, Ming-Ming and Li, Xiang , journal =. From words to worth: Newborn article impact prediction with. 2024 , url =. 2408.03934 , archiveprefix =

  46. [46]

    The landscape of memorization in

    Xiong, Alexander and Zhao, Xuandong and Pappu, Aneesh and Song, Dawn , journal =. The landscape of memorization in. 2025 , url =. 2507.05578 , archiveprefix =

  47. [47]

    arXiv preprint arXiv:2601.13627 , year =

    Are large language models able to predict highly cited papers? Evidence from statistical publications , author =. arXiv preprint arXiv:2601.13627 , year =. 2601.13627 , archiveprefix =

  48. [48]

    Scientometrics , volume =

    Citation count prediction based on the content of a paper , author =. Scientometrics , volume =

  49. [49]

    Findings of the Association for Computational Linguistics: EMNLP 2024 , year =

    What matters in memorizing and recalling facts? Multifaceted benchmarks for knowledge probing in language models , author =. Findings of the Association for Computational Linguistics: EMNLP 2024 , year =. doi:10.18653/v1/2024.findings-emnlp.771 , url =

  50. [50]

    PLOS ONE , volume =

    Instant prediction of scientific paper cited potential based on semantic and metadata features: Taking artificial intelligence field as an example , author =. PLOS ONE , volume =. 2024 , doi =