From Node2Vec to GPT-based GraphRAG: scientific impact prediction across graph and language models
Pith reviewed 2026-05-19 23:29 UTC · model grok-4.3
The pith
Directed citation graphs combined with textual embeddings predict scientific impact with 0.84-0.85 AUC, while GPT prompts without retrieval often match GraphRAG performance at 0.87.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors formulate impact prediction as classifying papers into cohort-normalized top-P% citation ranks and show that supervised models using Node2Vec on directed citation graphs plus textual embeddings reach about 0.84-0.85 AUC. GPT-based GraphRAG using graph neighborhoods as context achieves up to 0.87 but target-only prompts perform as well or better, indicating that structural and textual signals complement each other in supervised settings while retrieval augmentation needs careful comparison to simple baselines.
What carries the argument
Temporally constrained citation and textual-similarity graphs processed with Node2Vec embeddings, fused with OpenAI text embeddings for supervised classification, alongside GPT models prompted with or without graph neighborhood context.
Load-bearing premise
Cohort-normalized top-P% citation rank years after publication acts as a stable, unbiased proxy for scientific impact that can be predicted from publication-time data without future leakage.
What would settle it
Model AUC dropping to near 0.5 or below when applied to predict impact in a completely new scientific field or later publication year cohort using the same training setup.
Figures
read the original abstract
Identifying which newly published scientific papers are likely to become highly cited is important for prioritizing research attention, supporting editorial decisions, and guiding the allocation of scientific resources, particularly under cold-start conditions where little direct evidence is available at publication time. In this work, we formulate impact prediction as a cohort-normalized top-P% classification task and compare graph-based and LLM-based approaches under a unified framework. We construct citation and textual-similarity graphs under temporal constraints and generate Node2Vec representations, either alone or combined with OpenAI text embeddings. The best supervised configuration combines directed citation graphs with textual embeddings, reaching approximately 0.84-0.85 AUC. We also evaluate a GPT-based GraphRAG setup, using GPT 5.5 and 5.4 Nano, in which graph neighborhoods are used as contextual evidence for prediction. Although the LLM-based approach achieves high performance, retrieved context does not consistently improve results; target-only prompts often perform as well as or better than GraphRAG prompts achieving the 0.87 mark. These findings indicate that structural and textual signals are complementary for supervised prediction, while retrieval augmentation must be carefully evaluated against simpler LLM baselines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript formulates scientific impact prediction as a cohort-normalized top-P% citation classification task and compares Node2Vec embeddings derived from temporally constrained citation and textual-similarity graphs (alone or combined with OpenAI text embeddings) against GPT-based GraphRAG prompting using GPT-3.5/4 variants. It reports that the strongest supervised configuration reaches approximately 0.84-0.85 AUC while LLM prompting achieves up to 0.87 AUC, with the observation that target-only prompts frequently match or exceed GraphRAG performance.
Significance. If the reported performance differences hold under rigorous validation, the work usefully demonstrates complementarity between graph structure and textual features for cold-start prediction and supplies a practical reminder that retrieval augmentation must be benchmarked against simpler LLM baselines. The emphasis on temporal graph construction to respect publication-time information is a positive methodological choice.
major comments (3)
- [Abstract] Abstract: The central AUC figures (0.84-0.85 supervised, 0.87 LLM) are presented without any description of dataset size, the concrete value chosen for P, the exact temporal train/test split dates, cross-validation procedure, or statistical significance testing of differences across configurations. These omissions are load-bearing for assessing whether the headline performance claims are reliable.
- [Abstract] Abstract: The cohort-normalized top-P% future citation rank is adopted as the prediction target with no explicit verification that post-publication information (e.g., journal effects, topic popularity shifts, or early visibility signals) does not leak into the label or the temporally constrained node features/neighborhoods. A concrete test—such as reporting AUC when using only pre-publication metadata or stratifying results by field—would strengthen the interpretation that the models are predicting intrinsic impact rather than recovering early cues.
- [Abstract] Abstract: The statement that 'retrieved context does not consistently improve results' requires supporting quantitative evidence; a table or figure comparing AUC (with confidence intervals) for target-only versus GraphRAG prompts across all model variants would make this claim verifiable rather than qualitative.
minor comments (1)
- [Abstract] Abstract: The phrasing 'GPT 5.5 and 5.4 Nano' is unclear and likely contains a typographical error; the exact model identifiers (e.g., GPT-3.5-turbo and GPT-4o) should be stated explicitly.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback, which highlights important areas for improving transparency and rigor. We address each major comment point by point below and have incorporated revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central AUC figures (0.84-0.85 supervised, 0.87 LLM) are presented without any description of dataset size, the concrete value chosen for P, the exact temporal train/test split dates, cross-validation procedure, or statistical significance testing of differences across configurations. These omissions are load-bearing for assessing whether the headline performance claims are reliable.
Authors: We agree that the abstract would benefit from these key details to allow readers to properly evaluate the claims. The full manuscript provides this information in the Methods and Experimental Setup sections, including the scale of the paper collection, the specific P threshold for the top-P% task, the exact temporal cutoffs for train/test splits, the cross-validation procedure used, and statistical tests comparing AUC differences. We will revise the abstract to concisely include summaries of these elements. revision: yes
-
Referee: [Abstract] Abstract: The cohort-normalized top-P% future citation rank is adopted as the prediction target with no explicit verification that post-publication information (e.g., journal effects, topic popularity shifts, or early visibility signals) does not leak into the label or the temporally constrained node features/neighborhoods. A concrete test—such as reporting AUC when using only pre-publication metadata or stratifying results by field—would strengthen the interpretation that the models are predicting intrinsic impact rather than recovering early cues.
Authors: This concern is well-taken. The temporal constraints on graph construction and feature extraction are explicitly designed to use only information available at publication time, thereby avoiding post-publication leakage. To further bolster the interpretation, we will add new analyses that report AUC using only pre-publication metadata and that stratify performance by scientific field. These results will be presented in a dedicated subsection of the revised manuscript. revision: yes
-
Referee: [Abstract] Abstract: The statement that 'retrieved context does not consistently improve results' requires supporting quantitative evidence; a table or figure comparing AUC (with confidence intervals) for target-only versus GraphRAG prompts across all model variants would make this claim verifiable rather than qualitative.
Authors: We agree that the claim requires explicit quantitative backing to be fully verifiable. While detailed per-variant comparisons appear in the Results section, we will add a consolidated table (or expand an existing results table) that directly reports AUC values together with confidence intervals for target-only versus GraphRAG prompts across all GPT variants. This will make the observation that retrieved neighborhoods do not consistently outperform target-only baselines immediately evident and quantifiable. revision: yes
Circularity Check
No circularity: empirical ML evaluation on temporal held-out data
full rationale
The paper trains Node2Vec embeddings and GPT-based classifiers on temporally constrained citation and similarity graphs to predict cohort-normalized future citation rank. AUC values are measured on held-out test sets after temporal splits, not obtained by fitting a parameter that directly encodes the target label or by renaming an input. No self-definitional equations, fitted-input predictions, or load-bearing self-citations appear in the derivation. The approach is a standard supervised comparison whose outputs are independent of the inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- top-P% threshold
axioms (1)
- domain assumption Citation and textual-similarity graphs can be constructed under strict temporal constraints that simulate cold-start conditions at publication time.
Reference graph
Works this paper leans on
-
[1]
The number of papers published yearly is shown in Figure 2
Data processing The first part of this work started by selecting35,354academic papers from the journal ACS Applied Materials & Interfaces(American Chemical Society, ACS), spanning 11 pub- lication years from 2009 to 2020 and exhibiting a natural year-over-year growth in volume. The number of papers published yearly is shown in Figure 2. The only informati...
work page 2009
-
[2]
For both graph types we created four variations based on (i) edge direction (directed vs
Graph Construction Once the database was organized intoY-year post-publication windows and the target metric labels were defined, we used the data available up to that point to construct two graphs representing complementary relational views: citations and semantic similarity. For both graph types we created four variations based on (i) edge direction (di...
work page 2009
-
[3]
Each graph family is further expanded by two edge direction types (directed vs
Embedding construction After the graph construction phase, we have the following graph families: (i) the citation graph built from citation relations, and (ii) the similarity graph built by connecting top-K 9 most similar papers, producing four similarity-graph variants according toK∈ {3,5,7,9}. Each graph family is further expanded by two edge direction ...
-
[4]
Impact classification In the final phase, we use the embeddings from the previous phase as inputs to a su- pervised classification model that predicts whether each paper will be a “top paper” un- der a given definition, percentile thresholdP∈ {10,20,30,40,50}, and prediction horizon Y∈ {0, . . . ,10}(when observable). Concretely, each training instance co...
work page 2048
-
[5]
Graph construction The LLM-based experiments use the graphs constructed in the previous stage; conse- quently, no additional graph-construction procedures are introduced here. We retain the 13 FIG. 3. Overview of the LLM-based GraphRAG methodology for top-paper prediction. The pro- cess reuses the citation and textual-similarity graphs constructed in the ...
work page 2050
-
[6]
Context retrieval For each sampled target paper, we extracted a local neighborhood from the graph to serve as contextual information within the prompt. We evaluated two distinct retrieval strategies: (i) random sampling from the target node’s immediate graph neighbors, and (ii) similarity- based selection, where we identified the top five most similar pap...
-
[7]
Prompt construction We employed GPT-5.5 and GPT 5.4 Nano as the underlying models, configuring the prompting protocol to function as a specialized scientific impact prediction engine rather than a general-purpose assistant. To achieve this, each request was structured into three distinct layers: a system prompt, a developer prompt, and a programmatically ...
-
[8]
Prediction and evaluation For each target paper, the LLM produces a single structured response centered on a probability vector for top-paper prediction across all requested horizon years, together with 16 additional auxiliary outputs included for completeness. Thus, unlike the graph-based classi- fier, which solves separate binary classification problems...
-
[9]
Comparison between citation and textual-similarity graphs Figure 4 presents the AUC scores for the classification of papers belonging to the top 20% of their respective cohorts for each year following publication. We evaluate two distinct graph construction strategies –Paper CitationandTextual Similarity– against two input representations for the neural c...
-
[10]
Sensitivity analysis of Top-K textual similarity graphs In Figure 5, we examine the textual-similarity graphs by varying the number of neighbors (K) from 3 to 9. The most prominent finding is that concatenating textual embeddings with Node2Vec consistently outperforms the standalone Node2Vec model across all values ofK. This confirms that explicitly prese...
-
[11]
Effect of quantile-based thresholds on top-paper prediction performance In Figure 6, we extend the experiment across various quantile thresholds (50th to90 th percentiles in increments of 10) used to define the positive class. While the figure displays results exclusively for directed and weighted graphs, the observed trends were consistent across other g...
-
[12]
Effect of neighbor retrieval strategy in GraphRAG-based prediction We employed the graph structure as a retrieval mechanism, selecting neighbors of the target paper and injecting them into the LLM prompt as contextual evidence. We evaluated multiple graph configurations by varying edge direction and edge weighting, and compared two neighborhood-selection ...
-
[13]
Effect of graph-retrieved context on LLM prediction performance In Figure 8, we replicate the GraphRAG setup under a simplified and controlled configu- ration, using directed and unweighted graphs with random retrieval, and compare prompts withandwithoutgraph-retrieved neighbors. The goal of this experiment is to isolate the contribution of retrieved grap...
-
[14]
Cross-journal evaluation with and without graph-retrieved context To assess whether the behavior observed in the main corpus was specific to a single journal or reflected a more general property of the LLM-based prediction setup, we repeated the context-ablation experiment on three additional journals:Informetrics,PNAS, and PRL. We kept the same controlle...
work page 2025
- [15]
-
[16]
D. W. Aksnes, L. Langfeldt, and P. Wouters. Citations, citation indicators, and research quality: An overview of basic concepts and theories.Sage Open, 9(1):2158244019829575, 2019
work page 2019
-
[17]
D. R. Amancio, M. d. G. V. Nunes, O. N. Oliveira Jr, and L. da F. Costa. Using complex networks concepts to assess approaches for citations in scientific papers.Scientometrics, 91 (3):827–842, 2012
work page 2012
-
[18]
T. Azad, I. Al Azher, S. R. Choudhury, and H. Alhoori. Predicting the scholarly impact of research papers using retrieval-augmented llms. InProceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025), pages 124–131, 2025
work page 2025
-
[19]
I. Beltagy, K. Lo, and A. Cohan. SciBERT: A pretrained language model for scientific text. In K. Inui, J. Jiang, V. Ng, and X. Wan, editors,Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3615–3620, Hong Kong, China, Nov. ...
work page 2019
-
[20]
A. C. M. Brito, F. N. Silva, and D. R. Amancio. A complex network approach to political analysis: Application to the brazilian chamber of deputies.Plos one, 15(3):e0229928, 2020
work page 2020
-
[21]
A. Cohan, S. Feldman, I. Beltagy, D. Downey, and D. Weld. SPECTER: Document-level rep- resentation learning using citation-informed transformers. In D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault, editors,Proceedings of the 58th Annual Meeting of the Association for Com- putational Linguistics, pages 2270–2282, Online, July 2020. Association for Comp...
work page 2020
-
[22]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding.CoRR, abs/1810.04805, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[23]
node2vec: Scalable Feature Learning for Networks
A. Grover and J. Leskovec. node2vec: Scalable feature learning for networks.CoRR, abs/1607.00653, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[24]
G. He, Z. Xue, Z. Jiang, Y. Kang, S. Zhao, and W. Lu. H2cgl: Modeling dynamics of citation network for impact prediction.Information Processing & Management, 60(6):103512, 2023
work page 2023
-
[25]
L. He, L. Bai, X. Yang, H. Du, and J. Liang. High-order graph attention network.Information Sciences, 630:222–234, 2023. ISSN 0020-0255
work page 2023
- [26]
-
[27]
K. Kousha and M. Thelwall. Factors associating with or predicting more cited or higher quality journal articles: An annual review of information science and technology (arist) paper.Journal of the Association for Information Science and Technology, 75(3):215–244, 2024
work page 2024
-
[28]
T. Mikolov, K. Chen, G. S. Corrado, and J. Dean. Efficient estimation of word representations in vector space, 2013
work page 2013
-
[29]
Distributed Representations of Words and Phrases and their Compositionality
T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality.CoRR, abs/1310.4546, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[30]
A. P. Millán, H. Sun, L. Giambagli, R. Muolo, T. Carletti, J. J. Torres, F. Radicchi, J. Kurths, and G. Bianconi. Topology shapes dynamics of higher-order networks.Nature Physics, 21(3): 353–361, 2025
work page 2025
-
[31]
OpenAI. New embedding models and api updates.https://openai.com/index/ new-embedding-models-and-api-updates/, Jan. 2024. Accessed: 2026-01-16
work page 2024
- [32]
-
[33]
A. M. Petersen, R. K. Pan, F. Pammolli, and S. Fortunato. Methods to account for citation inflation in research evaluation.Research Policy, 48(7):1855–1865, 2019
work page 2019
-
[34]
C. Stegehuis, N. Litvak, and L. Waltman. Predicting the long-term citation impact of recent publications.Journal of Informetrics, 9(3):642–657, 2015. ISSN 1751-1577
work page 2015
- [35]
-
[36]
A. Vital and D. R. Amancio. A comparative analysis of local similarity metrics and machine learning approaches: application to link prediction in author citation networks.Scientometrics, 127(10):6011–6028, 2022
work page 2022
-
[37]
A. Vital, Jr., F. N. Silva, and D. R. Amancio. Comparing random walks in graph embedding and link prediction.PLOS ONE, 19(11):1–22, 11 2024
work page 2024
-
[38]
A. Vital Jr, F. N. Silva, and D. R. Amancio. Recovering link-weight structure in complex networks with weight-aware random walks.arXiv preprint arXiv:2508.07489, 2025
-
[39]
A. Vital Jr, F. N. Silva, O. N. Oliveira Jr, and D. R. Amancio. Predicting citation impact of research papers using gpt and other text embeddings.Physica A: Statistical Mechanics and its Applications, 674:130789, 2025. 38
work page 2025
-
[40]
L. Waltman. A review of the literature on citation impact indicators.Journal of informetrics, 10(2):365–391, 2016
work page 2016
-
[41]
L. Waltman and M. Schreiber. On the calculation of percentile-based bibliometric indicators. Journal of the American Society for information Science and Technology, 64(2):372–379, 2013
work page 2013
-
[42]
D. Wang, C. Song, and A.-L. Barabási. Quantifying long-term scientific impact.Science, 342 (6154):127–132, 2013
work page 2013
-
[43]
X. Wu, H. Pang, Y. Fan, Y. Linghu, and Y. Luo. Probwalk: A random walk approach in weighted graph embedding.Procedia Computer Science, 183:683–689, 2021
work page 2021
- [44]
-
[45]
Q. Zhao and X. Feng. Utilizing citation network structure to predict paper citation counts: A deep learning approach.Journal of Informetrics, 16(1):101235, 2022
work page 2022
-
[46]
X. Zhou, J. Wang, J. Wang, and Q. Guan. Predicting air quality using a multi-scale spatiotem- poral graph attention network.Information Sciences, 680:121072, 2024. ISSN 0020-0255. 39 Appendix A: Prompt Templates Used in the LLM-Based Experiments This appendix presents the prompt templates used in the LLM-based prediction exper- iments. The prompting proto...
work page 2024
-
[47]
System Prompt System Prompt You are a scientific impact prediction engine for journal articles. Your job is to estimate calibrated probabilities for whether a target paper will become atop paperwithin its journal at each requested horizon year. Output rules
-
[48]
No Markdown, no explanation, and no extra text
Output valid JSON only. No Markdown, no explanation, and no extra text
- [49]
- [50]
-
[51]
Probabilities must be numbers, not strings
- [52]
-
[53]
Do not reveal reasoning or chain-of-thought. Return only the final JSON
-
[54]
Use only the information explicitly present in the XML input. 40
-
[55]
Do not use external facts or hidden assumptions about papers, authors, journals, venues, identifiers, files, or datasets
-
[56]
Developer Prompt Developer Prompt Task Predict, for the target journal article, the probability that it will be atop paperby accumulated citations at each requested horizon year. Positive event The positive event is defined by<CONFIG><q_value>: •q_valueis a quantile threshold. •A paper is consideredtopif it belongs to the top(1−q_value)fraction within its...
-
[57]
User Prompt Template The user prompt was generated dynamically from the experiment payload and serialized in XML format. It always contained a<REQUEST>root block. The<OUTPUT_SPEC>section specified the required JSON schema and the exact output vector length. The<CONFIG> section described the graph and retrieval settings. The<TARGET>section contained the me...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.