Impacts of Data Splitting Strategies on Parameterized Link Prediction Algorithms
Pith reviewed 2026-05-18 00:31 UTC · model grok-4.3
The pith
Using the test set during hyperparameter tuning inflates link prediction performance by an average 3.6%.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors show that the common practice of using the test set during hyperparameter tuning in link prediction leads to human-induced information leakage, resulting in an average overestimation of model performance by about 3.6% across 60 networks, with the bias exceeding 15% for specific algorithms. The Loss Ratio metric is proposed to quantify this overestimation, and the analysis reveals that heuristic and random-walk-based methods are more robust and stable.
What carries the argument
Loss Ratio, a metric that quantifies the extent of performance overestimation caused by information leakage from using the test set in hyperparameter tuning.
If this is right
- Standardized data splitting strategies are required for fair and reproducible benchmarking of link prediction models.
- Parameterized models should be evaluated without access to test data during tuning to avoid inflated results.
- Heuristic and random-walk-based methods provide more reliable performance estimates under proper evaluation protocols.
- The overestimation issue is widespread across networks in multiple domains.
Where Pith is reading between the lines
- Many existing published results on link prediction may have overstated performance and could benefit from re-assessment using the Loss Ratio.
- This evaluation issue likely extends to other network analysis tasks involving parameterized models, such as community detection or graph classification.
- Researchers should develop automated tools to detect and correct for such leakage in benchmarking pipelines.
Load-bearing premise
The Loss Ratio metric accurately and specifically isolates the performance overestimation caused by using the test set during hyperparameter tuning.
What would settle it
Re-running the experiments on the same 60 networks but with hyperparameters tuned strictly on training and validation data only, then checking whether the observed performance drop matches the Loss Ratio values reported for each algorithm.
Figures
read the original abstract
Link prediction is a fundamental problem in network science, aiming to infer potential or missing links based on observed network structures. With the increasing adoption of parameterized models, the rigor of evaluation protocols has become critically important. However, a previously common practice of using the test set during hyperparameter tuning has led to human-induced information leakage, thereby inflating the reported model performance. To address this issue, this study introduces a novel evaluation metric, Loss Ratio, which quantitatively measures the extent of performance overestimation. We conduct large-scale experiments on 60 real-world networks across six domains. The results demonstrate that the information leakage leads to an average overestimation of about 3.6%, with the bias reaching over 15% for specific algorithms. Meanwhile, heuristic and random-walk-based methods exhibit greater robustness and stability. The analysis uncovers a pervasive information leakage issue in link prediction evaluation and underscores the necessity of adopting standardized data splitting strategies to enable fair and reproducible benchmarking of link prediction models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that using the test set during hyperparameter tuning for parameterized link prediction models introduces information leakage that inflates reported performance. It introduces a new metric called Loss Ratio to quantify the resulting overestimation and reports results from experiments on 60 real-world networks across six domains, finding an average overestimation of about 3.6% (with bias exceeding 15% for some algorithms). Heuristic and random-walk methods are described as more robust, and the work recommends standardized data splitting strategies for fair benchmarking.
Significance. If the central empirical findings hold after clarification of controls, the paper would usefully document a methodological issue that affects reproducibility in link prediction research. The scale of the study (60 networks) supplies broad evidence that the problem is pervasive rather than isolated. The proposal of Loss Ratio as a diagnostic offers a concrete, if preliminary, tool for quantifying evaluation bias in graph-based models.
major comments (3)
- [§3.2] §3.2 (Loss Ratio definition): the metric is presented as a normalized performance gap between tuning regimes that include versus exclude test edges, but the text does not specify whether hyperparameter search budget, number of trials, early-stopping criteria, or random seeds are held identical across regimes; without these controls the observed gap cannot be attributed solely to information leakage rather than differences in model selection effort.
- [Results section] Results section and associated tables (e.g., aggregate statistics over 60 networks): the reported average 3.6% overestimation and per-algorithm maxima above 15% are given without error bars, confidence intervals, or explicit exclusion criteria for networks, preventing assessment of whether the quantitative claims are statistically robust or sensitive to a few outliers.
- [§5.3] §5.3 (robustness claims): the statement that heuristic and random-walk methods exhibit greater stability is not accompanied by a controlled comparison (e.g., fixed search space size or parameter count) that would rule out the alternative explanation that these methods simply require less tuning and therefore show smaller apparent gaps.
minor comments (2)
- [Abstract] Abstract: the six domains of the 60 networks are mentioned but not enumerated; adding the domain list would improve context for readers.
- [Figures] Figure captions: several figures comparing splitting strategies would benefit from explicit labels indicating which curves correspond to 'test-included' versus 'test-excluded' tuning.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments on our manuscript. We have addressed each major comment point by point below, making revisions to improve clarity, statistical presentation, and controlled comparisons where appropriate.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Loss Ratio definition): the metric is presented as a normalized performance gap between tuning regimes that include versus exclude test edges, but the text does not specify whether hyperparameter search budget, number of trials, early-stopping criteria, or random seeds are held identical across regimes; without these controls the observed gap cannot be attributed solely to information leakage rather than differences in model selection effort.
Authors: We agree that explicit controls are essential to attribute the gap to information leakage. In the revised manuscript, §3.2 now states that both regimes used an identical search budget of 100 trials, the same early-stopping rule based on validation performance, and fixed random seeds. This ensures the observed difference arises from test-edge inclusion rather than unequal optimization effort. We have added a brief description of the protocol and pseudocode in the appendix. revision: yes
-
Referee: [Results section] Results section and associated tables (e.g., aggregate statistics over 60 networks): the reported average 3.6% overestimation and per-algorithm maxima above 15% are given without error bars, confidence intervals, or explicit exclusion criteria for networks, preventing assessment of whether the quantitative claims are statistically robust or sensitive to a few outliers.
Authors: We acknowledge the value of statistical robustness indicators. The revised Results section now reports the mean overestimation with standard deviation (4.2%) and 95% confidence intervals across all 60 networks. We explicitly note that no networks were excluded and include a short sensitivity check confirming the average remains stable (3.5%) after removing the top 5% of outliers. revision: yes
-
Referee: [§5.3] §5.3 (robustness claims): the statement that heuristic and random-walk methods exhibit greater stability is not accompanied by a controlled comparison (e.g., fixed search space size or parameter count) that would rule out the alternative explanation that these methods simply require less tuning and therefore show smaller apparent gaps.
Authors: This point is well taken. We have added a controlled experiment in the revised §5.3 that normalizes the hyperparameter search to an equal number of configurations (50 trials) for every method. Under this equalized tuning budget, heuristic and random-walk methods continue to exhibit smaller gaps than parameterized models. A new supplementary table summarizes the controlled results. revision: yes
Circularity Check
Empirical evaluation with introduced metric exhibits no circularity
full rationale
The paper performs large-scale experiments across 60 real-world networks and introduces the Loss Ratio metric to quantify performance overestimation from test-set inclusion in hyperparameter tuning. No equations, derivations, or self-citations reduce the reported 3.6% average overestimation (or per-algorithm biases) to fitted parameters defined by the same experiments or to prior self-work by construction. Central claims rest on direct experimental comparisons between splitting strategies, rendering the analysis self-contained against external benchmarks with no load-bearing reductions to inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard train-test splits for link prediction should keep the test set completely separate from hyperparameter tuning and model selection.
invented entities (1)
-
Loss Ratio
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Loss Ratio L = |AUC* − AUC'| / AUC* … nested partitioning with consistent test set … two-set vs three-set split
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
heuristic and random-walk-based methods exhibit greater robustness … deep learning … higher L
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
L. Lü, T. Zhou, Link prediction in complex networks: A survey, Physica A 390 (2011) 1150–1170. https://doi.org/10.1016/j.physa.2010.11.027
-
[2]
P. Wang, B. Xu, Y. Wu, X. Zhou, Link prediction in social networks: the state-of-the-art, Sci. China Inf. Sci. 58 (2015) 1–38. https://doi.org/10.1007/s11432-015-5399-y
-
[3]
V. Martínez, F. Berzal, J.C. Cubero, A survey of link prediction in complex networks, ACM Comput. Surv. 49 (2016) 69. https://doi.org/10.1145/3012704
-
[4]
A. Kumar, S.S. Singh, K. Singh, B. Biswas, Link prediction techniques, applications, and performance: a survey, Physica A 553 (2020) 124289. https://doi.org/10.1016/j.physa.2020.124289
-
[5]
A. Divakaran, A. Mohan, Temporal link prediction: a survey, New Gener. Comput. 38 (2020) 213–258. https://doi.org/10.1007/s00354-019-00070-4
-
[6]
Zhou, Progresses and challenges in link prediction, iScience 24 (2021) 103217
T. Zhou, Progresses and challenges in link prediction, iScience 24 (2021) 103217. https://doi.org/10.1016/j.isci.2021.103217
-
[7]
D. Arrar, N. Kamel, A. Lakhfif, A comprehensive survey of link prediction methods, J. Supercomput. 80 (2024) 3902–3942. https://doi.org/10.1007/s11227-023-05823-4
-
[8]
L.A. Adamic, E. Adar, Friends and neighbors on the Web, Soc. Netw. 25 (2003) 211–230. https://doi.org/10.1016/S0378-8733(03)00009-1
-
[9]
I.A. Kovács, K. Luck, K. Spirohn, Y. Wang, C. Pollis, S. Schlabach, W. Bian, D.-K. Kim, N. Kishore, T. Hao, M.A. Calderwood, M. Vidal, A.-L. Barabási, Network-based prediction of protein–protein interactions, Nat. Commun. 10 (2019) 1240. https://doi.org/10.1038/s41467-019-09177-y
-
[10]
L. Lü, M. Medo, C.H. Yeung, Y.-C. Zhang, Z.-K. Zhang, T. Zhou, Recommender systems, Phys. Rep. 519 (2012) 1–49. https://doi.org/10.1016/j.physrep.2012.02.006
-
[11]
A new status index derived from sociometric analysis
L. Katz, A new status index derived from sociometric analysis, Psychometrika 18 (1953) 39–43. https://doi.org/10.1007/BF02289026
-
[12]
E. A. Leicht, P. Holme, M. E. J. Newman, Vertex similarity in networks, Phys. Rev. E 73 (2006) 026120. https://doi.org/10.1103/PhysRevE.73.026120
-
[13]
The link-prediction problem for social networks
D. Liben-Nowell, J. Kleinberg, The link-prediction problem for social networks, Journal of the American SocietyforInformationScienceandTechnology58(2007)1019–1031.https://doi.org/10.1002/asi.20591. 11
-
[14]
F. Fouss, A. Pirotte, J.-M. Renders, M. Saerens, Random-walk computation of similarities between nodes of a graph with application to collaborative recommendation, IEEE Trans. Knowl. Data Eng. 19 (2007) 355–369. https://doi.org/10.1109/TKDE.2007.46
-
[15]
T. Zhou, L. Lü, Y.-C. Zhang, Predicting missing links via local information, Eur. Phys. J. B 71 (2009) 623–630. https://doi.org/10.1140/epjb/e2009-00335-8
-
[16]
W. Liu, L. Lü, Link prediction based on local random walk, Europhys. Lett. 89 (2010) 58007. https://doi.org/10.1209/0295-5075/89/58007
-
[17]
C.V. Cannistraci, G. Alanis-Lobato, T. Ravasi, From link-prediction in brain connectomes and pro- tein interactomes to the local-community-paradigm in complex networks, Sci. Rep. 3 (2013) 1613. https://doi.org/10.1038/srep01613
-
[18]
A. Clauset, C. Moore, M.E.J. Newman, Hierarchical structure and the prediction of missing links, Nature 453 (2008) 98–101. https://doi.org/10.1038/nature06830
-
[19]
R. Guimerà, M. Sales-Pardo, Missing and spurious interactions and the reconstruction of complex networks, Proc. Natl. Acad. Sci. U.S.A. 106 (2009) 22073–22078
work page 2009
-
[20]
S. Soundarajan, J. Hopcroft, Using community information to improve the precision of link prediction methods, in: Proceedings of the 21st International Conference on World Wide Web, 2012, pp. 607–608. https://doi.org/10.1145/2187980.2188150
-
[21]
L. Lü, L. Pan, T. Zhou, Y. Zhang, H.E. Stanley, Toward link predictability of complex networks, Proc. Natl. Acad. Sci. U.S.A. 112 (2015) 2325–2330. https://doi.org/10.1073/pnas.1424644112
-
[22]
J. Kunegis, A. Lommatzsch, Learning spectral graph transformations for link prediction, in: Pro- ceedings of the 26th Annual International Conference on Machine Learning, 2009, pp. 561–568. https://dl.acm.org/doi/10.1145/1553374.1553447
- [23]
-
[24]
Deepwalk: Online learning of social representations,
B. Perozzi, R. Al-Rfou, S. Skiena, DeepWalk: Online learning of social representations, in: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 2014, pp. 701–710. https://doi.org/10.1145/2623330.2623732
-
[25]
B. Yang, W.-T. Yih, X. He, J. Gao, L. Deng, Embedding entities and relations for learning and inference in knowledge bases, in: Proceedings of the 3rd International Conference on Learning Representations, 2015
work page 2015
-
[26]
T. Trouillon, J. Welbl, S. Riedel, É. Gaussier, G. Bouchard, Complex embeddings for simple link prediction, in: Proceedings of The 33rd International Conference on Machine Learning, 2016, pp. 2071– 2080
work page 2016
-
[27]
Node2vec: Scalable feature learning for networks,
A. Grover, J. Leskovec, node2vec: Scalable feature learning for networks, in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 855–864. https://doi.org/10.1145/2939672.2939754
-
[28]
T.N. Kipf, M. Welling, Semi-supervised classification with graph convolutional networks, in: Proceed- ings of the 5th International Conference on Learning Representations, 2017
work page 2017
-
[29]
W.L. Hamilton, Z. Ying, J. Leskovec, Inductive representation learning on large graphs, Adv. Neural Inf. Process. Syst. 30 (2017). https://dl.acm.org/doi/10.5555/3294771.3294869
-
[30]
P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, Y. Bengio, Graph attention networks, in: Proceedings of the International Conference on Learning Representations, 2018. https://openreview.net/forum?id=rJXMpikCZ
work page 2018
-
[31]
M. Zhang, Y.-C. Chen, Link prediction based on graph neural networks, in: Advances in Neural Infor- mation Processing Systems, 2018, pp. 5171–5181. 12
work page 2018
-
[32]
K. Xu, W. Hu, J. Leskovec, S. Jegelka, How powerful are graph neural networks?, in: Proceedings of the International Conference on Learning Representations, 2019. https://openreview.net/forum?id=ryGs6iA5Km
work page 2019
-
[33]
S.J. Ahn, M. Kim, Variational graph normalized autoencoders, in: Proceedings of the 30th ACM International Conference on Information and Knowledge Management, 2021, pp. 2827–2836. https://doi.org/10.1145/3459637.3482149
-
[34]
B.P. Chamberlain, S. Shirobokov, E. Rossi, F. Frasca, T. Markovich, N.Y. Hammerla, M.M. Bronstein, M. Hansmire, Graph neural networks for link prediction with subgraph sketching, in: Proceedings of the Eleventh International Conference on Learning Representations, 2022. https://openreview.net/forum?id=m1oqEOAozQU
work page 2022
-
[35]
R. Kohavi, A study of cross-validation and bootstrap for accuracy estimation and model selection, in: Proceedings of the 14th international joint conference on Artificial intelligence, 1995, pp. 1137–1145
work page 1995
-
[36]
S. Arlot, A. Celisse, A survey of cross-validation procedures for model selection, Stat. Surv. 4 (2010) 40–79. https://doi.org/10.1214/09-SS054
-
[37]
Leak- age in data mining: Formulation, detection, and avoidance
S. Kaufman, S. Rosset, C. Perlich, O. Stitelman, Leakage in data mining: Formula- tion, detection, and avoidance, ACM Trans. Knowl. Discov. Data 6 (2012) 1556–4681. https://doi.org/10.1145/2382577.2382579
-
[38]
T.G. Dietterich, Approximate statistical tests for comparing supervised classification learning algo- rithms, Neural Comput. 10 (1998) 1895–1923. https://doi.org/10.1162/089976698300017197
-
[39]
G.C. Cawley, N.L.C. Talbot, On over-fitting in model selection and subsequent selection bias in perfor- mance evaluation, J. Mach. Learn. Res. 11 (2010) 2079–2107
work page 2010
-
[40]
S. Varma, R. Simon, Bias in error estimation when using cross-validation for model selection, BMC Bioinform. 7 (2006) 91. https://doi.org/10.1186/1471-2105-7-91
-
[41]
G. Forman, M. Scholz, Apples-to-apples in cross-validation studies: Pitfalls in clas- sifier performance measurement, ACM SIGKDD Explor. Newsl. 12 (2010) 49–57. https://doi.org/10.1145/1882471.1882479
-
[42]
D.R. Roberts, V. Bahn, S. Ciuti, M.S. Boyce, J. Elith, G. Guillera-Arroita, S. Hauenstein, C. Lahoz- Monfort, B. Schröder, W. Thuiller, D.I. Warton, B.A. Wintle, F. Hartig, C. Dormann, Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure, Ecography 40 (2017) 913–929. https://doi.org/10.1111/ecog.02881
-
[43]
A. Breit, S. Ott, A. Agibetov, M. Samwald, OpenBioLink: a benchmarking frame- work for large-scale biomedical link prediction, Bioinformatics 36 (2020) 4097–4098. https://doi.org/10.1093/bioinformatics/btaa274
-
[44]
G. Varoquaux, V. Cheplygina, Machine learning for medical imaging: methodological failures and recommendations for the future, NPJ Digit. Med. 5 (2022) 48. https://doi.org/10.1038/s41746-022- 00592-y
-
[45]
D.Sun, T.Zhou, J.G.Liu, R.Stanley, Y.-C.Zhang, Informationfilteringbasedontransferringsimilarity, Phys. Rev. E 80 (2009) 017101. https://doi.org/10.1103/PhysRevE.80.017101
-
[46]
T. Zhou, Discriminating abilities of threshold-free evaluation metrics in link prediction, Physica A 615 (2023) 128529. https://doi.org/10.1016/j.physa.2023.128529
-
[47]
X. Jiao, S. Wan, Q. Liu, Y. Bi, Y.-L. Lee, E. Xu, D. Hao, T. Zhou, Comparing discriminating abilities of evaluation metrics in link prediction, J. Phys. Complex. 5 (2024) 025014. https://doi.org/10.1088/2632- 072X/ad46be
-
[48]
Y. Bi, X. Jiao, Y.-L. Lee, T. Zhou, Inconsistency among evaluation metrics in link prediction, PNAS Nexus 3 (2024) pgae498. https://doi.org/10.1093/pnasnexus/pgae498. 13
-
[49]
S. Wan, Y. Bi, X. Jiao, T. Zhou, Quantifying discriminability of evaluation metrics in link prediction for real networks, Chaos Solitons Fractals 199 (2025) 116864. https://doi.org/10.1016/j.chaos.2025.116864
-
[50]
J. Kunegis, KONECT: the Koblenz network collection, in: Proceedings of the 22nd International Con- ference on World Wide Web, 2013, pp. 1343–1350. https://doi.org/10.1145/2487788.2488173
-
[51]
R.A. Rossi, N.K. Ahmed, The Network Data Repository with Interactive Graph Analytics and Visualization, in: Proceedings of the 29th AAAI Conference on Artificial Intelligence, 2015. https://doi.org/10.1609/aaai.v29i1.9277. 14
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.