pith. sign in

arxiv: 2511.05834 · v2 · submitted 2025-11-08 · 📊 stat.OT

Impacts of Data Splitting Strategies on Parameterized Link Prediction Algorithms

Pith reviewed 2026-05-18 00:31 UTC · model grok-4.3

classification 📊 stat.OT
keywords link predictioninformation leakagedata splittinghyperparameter tuningevaluation metricnetwork scienceperformance overestimation
0
0 comments X

The pith

Using the test set during hyperparameter tuning inflates link prediction performance by an average 3.6%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how data splitting strategies affect the evaluation of parameterized link prediction algorithms. It identifies that incorporating the test set into hyperparameter tuning creates information leakage that artificially boosts reported performance. To measure this effect, a new metric termed Loss Ratio is introduced. Experiments involving 60 real-world networks from six domains reveal an average performance overestimation of 3.6%, with some algorithms experiencing over 15% bias. Heuristic and random-walk methods demonstrate more resilience to this issue. Overall, the findings advocate for standardized data splitting to promote fair comparisons and reproducibility in the field.

Core claim

The authors show that the common practice of using the test set during hyperparameter tuning in link prediction leads to human-induced information leakage, resulting in an average overestimation of model performance by about 3.6% across 60 networks, with the bias exceeding 15% for specific algorithms. The Loss Ratio metric is proposed to quantify this overestimation, and the analysis reveals that heuristic and random-walk-based methods are more robust and stable.

What carries the argument

Loss Ratio, a metric that quantifies the extent of performance overestimation caused by information leakage from using the test set in hyperparameter tuning.

If this is right

  • Standardized data splitting strategies are required for fair and reproducible benchmarking of link prediction models.
  • Parameterized models should be evaluated without access to test data during tuning to avoid inflated results.
  • Heuristic and random-walk-based methods provide more reliable performance estimates under proper evaluation protocols.
  • The overestimation issue is widespread across networks in multiple domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Many existing published results on link prediction may have overstated performance and could benefit from re-assessment using the Loss Ratio.
  • This evaluation issue likely extends to other network analysis tasks involving parameterized models, such as community detection or graph classification.
  • Researchers should develop automated tools to detect and correct for such leakage in benchmarking pipelines.

Load-bearing premise

The Loss Ratio metric accurately and specifically isolates the performance overestimation caused by using the test set during hyperparameter tuning.

What would settle it

Re-running the experiments on the same 60 networks but with hyperparameters tuned strictly on training and validation data only, then checking whether the observed performance drop matches the Loss Ratio values reported for each algorithm.

Figures

Figures reproduced from arXiv: 2511.05834 by Tao Zhou, Xinshan Jiao, Yilin Bi, Yuxin Luo.

Figure 1
Figure 1. Figure 1: The two-set and three-set split diagram. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Performance of different link prediction algorithms under varying [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: AUC under different training ratios ρ. The left and right panels correspond to the two-set split and three-set split settings, respectively. Star markers denote algorithms that achieve the best performance under a specific ρ in only one of the two split settings, while enlarged circular markers indicate algorithms that are optimal under both two-set split and three-set split for the same ρ. The grey vertic… view at source ↗
read the original abstract

Link prediction is a fundamental problem in network science, aiming to infer potential or missing links based on observed network structures. With the increasing adoption of parameterized models, the rigor of evaluation protocols has become critically important. However, a previously common practice of using the test set during hyperparameter tuning has led to human-induced information leakage, thereby inflating the reported model performance. To address this issue, this study introduces a novel evaluation metric, Loss Ratio, which quantitatively measures the extent of performance overestimation. We conduct large-scale experiments on 60 real-world networks across six domains. The results demonstrate that the information leakage leads to an average overestimation of about 3.6%, with the bias reaching over 15% for specific algorithms. Meanwhile, heuristic and random-walk-based methods exhibit greater robustness and stability. The analysis uncovers a pervasive information leakage issue in link prediction evaluation and underscores the necessity of adopting standardized data splitting strategies to enable fair and reproducible benchmarking of link prediction models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that using the test set during hyperparameter tuning for parameterized link prediction models introduces information leakage that inflates reported performance. It introduces a new metric called Loss Ratio to quantify the resulting overestimation and reports results from experiments on 60 real-world networks across six domains, finding an average overestimation of about 3.6% (with bias exceeding 15% for some algorithms). Heuristic and random-walk methods are described as more robust, and the work recommends standardized data splitting strategies for fair benchmarking.

Significance. If the central empirical findings hold after clarification of controls, the paper would usefully document a methodological issue that affects reproducibility in link prediction research. The scale of the study (60 networks) supplies broad evidence that the problem is pervasive rather than isolated. The proposal of Loss Ratio as a diagnostic offers a concrete, if preliminary, tool for quantifying evaluation bias in graph-based models.

major comments (3)
  1. [§3.2] §3.2 (Loss Ratio definition): the metric is presented as a normalized performance gap between tuning regimes that include versus exclude test edges, but the text does not specify whether hyperparameter search budget, number of trials, early-stopping criteria, or random seeds are held identical across regimes; without these controls the observed gap cannot be attributed solely to information leakage rather than differences in model selection effort.
  2. [Results section] Results section and associated tables (e.g., aggregate statistics over 60 networks): the reported average 3.6% overestimation and per-algorithm maxima above 15% are given without error bars, confidence intervals, or explicit exclusion criteria for networks, preventing assessment of whether the quantitative claims are statistically robust or sensitive to a few outliers.
  3. [§5.3] §5.3 (robustness claims): the statement that heuristic and random-walk methods exhibit greater stability is not accompanied by a controlled comparison (e.g., fixed search space size or parameter count) that would rule out the alternative explanation that these methods simply require less tuning and therefore show smaller apparent gaps.
minor comments (2)
  1. [Abstract] Abstract: the six domains of the 60 networks are mentioned but not enumerated; adding the domain list would improve context for readers.
  2. [Figures] Figure captions: several figures comparing splitting strategies would benefit from explicit labels indicating which curves correspond to 'test-included' versus 'test-excluded' tuning.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We have addressed each major comment point by point below, making revisions to improve clarity, statistical presentation, and controlled comparisons where appropriate.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Loss Ratio definition): the metric is presented as a normalized performance gap between tuning regimes that include versus exclude test edges, but the text does not specify whether hyperparameter search budget, number of trials, early-stopping criteria, or random seeds are held identical across regimes; without these controls the observed gap cannot be attributed solely to information leakage rather than differences in model selection effort.

    Authors: We agree that explicit controls are essential to attribute the gap to information leakage. In the revised manuscript, §3.2 now states that both regimes used an identical search budget of 100 trials, the same early-stopping rule based on validation performance, and fixed random seeds. This ensures the observed difference arises from test-edge inclusion rather than unequal optimization effort. We have added a brief description of the protocol and pseudocode in the appendix. revision: yes

  2. Referee: [Results section] Results section and associated tables (e.g., aggregate statistics over 60 networks): the reported average 3.6% overestimation and per-algorithm maxima above 15% are given without error bars, confidence intervals, or explicit exclusion criteria for networks, preventing assessment of whether the quantitative claims are statistically robust or sensitive to a few outliers.

    Authors: We acknowledge the value of statistical robustness indicators. The revised Results section now reports the mean overestimation with standard deviation (4.2%) and 95% confidence intervals across all 60 networks. We explicitly note that no networks were excluded and include a short sensitivity check confirming the average remains stable (3.5%) after removing the top 5% of outliers. revision: yes

  3. Referee: [§5.3] §5.3 (robustness claims): the statement that heuristic and random-walk methods exhibit greater stability is not accompanied by a controlled comparison (e.g., fixed search space size or parameter count) that would rule out the alternative explanation that these methods simply require less tuning and therefore show smaller apparent gaps.

    Authors: This point is well taken. We have added a controlled experiment in the revised §5.3 that normalizes the hyperparameter search to an equal number of configurations (50 trials) for every method. Under this equalized tuning budget, heuristic and random-walk methods continue to exhibit smaller gaps than parameterized models. A new supplementary table summarizes the controlled results. revision: yes

Circularity Check

0 steps flagged

Empirical evaluation with introduced metric exhibits no circularity

full rationale

The paper performs large-scale experiments across 60 real-world networks and introduces the Loss Ratio metric to quantify performance overestimation from test-set inclusion in hyperparameter tuning. No equations, derivations, or self-citations reduce the reported 3.6% average overestimation (or per-algorithm biases) to fitted parameters defined by the same experiments or to prior self-work by construction. Central claims rest on direct experimental comparisons between splitting strategies, rendering the analysis self-contained against external benchmarks with no load-bearing reductions to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the validity of the Loss Ratio as a leakage measure and the assumption that observed differences stem primarily from splitting strategy rather than other experimental variables; no free parameters or invented physical entities are introduced.

axioms (1)
  • domain assumption Standard train-test splits for link prediction should keep the test set completely separate from hyperparameter tuning and model selection.
    This premise underpins the identification of information leakage as the source of overestimation.
invented entities (1)
  • Loss Ratio no independent evidence
    purpose: To quantitatively measure the extent of performance overestimation due to information leakage in evaluation protocols.
    Newly proposed metric whose definition and calibration are internal to the study.

pith-pipeline@v0.9.0 · 5466 in / 1333 out tokens · 50911 ms · 2026-05-18T00:31:44.757299+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages

  1. [1]

    L. Lü, T. Zhou, Link prediction in complex networks: A survey, Physica A 390 (2011) 1150–1170. https://doi.org/10.1016/j.physa.2010.11.027

  2. [2]

    P. Wang, B. Xu, Y. Wu, X. Zhou, Link prediction in social networks: the state-of-the-art, Sci. China Inf. Sci. 58 (2015) 1–38. https://doi.org/10.1007/s11432-015-5399-y

  3. [3]

    Martínez, F

    V. Martínez, F. Berzal, J.C. Cubero, A survey of link prediction in complex networks, ACM Comput. Surv. 49 (2016) 69. https://doi.org/10.1145/3012704

  4. [4]

    Kumar, S.S

    A. Kumar, S.S. Singh, K. Singh, B. Biswas, Link prediction techniques, applications, and performance: a survey, Physica A 553 (2020) 124289. https://doi.org/10.1016/j.physa.2020.124289

  5. [5]

    Divakaran, A

    A. Divakaran, A. Mohan, Temporal link prediction: a survey, New Gener. Comput. 38 (2020) 213–258. https://doi.org/10.1007/s00354-019-00070-4

  6. [6]

    Zhou, Progresses and challenges in link prediction, iScience 24 (2021) 103217

    T. Zhou, Progresses and challenges in link prediction, iScience 24 (2021) 103217. https://doi.org/10.1016/j.isci.2021.103217

  7. [7]

    Arrar, N

    D. Arrar, N. Kamel, A. Lakhfif, A comprehensive survey of link prediction methods, J. Supercomput. 80 (2024) 3902–3942. https://doi.org/10.1007/s11227-023-05823-4

  8. [8]

    Adamic, E

    L.A. Adamic, E. Adar, Friends and neighbors on the Web, Soc. Netw. 25 (2003) 211–230. https://doi.org/10.1016/S0378-8733(03)00009-1

  9. [9]

    Kovács, K

    I.A. Kovács, K. Luck, K. Spirohn, Y. Wang, C. Pollis, S. Schlabach, W. Bian, D.-K. Kim, N. Kishore, T. Hao, M.A. Calderwood, M. Vidal, A.-L. Barabási, Network-based prediction of protein–protein interactions, Nat. Commun. 10 (2019) 1240. https://doi.org/10.1038/s41467-019-09177-y

  10. [10]

    L. Lü, M. Medo, C.H. Yeung, Y.-C. Zhang, Z.-K. Zhang, T. Zhou, Recommender systems, Phys. Rep. 519 (2012) 1–49. https://doi.org/10.1016/j.physrep.2012.02.006

  11. [11]

    A new status index derived from sociometric analysis

    L. Katz, A new status index derived from sociometric analysis, Psychometrika 18 (1953) 39–43. https://doi.org/10.1007/BF02289026

  12. [12]

    E. A. Leicht, P. Holme, M. E. J. Newman, Vertex similarity in networks, Phys. Rev. E 73 (2006) 026120. https://doi.org/10.1103/PhysRevE.73.026120

  13. [13]

    The link-prediction problem for social networks

    D. Liben-Nowell, J. Kleinberg, The link-prediction problem for social networks, Journal of the American SocietyforInformationScienceandTechnology58(2007)1019–1031.https://doi.org/10.1002/asi.20591. 11

  14. [14]

    Fouss, A

    F. Fouss, A. Pirotte, J.-M. Renders, M. Saerens, Random-walk computation of similarities between nodes of a graph with application to collaborative recommendation, IEEE Trans. Knowl. Data Eng. 19 (2007) 355–369. https://doi.org/10.1109/TKDE.2007.46

  15. [15]

    T. Zhou, L. Lü, Y.-C. Zhang, Predicting missing links via local information, Eur. Phys. J. B 71 (2009) 623–630. https://doi.org/10.1140/epjb/e2009-00335-8

  16. [16]

    W. Liu, L. Lü, Link prediction based on local random walk, Europhys. Lett. 89 (2010) 58007. https://doi.org/10.1209/0295-5075/89/58007

  17. [17]

    Cannistraci, G

    C.V. Cannistraci, G. Alanis-Lobato, T. Ravasi, From link-prediction in brain connectomes and pro- tein interactomes to the local-community-paradigm in complex networks, Sci. Rep. 3 (2013) 1613. https://doi.org/10.1038/srep01613

  18. [18]

    Clauset, C

    A. Clauset, C. Moore, M.E.J. Newman, Hierarchical structure and the prediction of missing links, Nature 453 (2008) 98–101. https://doi.org/10.1038/nature06830

  19. [19]

    Guimerà, M

    R. Guimerà, M. Sales-Pardo, Missing and spurious interactions and the reconstruction of complex networks, Proc. Natl. Acad. Sci. U.S.A. 106 (2009) 22073–22078

  20. [20]

    Soundarajan, J

    S. Soundarajan, J. Hopcroft, Using community information to improve the precision of link prediction methods, in: Proceedings of the 21st International Conference on World Wide Web, 2012, pp. 607–608. https://doi.org/10.1145/2187980.2188150

  21. [21]

    L. Lü, L. Pan, T. Zhou, Y. Zhang, H.E. Stanley, Toward link predictability of complex networks, Proc. Natl. Acad. Sci. U.S.A. 112 (2015) 2325–2330. https://doi.org/10.1073/pnas.1424644112

  22. [22]

    Kunegis, A

    J. Kunegis, A. Lommatzsch, Learning spectral graph transformations for link prediction, in: Pro- ceedings of the 26th Annual International Conference on Machine Learning, 2009, pp. 561–568. https://dl.acm.org/doi/10.1145/1553374.1553447

  23. [23]

    Bordes, N

    A. Bordes, N. Usunier, A. Garcia-Durán, J. Weston, O. Yakhnenko, Translating embeddings for model- ing multi-relational data, in: Advances in Neural Information Processing Systems, 2013, pp. 2787–2795

  24. [24]

    Deepwalk: Online learning of social representations,

    B. Perozzi, R. Al-Rfou, S. Skiena, DeepWalk: Online learning of social representations, in: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 2014, pp. 701–710. https://doi.org/10.1145/2623330.2623732

  25. [25]

    Yang, W.-T

    B. Yang, W.-T. Yih, X. He, J. Gao, L. Deng, Embedding entities and relations for learning and inference in knowledge bases, in: Proceedings of the 3rd International Conference on Learning Representations, 2015

  26. [26]

    Trouillon, J

    T. Trouillon, J. Welbl, S. Riedel, É. Gaussier, G. Bouchard, Complex embeddings for simple link prediction, in: Proceedings of The 33rd International Conference on Machine Learning, 2016, pp. 2071– 2080

  27. [27]

    Node2vec: Scalable feature learning for networks,

    A. Grover, J. Leskovec, node2vec: Scalable feature learning for networks, in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 855–864. https://doi.org/10.1145/2939672.2939754

  28. [28]

    T.N. Kipf, M. Welling, Semi-supervised classification with graph convolutional networks, in: Proceed- ings of the 5th International Conference on Learning Representations, 2017

  29. [29]

    Hamilton, Z

    W.L. Hamilton, Z. Ying, J. Leskovec, Inductive representation learning on large graphs, Adv. Neural Inf. Process. Syst. 30 (2017). https://dl.acm.org/doi/10.5555/3294771.3294869

  30. [30]

    Veličković, G

    P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, Y. Bengio, Graph attention networks, in: Proceedings of the International Conference on Learning Representations, 2018. https://openreview.net/forum?id=rJXMpikCZ

  31. [31]

    Zhang, Y.-C

    M. Zhang, Y.-C. Chen, Link prediction based on graph neural networks, in: Advances in Neural Infor- mation Processing Systems, 2018, pp. 5171–5181. 12

  32. [32]

    K. Xu, W. Hu, J. Leskovec, S. Jegelka, How powerful are graph neural networks?, in: Proceedings of the International Conference on Learning Representations, 2019. https://openreview.net/forum?id=ryGs6iA5Km

  33. [33]

    S.J. Ahn, M. Kim, Variational graph normalized autoencoders, in: Proceedings of the 30th ACM International Conference on Information and Knowledge Management, 2021, pp. 2827–2836. https://doi.org/10.1145/3459637.3482149

  34. [34]

    Chamberlain, S

    B.P. Chamberlain, S. Shirobokov, E. Rossi, F. Frasca, T. Markovich, N.Y. Hammerla, M.M. Bronstein, M. Hansmire, Graph neural networks for link prediction with subgraph sketching, in: Proceedings of the Eleventh International Conference on Learning Representations, 2022. https://openreview.net/forum?id=m1oqEOAozQU

  35. [35]

    R. Kohavi, A study of cross-validation and bootstrap for accuracy estimation and model selection, in: Proceedings of the 14th international joint conference on Artificial intelligence, 1995, pp. 1137–1145

  36. [36]

    Arlot, A

    S. Arlot, A. Celisse, A survey of cross-validation procedures for model selection, Stat. Surv. 4 (2010) 40–79. https://doi.org/10.1214/09-SS054

  37. [37]

    Leak- age in data mining: Formulation, detection, and avoidance

    S. Kaufman, S. Rosset, C. Perlich, O. Stitelman, Leakage in data mining: Formula- tion, detection, and avoidance, ACM Trans. Knowl. Discov. Data 6 (2012) 1556–4681. https://doi.org/10.1145/2382577.2382579

  38. [38]

    Neural Computation , author =

    T.G. Dietterich, Approximate statistical tests for comparing supervised classification learning algo- rithms, Neural Comput. 10 (1998) 1895–1923. https://doi.org/10.1162/089976698300017197

  39. [39]

    Cawley, N.L.C

    G.C. Cawley, N.L.C. Talbot, On over-fitting in model selection and subsequent selection bias in perfor- mance evaluation, J. Mach. Learn. Res. 11 (2010) 2079–2107

  40. [40]

    Varma, R

    S. Varma, R. Simon, Bias in error estimation when using cross-validation for model selection, BMC Bioinform. 7 (2006) 91. https://doi.org/10.1186/1471-2105-7-91

  41. [41]

    Forman, M

    G. Forman, M. Scholz, Apples-to-apples in cross-validation studies: Pitfalls in clas- sifier performance measurement, ACM SIGKDD Explor. Newsl. 12 (2010) 49–57. https://doi.org/10.1145/1882471.1882479

  42. [42]

    Roberts, V

    D.R. Roberts, V. Bahn, S. Ciuti, M.S. Boyce, J. Elith, G. Guillera-Arroita, S. Hauenstein, C. Lahoz- Monfort, B. Schröder, W. Thuiller, D.I. Warton, B.A. Wintle, F. Hartig, C. Dormann, Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure, Ecography 40 (2017) 913–929. https://doi.org/10.1111/ecog.02881

  43. [43]

    Breit, S

    A. Breit, S. Ott, A. Agibetov, M. Samwald, OpenBioLink: a benchmarking frame- work for large-scale biomedical link prediction, Bioinformatics 36 (2020) 4097–4098. https://doi.org/10.1093/bioinformatics/btaa274

  44. [44]

    Varoquaux, V

    G. Varoquaux, V. Cheplygina, Machine learning for medical imaging: methodological failures and recommendations for the future, NPJ Digit. Med. 5 (2022) 48. https://doi.org/10.1038/s41746-022- 00592-y

  45. [45]

    D.Sun, T.Zhou, J.G.Liu, R.Stanley, Y.-C.Zhang, Informationfilteringbasedontransferringsimilarity, Phys. Rev. E 80 (2009) 017101. https://doi.org/10.1103/PhysRevE.80.017101

  46. [46]

    Zhou, Discriminating abilities of threshold-free evaluation metrics in link prediction, Physica A 615 (2023) 128529

    T. Zhou, Discriminating abilities of threshold-free evaluation metrics in link prediction, Physica A 615 (2023) 128529. https://doi.org/10.1016/j.physa.2023.128529

  47. [47]

    X. Jiao, S. Wan, Q. Liu, Y. Bi, Y.-L. Lee, E. Xu, D. Hao, T. Zhou, Comparing discriminating abilities of evaluation metrics in link prediction, J. Phys. Complex. 5 (2024) 025014. https://doi.org/10.1088/2632- 072X/ad46be

  48. [48]

    Y. Bi, X. Jiao, Y.-L. Lee, T. Zhou, Inconsistency among evaluation metrics in link prediction, PNAS Nexus 3 (2024) pgae498. https://doi.org/10.1093/pnasnexus/pgae498. 13

  49. [49]

    S. Wan, Y. Bi, X. Jiao, T. Zhou, Quantifying discriminability of evaluation metrics in link prediction for real networks, Chaos Solitons Fractals 199 (2025) 116864. https://doi.org/10.1016/j.chaos.2025.116864

  50. [50]

    Internationale Kommunikation

    J. Kunegis, KONECT: the Koblenz network collection, in: Proceedings of the 22nd International Con- ference on World Wide Web, 2013, pp. 1343–1350. https://doi.org/10.1145/2487788.2488173

  51. [51]

    Rossi, N.K

    R.A. Rossi, N.K. Ahmed, The Network Data Repository with Interactive Graph Analytics and Visualization, in: Proceedings of the 29th AAAI Conference on Artificial Intelligence, 2015. https://doi.org/10.1609/aaai.v29i1.9277. 14