A Fair Evaluation of Graph Foundation Models for Node Property Prediction
Pith reviewed 2026-06-26 00:48 UTC · model grok-4.3
The pith
Only the most recent Graph Foundation Models based on Prior-data Fitted Networks outperform well-tuned Graph Neural Networks on node property prediction.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In a fair and rigorous reevaluation of 9 recent GFMs for node property prediction, only the most recent ones based on the Prior-data Fitted Networks paradigm outperform well-tuned GNNs in predictive performance, although at a higher inference cost.
What carries the argument
A unified evaluation setting that standardizes datasets, protocols, and hyperparameter tuning to enable direct comparison of GFMs against GNN baselines.
If this is right
- Most GFMs proposed to date do not improve predictive performance over well-tuned GNNs when placed under identical evaluation conditions.
- PFN-based GFMs deliver higher accuracy but incur greater inference cost than the GNN baselines.
- Inconsistent evaluation practices have previously obscured which models represent genuine advances.
- Applications in fraud detection and recommendation systems may adopt the top PFN models only where the added inference cost is acceptable.
Where Pith is reading between the lines
- Model developers could explore hybrids that retain GNN efficiency while incorporating prior-data fitting to close the cost gap.
- Claims of foundation-model superiority on graphs should be tested against standardized GNN baselines before being treated as settled.
- Extending the protocol to larger or more heterogeneous graphs would test whether the observed performance ordering holds beyond the current datasets.
Load-bearing premise
The chosen evaluation protocols, datasets, and hyperparameter tuning procedures constitute a fair and rigorous unified setting that enables reliable comparison across GFMs and GNN baselines.
What would settle it
Re-running the same experiments with additional datasets or different tuning procedures where well-tuned GNNs match or exceed the top PFN-based GFMs would falsify the performance claim.
read the original abstract
Due to the wide use of graph-structured data in different fields of industry and science, the development of Graph Foundation Models (GFMs) has recently attracted a lot of attention. While many different types of models are called GFMs, particular interest has been paid to GFMs designed for node property prediction tasks, which is one of the most popular settings in Graph ML with lots of real-world applications from fraud detection in financial and social networks to recommendation systems for e-commerce and user-generated content platforms. While a number of GFMs for this task have been recently proposed, the field has not converged to a unified evaluation setting, and different works evaluate their models in widely different ways, preventing reliable comparison of GFMs with each other and with other types of models. In this work, we conduct a fair and rigorous reevaluation of 9 recent GFMs for node property prediction, comparing them to strong Graph Neural Network (GNN) baselines. We find that, among these GFMs, only the most recent ones based on the Prior-data Fitted Networks paradigm outperform well-tuned GNNs in predictive performance, although at a higher inference cost.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript conducts a fair and rigorous reevaluation of 9 recent Graph Foundation Models (GFMs) for node property prediction tasks. It compares them to strong Graph Neural Network (GNN) baselines under a unified setting and reports that only the most recent GFMs based on the Prior-data Fitted Networks (PFN) paradigm outperform well-tuned GNNs, although at higher inference cost.
Significance. If the claimed unified evaluation protocol is sound and reproducible, the work would provide a much-needed standardized benchmark for GFMs versus GNNs. This addresses the current lack of convergence on evaluation settings in the field and could help clarify whether recent GFMs deliver practical advantages for node property prediction applications such as fraud detection and recommendation systems.
Simulated Author's Rebuttal
We thank the referee for their summary recognizing the value of a unified evaluation for GFMs versus GNNs. The recommendation is listed as uncertain, but the report contains no specific major comments to address. We stand ready to provide further details on the protocol or results if that would resolve the uncertainty.
Circularity Check
No significant circularity in empirical benchmarking study
full rationale
This is a purely empirical reevaluation paper that compares existing GFMs to GNN baselines on node property prediction tasks using datasets and protocols. No mathematical derivations, first-principles predictions, fitted parameters renamed as outputs, or self-citation chains appear in the abstract or described content. The central claim rests on experimental results rather than any reduction of predictions to inputs by construction, satisfying the default expectation of no circularity for non-derivational work.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450,
-
[2]
A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M
Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosse- lut, A., Brunskill, E., et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258,
-
[3]
Choi, J., Kang, W., Kim, M., Kim, J., and Park, N. Can TabPFN compete with gnns for node classification via graph tabularization?arXiv preprint arXiv:2512.08798,
-
[4]
Personalized audiobook recommendations at Spotify through graph neural networks
De Nadai, M., Fabbri, F., Gigioli, P., Wang, A., Li, A., Sil- vestri, F., Kim, L., Lin, S., Radosavljevic, V ., Ghael, S., Nyhan, D., Bouchard, H., Lalmas, M., and Damianou, A. Personalized audiobook recommendations at Spotify through graph neural networks. InCompanion Proceed- ings of the ACM Web Conference 2024, pp. 403–412,
2024
-
[5]
Dong, K., Mao, H., Guo, Z., and Chawla, N. V . Universal link predictor by in-context learning on graphs.arXiv preprint arXiv:2402.07738,
-
[6]
Turning tabular foundation models into graph foundation models.arXiv preprint arXiv:2508.20906,
Eremeev, D., Bazhenov, G., Platonov, O., Babenko, A., and Prokhorenkova, L. Turning tabular foundation models into graph foundation models.arXiv preprint arXiv:2508.20906,
-
[7]
E., Ranjan, R., Robinson, J., Ying, R., You, J., and Leskovec, J
Fey, M., Hu, W., Huang, K., Lenssen, J. E., Ranjan, R., Robinson, J., Ying, R., You, J., and Leskovec, J. Rela- tional deep learning: Graph representation learning on relational databases.arXiv preprint arXiv:2312.04615,
-
[8]
Frasca, F., Jogl, F., Eliasof, M., Ostrovsky, M., Sch¨onlieb, C.-B., G¨artner, T., and Maron, H. Towards foundation models on graphs: An analysis on cross-dataset transfer of pretrained GNNs.arXiv preprint arXiv:2412.17609,
-
[9]
Grinsztajn, L., Fl¨oge, K., Key, O., Birkel, F., Jund, P., Roof, B., Manium, M., Hoo, S. B., B ¨uhler, M., Garg, A., Sa- faric, D., Robertson, J., J¨ager, B., Alessi, S., Hayler, A., Moroshan, V ., Purucker, L., Singer, P., Arazi, A., Siems, J., Metzen, J. H., Grab, G., Erickson, N., Guo, S., Kalfon, E., Bing, S., Salinas, D., Cornu, C., Wehrhahn, L. C., ...
-
[10]
I., Bronstein, M., and Finkelshtein, B
Hayler, A., Huang, X., Ceylan, I. I., Bronstein, M., and Finkelshtein, B. Bringing graphs to the table: Zero-shot node classification via tabular foundation models.arXiv preprint arXiv:2509.07143,
-
[11]
Lipton, Z. C. and Steinhardt, J. Troubling trends in machine learning scholarship.arXiv preprint arXiv:1807.03341,
-
[12]
Platonov, O. and Prokhorenkova, L. Cluster atten- tion for graph machine learning.arXiv preprint arXiv:2604.07492,
-
[13]
On finetuning tabular foundation models.arXiv preprint arXiv:2506.08982,
Rubachev, I., Kotelnikov, A., Kartashev, N., and Babenko, A. On finetuning tabular foundation models.arXiv preprint arXiv:2506.08982,
-
[14]
Dropout: A simple way to prevent neural networks from overfitting.Journal of Machine Learning Research, 15(56):1929–1958,
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting.Journal of Machine Learning Research, 15(56):1929–1958,
1929
-
[15]
GraphBench: Next- generation graph learning benchmarking.arXiv preprint arXiv:2512.04475,
Stoll, T., Qian, C., Finkelshtein, B., Parviz, A., Weber, D., Frasca, F., Shavit, H., Siraudin, A., Mielke, A., Anastacio, M., M ¨uller, E., Bechler-Speicher, M., Bron- stein, M., Galkin, M., Hoos, H., Niepert, M., Perozzi, B., T ¨onshoff, J., and Morris, C. GraphBench: Next- generation graph learning benchmarking.arXiv preprint arXiv:2512.04475,
-
[16]
Watanabe, S. Tree-structured parzen estimator: Understand- ing its algorithm components and their roles for better empirical performance.arXiv preprint arXiv:2304.11127,
-
[17]
Xia, L. and Huang, C. AnyGraph: Graph foundation model in the wild.arXiv preprint arXiv:2408.10700,
-
[18]
OpenGraph: Towards open graph foundation models
Xia, L., Kao, B., and Huang, C. OpenGraph: Towards open graph foundation models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,
2024
-
[19]
SAMGPT: Text-free graph foundation model for multi- domain pre-training and cross-domain adaptation
Yu, X., Gong, Z., Zhou, C., Fang, Y ., and Zhang, H. SAMGPT: Text-free graph foundation model for multi- domain pre-training and cross-domain adaptation. In Proceedings of the ACM Web Conference 2025, pp. 1142– 1153,
2025
-
[20]
Zhang, X., Ren, G., Yu, H., Yuan, H., Wang, H., Li, J., Wu, J., Mo, L., Mao, L., Hao, M., Dai, N., Xu, R., Li, S., Zhang, T., He, Y ., Wang, Y ., Zhang, Y ., Xu, Z., Li, D., Gao, F., Zou, H., Liu, J., Liu, J., Xu, J., Cheng, K., Li, K., Zhou, L., Li, Q., Fan, S., Lin, X., Han, X., Li, X., Lu, Y ., Xue, Y ., Jiang, Y ., Wang, Z., Wang, Z., and Cui, P. Limi...
-
[21]
8 A Fair Evaluation of Graph Foundation Models for Node Property Prediction A. Non-PFN-Based GFMs for Node Property Prediction The key challenge for GFMs dealing with node property prediction is that graphs can come from vastly different domains (e.g., social networks, co-purchasing networks, road networks). This implies that the graphs can have different...
2024
-
[22]
learning how to learn
under the name of Transformer Neural Processes). The idea of PFNs is to train models to make predictions on previously unseen datasets in a single forward pass. PFNs perform in-context learning (ICL): rather than updating the model parameters for each new dataset, they use the context provided as input to make predictions without per-dataset training. In ...
2017
-
[23]
Thus, these models handle graph-structured data natively instead of having to simplify it by converting to a table
— are significantly more involved: they create custom graph-native model architectures by augmenting standard PFN Transformers with graph neighborhood aggregation modules, design prior distributions over attributed graphs, and train these models on synthetic node property prediction datasets sampled from these prior distributions. Thus, these models handl...
2025
-
[24]
and Graph Foundation Models (Eremeev et al., 2025
2025
-
[25]
and normalizations (Ioffe & Szegedy, 2015; Ba et al., 2016), which often significantly improve the performance of neural models and are frequently used by the models being compared to GNNs. Further, from our extensive experience of training GNNs across a wide range of datasets, tasks, and settings, both for research and for industrial applications, we obs...
2015
-
[26]
with simple additive attention and Local Graph Transformer (LGT) (Shi et al., 2021; Platonov et al., 2023b) with scaled dot product attention (Vaswani et al.,
2021
-
[27]
The official implementation of GNNs from Platonov et al
(note that this Graph Transformer variant only allows each node to attend to its neighbors — we specifically refer to it as a Local Graph Transformer following Platonov & Prokhorenkova (2026) to distinguish it from more common Global Graph Transformers that allow each node to attend to any other node unconstrained by the graph structure). The official imp...
2026
-
[28]
in the MPNNs class from the official codebase of Luo et al. (2024). In our experiments in Section 3, we observe that the GNN implementations from Luo et al. (2024) typically perform better for the simpler GCN and GraphSAGE models, while the GNN implementations from Platonov et al. (2023b) typically perform better for the more complex attention-based GAT a...
2024
-
[29]
Further, we also observe that the GNN implementations from Platonov et al
is the strongest of the GNNs in our experiments, achieving the fifth place among all the considered models according to both average rank and average normalized score aggregated metrics (and the third and the second places on the subset of only node classification datasets according to average rank and average normalized score aggregated metrics, respecti...
2024
-
[30]
and GraphBench (Stoll et al., 2025). While GraphBench mostly aims at widening the scope of Graph ML tasks and does not present many classic node property prediction tasks, GraphLand aims exactly at node property prediction and provides a diverse collection of datasets representing real-world industrial applications of this task. We believe GraphLand can b...
2025
-
[31]
For each of the considered GNNs, we search over all hyperparameters available in the corresponding official implementation
provided in the Optuna library (Akiba et al., 2019). For each of the considered GNNs, we search over all hyperparameters available in the corresponding official implementation. The complete hyperparameter search spaces are provided in Appendix I. We train all GNNs with the AdamW optimizer (Kingma & Ba, 2015; Loshchilov & Hutter,
2019
-
[32]
GFMs can support the in-context learning (ICL) regime, the fine-tuning (FT) regime, or both
for a maximum of3000steps using early stopping based on the validation set performance with a patience of1000steps. GFMs can support the in-context learning (ICL) regime, the fine-tuning (FT) regime, or both. One of the benefits of the ICL regime is that it does not require hyperparameter optimization, as the pretrained model is used as is without weight ...
2019
-
[33]
Note that for GNNs, the models with the best hyperparameters can be widely different in size across GNN types and datasets
We can see that inference-time ensembling almost always improves the predictive performance of the 12 A Fair Evaluation of Graph Foundation Models for Node Property Prediction Table 3.The memory (VRAM, in GB) required for: a single training run with the best hyperparameters (Tr), a single inference run with the best hyperparameters (Inf). Note that for GN...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.