GraphBench: Next-generation graph learning benchmarking
Pith reviewed 2026-05-17 01:25 UTC · model grok-4.3
The pith
GraphBench supplies a standardized benchmark suite for graph learning across diverse domains and task types.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce GraphBench, a comprehensive benchmark suite spanning diverse real-world domains and task settings, including node-level, edge-level, graph-level, and generative tasks. GraphBench provides standardized evaluation protocols, including consistent dataset splits and metrics for assessing out-of-distribution generalization across selected tasks, as well as a unified hyperparameter-tuning framework. We further evaluate GraphBench with recent message-passing neural networks and graph transformer models, establishing principled baselines for future research.
What carries the argument
GraphBench, the benchmark suite that supplies consistent dataset splits, out-of-distribution metrics, and a shared hyperparameter-tuning framework across node, edge, graph, and generative tasks.
If this is right
- Model comparisons become possible on equal footing across different graph domains and task types.
- Research gains clearer signals on whether models truly generalize beyond their training distributions.
- Variability from ad-hoc hyperparameter choices decreases because a common tuning procedure is supplied.
- New graph models can be measured against documented baselines instead of isolated prior results.
Where Pith is reading between the lines
- Adoption of GraphBench could steer effort toward models that handle the included generative tasks more robustly.
- The same standardization pattern might later be applied to other structured data types where benchmarking is currently scattered.
- If the out-of-distribution metrics prove predictive, they could serve as a template for testing generalization in related structured-prediction settings.
Load-bearing premise
The selected datasets and tasks sufficiently represent the diversity and challenges of real-world graph learning problems across domains.
What would settle it
A follow-up study that collects new graph datasets from additional domains and finds that model rankings on GraphBench do not predict performance on those new datasets would show the benchmark misses key real-world variation.
read the original abstract
Machine learning on graphs has made substantial progress across domains such as molecular property prediction and chip design. Yet benchmarking practices remain fragmented, often relying on narrow, task-specific datasets and inconsistent evaluation protocols, hindering reproducibility and broader progress. With the recent popularity of graph foundation models, these weaknesses have become apparent, as existing benchmarks are insufficient for thorough evaluation. To address these challenges, we introduce GraphBench, a comprehensive benchmark suite spanning diverse real-world domains and task settings, including node-level, edge-level, graph-level, and generative tasks. GraphBench provides standardized evaluation protocols, including consistent dataset splits and metrics for assessing out-of-distribution generalization across selected tasks, as well as a unified hyperparameter-tuning framework. We further evaluate GraphBench with recent message-passing neural networks and graph transformer models, establishing principled baselines for future research. See www.graphbench.io for further details.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces GraphBench, a benchmark suite for graph machine learning intended to address fragmented practices by spanning node-, edge-, graph-level, and generative tasks across diverse real-world domains. It supplies standardized evaluation protocols with consistent splits, metrics for out-of-distribution generalization, a unified hyperparameter-tuning framework, and baseline results from message-passing neural networks and graph transformers.
Significance. If the dataset selection and coverage can be shown to be systematically justified, GraphBench could meaningfully reduce fragmentation in graph learning evaluation and provide a more reliable platform for comparing models, including graph foundation models. The emphasis on OOD splits and unified tuning is a constructive contribution that existing benchmarks often lack.
major comments (2)
- [Abstract / Benchmark Construction] Abstract and benchmark construction section: the claim that GraphBench is 'comprehensive' and spans 'diverse real-world domains' is not supported by explicit dataset selection criteria, domain-coverage metrics, or comparison against application surveys. This selection justification is load-bearing for the central assertion that the suite addresses fragmentation better than prior benchmarks.
- [Abstract] Abstract: no details are provided on verification that the chosen datasets and protocols are free of post-hoc choices or on potential selection biases. Without this, the 'principled baselines' established with MPNNs and transformers inherit uncertainty about whether observed differences reflect meaningful model distinctions or benchmark artifacts.
minor comments (1)
- [Abstract] The reference to www.graphbench.io for further details should be accompanied by a self-contained summary of key dataset statistics and protocol choices in the manuscript itself.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We appreciate the emphasis on strengthening the justification for dataset selection and addressing potential biases, as these are central to the value of GraphBench. We address each major comment below and have made revisions to improve clarity and transparency in the next version of the manuscript.
read point-by-point responses
-
Referee: [Abstract / Benchmark Construction] Abstract and benchmark construction section: the claim that GraphBench is 'comprehensive' and spans 'diverse real-world domains' is not supported by explicit dataset selection criteria, domain-coverage metrics, or comparison against application surveys. This selection justification is load-bearing for the central assertion that the suite addresses fragmentation better than prior benchmarks.
Authors: We agree that explicit documentation of the selection process is necessary to substantiate the claims of comprehensiveness and diversity. In the revised manuscript, we have expanded the Benchmark Construction section with a new subsection that details the dataset selection criteria. These criteria prioritize coverage of distinct real-world domains (e.g., molecular biology, social networks, citation networks, and infrastructure), balance across task levels (node, edge, graph, and generative), and reference established application surveys in graph machine learning to ensure relevance. We have also added quantitative domain-coverage metrics, including a comparison table against prior benchmarks such as OGB and TUDataset, showing the number of unique domains and task types represented. These additions directly support the central assertion that GraphBench reduces fragmentation more effectively than existing suites. revision: yes
-
Referee: [Abstract] Abstract: no details are provided on verification that the chosen datasets and protocols are free of post-hoc choices or on potential selection biases. Without this, the 'principled baselines' established with MPNNs and transformers inherit uncertainty about whether observed differences reflect meaningful model distinctions or benchmark artifacts.
Authors: We recognize the concern that insufficient transparency on selection biases and post-hoc choices could undermine confidence in the baselines. To address this, we have revised the abstract and added a dedicated paragraph in the Benchmark Construction section clarifying that datasets were selected based on their established use in prior literature and domain coverage before any baseline experiments were conducted. We describe verification steps, including reliance on publicly available fixed splits where possible and definition of OOD generalization metrics independently of model performance. A new limitations subsection acknowledges potential selection biases and explains how the unified hyperparameter-tuning framework and standardized protocols reduce the risk of benchmark artifacts influencing observed model differences. These changes provide greater transparency while preserving the integrity of the reported baselines. revision: yes
Circularity Check
No circularity: benchmark curation paper with no derivation chain
full rationale
This is an engineering/benchmark paper whose central contribution is the introduction of GraphBench itself—a curated suite of datasets, tasks, splits, metrics, and tuning protocols—rather than any mathematical derivation, first-principles prediction, or fitted quantity that reduces to its own inputs. The abstract and description contain no equations, no self-definitional loops, no fitted-input predictions, and no load-bearing self-citations to uniqueness theorems. Claims of comprehensiveness rest on the explicit selection and standardization choices documented in the paper, which are externally verifiable against the released suite at graphbench.io and do not collapse into prior results by construction.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce GraphBench, a comprehensive benchmark suite spanning diverse real-world domains and task settings, including node-level, edge-level, graph-level, and generative tasks... standardized evaluation protocols, including consistent dataset splits and metrics for assessing out-of-distribution generalization
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
GraphBench provides standardized evaluation protocols... unified hyperparameter-tuning framework
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Have Graph -- Will Lift? The Case for Higher-Order Benchmarks
The paper argues that the topological deep learning community should develop new benchmark datasets with native higher-order structure rather than continuing to lift graph datasets.
Reference graph
Works this paper leans on
-
[1]
SHA-256 collision attack with programmatic SAT
N Alamgir, S Nejati, and C Bright. SHA-256 collision attack with programmatic SAT. InWorkshop on Practical Aspects of Automated Reasoning (PAAR) and Satisfiability Checking and Symbolic Computation Workshop (SC-Square), adjunct to (IJCAR 2024),
work page 2024
-
[2]
T Balyo, M J. H. Heule, and M Järvisalo. SAT competition 2016: Recent developments. In Proceedings of the AAAI Conference on Artificial Intelligence,
work page 2016
-
[5]
BreakID-kissat in SAT Competition 2024 (System Description)
B Bogaerts, J Nordström, A Oertel, and D Vandesande. BreakID-kissat in SAT Competition 2024 (System Description). InProceedings of SAT Competition 2024,
work page 2024
-
[6]
Breakid-kissat in sat competition 2023 (system description)
Bart Bogaerts, Jakob Nordström, Andy Oertel, and Cagrı Uluç Yıldırımoglu. Breakid-kissat in sat competition 2023 (system description). InProceedings of SAT Competition 2023,
work page 2023
-
[7]
InProceedings ofSAT Competition 2023,
work page 2023
-
[8]
ASAP.V2 and ASAP.V3: sequential optimization of an algorithm selector and a scheduler
F Gonard, M Schoenauer, and M Sebag. ASAP.V2 and ASAP.V3: sequential optimization of an algorithm selector and a scheduler. InOpen Algorithm Selection Challenge 2017, Proceedings of Machine Learning Research,
work page 2017
-
[9]
Sbva-cadical and sbva-kissat: Structured bounded variable addition
A Haberlandt and H Green. Sbva-cadical and sbva-kissat: Structured bounded variable addition. In Proceedings of SAT Competition 2023,
work page 2023
-
[10]
SAT competition 2018.Journal on Satisfiability, Boolean Modeling and Computation, (1),
M J H Heule, M Järvisalo, and M Suda. SAT competition 2018.Journal on Satisfiability, Boolean Modeling and Computation, (1),
work page 2018
-
[11]
Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik. ISBN 9783959771566. doi: 10.4230/LIPIcs.CCC.2020.22. URL https://doi.org/10.4230/ LIPIcs.CCC.2020.22. A Iser and C Jabs. Global benchmark database. InProceedings of the International Conference on Theory and Applications of Satisfiability Testing SAT, LIPIcs,
-
[12]
URLhttps://doi.org/10.1145/368996.369025
doi: 10.1145/368996.369025. URLhttps://doi.org/10.1145/368996.369025. Lukasz Kaiser and Ilya Sutskever. Neural gpus learn algorithms. InProceedings of the International Conference on Learning Representations ICLR,
-
[13]
W Li, R Li, Y Ma, S O Chan, D Pan, and B Yu
InProceedings of SAT Competition 2024, 2024b. W Li, R Li, Y Ma, S O Chan, D Pan, and B Yu. Rethinking graph neural networks for the graph coloring problem.arXiv preprint,
work page 2024
-
[14]
The algorithm selection competitions 2015 and
M Lindauer, J N van Rijn, and L Kotthoff. The algorithm selection competitions 2015 and
work page 2015
-
[15]
InProceedings of SAT Competition 2024,
work page 2024
-
[16]
Problems and results of iwls 2023 programming contest,
A Mishchenko and Y Miyasaka. Problems and results of iwls 2023 programming contest,
work page 2023
-
[17]
URL https://pkgs.fedoraproject.org/repo/extras/ngspice/ngspice23-manual.pdf/ eb0d68eb463a41a0571757a00a5b9f9d/ngspice23-manual.pdf. Accessed: 2025-09-23. M E J Newman.Networks: An Introduction
work page 2025
-
[18]
doi: 10.1057/s41599-024-03253-5
ISSN 2662-9992. doi: 10.1057/s41599-024-03253-5. URL https://doi.org/10.1057/s41599-024-03253-5. Vangelis Th Paschos.Applications of combinatorial optimization
-
[19]
InProceedings of SAT Competition 2023,
work page 2023
-
[20]
In Proceedings of SAT Competition 2023,
work page 2023
-
[21]
On the complexity of derivation in propositional calculus
G S Tseitin. On the complexity of derivation in propositional calculus. InAutomation of reasoning: 2: Classical papers on computational logic 1967–1970
work page 1967
-
[22]
Dataset details Here, we provide additional details on the datasets
A. Dataset details Here, we provide additional details on the datasets. A.1. Social networks: Predicting engagements on BlueSky More on metricsWe assess model performance using two metrics: the coefficient of determination (𝑅2) and the Spearman correlation (𝜌). Given a set of evaluation nodes𝑈⊂𝑉 , reference engagement kind𝜅and prediction interval𝜏 1,2, th...
work page 2023
-
[23]
Experimental setup Similar to Bechler-Speicher et al
B. Experimental setup Similar to Bechler-Speicher et al. (2025), we provide an encoder-processor-decoder architecture for all tasks, which we detail in the following. While task-specific parts of our architecture vary, we offer a general architecture to measure the performance of selected baselines on our tasks. For the chip designandweatherforecastdatase...
work page 2025
-
[24]
128 128 128 128 128 128 128 128 128 B.2
0.2 0.2Epochs 1000 1000 1000 1000 1000 1000 1000 1000 1000 ArchitectureLayers 4 2 4 4 4 4 4 2 2Hidden dim. 128 128 128 128 128 128 128 128 128 B.2. Baseline architectures In the following, we provide implementation details on the baselines used for the datasets in GraphBench. For dataset-specific design choices, we provide detailed information in Section ...
work page 2025
-
[25]
Epochs 700 700 700 700 Architecture Layers 6 4 4 4 Hidden dim. 384 384 384 384 Attn. heads 0 4 0 0 Activation GELU RELU RELU RELU 58 tokenization, treating each graph node as a single token input to the GT. However, for edge-level tasks, we use the transformation outlined for algorithmic reasoning tasks, allowing edge-level tokens to be used without chang...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.