GraphBench: Next-generation graph learning benchmarking

Ali Parviz; Antoine Siraudin; Arman Mielke; Ben Finkelshtein; Bryan Perozzi; Chendi Qian; Christopher Morris; Darius Weber; Erik M\"uller; Fabrizio Frasca

arxiv: 2512.04475 · v5 · submitted 2025-12-04 · 💻 cs.LG · cs.AI· cs.NE· stat.ML

GraphBench: Next-generation graph learning benchmarking

Timo Stoll , Chendi Qian , Ben Finkelshtein , Ali Parviz , Darius Weber , Fabrizio Frasca , Hadar Shavit , Antoine Siraudin

show 11 more authors

Arman Mielke Marie Anastacio Erik M\"uller Maya Bechler-Speicher Michael Bronstein Mikhail Galkin Holger Hoos Mathias Niepert Bryan Perozzi Jan T\"onshoff Christopher Morris

This is my paper

Pith reviewed 2026-05-17 01:25 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.NEstat.ML

keywords graph learningbenchmarkinggraph neural networksmessage passinggraph transformersout-of-distribution generalizationreproducibilityevaluation protocols

0 comments

The pith

GraphBench supplies a standardized benchmark suite for graph learning across diverse domains and task types.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Machine learning on graphs has advanced in areas such as molecular property prediction and chip design, but fragmented datasets and inconsistent protocols have limited reproducibility. The paper introduces GraphBench to address this by offering a broad collection of real-world datasets that cover node-level, edge-level, graph-level, and generative tasks. It includes fixed dataset splits, metrics that test out-of-distribution generalization, and a shared framework for hyperparameter tuning. The work then runs recent message-passing networks and graph transformers on the suite to produce initial baselines. If the benchmark holds up, researchers gain a common reference point that can accelerate progress especially as larger graph foundation models appear.

Core claim

We introduce GraphBench, a comprehensive benchmark suite spanning diverse real-world domains and task settings, including node-level, edge-level, graph-level, and generative tasks. GraphBench provides standardized evaluation protocols, including consistent dataset splits and metrics for assessing out-of-distribution generalization across selected tasks, as well as a unified hyperparameter-tuning framework. We further evaluate GraphBench with recent message-passing neural networks and graph transformer models, establishing principled baselines for future research.

What carries the argument

GraphBench, the benchmark suite that supplies consistent dataset splits, out-of-distribution metrics, and a shared hyperparameter-tuning framework across node, edge, graph, and generative tasks.

If this is right

Model comparisons become possible on equal footing across different graph domains and task types.
Research gains clearer signals on whether models truly generalize beyond their training distributions.
Variability from ad-hoc hyperparameter choices decreases because a common tuning procedure is supplied.
New graph models can be measured against documented baselines instead of isolated prior results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Adoption of GraphBench could steer effort toward models that handle the included generative tasks more robustly.
The same standardization pattern might later be applied to other structured data types where benchmarking is currently scattered.
If the out-of-distribution metrics prove predictive, they could serve as a template for testing generalization in related structured-prediction settings.

Load-bearing premise

The selected datasets and tasks sufficiently represent the diversity and challenges of real-world graph learning problems across domains.

What would settle it

A follow-up study that collects new graph datasets from additional domains and finds that model rankings on GraphBench do not predict performance on those new datasets would show the benchmark misses key real-world variation.

read the original abstract

Machine learning on graphs has made substantial progress across domains such as molecular property prediction and chip design. Yet benchmarking practices remain fragmented, often relying on narrow, task-specific datasets and inconsistent evaluation protocols, hindering reproducibility and broader progress. With the recent popularity of graph foundation models, these weaknesses have become apparent, as existing benchmarks are insufficient for thorough evaluation. To address these challenges, we introduce GraphBench, a comprehensive benchmark suite spanning diverse real-world domains and task settings, including node-level, edge-level, graph-level, and generative tasks. GraphBench provides standardized evaluation protocols, including consistent dataset splits and metrics for assessing out-of-distribution generalization across selected tasks, as well as a unified hyperparameter-tuning framework. We further evaluate GraphBench with recent message-passing neural networks and graph transformer models, establishing principled baselines for future research. See www.graphbench.io for further details.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GraphBench adds a unified suite with OOD protocols and baselines, but dataset selection lacks the justification needed to support claims of broad real-world coverage.

read the letter

GraphBench introduces a new benchmark collection that pulls together node, edge, graph, and generative tasks across domains, along with fixed splits for out-of-distribution tests and a shared hyperparameter tuning setup. The authors also run baselines using message-passing networks and graph transformers to give others a concrete reference point. That standardization effort is the clearest practical step forward here, since it directly targets the inconsistent protocols that have made comparisons in graph ML difficult. The release at graphbench.io could save researchers time when they want to test new models against something consistent rather than piecing together their own small sets. The main weakness is in the dataset choices. The paper describes spanning diverse real-world areas like molecular prediction and chip design, yet it does not lay out explicit selection criteria, coverage metrics, or comparisons against domain surveys. Without that documentation, the claim that these tasks and splits represent the actual challenges in the field stays more asserted than demonstrated. The baselines then sit on top of that same uncertainty, so it is not yet clear how much new insight they provide about model differences. This work is aimed at graph ML practitioners who run experiments and want better reproducibility. Someone building or evaluating larger models would find the protocols useful once the selection details are tightened. It should go to peer review. The idea addresses a genuine bottleneck in the area, and referees can push for the missing justification without needing to start from zero.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces GraphBench, a benchmark suite for graph machine learning intended to address fragmented practices by spanning node-, edge-, graph-level, and generative tasks across diverse real-world domains. It supplies standardized evaluation protocols with consistent splits, metrics for out-of-distribution generalization, a unified hyperparameter-tuning framework, and baseline results from message-passing neural networks and graph transformers.

Significance. If the dataset selection and coverage can be shown to be systematically justified, GraphBench could meaningfully reduce fragmentation in graph learning evaluation and provide a more reliable platform for comparing models, including graph foundation models. The emphasis on OOD splits and unified tuning is a constructive contribution that existing benchmarks often lack.

major comments (2)

[Abstract / Benchmark Construction] Abstract and benchmark construction section: the claim that GraphBench is 'comprehensive' and spans 'diverse real-world domains' is not supported by explicit dataset selection criteria, domain-coverage metrics, or comparison against application surveys. This selection justification is load-bearing for the central assertion that the suite addresses fragmentation better than prior benchmarks.
[Abstract] Abstract: no details are provided on verification that the chosen datasets and protocols are free of post-hoc choices or on potential selection biases. Without this, the 'principled baselines' established with MPNNs and transformers inherit uncertainty about whether observed differences reflect meaningful model distinctions or benchmark artifacts.

minor comments (1)

[Abstract] The reference to www.graphbench.io for further details should be accompanied by a self-contained summary of key dataset statistics and protocol choices in the manuscript itself.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We appreciate the emphasis on strengthening the justification for dataset selection and addressing potential biases, as these are central to the value of GraphBench. We address each major comment below and have made revisions to improve clarity and transparency in the next version of the manuscript.

read point-by-point responses

Referee: [Abstract / Benchmark Construction] Abstract and benchmark construction section: the claim that GraphBench is 'comprehensive' and spans 'diverse real-world domains' is not supported by explicit dataset selection criteria, domain-coverage metrics, or comparison against application surveys. This selection justification is load-bearing for the central assertion that the suite addresses fragmentation better than prior benchmarks.

Authors: We agree that explicit documentation of the selection process is necessary to substantiate the claims of comprehensiveness and diversity. In the revised manuscript, we have expanded the Benchmark Construction section with a new subsection that details the dataset selection criteria. These criteria prioritize coverage of distinct real-world domains (e.g., molecular biology, social networks, citation networks, and infrastructure), balance across task levels (node, edge, graph, and generative), and reference established application surveys in graph machine learning to ensure relevance. We have also added quantitative domain-coverage metrics, including a comparison table against prior benchmarks such as OGB and TUDataset, showing the number of unique domains and task types represented. These additions directly support the central assertion that GraphBench reduces fragmentation more effectively than existing suites. revision: yes
Referee: [Abstract] Abstract: no details are provided on verification that the chosen datasets and protocols are free of post-hoc choices or on potential selection biases. Without this, the 'principled baselines' established with MPNNs and transformers inherit uncertainty about whether observed differences reflect meaningful model distinctions or benchmark artifacts.

Authors: We recognize the concern that insufficient transparency on selection biases and post-hoc choices could undermine confidence in the baselines. To address this, we have revised the abstract and added a dedicated paragraph in the Benchmark Construction section clarifying that datasets were selected based on their established use in prior literature and domain coverage before any baseline experiments were conducted. We describe verification steps, including reliance on publicly available fixed splits where possible and definition of OOD generalization metrics independently of model performance. A new limitations subsection acknowledges potential selection biases and explains how the unified hyperparameter-tuning framework and standardized protocols reduce the risk of benchmark artifacts influencing observed model differences. These changes provide greater transparency while preserving the integrity of the reported baselines. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark curation paper with no derivation chain

full rationale

This is an engineering/benchmark paper whose central contribution is the introduction of GraphBench itself—a curated suite of datasets, tasks, splits, metrics, and tuning protocols—rather than any mathematical derivation, first-principles prediction, or fitted quantity that reduces to its own inputs. The abstract and description contain no equations, no self-definitional loops, no fitted-input predictions, and no load-bearing self-citations to uniqueness theorems. Claims of comprehensiveness rest on the explicit selection and standardization choices documented in the paper, which are externally verifiable against the released suite at graphbench.io and do not collapse into prior results by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Benchmark introduction paper with no free parameters, mathematical axioms, or invented entities; relies on standard practices in ML benchmarking.

pith-pipeline@v0.9.0 · 5523 in / 1081 out tokens · 27447 ms · 2026-05-17T01:25:38.523552+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce GraphBench, a comprehensive benchmark suite spanning diverse real-world domains and task settings, including node-level, edge-level, graph-level, and generative tasks... standardized evaluation protocols, including consistent dataset splits and metrics for assessing out-of-distribution generalization
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

GraphBench provides standardized evaluation protocols... unified hyperparameter-tuning framework

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Have Graph -- Will Lift? The Case for Higher-Order Benchmarks
cs.LG 2026-05 unverdicted novelty 3.0

The paper argues that the topological deep learning community should develop new benchmark datasets with native higher-order structure rather than continuing to lift graph datasets.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · cited by 1 Pith paper

[1]

SHA-256 collision attack with programmatic SAT

N Alamgir, S Nejati, and C Bright. SHA-256 collision attack with programmatic SAT. InWorkshop on Practical Aspects of Automated Reasoning (PAAR) and Satisfiability Checking and Symbolic Computation Workshop (SC-Square), adjunct to (IJCAR 2024),

work page 2024
[2]

T Balyo, M J. H. Heule, and M Järvisalo. SAT competition 2016: Recent developments. In Proceedings of the AAAI Conference on Artificial Intelligence,

work page 2016
[5]

BreakID-kissat in SAT Competition 2024 (System Description)

B Bogaerts, J Nordström, A Oertel, and D Vandesande. BreakID-kissat in SAT Competition 2024 (System Description). InProceedings of SAT Competition 2024,

work page 2024
[6]

Breakid-kissat in sat competition 2023 (system description)

Bart Bogaerts, Jakob Nordström, Andy Oertel, and Cagrı Uluç Yıldırımoglu. Breakid-kissat in sat competition 2023 (system description). InProceedings of SAT Competition 2023,

work page 2023
[7]

InProceedings ofSAT Competition 2023,

work page 2023
[8]

ASAP.V2 and ASAP.V3: sequential optimization of an algorithm selector and a scheduler

F Gonard, M Schoenauer, and M Sebag. ASAP.V2 and ASAP.V3: sequential optimization of an algorithm selector and a scheduler. InOpen Algorithm Selection Challenge 2017, Proceedings of Machine Learning Research,

work page 2017
[9]

Sbva-cadical and sbva-kissat: Structured bounded variable addition

A Haberlandt and H Green. Sbva-cadical and sbva-kissat: Structured bounded variable addition. In Proceedings of SAT Competition 2023,

work page 2023
[10]

SAT competition 2018.Journal on Satisfiability, Boolean Modeling and Computation, (1),

M J H Heule, M Järvisalo, and M Suda. SAT competition 2018.Journal on Satisfiability, Boolean Modeling and Computation, (1),

work page 2018
[11]

ISBN 9783959771566

Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik. ISBN 9783959771566. doi: 10.4230/LIPIcs.CCC.2020.22. URL https://doi.org/10.4230/ LIPIcs.CCC.2020.22. A Iser and C Jabs. Global benchmark database. InProceedings of the International Conference on Theory and Applications of Satisfiability Testing SAT, LIPIcs,

work page doi:10.4230/lipics.ccc.2020.22 2020
[12]

URLhttps://doi.org/10.1145/368996.369025

doi: 10.1145/368996.369025. URLhttps://doi.org/10.1145/368996.369025. Lukasz Kaiser and Ilya Sutskever. Neural gpus learn algorithms. InProceedings of the International Conference on Learning Representations ICLR,

work page doi:10.1145/368996.369025
[13]

W Li, R Li, Y Ma, S O Chan, D Pan, and B Yu

InProceedings of SAT Competition 2024, 2024b. W Li, R Li, Y Ma, S O Chan, D Pan, and B Yu. Rethinking graph neural networks for the graph coloring problem.arXiv preprint,

work page 2024
[14]

The algorithm selection competitions 2015 and

M Lindauer, J N van Rijn, and L Kotthoff. The algorithm selection competitions 2015 and

work page 2015
[15]

InProceedings of SAT Competition 2024,

work page 2024
[16]

Problems and results of iwls 2023 programming contest,

A Mishchenko and Y Miyasaka. Problems and results of iwls 2023 programming contest,

work page 2023
[17]

Accessed: 2025-09-23

URL https://pkgs.fedoraproject.org/repo/extras/ngspice/ngspice23-manual.pdf/ eb0d68eb463a41a0571757a00a5b9f9d/ngspice23-manual.pdf. Accessed: 2025-09-23. M E J Newman.Networks: An Introduction

work page 2025
[18]

doi: 10.1057/s41599-024-03253-5

ISSN 2662-9992. doi: 10.1057/s41599-024-03253-5. URL https://doi.org/10.1057/s41599-024-03253-5. Vangelis Th Paschos.Applications of combinatorial optimization

work page doi:10.1057/s41599-024-03253-5
[19]

InProceedings of SAT Competition 2023,

work page 2023
[20]

In Proceedings of SAT Competition 2023,

work page 2023
[21]

On the complexity of derivation in propositional calculus

G S Tseitin. On the complexity of derivation in propositional calculus. InAutomation of reasoning: 2: Classical papers on computational logic 1967–1970

work page 1967
[22]

Dataset details Here, we provide additional details on the datasets

A. Dataset details Here, we provide additional details on the datasets. A.1. Social networks: Predicting engagements on BlueSky More on metricsWe assess model performance using two metrics: the coefficient of determination (𝑅2) and the Spearman correlation (𝜌). Given a set of evaluation nodes𝑈⊂𝑉 , reference engagement kind𝜅and prediction interval𝜏 1,2, th...

work page 2023
[23]

Experimental setup Similar to Bechler-Speicher et al

B. Experimental setup Similar to Bechler-Speicher et al. (2025), we provide an encoder-processor-decoder architecture for all tasks, which we detail in the following. While task-specific parts of our architecture vary, we offer a general architecture to measure the performance of selected baselines on our tasks. For the chip designandweatherforecastdatase...

work page 2025
[24]

128 128 128 128 128 128 128 128 128 B.2

0.2 0.2Epochs 1000 1000 1000 1000 1000 1000 1000 1000 1000 ArchitectureLayers 4 2 4 4 4 4 4 2 2Hidden dim. 128 128 128 128 128 128 128 128 128 B.2. Baseline architectures In the following, we provide implementation details on the baselines used for the datasets in GraphBench. For dataset-specific design choices, we provide detailed information in Section ...

work page 2025
[25]

384 384 384 384 Attn

Epochs 700 700 700 700 Architecture Layers 6 4 4 4 Hidden dim. 384 384 384 384 Attn. heads 0 4 0 0 Activation GELU RELU RELU RELU 58 tokenization, treating each graph node as a single token input to the GT. However, for edge-level tasks, we use the transformation outlined for algorithmic reasoning tasks, allowing edge-level tokens to be used without chang...

work page 2025

[1] [1]

SHA-256 collision attack with programmatic SAT

N Alamgir, S Nejati, and C Bright. SHA-256 collision attack with programmatic SAT. InWorkshop on Practical Aspects of Automated Reasoning (PAAR) and Satisfiability Checking and Symbolic Computation Workshop (SC-Square), adjunct to (IJCAR 2024),

work page 2024

[2] [2]

T Balyo, M J. H. Heule, and M Järvisalo. SAT competition 2016: Recent developments. In Proceedings of the AAAI Conference on Artificial Intelligence,

work page 2016

[3] [5]

BreakID-kissat in SAT Competition 2024 (System Description)

B Bogaerts, J Nordström, A Oertel, and D Vandesande. BreakID-kissat in SAT Competition 2024 (System Description). InProceedings of SAT Competition 2024,

work page 2024

[4] [6]

Breakid-kissat in sat competition 2023 (system description)

Bart Bogaerts, Jakob Nordström, Andy Oertel, and Cagrı Uluç Yıldırımoglu. Breakid-kissat in sat competition 2023 (system description). InProceedings of SAT Competition 2023,

work page 2023

[5] [7]

InProceedings ofSAT Competition 2023,

work page 2023

[6] [8]

ASAP.V2 and ASAP.V3: sequential optimization of an algorithm selector and a scheduler

F Gonard, M Schoenauer, and M Sebag. ASAP.V2 and ASAP.V3: sequential optimization of an algorithm selector and a scheduler. InOpen Algorithm Selection Challenge 2017, Proceedings of Machine Learning Research,

work page 2017

[7] [9]

Sbva-cadical and sbva-kissat: Structured bounded variable addition

A Haberlandt and H Green. Sbva-cadical and sbva-kissat: Structured bounded variable addition. In Proceedings of SAT Competition 2023,

work page 2023

[8] [10]

SAT competition 2018.Journal on Satisfiability, Boolean Modeling and Computation, (1),

M J H Heule, M Järvisalo, and M Suda. SAT competition 2018.Journal on Satisfiability, Boolean Modeling and Computation, (1),

work page 2018

[9] [11]

ISBN 9783959771566

Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik. ISBN 9783959771566. doi: 10.4230/LIPIcs.CCC.2020.22. URL https://doi.org/10.4230/ LIPIcs.CCC.2020.22. A Iser and C Jabs. Global benchmark database. InProceedings of the International Conference on Theory and Applications of Satisfiability Testing SAT, LIPIcs,

work page doi:10.4230/lipics.ccc.2020.22 2020

[10] [12]

URLhttps://doi.org/10.1145/368996.369025

doi: 10.1145/368996.369025. URLhttps://doi.org/10.1145/368996.369025. Lukasz Kaiser and Ilya Sutskever. Neural gpus learn algorithms. InProceedings of the International Conference on Learning Representations ICLR,

work page doi:10.1145/368996.369025

[11] [13]

W Li, R Li, Y Ma, S O Chan, D Pan, and B Yu

InProceedings of SAT Competition 2024, 2024b. W Li, R Li, Y Ma, S O Chan, D Pan, and B Yu. Rethinking graph neural networks for the graph coloring problem.arXiv preprint,

work page 2024

[12] [14]

The algorithm selection competitions 2015 and

M Lindauer, J N van Rijn, and L Kotthoff. The algorithm selection competitions 2015 and

work page 2015

[13] [15]

InProceedings of SAT Competition 2024,

work page 2024

[14] [16]

Problems and results of iwls 2023 programming contest,

A Mishchenko and Y Miyasaka. Problems and results of iwls 2023 programming contest,

work page 2023

[15] [17]

Accessed: 2025-09-23

URL https://pkgs.fedoraproject.org/repo/extras/ngspice/ngspice23-manual.pdf/ eb0d68eb463a41a0571757a00a5b9f9d/ngspice23-manual.pdf. Accessed: 2025-09-23. M E J Newman.Networks: An Introduction

work page 2025

[16] [18]

doi: 10.1057/s41599-024-03253-5

ISSN 2662-9992. doi: 10.1057/s41599-024-03253-5. URL https://doi.org/10.1057/s41599-024-03253-5. Vangelis Th Paschos.Applications of combinatorial optimization

work page doi:10.1057/s41599-024-03253-5

[17] [19]

InProceedings of SAT Competition 2023,

work page 2023

[18] [20]

In Proceedings of SAT Competition 2023,

work page 2023

[19] [21]

On the complexity of derivation in propositional calculus

G S Tseitin. On the complexity of derivation in propositional calculus. InAutomation of reasoning: 2: Classical papers on computational logic 1967–1970

work page 1967

[20] [22]

Dataset details Here, we provide additional details on the datasets

A. Dataset details Here, we provide additional details on the datasets. A.1. Social networks: Predicting engagements on BlueSky More on metricsWe assess model performance using two metrics: the coefficient of determination (𝑅2) and the Spearman correlation (𝜌). Given a set of evaluation nodes𝑈⊂𝑉 , reference engagement kind𝜅and prediction interval𝜏 1,2, th...

work page 2023

[21] [23]

Experimental setup Similar to Bechler-Speicher et al

B. Experimental setup Similar to Bechler-Speicher et al. (2025), we provide an encoder-processor-decoder architecture for all tasks, which we detail in the following. While task-specific parts of our architecture vary, we offer a general architecture to measure the performance of selected baselines on our tasks. For the chip designandweatherforecastdatase...

work page 2025

[22] [24]

128 128 128 128 128 128 128 128 128 B.2

0.2 0.2Epochs 1000 1000 1000 1000 1000 1000 1000 1000 1000 ArchitectureLayers 4 2 4 4 4 4 4 2 2Hidden dim. 128 128 128 128 128 128 128 128 128 B.2. Baseline architectures In the following, we provide implementation details on the baselines used for the datasets in GraphBench. For dataset-specific design choices, we provide detailed information in Section ...

work page 2025

[23] [25]

384 384 384 384 Attn

Epochs 700 700 700 700 Architecture Layers 6 4 4 4 Hidden dim. 384 384 384 384 Attn. heads 0 4 0 0 Activation GELU RELU RELU RELU 58 tokenization, treating each graph node as a single token input to the GT. However, for edge-level tasks, we use the transformation outlined for algorithmic reasoning tasks, allowing edge-level tokens to be used without chang...

work page 2025