TUDataset: A collection of benchmark datasets for learning with graphs

Christopher Morris; Franka Bause; Kristian Kersting; Marion Neumann; Nils M. Kriege; Petra Mutzel

arxiv: 2007.08663 · v1 · pith:BO7S35T3new · submitted 2020-07-16 · 💻 cs.LG · cs.NE· stat.ML

TUDataset: A collection of benchmark datasets for learning with graphs

Christopher Morris , Nils M. Kriege , Franka Bause , Kristian Kersting , Petra Mutzel , Marion Neumann This is my paper

Pith reviewed 2026-05-25 07:49 UTC · model grok-4.3

classification 💻 cs.LG cs.NEstat.ML

keywords graph classificationgraph regressionbenchmark datasetsgraph neural networksTUDatasetmachine learning on graphskernel methods

0 comments

The pith

The TUDataset supplies over 120 benchmark datasets for graph classification and regression together with Python data loaders, kernel and graph neural network baselines, and evaluation tools.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the TUDataset collection to address the shortage of standardized benchmarks for supervised learning on graph data. It gathers more than 120 datasets spanning many applications and sizes. Accompanying resources include Python loaders for the data, reference implementations of kernel methods and graph neural networks, plus tools to run and compare experiments. The goal is to make it easier for researchers to perform consistent and reproducible work in graph classification and regression. All materials are released online.

Core claim

We introduce the TUDataset for graph classification and regression. The collection consists of over 120 datasets of varying sizes from a wide range of applications. We provide Python-based data loaders, kernel and graph neural network baseline implementations, and evaluation tools. Here, we give an overview of the datasets, standardized evaluation procedures, and provide baseline experiments.

What carries the argument

The TUDataset collection, which aggregates more than 120 benchmark datasets for graph tasks and supplies loaders plus baseline code for kernels and graph neural networks.

If this is right

Standardized evaluation procedures enable direct comparisons of methods on the same graph classification tasks.
Baseline kernel and graph neural network results serve as reference points for new approaches.
Access to datasets from many application areas supports testing across domains.
Reproducible code allows verification of reported performance numbers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Widespread adoption might reduce the spread of incomparable results across different studies.
The collection could become a starting point for creating additional standardized test suites in related graph tasks.

Load-bearing premise

The main obstacle to progress in graph learning is the lack of meaningful benchmark datasets and standardized evaluation procedures, so releasing this collection will reduce that obstacle.

What would settle it

Papers in the area continue to rely on non-overlapping datasets and differing evaluation protocols without adopting the TUDataset resources.

read the original abstract

Recently, there has been an increasing interest in (supervised) learning with graph data, especially using graph neural networks. However, the development of meaningful benchmark datasets and standardized evaluation procedures is lagging, consequently hindering advancements in this area. To address this, we introduce the TUDataset for graph classification and regression. The collection consists of over 120 datasets of varying sizes from a wide range of applications. We provide Python-based data loaders, kernel and graph neural network baseline implementations, and evaluation tools. Here, we give an overview of the datasets, standardized evaluation procedures, and provide baseline experiments. All datasets are available at www.graphlearning.io. The experiments are fully reproducible from the code available at www.github.com/chrsmrrs/tudataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a straightforward resource paper releasing over 120 graph datasets plus loaders and baseline code to help standardize comparisons.

read the letter

The core contribution is the TUDataset collection itself: more than 120 graph datasets spanning different sizes and application areas, bundled with Python data loaders, kernel and GNN baseline implementations, and evaluation tools. The authors also point to public hosting at graphlearning.io and fully reproducible code on GitHub. That package of data plus tooling is the main deliverable, and it directly targets the problem of scattered private datasets in graph classification and regression work. The scale and the decision to include ready baselines are the parts that could actually move practice if the collection holds up under use. The abstract does not supply selection criteria or validation steps, so we cannot yet judge how the datasets were filtered or whether duplicates or low-quality entries slipped in. The motivation about lagging benchmarks is stated plainly but receives no supporting counts or examples in the text available. This paper is for researchers who run graph neural network or kernel experiments and want a single place to pull consistent test sets rather than hunting down individual sources. A reader who needs quick, reproducible baselines on a broad set of graphs will get concrete value from the loaders and code even before any deeper analysis. The work shows clear intent to support community standards rather than advance a new method, which is fine for a resource paper. I would send it to peer review because a well-curated public collection at this size can reduce redundant data work and improve comparability across papers, provided the full version documents the curation process and reports the baseline numbers clearly.

Referee Report

1 major / 0 minor

Summary. The paper claims to introduce the TUDataset collection for graph classification and regression, consisting of over 120 datasets from various applications. It provides Python-based data loaders, kernel and GNN baseline implementations, and evaluation tools. The abstract states that an overview of the datasets, standardized evaluation procedures, and baseline experiments are given, with all datasets available at www.graphlearning.io and experiments reproducible from code at the provided GitHub repository.

Significance. If the collection is comprehensive and the tools effective, this resource could help standardize benchmarks in graph learning, facilitating advancements by addressing the lack of meaningful benchmarks. The explicit commitment to reproducibility through available code is a strength that enhances the potential impact.

major comments (1)

[Abstract] Abstract: The abstract asserts the existence and availability of the collection and tools but supplies no details on dataset selection criteria, validation, or baseline performance numbers; soundness of the central claim cannot be verified beyond the statement of availability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and address the major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract asserts the existence and availability of the collection and tools but supplies no details on dataset selection criteria, validation, or baseline performance numbers; soundness of the central claim cannot be verified beyond the statement of availability.

Authors: Abstracts are intentionally concise and serve to summarize the paper's contributions at a high level. The manuscript body provides the overview of the datasets (including selection criteria and characteristics from various applications), standardized evaluation procedures, and baseline experiments with performance numbers. The central claim of introducing a reproducible collection is substantiated by the public availability of all datasets at www.graphlearning.io and the code at the GitHub repository, enabling direct verification and use by the community. We maintain that the abstract appropriately highlights these elements without requiring the level of detail suggested. revision: no

Circularity Check

0 steps flagged

No significant circularity; resource announcement only

full rationale

The paper is a dataset collection announcement containing no derivations, equations, predictions, fitted parameters, or load-bearing technical claims. The abstract describes introducing TUDataset with loaders and baselines but presents no analytic chain that could reduce to self-definition, fitted inputs, or self-citations. This is a standard resource paper whose claims are self-contained and externally verifiable by dataset availability.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are required; the contribution is a curated resource rather than a derivation.

pith-pipeline@v0.9.0 · 5643 in / 1026 out tokens · 42756 ms · 2026-05-25T07:49:20.880476+00:00 · methodology

discussion (0)

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GraphIP-Bench: How Hard Is It to Steal a Graph Neural Network, and Can We Stop It?
cs.CR 2026-05 accept novelty 8.0

GraphIP-Bench shows stealing GNNs is easy at moderate query budgets, most defenses fail to block or reliably trace extraction, and watermarks lose verification power on surrogates while heterophilic graphs are harder ...
HSG-12M: A Large-Scale Benchmark of Spatial Multigraphs from the Energy Spectra of Non-Hermitian Crystals
cs.LG 2025-06 conditional novelty 8.0

Authors release HSG-12M, a dataset of 16.7 million spatial multigraphs generated from non-Hermitian crystal energy spectra via the Poly2Graph pipeline, along with initial GNN benchmarks.
Beyond Oversquashing: Understanding Signal Propagation in GNNs Via Observables
cs.LG 2026-05 unverdicted novelty 7.0

Quantum-inspired observables reveal poor signal routing in standard spectral GNNs and motivate Schrödinger GNNs with superior propagation capacity.
Higher-order Persistence Diagrams
cs.CG 2026-05 unverdicted novelty 7.0

Higher-order persistence diagrams are defined recursively via interval containments, and their aggregations can be evaluated in nearly linear time using zeta transforms instead of explicit pair enumeration.
CTQWformer: A CTQW-based Transformer for Graph Classification
cs.LG 2026-05 unverdicted novelty 7.0

CTQWformer fuses continuous-time quantum walks into a graph transformer and recurrent module to outperform standard GNNs and graph kernels on classification benchmarks.
Concept Graph Convolutions: Message Passing in the Concept Space
cs.LG 2026-04 unverdicted novelty 7.0

Concept Graph Convolutions perform message passing on node concepts to increase interpretability of graph neural networks without losing task performance.
R2G: A Multi-View Circuit Graph Benchmark Suite from RTL to GDSII
cs.CV 2026-04 accept novelty 7.0

R2G is a multi-view circuit graph benchmark showing that representation choice affects GNN accuracy more than model architecture, with node-centric views and deeper decoders performing best.
Efficient and Accurate Graph Classification with Hyperdimensional Computing on FPGA
cs.AR 2025-12 conditional novelty 7.0

HyperX is the first end-to-end FPGA accelerator for Nyström-based HDC graph classification, delivering 6.85× speedup and 169× energy efficiency over CPU baselines plus 3.4% average accuracy gain on TUDataset benchmarks.
Graph Learning via Logic-Based Weisfeiler-Leman Variants and Tabularization
cs.LG 2025-08 unverdicted novelty 7.0

Logic-based Weisfeiler-Leman variants enable graph-to-table conversion for classification that matches GNN and graph transformer accuracy while running 5-20x faster without GPUs.
HSG-12M: A Large-Scale Benchmark of Spatial Multigraphs from the Energy Spectra of Non-Hermitian Crystals
cs.LG 2025-06 unverdicted novelty 7.0

HSG-12M is a large dataset of spatial multigraphs derived from non-Hermitian crystal energy spectra via the Poly2Graph pipeline, positioned as the first large-scale benchmark of this graph type.
A Benchmark Dataset for Graph Regression with Homogeneous and Multi-Relational Variants
cs.LG 2025-05 unverdicted novelty 7.0

RelSC is a new graph regression benchmark from program graphs with execution time labels, released in homogeneous (RelSC-H) and multi-relational (RelSC-M) variants to study representation effects.
Estimating Subgraph Importance with Structural Prior Domain Knowledge
cs.LG 2026-05 unverdicted novelty 6.0

A label-free Group Lasso method estimates important subgraphs in pretrained GNNs by incorporating domain structural knowledge.
Quantum Injection Pathways for Implicit Graph Neural Networks
quant-ph 2026-05 unverdicted novelty 6.0

Independent quantum signal injection into graph DEQs yields higher test accuracy and fewer solver iterations than state-dependent or backbone-dependent injection and classical equilibrium models on NCI1, PROTEINS, and...
GraphNetz: Statistical Benchmarking of Graph Neural Networks with Paired Tests and Rank Aggregation
cs.CE 2026-05 unverdicted novelty 6.0

GraphNetz supplies an automated statistical pipeline for GNN benchmarking that includes per-cell confidence intervals, paired tests with multiple-comparison correction, and critical-difference diagrams across tasks an...
Subgraph Concept Networks: Concept Levels in Graph Classification
cs.LG 2026-04 unverdicted novelty 6.0

Subgraph Concept Network is a new GNN architecture that distills meaningful concepts at node, subgraph, and graph levels via soft clustering to improve explainability while maintaining competitive accuracy.
Learning from Historical Activations in Graph Neural Networks
cs.LG 2026-01 unverdicted novelty 6.0

HISTOGRAPH applies unified layer-wise attention followed by node-wise attention over historical GNN activations to improve graph classification, especially in deep models.
Adaptive Canonicalization with Application to Invariant Anisotropic Geometric Networks
cs.LG 2025-09 unverdicted novelty 6.0

Adaptive canonicalization selects input canonical forms by maximizing network predictive confidence to yield continuous symmetry-preserving models with universal approximation for equivariant geometric networks.
How Embeddings Shape Graph Neural Networks: Classical vs Quantum-Oriented Node Representations
cs.LG 2026-04 unverdicted novelty 5.0

Quantum-oriented embeddings deliver consistent gains on structure-driven graph datasets while classical baselines perform adequately on attribute-limited social graphs, under identical training pipelines across five T...
GP2F: Cross-Domain Graph Prompting with Adaptive Fusion of Pre-trained Graph Neural Networks
cs.LG 2026-02 unverdicted novelty 5.0

GP2F is a dual-branch graph prompting framework that fuses frozen pre-trained knowledge with task-specific adaptation to reduce estimation error and outperform baselines in cross-domain few-shot node and graph classification.
OpenGLT: A Comprehensive Benchmark of Graph Neural Networks for Graph-Level Tasks
cs.LG 2025-01 unverdicted novelty 5.0

OpenGLT benchmark finds no single GNN architecture dominates graph-level tasks, with subgraph-based models strongest in expressiveness, graph learning and SSL models in robustness, node and pooling models in efficienc...
Position: Graph Condensation Needs a Reset -- Move Beyond Full-dataset Training and Model-Dependence
cs.LG 2026-05 conditional novelty 4.0

The paper claims current graph condensation approaches are flawed due to full-dataset training requirements, high overhead, poor generalization, and misleading evaluation metrics, calling for a reset toward lightweigh...
Fine-Grained Graph Generation through Latent Mixture Scheduling
cs.AI 2026-05 unverdicted novelty 4.0

A novel CVAE with mixture scheduling achieves fine-grained structural control in graph generation, showing high quality and controllability on five datasets.
Position: Graph Condensation Needs a Reset -- Move Beyond Full-dataset Training and Model-Dependence
cs.LG 2026-05 unverdicted novelty 3.0

Graph condensation methods must move beyond full-dataset training and model dependence toward lightweight, architecture-agnostic designs to achieve practical efficiency in GNNs.
Graph Rewiring in GNNs to Mitigate Over-Squashing and Over-Smoothing: A Survey
cs.LG 2024-11 unverdicted novelty 2.0

A survey compiling graph rewiring techniques for mitigating over-squashing and over-smoothing in GNNs.