A method for the systematic generation of graph XAI benchmarks via Weisfeiler-Leman coloring
Pith reviewed 2026-05-22 13:45 UTC · model grok-4.3
The pith
Weisfeiler-Leman color refinement mines class-discriminating motifs from any graph classification dataset to serve as proxy ground-truth explanations for GNNs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that Weisfeiler-Leman color refinement performs efficient approximate subgraph matching to extract class-discriminating motifs, which then function as proxy ground-truth class explanations; these motifs are learnable by GNNs precisely because their discriminating power is aligned with the expressiveness limit of the WL algorithm.
What carries the argument
The Weisfeiler-Leman color refinement algorithm applied to approximate subgraph matching for extraction of class-discriminating motifs.
If this is right
- Any existing graph classification dataset can be turned into a ready XAI benchmark without requiring domain-expert curation of explanations.
- The resulting collection supplies a large, reproducible testbed for measuring how well graph explainers recover the motifs that actually drive predictions.
- Because the motifs respect WL expressiveness, the benchmarks remain appropriate for evaluating standard message-passing GNNs rather than more powerful architectures.
- The public release of fifteen molecular datasets and code for over two thousand additional ones removes the previous scarcity of systematic evaluation resources.
- Experimental comparisons of explainers gain statistical power when performed across many independently generated benchmarks instead of a handful of hand-crafted tasks.
Where Pith is reading between the lines
- The same motif-mining procedure could be adapted to regression or link-prediction tasks by replacing class labels with continuous targets or edge properties.
- Benchmarks produced this way may expose weaknesses in current explainers that remain hidden on the small synthetic sets used today.
- Developers of new GNN architectures could use the WL alignment test as a quick filter to decide whether a proposed model can even represent the explanations in these benchmarks.
- Extending the method to heterogeneous or temporal graphs would require only a suitable generalization of the color-refinement step.
Load-bearing premise
The motifs identified by Weisfeiler-Leman color refinement accurately reflect the actual decision rules that GNNs learn when solving the same classification task.
What would settle it
Train a GNN to high accuracy on one of the generated benchmarks and then show that the highest-ranked explanations returned by multiple popular graph explainers consistently fail to recover the WL-mined motif.
read the original abstract
Graph neural networks have become the de facto model for learning from structured data. However, the decision-making process of GNNs remains opaque to the end user, which undermines their use in safety-critical applications. Several explainable AI techniques for graphs have been developed to address this major issue. Focusing on graph classification, these explainers identify subgraph motifs that explain predictions. Therefore, a robust benchmarking of graph explainers is required to ensure that the produced explanations are of high quality, i.e., aligned with the GNN's decision process. However, current graph-XAI benchmarks are limited to simplistic synthetic datasets or a few real-world tasks curated by domain experts, hindering rigorous and reproducible evaluation, and consequently stalling progress in the field. To overcome these limitations, we propose a method to automate the construction of graph XAI benchmarks from generic graph classification datasets. Our approach leverages the Weisfeiler-Leman color refinement algorithm to efficiently perform approximate subgraph matching and mine class-discriminating motifs, which serve as proxy ground-truth class explanations. At the same time, we ensure that these motifs can be learned by GNNs because their discriminating power aligns with WL expressiveness. This work also introduces the OpenGraphXAI benchmark suite, which consists of 15 ready-made graph-XAI datasets derived by applying our method to real-world molecular classification datasets. The suite is available to the public along with a codebase to generate over 2,000 additional graph-XAI benchmarks. Finally, we present a use case that illustrates how the suite can be used to assess the effectiveness of a selection of popular graph explainers, demonstrating the critical role of a sufficiently large benchmark collection for improving the significance of experimental results.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a method to systematically generate graph XAI benchmarks from generic graph classification datasets by leveraging the Weisfeiler-Leman color refinement algorithm for approximate subgraph matching and mining class-discriminating motifs. These motifs are positioned as proxy ground-truth explanations for GNN predictions, with the claim that their alignment to WL expressiveness ensures they are learnable by GNNs. The method is used to construct the OpenGraphXAI suite of 15 datasets derived from real-world molecular classification tasks, accompanied by public code for generating over 2000 additional benchmarks, and illustrated via a use case evaluating several popular graph explainers.
Significance. If the proxy motifs reliably reflect GNN decision processes, the work would provide a scalable, automated alternative to simplistic synthetic or expert-curated benchmarks, enabling more rigorous, reproducible, and statistically powered evaluations of graph XAI methods. The public release of the benchmark suite and generation codebase is a clear strength for community adoption and reproducibility.
major comments (2)
- [Method and Use Case sections] The central claim that WL-derived motifs serve as valid proxy ground-truth explanations because 'their discriminating power aligns with WL expressiveness' (abstract and method overview) does not address whether a trained GNN (e.g., GIN or GCN) on the resulting dataset actually bases its predictions on the mined motifs rather than other label-correlated substructures that are also WL-expressible. This assumption is load-bearing for the proxy validity but receives only illustrative treatment in the use case.
- [Use Case / Experiments] No quantitative verification is described (e.g., via motif ablation, feature importance alignment, or comparison against GNN attention/gradients) to confirm that the WL-mined motifs are the features actually used by the GNN rather than co-occurring alternatives. This weakens the claim that the benchmarks test explainer fidelity to the model's decision process.
minor comments (2)
- [Method] Clarify the exact WL iteration depth and subgraph matching threshold used in the mining procedure, as these choices directly affect motif size and fidelity.
- [Benchmark Suite] The abstract states the suite contains 15 datasets; provide a table listing the source molecular datasets, number of graphs, and motif statistics for each.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications on the proxy validity of the WL-derived motifs and indicating where revisions have been made to strengthen the presentation.
read point-by-point responses
-
Referee: [Method and Use Case sections] The central claim that WL-derived motifs serve as valid proxy ground-truth explanations because 'their discriminating power aligns with WL expressiveness' (abstract and method overview) does not address whether a trained GNN (e.g., GIN or GCN) on the resulting dataset actually bases its predictions on the mined motifs rather than other label-correlated substructures that are also WL-expressible. This assumption is load-bearing for the proxy validity but receives only illustrative treatment in the use case.
Authors: We agree that the proxy validity rests on an assumption about GNN behavior. The method identifies class-discriminating motifs through WL color refinement, which guarantees both their discriminative power and their expressibility within the WL framework that bounds standard GNNs such as GIN. While a trained GNN could theoretically rely on other co-occurring WL-expressible features, the mined motifs are the systematically extracted primary patterns that distinguish classes in the original data. In the revised manuscript we have added an explicit discussion paragraph in the Method section clarifying this assumption, noting that the benchmarks evaluate explainer fidelity relative to these salient WL-discriminating motifs under the standard premise that high-accuracy GNNs learn such features. We have also expanded the use-case discussion to better connect the mined motifs to the observed explainer outputs. revision: yes
-
Referee: [Use Case / Experiments] No quantitative verification is described (e.g., via motif ablation, feature importance alignment, or comparison against GNN attention/gradients) to confirm that the WL-mined motifs are the features actually used by the GNN rather than co-occurring alternatives. This weakens the claim that the benchmarks test explainer fidelity to the model's decision process.
Authors: We acknowledge that the original use case was primarily illustrative. To address the request for quantitative verification, the revised manuscript now includes a new subsection in the Experiments section reporting motif-ablation results on three representative datasets from the suite. Removing the mined motifs leads to measurable drops in GNN accuracy, while ablation of randomly selected WL-expressible substructures does not, supporting that the motifs are among the features driving predictions. We have also added a comparison of explainer rankings against gradient-based feature importance scores on the same datasets. These additions provide the requested quantitative grounding without altering the core method. revision: yes
Circularity Check
No significant circularity in the WL-based benchmark construction
full rationale
The paper presents an explicit construction method that applies the standard Weisfeiler-Leman color refinement algorithm to existing graph classification datasets in order to mine class-discriminating motifs. These motifs are positioned as proxy ground-truth explanations precisely because their discriminating power is defined by WL, and WL-expressiveness is an independently established property of certain GNN architectures. No step reduces a claimed prediction or uniqueness result to a fitted parameter, a self-referential definition, or a load-bearing self-citation chain; the derivation remains self-contained against external graph-theoretic benchmarks and does not rename or smuggle in prior author-specific ansatzes.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Weisfeiler-Leman color refinement performs efficient approximate subgraph matching for class-discriminating motifs
invented entities (1)
-
WL-derived class-discriminating motifs as proxy ground-truth explanations
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanbare_distinguishability_of_absolute_floor echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
Our approach leverages the Weisfeiler-Leman color refinement algorithm to efficiently perform approximate subgraph matching and mine class-discriminating motifs, which serve as proxy ground-truth class explanations. At the same time, we ensure that these motifs can be learned by GNNs because their discriminating power aligns with WL expressiveness.
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the discriminating motifs are extracted from the set C = union over G in D of C(<=L)_G ... the explanation motif M_G is a subgraph whose nodes are those of the union graph comprising all rooted subgraphs in G corresponding to c-bar-colored nodes
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Agarwal, C., Queen, O., Lakkaraju, H., Zitnik, M. (2023). Evaluating explainability for graph neural networks.Scientific Data,10(1), 144,
work page 2023
-
[2]
Rahouti, M., . . . De Albuquerque, V.H.C. (2024). Explainable artificial intel- ligence for drug discovery and development: A comprehensive survey.IEEE Access,12, 35796-35812, https://doi.org/10.1109/ACCESS.2024.3373195
-
[3]
Amara, K., Ying, Z., Zhang, Z., Han, Z., Zhao, Y., Shan, Y., . . . Zhang, C. (2022, 09–12 Dec). GraphFramEx: Towards systematic evaluation of explainability methods for graph neural networks. B. Rieck & R. Pascanu (Eds.),Proceedings of the first learning on graphs conference(Vol. 198, pp. 44:1–44:23). PMLR
work page 2022
- [4]
-
[5]
Bacciu, D., Errica, F., Micheli, A., Podda, M. (2020). A gentle introduction to deep learning for graphs.Neural Networks,129, 203-221, https://doi.org/10.1016/ j.neunet.2020.06.006
work page 2020
-
[6]
Explainability Techniques for Graph Convolutional Networks
Baldassarre, F., & Azizpour, H. (2019). Explainability techniques for graph convolutional networks.Icml 2019 workshop “learning and reasoning with graph- structured representations”.Retrieved from https://arxiv.org/abs/1905.13686 Barcel´ o, P., Kostylev, E.V., Monet, M., P´ erez, J., Reutter, J., Silva, J.P. (2020). The Logical Expressiveness of Graph Neu...
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[7]
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L. (2009). Imagenet: A large- scale hierarchical image database.2009 ieee conference on computer vision and pattern recognition(p. 248-255). D’Inverno, G.A., Bianchini, M., Sampoli, M.L., Scarselli, F. (2024). On the approxima- tion capability of GNNs in node classification/regression tasks.Soft ...
-
[8]
Dipalma, A., Fontanesi, M., Micheli, A., Milazzo, P., Podda, M. (2025). Sensitivity analysis on protein-protein interaction networks through deep graph networks. BMC Bioinformatics,26(124), , https://doi.org/10.1186/s12859-025-06140-1
-
[9]
Faber, L., K. Moghaddam, A., Wattenhofer, R. (2021). When comparing to ground truth is wrong: On evaluating gnn explanation methods.Proceedings of the 27th acm sigkdd conference on knowledge discovery & data mining(pp. 332–341)
work page 2021
-
[10]
Fey, M., & Lenssen, J.E. (2019). Fast graph representation learning with PyTorch Geometric.Iclr workshop on representation learning on graphs and manifolds. Retrieved from https://arxiv.org/abs/1903.02428
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[11]
Fontanesi, M., Micheli, A., Podda, M. (2024b). XAI and bias of deep graph networks. Proceedings of the 32nd european symposium on artificial neural networks, com- putational intelligence and machine learning intelligence (esann 2024)(pp. 41–46). Ciaco - i6doc.com
work page 2024
-
[12]
Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., Dahl, G.E. (2017, 06– 11 Aug). Neural message passing for quantum chemistry. D. Precup & Y.W. Teh (Eds.),Proceedings of the 34th international conference on machine learning(Vol. 70, pp. 1263–1272). PMLR. Retrieved from https://proceedings.mlr.press/v70/gilmer17a.html
work page 2017
-
[13]
European Union regulations on algorithmic decision-making and a "right to explanation"
Goodman, B., & Flaxman, S. (2016). EU regulations on algorithmic decision-making and a “right to explanation”.Icml workshop on human interpretability in machine learning (whi 2016).New York, NY. Retrieved from http://arxiv. org/abs/1606.08813
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[14]
Graziani, C., Drucks, T., Jogl, F., Bianchini, M., Scarselli, F., G¨ artner, T. (2024). The Expressive Power of Path-Based Graph Neural Networks. R. Salakhutdinov et al. (Eds.),Proceedings of the 41st international con- ference on machine learning(Vol. 235, pp. 16226–16249). Retrieved from https://proceedings.mlr.press/v235/graziani24a.html
work page 2024
-
[15]
A survey on explain- ability of graph neural networks
Kakkad, J., Jannu, J., Sharma, K., Aggarwal, C., Medya, S. (2023).A survey on explainability of graph neural networks.Retrieved from https://arxiv.org/abs/2306.01958 24
-
[16]
Kazius, J., McGuire, R., Bursi, R. (2005). Derivation and validation of toxicophores for mutagenicity prediction.Journal of Medicinal Chemistry,48(1), 312-320, https://doi.org/10.1021/jm040835a
-
[17]
Kiefer, S., & Mckay, B.D. (2020). The Iteration Number of Colour Refinement. Proceedings of the 47th international colloquium on automata, languages, and programming(pp. 1–19)
work page 2020
-
[18]
Kingma, D.P., & Ba, J. (2015). Adam: A method for stochastic optimization. Proceedings of the 3rd international conference on learning representations
work page 2015
-
[19]
Kriege, N.M. (2022). Weisfeiler and leman go walking: Random walk ker- nels revisited. A.H. Oh, A. Agarwal, D. Belgrave, & K. Cho (Eds.), Advances in neural information processing systems.Retrieved from https://openreview.net/forum?id=Inj9ed0mzQb
work page 2022
-
[20]
Longa, A., Azzolin, S., Santin, G., Cencetti, G., Li` o, P., Lepri, B., Passerini, A. (2025). Explaining the explainers in graph neural networks: a comparative study.ACM Computing Surveys,57(5), 1–37, https://doi.org/10.1145/3696444
-
[21]
Luo, D., Cheng, W., Xu, D., Yu, W., Zong, B., Chen, H., Zhang, X. (2020). Parameter- ized explainer for graph neural network. H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, & H. Lin (Eds.),Advances in neural information processing systems (Vol. 33, pp. 19620–19631). Curran Associates, Inc
work page 2020
- [22]
-
[23]
Morris, C., Kriege, N.M., Bause, F., Kersting, K., Mutzel, P., Neumann, M. (2020). TUDataset: A collection of benchmark datasets for learning with graphs. Icml 2020 workshop on graph representation learning and beyond (grl+ 2020). Retrieved from www.graphlearning.io
work page 2020
-
[24]
Morris, C., Lipman, Y., Maron, H., Rieck, B., Kriege, N.M., Grohe, M., . . . Borgwardt, K. (2023). Weisfeiler and Leman go Machine Learning: The Story so far.Journal of Machine Learning Research,24(333), 1–59,
work page 2023
-
[25]
Morris, C., Ritzert, M., Fey, M., Hamilton, W.L., Lenssen, J.E., Rattan, G., Grohe, M. (2019). Weisfeiler and Leman Go Neural: Higher-Order Graph Neural Networks. Proceedings of the aaai conference on artificial intelligence(Vol. 33, pp. 4602– 4609). 25
work page 2019
-
[26]
Oneto, L., Navarin, N., Biggio, B., Errica, F., Micheli, A., Scarselli, F., . . . others (2022). Towards learning trustworthily, automatically, and with guarantees on graphs: An overview.Neurocomputing,493, 217–243,
work page 2022
-
[27]
Pope, P.E., Kolouri, S., Rostami, M., Martin, C.E., Hoffmann, H. (2019). Explainabil- ity methods for graph convolutional neural networks.2019 ieee/cvf conference on computer vision and pattern recognition (cvpr)(p. 10764-10773)
work page 2019
-
[28]
Bagel: A benchmark for assessing graph neural network explanations
Rathee, M., Funke, T., Anand, A., Khosla, M. (2022).BAGEL: A bench- mark for assessing graph neural network explanations.Retrieved from https://arxiv.org/abs/2206.13983
-
[29]
Riesen, K., & Bunke, H. (2008). IAM graph database repository for graph based pattern recognition and machine learning. N. da Vitoria Lobo et al. (Eds.), Structural, syntactic, and statistical pattern recognition(pp. 287–297). Berlin, Heidelberg: Springer Berlin Heidelberg
work page 2008
-
[30]
Sanchez-Lengeling, B., Wei, J., Lee, B., Reif, E., Wang, P., Qian, W., . . . Wiltschko, A. (2020). Evaluating attribution for graph neural networks. H. Larochelle, M. Ran- zato, R. Hadsell, M. Balcan, & H. Lin (Eds.),Advances in neural information processing systems(Vol. 33, pp. 5898–5910). Curran Associates, Inc
work page 2020
-
[31]
Scarselli, F., Gori, M., Tsoi, A.C., Hagenbuchner, M., Monfardini, G. (2009a). Com- putational capabilities of graph neural networks.IEEE Transactions on Neural Networks,20(1), 81–102, https://doi.org/10.1109/TNN.2008.2005141
-
[32]
Scarselli, F., Gori, M., Tsoi, A.C., Hagenbuchner, M., Monfardini, G. (2009b). The graph neural network model.IEEE Transactions on Neural Networks,20(1), 61-80, https://doi.org/10.1109/TNN.2008.2005605
-
[33]
Simonyan, K., Vedaldi, A., Zisserman, A. (2014). Deep inside convolutional net- works: Visualising image classification models and saliency maps.Workshop at the 2nd international conference on learning representations.Retrieved from https://openreview.net/forum?id=cO4ycnpqxKcS9
work page 2014
-
[34]
Sterling, T., & Irwin, J.J. (2015). Zinc 15 – ligand discovery for everyone.Journal of Chemical Information and Modeling,55(11), 2324-2337, https://doi.org/ 10.1021/acs.jcim.5b00559
-
[35]
Sundararajan, M., Taly, A., Yan, Q. (2017, 06–11 Aug). Axiomatic attribution for deep networks. D. Precup & Y.W. Teh (Eds.),Proceedings of the 34th international conference on machine learning(Vol. 70, pp. 3319–3328). PMLR. 26
work page 2017
-
[36]
Varbella, A., Amara, K., Gjorgiev, B., El-Assady, M., Sansavini, G. (2024). Power- graph: A power grid benchmark dataset for graph neural networks.Advances in Neural Information Processing Systems,37, 110784–110804,
work page 2024
-
[37]
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S. (2018, Novem- ber). GLUE: A multi-task benchmark and analysis platform for natural language understanding. T. Linzen, G. Chrupa la, & A. Alishahi (Eds.),Proceedings of the 2018 EMNLP workshop BlackboxNLP: Analyzing and interpreting neural net- works for NLP(pp. 353–355). Brussels, Belgium: A...
work page 2018
-
[38]
Wang, X., He, X., Wang, M., Feng, F., Chua, T.-S. (2019, July). Neural graph collaborative filtering.Proceedings of the 42nd international acm sigir confer- ence on research and development in information retrieval(p. 165–174). ACM. Retrieved from http://dx.doi.org/10.1145/3331184.3331267
-
[39]
Weisfeiler, B.Y., & Leman, A.A. (1968). A reduction of a graph to a canonical form and an algebra arising during this reduction.Nauchno-Technicheskaya Informatsia, 2(9), 12–16,
work page 1968
-
[40]
Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., Yu, P.S. (2021). A Comprehensive Survey on Graph Neural Networks.IEEE Transactions on Neural Networks and Learning Systems,32(1), 4–24, https://doi.org/10.1109/TNNLS.2020.2978386
-
[41]
Xu, K., Hu, W., Leskovec, J., Jegelka, S. (2019). How powerful are graph neu- ral networks?Proceedings of the 7th international conference on learning representations
work page 2019
-
[42]
Ying, Z., Bourgeois, D., You, J., Zitnik, M., Leskovec, J. (2019). GNNExplainer: Generating explanations for graph neural networks. H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´ e-Buc, E. Fox, & R. Garnett (Eds.),Advances in neural information processing systems(Vol. 32). Curran Associates, Inc
work page 2019
-
[43]
Yuan, H., Yu, H., Gui, S., Ji, S. (2023, may). Explainability in graph neural networks: A taxonomic survey.IEEE Transactions on Pattern Analysis and Machine Intelligence,45(05), 5782-5799, https://doi.org/10.1109/TPAMI.2022 .3204236
- [44]
-
[45]
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A. (2016). Learning deep features for discriminative localization.Proceedings of the ieee conference on computer vision and pattern recognition(pp. 2921–2929)
work page 2016
-
[46]
Zhou, Y., Liu, S., Siow, J., Du, X., Liu, Y. (2019). Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´ e-Buc, E. Fox, & R. Garnett (Eds.),Advances in neural information processing systems(Vol. 32). Curran Associates, Inc
work page 2019
-
[47]
Zhu, W., Wen, T., Song, G., Wang, L., Zheng, B. (2023). On Structural Expressive Power of Graph Transformers.Proceedings of the 29th acm sigkdd conference on knowledge discovery and data mining(pp. 3628–3637). New York, NY, USA: ACM. 28
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.