Recognition: no theorem link
Cluster Attention for Graph Machine Learning
Pith reviewed 2026-05-10 17:52 UTC · model grok-4.3
The pith
Augmenting message passing networks or graph transformers with cluster attention improves performance on a wide range of graph datasets.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We divide graph nodes into clusters using off-the-shelf community detection algorithms and let each node attend to all other nodes inside each of its clusters. This cluster attention supplies large receptive fields while retaining strong graph-structure-based inductive biases. When it augments either message passing neural networks or graph transformers, the combined models achieve significantly higher performance on a wide range of graph datasets, including the real-world applications collected in the GraphLand benchmark.
What carries the argument
Cluster attention (CLATT), which partitions nodes via community detection and restricts attention to intra-cluster pairs to expand reach while preserving topology biases.
Load-bearing premise
Community detection algorithms will produce clusters that capture useful graph structure and thereby supply effective inductive biases.
What would settle it
Applying the proposed cluster attention augmentation to the same models and GraphLand datasets and measuring no improvement or a drop in accuracy compared with the unaugmented baselines.
Figures
read the original abstract
Message Passing Neural Networks have recently become the most popular approach to graph machine learning tasks; however, their receptive field is limited by the number of message passing layers. To increase the receptive field, Graph Transformers with global attention have been proposed; however, global attention does not take into account the graph topology and thus lacks graph-structure-based inductive biases, which are typically very important for graph machine learning tasks. In this work, we propose an alternative approach: cluster attention (CLATT). We divide graph nodes into clusters with off-the-shelf graph community detection algorithms and let each node attend to all other nodes in each cluster. CLATT provides large receptive fields while still having strong graph-structure-based inductive biases. We show that augmenting Message Passing Neural Networks or Graph Transformers with CLATT significantly improves their performance on a wide range of graph datasets including datasets from the recently introduced GraphLand benchmark representing real-world applications of graph machine learning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Cluster Attention (CLATT) as an augmentation to Message Passing Neural Networks or Graph Transformers. Nodes are partitioned into clusters via off-the-shelf community detection algorithms, after which each node attends to all others within its cluster(s). The authors claim this simultaneously yields large receptive fields and preserves graph-structure inductive biases, resulting in significant performance gains on diverse graph datasets including the GraphLand benchmark.
Significance. If the results hold under rigorous controls, CLATT offers a practical compromise between the depth-limited receptive fields of MPNNs and the topology-agnostic nature of global attention in Graph Transformers. The use of standard community detection algorithms is a strength for simplicity and reproducibility. The approach could be impactful for real-world graph tasks if cluster-scale analysis confirms the receptive-field benefit.
major comments (3)
- [Method / Experiments] The central claim that CLATT delivers large receptive fields while retaining inductive biases depends on cluster sizes being substantially larger than standard 1-hop neighborhoods. The manuscript provides no statistics (mean, median, or distribution) on cluster sizes produced by the chosen community detection algorithms on any evaluated dataset; without this, the receptive-field argument cannot be evaluated and any gains cannot be attributed to the stated mechanism.
- [Experiments] The experimental section must specify the exact community detection algorithm(s), all hyperparameters, and the handling of overlapping or multi-cluster assignments, as these are free parameters listed in the paper's own description. Results should include error bars, statistical significance tests, and ablation studies isolating CLATT from the clustering choice itself.
- [Experiments] Performance tables (presumably in §4 or §5) report improvements but lack explicit comparison to other receptive-field expansion techniques (e.g., higher-order message passing or diffusion-based methods) that also aim to enlarge neighborhoods while respecting topology; this weakens the claim that CLATT is a distinctly advantageous alternative.
minor comments (2)
- [Method] Add a formal equation or pseudocode block defining the CLATT attention computation, including how intra-cluster attention is integrated with the base MPNN or Transformer layers.
- [Experiments] Ensure all datasets, community detection parameters, and code are released to support reproducibility of the reported gains.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments. We address each major point below and will incorporate revisions to provide the requested statistics, specifications, controls, and comparisons, thereby strengthening the manuscript's claims regarding receptive fields and inductive biases.
read point-by-point responses
-
Referee: [Method / Experiments] The central claim that CLATT delivers large receptive fields while retaining inductive biases depends on cluster sizes being substantially larger than standard 1-hop neighborhoods. The manuscript provides no statistics (mean, median, or distribution) on cluster sizes produced by the chosen community detection algorithms on any evaluated dataset; without this, the receptive-field argument cannot be evaluated and any gains cannot be attributed to the stated mechanism.
Authors: We agree that cluster-size statistics are necessary to rigorously support the receptive-field mechanism. In the revised manuscript we will add a dedicated subsection (with accompanying table) reporting, for every dataset and algorithm, the mean, median, standard deviation, and full distribution of cluster sizes. These numbers will be contrasted with the average 1-hop neighborhood sizes to demonstrate that clusters are substantially larger, directly linking the observed gains to the intended mechanism. revision: yes
-
Referee: [Experiments] The experimental section must specify the exact community detection algorithm(s), all hyperparameters, and the handling of overlapping or multi-cluster assignments, as these are free parameters listed in the paper's own description. Results should include error bars, statistical significance tests, and ablation studies isolating CLATT from the clustering choice itself.
Authors: We will expand the experimental section to name the precise algorithms (e.g., Louvain, Leiden), list all hyperparameters with their chosen values, and explicitly describe the multi-cluster assignment policy used in our implementation. All reported results will include standard-error bars across random seeds, paired statistical significance tests against baselines, and new ablation experiments that replace community detection with random partitioning (while keeping cluster-size distribution matched) to isolate the contribution of structure-aware clustering from the mere presence of larger attention scopes. revision: yes
-
Referee: [Experiments] Performance tables (presumably in §4 or §5) report improvements but lack explicit comparison to other receptive-field expansion techniques (e.g., higher-order message passing or diffusion-based methods) that also aim to enlarge neighborhoods while respecting topology; this weakens the claim that CLATT is a distinctly advantageous alternative.
Authors: We acknowledge the benefit of direct head-to-head comparisons. The revised version will include additional baselines—specifically k-hop message passing and diffusion-convolution variants—on the same GraphLand and other benchmarks. These results will be presented in the main tables (or appendix with summary in the main text) to position CLATT relative to other topology-respecting receptive-field expansions and to quantify its practical trade-offs in accuracy versus computational cost. revision: yes
Circularity Check
No circularity: new architectural component with independent empirical claims
full rationale
The paper proposes CLATT as an architectural augmentation to MPNNs and Graph Transformers: nodes are partitioned via off-the-shelf community detection, then attention is performed within clusters. This is a design choice, not a mathematical derivation, equation, or prediction that reduces to fitted inputs or self-referential definitions. No equations appear in the abstract or description that equate a claimed result to its own construction. The central claim of performance gains is empirical and falsifiable on external benchmarks, with no load-bearing self-citations or uniqueness theorems invoked. The derivation chain is therefore self-contained.
Axiom & Free-Parameter Ledger
free parameters (1)
- choice and parameters of community detection algorithm
axioms (1)
- domain assumption Community detection algorithms produce clusters that meaningfully reflect graph topology suitable for attention-based learning.
invented entities (1)
-
Cluster Attention (CLATT) mechanism
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization.arXiv preprint arXiv:1607.06450,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Relational inductive biases, deep learning, and graph networks
Battaglia, P. W., Hamrick, J. B., Bapst, V ., Sanchez-Gonzalez, A., Zambaldi, V ., Malinowski, M., Tacchetti, A., Raposo, D., Santoro, A., Faulkner, R., et al. Relational inductive biases, deep learning, and graph networks.arXiv preprint arXiv:1806.01261,
work page internal anchor Pith review arXiv
-
[3]
D., Guillaume, J.-L., Lambiotte, R., and Lefebvre, E
Blondel, V . D., Guillaume, J.-L., Lambiotte, R., and Lefebvre, E. Fast unfolding of communities in large networks.Journal of statistical mechanics: theory and experiment, 2008(10):P10008,
2008
-
[4]
Benchmarking graph neural networks,
Dwivedi, V ., Joshi, C., Laurent, T., Bengio, Y ., and Bresson, X. Benchmarking graph neural networks. arXiv preprint arXiv:2003.00982,
-
[5]
Dwivedi, V . P. and Bresson, X. A generalization of transformer networks to graphs.AAAI 2021 Workshop on Deep Learning on Graphs: Methods and Applications (DLG-AAAI 2021),
2021
-
[6]
Liu, Y ., Qiu, P., Xing, Y ., Liu, Y ., Du, P., Hong, C., Zheng, J., Zheng, T., and He, T. Bridging academia and industry: A comprehensive benchmark for attributed graph clustering.arXiv preprint arXiv:2602.08519,
-
[7]
Gemsec: Graph embedding with self clustering
Rozemberczki, B., Davies, R., Sarkar, R., and Sutton, C. Gemsec: Graph embedding with self clustering. InProceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2019, pp. 65–72. ACM,
2019
-
[8]
arXiv preprint arXiv:1909.01315 (2019)
Wang, M., Zheng, D., Ye, Z., Gan, Q., Li, M., Song, X., Zhou, J., Ma, C., Yu, L., Gai, Y ., et al. Deep Graph Library: A Graph-Centric, Highly-Performant Package for Graph Neural Networks.arXiv preprint arXiv:1909.01315,
-
[9]
There are two widely used definitions of clustering coefficients (Boccaletti et al., 2014): the global clustering coefficient and the average local clustering coefficient
Clustering coefficients show how typical it is for two neighbors of a node to also be neighbors. There are two widely used definitions of clustering coefficients (Boccaletti et al., 2014): the global clustering coefficient and the average local clustering coefficient. In graphs used in our experiments, both clustering coefficients range from approximately...
2014
-
[10]
(2023b) (originally called amazon-ratings) was obtained from data collected by Leskovec et al
D Amazon-ratings-full dataset The amazon-ratings-5core dataset from Platonov et al. (2023b) (originally called amazon-ratings) was obtained from data collected by Leskovec et al. (2007). Since this dataset 19 was used by Platonov et al. (2023b) to evaluate models specifically designed for non-homophilous graphs, many of which are not scalable, only the 5-...
2007
-
[11]
was used, i.e., nodes of degree less than 5 were iteratively removed from the graph until no such nodes were left. While this procedure was used to reduce the size of the graph, we notice that it resulted in a graph with a peculiar structure: amazon-ratings-5core has a lot of small clusters of 5 or slightly more nodes that are very densely interconnected ...
2007
-
[12]
For GGTs, we set the dimension of positional encodings (DeepWalk embeddings or Laplacian eigenvectors) to128. We train all models for 1000 steps with Adam optimizer (Kingma & Ba, 2015), except for the amazon-ratings-5core dataset, for which we have found longer training to often be beneficial, and thus train all models for3000steps. 20 When using CLATT, f...
2015
-
[13]
All experiments were run on an NVIDIA Tesla A100 80GB GPU
and DGL (Wang et al., 2019). All experiments were run on an NVIDIA Tesla A100 80GB GPU. F Reproducibility We provide our code with the implementation of CLATT and instructions to reproduce all exper- imental results in our paper in the following repository: https://github.com/OlegPlatonov/ cluster-attention. 21
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.