arxiv: 2604.07492 · v1 · submitted 2026-04-08 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Cluster Attention for Graph Machine Learning

Oleg Platonov , Liudmila Prokhorenkova

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:52 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords graph machine learningcluster attentionmessage passing neural networksgraph transformerscommunity detectionreceptive fieldsinductive biasesGraphLand

0 comments

The pith

Augmenting message passing networks or graph transformers with cluster attention improves performance on a wide range of graph datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Message passing neural networks reach only nearby nodes after a few layers, limiting what they can learn from distant parts of a graph. Global attention in transformers solves the distance problem but drops the topology cues that graphs usually provide. The paper tests an intermediate step: run a standard community detection algorithm to split nodes into clusters, then let every node attend to every other node inside those clusters. When this cluster attention is added to existing models, accuracy rises across many benchmarks, including real-world GraphLand tasks. A reader would care because the method keeps the structure signal while expanding the view without paying the full cost of attending to the entire graph.

Core claim

We divide graph nodes into clusters using off-the-shelf community detection algorithms and let each node attend to all other nodes inside each of its clusters. This cluster attention supplies large receptive fields while retaining strong graph-structure-based inductive biases. When it augments either message passing neural networks or graph transformers, the combined models achieve significantly higher performance on a wide range of graph datasets, including the real-world applications collected in the GraphLand benchmark.

What carries the argument

Cluster attention (CLATT), which partitions nodes via community detection and restricts attention to intra-cluster pairs to expand reach while preserving topology biases.

Load-bearing premise

Community detection algorithms will produce clusters that capture useful graph structure and thereby supply effective inductive biases.

What would settle it

Applying the proposed cluster attention augmentation to the same models and GraphLand datasets and measuring no improvement or a drop in accuracy compared with the unaugmented baselines.

Figures

Figures reproduced from arXiv: 2604.07492 by Liudmila Prokhorenkova, Oleg Platonov.

**Figure 2.** Figure 2: Similarities of clusterings produced by different node clustering algorithms on 4 different [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Distributions of average attention distances for different types of attention. Note that the [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗

read the original abstract

Message Passing Neural Networks have recently become the most popular approach to graph machine learning tasks; however, their receptive field is limited by the number of message passing layers. To increase the receptive field, Graph Transformers with global attention have been proposed; however, global attention does not take into account the graph topology and thus lacks graph-structure-based inductive biases, which are typically very important for graph machine learning tasks. In this work, we propose an alternative approach: cluster attention (CLATT). We divide graph nodes into clusters with off-the-shelf graph community detection algorithms and let each node attend to all other nodes in each cluster. CLATT provides large receptive fields while still having strong graph-structure-based inductive biases. We show that augmenting Message Passing Neural Networks or Graph Transformers with CLATT significantly improves their performance on a wide range of graph datasets including datasets from the recently introduced GraphLand benchmark representing real-world applications of graph machine learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Cluster attention via community detection is a simple middle-ground idea for graph models, but its effectiveness depends on whether the clusters actually provide large receptive fields.

read the letter

Cluster attention using community detection is a straightforward attempt to get bigger receptive fields in graph neural nets without going fully global. It augments either message passing or transformer models by letting nodes attend inside detected clusters. The new part is applying standard community detection algorithms to set the attention scope rather than using fixed hops or all nodes. This keeps some inductive bias from the graph structure. The claim is that it improves performance across many datasets, including real-world ones from GraphLand. The idea is easy to understand and implement, which is a plus for practical use. The main concern is whether the clusters are big enough. Off-the-shelf community detection often finds small groups, especially in sparse graphs. If that's the case here, the receptive field stays limited and the performance gains might not come from the intended mechanism. The abstract does not report cluster sizes or compare to the graph diameters. The experimental claims also need the full details to evaluate. We need to see the baselines, how they chose the community detection method, and whether the improvements hold with error bars. Readers working on graph ML for applied domains would find this useful as a simple tweak to try. It is worth a serious referee because the core idea addresses a genuine limitation, even if the supporting evidence needs more examination in review. I recommend sending it to peer review.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Cluster Attention (CLATT) as an augmentation to Message Passing Neural Networks or Graph Transformers. Nodes are partitioned into clusters via off-the-shelf community detection algorithms, after which each node attends to all others within its cluster(s). The authors claim this simultaneously yields large receptive fields and preserves graph-structure inductive biases, resulting in significant performance gains on diverse graph datasets including the GraphLand benchmark.

Significance. If the results hold under rigorous controls, CLATT offers a practical compromise between the depth-limited receptive fields of MPNNs and the topology-agnostic nature of global attention in Graph Transformers. The use of standard community detection algorithms is a strength for simplicity and reproducibility. The approach could be impactful for real-world graph tasks if cluster-scale analysis confirms the receptive-field benefit.

major comments (3)

[Method / Experiments] The central claim that CLATT delivers large receptive fields while retaining inductive biases depends on cluster sizes being substantially larger than standard 1-hop neighborhoods. The manuscript provides no statistics (mean, median, or distribution) on cluster sizes produced by the chosen community detection algorithms on any evaluated dataset; without this, the receptive-field argument cannot be evaluated and any gains cannot be attributed to the stated mechanism.
[Experiments] The experimental section must specify the exact community detection algorithm(s), all hyperparameters, and the handling of overlapping or multi-cluster assignments, as these are free parameters listed in the paper's own description. Results should include error bars, statistical significance tests, and ablation studies isolating CLATT from the clustering choice itself.
[Experiments] Performance tables (presumably in §4 or §5) report improvements but lack explicit comparison to other receptive-field expansion techniques (e.g., higher-order message passing or diffusion-based methods) that also aim to enlarge neighborhoods while respecting topology; this weakens the claim that CLATT is a distinctly advantageous alternative.

minor comments (2)

[Method] Add a formal equation or pseudocode block defining the CLATT attention computation, including how intra-cluster attention is integrated with the base MPNN or Transformer layers.
[Experiments] Ensure all datasets, community detection parameters, and code are released to support reproducibility of the reported gains.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below and will incorporate revisions to provide the requested statistics, specifications, controls, and comparisons, thereby strengthening the manuscript's claims regarding receptive fields and inductive biases.

read point-by-point responses

Referee: [Method / Experiments] The central claim that CLATT delivers large receptive fields while retaining inductive biases depends on cluster sizes being substantially larger than standard 1-hop neighborhoods. The manuscript provides no statistics (mean, median, or distribution) on cluster sizes produced by the chosen community detection algorithms on any evaluated dataset; without this, the receptive-field argument cannot be evaluated and any gains cannot be attributed to the stated mechanism.

Authors: We agree that cluster-size statistics are necessary to rigorously support the receptive-field mechanism. In the revised manuscript we will add a dedicated subsection (with accompanying table) reporting, for every dataset and algorithm, the mean, median, standard deviation, and full distribution of cluster sizes. These numbers will be contrasted with the average 1-hop neighborhood sizes to demonstrate that clusters are substantially larger, directly linking the observed gains to the intended mechanism. revision: yes
Referee: [Experiments] The experimental section must specify the exact community detection algorithm(s), all hyperparameters, and the handling of overlapping or multi-cluster assignments, as these are free parameters listed in the paper's own description. Results should include error bars, statistical significance tests, and ablation studies isolating CLATT from the clustering choice itself.

Authors: We will expand the experimental section to name the precise algorithms (e.g., Louvain, Leiden), list all hyperparameters with their chosen values, and explicitly describe the multi-cluster assignment policy used in our implementation. All reported results will include standard-error bars across random seeds, paired statistical significance tests against baselines, and new ablation experiments that replace community detection with random partitioning (while keeping cluster-size distribution matched) to isolate the contribution of structure-aware clustering from the mere presence of larger attention scopes. revision: yes
Referee: [Experiments] Performance tables (presumably in §4 or §5) report improvements but lack explicit comparison to other receptive-field expansion techniques (e.g., higher-order message passing or diffusion-based methods) that also aim to enlarge neighborhoods while respecting topology; this weakens the claim that CLATT is a distinctly advantageous alternative.

Authors: We acknowledge the benefit of direct head-to-head comparisons. The revised version will include additional baselines—specifically k-hop message passing and diffusion-convolution variants—on the same GraphLand and other benchmarks. These results will be presented in the main tables (or appendix with summary in the main text) to position CLATT relative to other topology-respecting receptive-field expansions and to quantify its practical trade-offs in accuracy versus computational cost. revision: yes

Circularity Check

0 steps flagged

No circularity: new architectural component with independent empirical claims

full rationale

The paper proposes CLATT as an architectural augmentation to MPNNs and Graph Transformers: nodes are partitioned via off-the-shelf community detection, then attention is performed within clusters. This is a design choice, not a mathematical derivation, equation, or prediction that reduces to fitted inputs or self-referential definitions. No equations appear in the abstract or description that equate a claimed result to its own construction. The central claim of performance gains is empirical and falsifiable on external benchmarks, with no load-bearing self-citations or uniqueness theorems invoked. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Assessment limited to abstract only; full paper may contain additional parameters or assumptions not visible here.

free parameters (1)

choice and parameters of community detection algorithm
Abstract states use of off-the-shelf algorithms without specifying which ones or any tuning.

axioms (1)

domain assumption Community detection algorithms produce clusters that meaningfully reflect graph topology suitable for attention-based learning.
Invoked when proposing to divide nodes into clusters for attention.

invented entities (1)

Cluster Attention (CLATT) mechanism no independent evidence
purpose: To enable large receptive fields while retaining graph-structure inductive biases.
Newly proposed attention variant combining clustering and attention.

pith-pipeline@v0.9.0 · 5449 in / 1303 out tokens · 48849 ms · 2026-05-10T17:52:52.313917+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 5 canonical work pages · 2 internal anchors

[1]

Layer Normalization

Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization.arXiv preprint arXiv:1607.06450,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Relational inductive biases, deep learning, and graph networks

Battaglia, P. W., Hamrick, J. B., Bapst, V ., Sanchez-Gonzalez, A., Zambaldi, V ., Malinowski, M., Tacchetti, A., Raposo, D., Santoro, A., Faulkner, R., et al. Relational inductive biases, deep learning, and graph networks.arXiv preprint arXiv:1806.01261,

work page internal anchor Pith review arXiv
[3]

D., Guillaume, J.-L., Lambiotte, R., and Lefebvre, E

Blondel, V . D., Guillaume, J.-L., Lambiotte, R., and Lefebvre, E. Fast unfolding of communities in large networks.Journal of statistical mechanics: theory and experiment, 2008(10):P10008,

2008
[4]

Benchmarking graph neural networks,

Dwivedi, V ., Joshi, C., Laurent, T., Bengio, Y ., and Bresson, X. Benchmarking graph neural networks. arXiv preprint arXiv:2003.00982,

work page arXiv 2003
[5]

Dwivedi, V . P. and Bresson, X. A generalization of transformer networks to graphs.AAAI 2021 Workshop on Deep Learning on Graphs: Methods and Applications (DLG-AAAI 2021),

2021
[6]

Bridging academia and industry: A comprehensive benchmark for attributed graph clustering.arXiv preprint arXiv:2602.08519,

Liu, Y ., Qiu, P., Xing, Y ., Liu, Y ., Du, P., Hong, C., Zheng, J., Zheng, T., and He, T. Bridging academia and industry: A comprehensive benchmark for attributed graph clustering.arXiv preprint arXiv:2602.08519,

work page arXiv
[7]

Gemsec: Graph embedding with self clustering

Rozemberczki, B., Davies, R., Sarkar, R., and Sutton, C. Gemsec: Graph embedding with self clustering. InProceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2019, pp. 65–72. ACM,

2019
[8]

arXiv preprint arXiv:1909.01315 (2019)

Wang, M., Zheng, D., Ye, Z., Gan, Q., Li, M., Song, X., Zhou, J., Ma, C., Yu, L., Gai, Y ., et al. Deep Graph Library: A Graph-Centric, Highly-Performant Package for Graph Neural Networks.arXiv preprint arXiv:1909.01315,

work page arXiv 1909
[9]

There are two widely used definitions of clustering coefficients (Boccaletti et al., 2014): the global clustering coefficient and the average local clustering coefficient

Clustering coefficients show how typical it is for two neighbors of a node to also be neighbors. There are two widely used definitions of clustering coefficients (Boccaletti et al., 2014): the global clustering coefficient and the average local clustering coefficient. In graphs used in our experiments, both clustering coefficients range from approximately...

2014
[10]

(2023b) (originally called amazon-ratings) was obtained from data collected by Leskovec et al

D Amazon-ratings-full dataset The amazon-ratings-5core dataset from Platonov et al. (2023b) (originally called amazon-ratings) was obtained from data collected by Leskovec et al. (2007). Since this dataset 19 was used by Platonov et al. (2023b) to evaluate models specifically designed for non-homophilous graphs, many of which are not scalable, only the 5-...

2007
[11]

was used, i.e., nodes of degree less than 5 were iteratively removed from the graph until no such nodes were left. While this procedure was used to reduce the size of the graph, we notice that it resulted in a graph with a peculiar structure: amazon-ratings-5core has a lot of small clusters of 5 or slightly more nodes that are very densely interconnected ...

2007
[12]

For GGTs, we set the dimension of positional encodings (DeepWalk embeddings or Laplacian eigenvectors) to128. We train all models for 1000 steps with Adam optimizer (Kingma & Ba, 2015), except for the amazon-ratings-5core dataset, for which we have found longer training to often be beneficial, and thus train all models for3000steps. 20 When using CLATT, f...

2015
[13]

All experiments were run on an NVIDIA Tesla A100 80GB GPU

and DGL (Wang et al., 2019). All experiments were run on an NVIDIA Tesla A100 80GB GPU. F Reproducibility We provide our code with the implementation of CLATT and instructions to reproduce all exper- imental results in our paper in the following repository: https://github.com/OlegPlatonov/ cluster-attention. 21

2019