pith. sign in

arxiv: 2605.18387 · v1 · pith:GBHKPFUNnew · submitted 2026-05-18 · 💻 cs.LG · cs.AI

Graph Hierarchical Recurrence for Long-Range Generalization

Pith reviewed 2026-05-20 13:00 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords Graph Neural NetworksLong-range DependenciesHierarchical PoolingOut-of-range GeneralizationParameter EfficiencyRecurrenceGraph Transformers
0
0 comments X

The pith

Graph Hierarchical Recurrence captures long-range dependencies by jointly processing graphs and their pooled hierarchical abstractions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Graph Hierarchical Recurrence (GHR) to tackle the inability of standard Graph Neural Networks and Graph Transformers to capture correlations between distant regions of a graph. These limitations become especially clear in out-of-range generalization, where test graphs require interactions over distances longer than those seen during training. GHR addresses this by applying recurrence across both the original graph and a hierarchical abstraction created through pooling. Across long-range benchmarks, this yields stronger performance, better generalization to unseen distances, and high parameter efficiency, often using only 1% of the parameters required by state-of-the-art models.

Core claim

GHR is a framework that operates jointly on the input graph and on a hierarchical abstraction obtained through pooling. This design delivers strong performance on long-range dependencies, improved out-of-range generalization, and high parameter efficiency, consistently outperforming existing graph models while using as little as 1% of the parameters of current state-of-the-art models.

What carries the argument

Graph Hierarchical Recurrence (GHR), which processes the input graph jointly with a hierarchical abstraction derived from pooling to encode distant correlations.

Load-bearing premise

The hierarchical abstraction obtained through pooling successfully encodes the long-range correlations that standard GNNs and GTs fail to capture.

What would settle it

A concrete counterexample would be a new long-range benchmark where GHR shows no improvement in out-of-range generalization compared to baselines after the pooling step is removed.

Figures

Figures reproduced from arXiv: 2605.18387 by Alessio Gravina, Bruno Lepri, Davide Bacciu, Marco Pacini, Sebastiano Bontorin, Stefano Carotti.

Figure 1
Figure 1. Figure 1: Out-of-Range Generalization for Graph Hierarchical Recurrence. Predicted distances on Single Source Shortest Path (SSSP) task from a sample source node in a Random Geometric Graph (RGG). We compare DeepGIN (10 layers) with our model GHR (TL = 4, TH = 3), both trained on distances of up to 10 hops. While both GIN and GHR accurately predict on in-range distances, only GHR achieves out-of-range generalization… view at source ↗
Figure 2
Figure 2. Figure 2: Visualization of the global recurrent step RW . Visual representation of the nested recurrent scheme as described in Algorithm 2. RL,WL and RH,WH propagate nodes’ information across the low-level graph GL and the high level abstraction GH. Message passing is interleaved between two layers in a nested weight-sharing scheme, which enables deep computation with a minimal parameter budget. 2 Preliminaries Grap… view at source ↗
Figure 3
Figure 3. Figure 3: Diameter of graphs and their pooled abstraction. Comparison of graph diameter between the low-level graph GL and its high-level abstraction GH = Pool(GL) obtained via graclus. (a) Average diameter of GL and GH across the ECHO benchmark [24] and the RGG dataset of Section 4.3, binned by graph size. GH is obtained with one graclus iteration for ECHO and three for RGG. Error bars show the standard deviation. … view at source ↗
Figure 5
Figure 5. Figure 5: Parameter efficiency on ECHO. Mean Absolute Error (MAE) versus number of parameters (log scale) for models evaluated on the SSSP, Eccentricity, and Energy tasks of the ECHO benchmark [24].GHR (lower-left in each panel) achieves state-of-the-art results on SSSP and Eccentricity, and is competitive on Energy, while using orders of magnitude fewer parameters than the baselines. 4.1 ECHO Benchmark The ECHO ben… view at source ↗
Figure 6
Figure 6. Figure 6: Out-of-Range Generalization on RGG SSSP, small-distance regime. MAE stratified by target distance for GHR, a deep GIN baseline, and a GPS-style baseline combining GINE with multi-head attention. Models are trained on RGGs with maximum training distance 5 (left) or 8 (right), and evaluated on test graphs with distances up to 8. The left panel shows that the three models are accurate in-range, but that the e… view at source ↗
Figure 7
Figure 7. Figure 7: Architectural Ablations of GHR on RGG SSSP. MAE stratified by target distance on RGGs for GHR variants and flat baselines (Deep, Recurrent, and their global-recurrence "+GR" counterparts). Models are trained on graphs with shortest-path distances up to 20 hops (dashed line) and evaluated on distances up to 40 hops. Stratifying by target distance reveals where each architecture’s performance degrades. While… view at source ↗
Figure 8
Figure 8. Figure 8: Complete recurrent architecture. Full recurrent architecture with the global recursion scheme and time-discounted loss computation (complete forward pass) as described in Section 3. This appendix details the experimental setup, hardware, and task-specific architectural implementa￾tions used in the main evaluation, followed by extended results. Implementation Details: To highlight the efficiency of the hier… view at source ↗
read the original abstract

Graph Neural Networks (GNNs) and Graph Transformers (GTs) are now a fundamental paradigm for graph learning, combining the representation-learning capabilities of deep models with the sample efficiency induced by their inductive biases. Despite their effectiveness, a large body of work has shown that these models still face fundamental limitations in tasks that require capturing correlations between distant regions of a graph. To address this issue, we introduce Graph Hierarchical Recurrence (GHR), a novel framework that operates jointly on the input graph and on a hierarchical abstraction obtained through pooling. We also show that the limitations of existing models are even more pronounced in out-of-range generalization, where test instances involve interactions over distances longer than those observed during training. By contrast, despite its simple design, GHR provides three key advantages: strong performance on long-range dependencies, improved out-of-range generalization, and high parameter efficiency. To corroborate these claims, we show that across a broad set of long-range benchmarks, GHR consistently outperforms existing graph models while using as little as 1% of the parameters of current state-of-the-art models. These results suggest a complementary direction to the current trend of scaling architectures to obtain graph foundation models, indicating that increased model capacity alone may not be sufficient for generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Graph Hierarchical Recurrence (GHR), a framework that jointly processes an input graph and a hierarchical abstraction obtained via pooling. The central claims are that GHR captures long-range dependencies more effectively than standard GNNs and Graph Transformers, yields improved out-of-range generalization, and achieves these gains with extreme parameter efficiency (as little as 1% of SOTA models) across long-range benchmarks.

Significance. If the empirical results hold under rigorous verification, the work is significant because it offers a lightweight, inductive-bias-driven alternative to the prevailing trend of scaling graph models for foundation-model performance. The emphasis on out-of-range generalization and the reported parameter reduction constitute a concrete, falsifiable contribution that could redirect research away from capacity increases alone.

major comments (2)
  1. [§4.3] §4.3, Table 4 (out-of-range split): the construction of test graphs with distances exceeding the training maximum is described only at a high level; without an explicit distance histogram or generation procedure, it is impossible to confirm that the reported gains are attributable to the hierarchical recurrence rather than differences in graph statistics.
  2. [§3.1] §3.1, Eq. (3): the pooling operator that produces the hierarchical abstraction is defined recursively but the recurrence update across hierarchy levels lacks a closed-form expression for information flow; this step is load-bearing for the claim that GHR encodes correlations beyond the receptive field of standard GNNs.
minor comments (2)
  1. [Figure 2] Figure 2: the legend does not distinguish the GHR curve from the pooled-only ablation, reducing readability of the long-range dependency plots.
  2. [§5] §5: several long-range benchmark citations are missing the year and venue, which should be added for completeness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and positive assessment of our work. The comments identify areas where additional detail will strengthen the presentation, and we address each one below with plans for revision.

read point-by-point responses
  1. Referee: [§4.3] §4.3, Table 4 (out-of-range split): the construction of test graphs with distances exceeding the training maximum is described only at a high level; without an explicit distance histogram or generation procedure, it is impossible to confirm that the reported gains are attributable to the hierarchical recurrence rather than differences in graph statistics.

    Authors: We appreciate the referee highlighting the need for greater transparency in the out-of-range evaluation protocol. In the revised manuscript we will expand §4.3 to include (i) a precise algorithmic description of how test graphs are generated so that their maximum pairwise distance exceeds the training maximum, (ii) pseudocode for the procedure, and (iii) distance histograms (graph diameter and average shortest-path length) comparing the training and test distributions. These additions will make it straightforward to verify that performance differences arise from GHR’s hierarchical recurrence rather than from unintended shifts in graph statistics. revision: yes

  2. Referee: [§3.1] §3.1, Eq. (3): the pooling operator that produces the hierarchical abstraction is defined recursively but the recurrence update across hierarchy levels lacks a closed-form expression for information flow; this step is load-bearing for the claim that GHR encodes correlations beyond the receptive field of standard GNNs.

    Authors: We agree that an explicit illustration of information flow across hierarchy levels would make the receptive-field argument more transparent. While the recursive definition in Eq. (3) is compact and standard for hierarchical models, we will add a step-by-step unrolling of the recurrence for a small two-level hierarchy in §3.1 and the appendix. The example will show how features from distant nodes in the original graph are aggregated through the pooled abstraction, thereby extending the effective receptive field beyond that of a flat GNN. We believe this concrete expansion, rather than a single closed-form expression, sufficiently clarifies the mechanism while preserving the recursive formulation that is natural for the architecture. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces GHR as a new framework combining input graphs with hierarchical pooling abstractions and evaluates it empirically on long-range benchmarks. Claims of outperformance and parameter efficiency are presented as experimental outcomes rather than mathematical derivations. No equations, self-referential definitions, or load-bearing self-citations are visible in the provided text that would reduce predictions to fitted inputs or prior author work by construction. The central argument rests on benchmark results testing the inductive bias, which is self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review is limited to the abstract; full details on modeling assumptions, pooling operators, and recurrence definitions are unavailable.

axioms (1)
  • domain assumption Existing GNNs and GTs face fundamental limitations in capturing correlations between distant regions of a graph.
    Stated as background motivation in the abstract.
invented entities (1)
  • Graph Hierarchical Recurrence (GHR) no independent evidence
    purpose: Framework that operates jointly on the input graph and a hierarchical abstraction obtained through pooling.
    Newly introduced method whose independent validation is not provided in the abstract.

pith-pipeline@v0.9.0 · 5765 in / 1176 out tokens · 68228 ms · 2026-05-20T13:00:53.038299+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 3 internal anchors

  1. [1]

    On the bottleneck of graph neural networks and its practical implications

    Uri Alon and Eran Yahav. On the bottleneck of graph neural networks and its practical implications. InInternational Conference on Learning Representations, 2021

  2. [2]

    DiffWire: Inductive Graph Rewiring via the Lovász Bound

    Adrián Arnaiz-Rodríguez, Ahmed Begga, Francisco Escolano, and Nuria M Oliver. DiffWire: Inductive Graph Rewiring via the Lovász Bound. In Bastian Rieck and Razvan Pascanu, editors,Proceedings of the First Learning on Graphs Conference, volume 198 ofProceedings of Machine Learning Research, pages 15:1–15:27. PMLR, 09–12 Dec 2022

  3. [3]

    American Mathematical Society Providence, RI, 2013

    David A Bader, Henning Meyerhenke, Peter Sanders, and Dorothea Wagner.Graph partitioning and graph clustering, volume 588. American Mathematical Society Providence, RI, 2013

  4. [4]

    Spatial networks.Physics reports, 499(1-3):1–101, 2011

    Marc Barthélemy. Spatial networks.Physics reports, 499(1-3):1–101, 2011

  5. [5]

    Residual gated graph convnets, 2018

    Xavier Bresson and Thomas Laurent. Residual gated graph convnets, 2018

  6. [6]

    A Note on Over-Smoothing for Graph Neural Networks, June 2020

    Chen Cai and Yusu Wang. A note on over-smoothing for graph neural networks.arXiv preprint arXiv:2006.13318, 2020

  7. [7]

    Weighted graph cuts without eigenvec- tors a multilevel approach.IEEE transactions on pattern analysis and machine intelligence, 29(11):1944–1957, 2007

    Inderjit S Dhillon, Yuqiang Guan, and Brian Kulis. Weighted graph cuts without eigenvec- tors a multilevel approach.IEEE transactions on pattern analysis and machine intelligence, 29(11):1944–1957, 2007

  8. [8]

    On over-squashing in message passing neural networks: the impact of width, depth, and topology

    Francesco Di Giovanni, Lorenzo Giusti, Federico Barbero, Giulia Luise, Pietro Liò, and Michael Bronstein. On over-squashing in message passing neural networks: the impact of width, depth, and topology. InProceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023

  9. [9]

    A generalization of transformer networks to graphs

    Vijay Prakash Dwivedi and Xavier Bresson. A generalization of transformer networks to graphs. AAAI Workshop on Deep Learning on Graphs: Methods and Applications, 2021

  10. [10]

    Long Range Graph Benchmark

    Vijay Prakash Dwivedi, Ladislav Rampášek, Michael Galkin, Ali Parviz, Guy Wolf, Anh Tuan Luu, and Dominique Beaini. Long Range Graph Benchmark. InAdvances in Neural Information Processing Systems, volume 35, pages 22326–22340. Curran Associates, Inc., 2022

  11. [11]

    Adaptive message passing: A general framework to mitigate oversmoothing, oversquashing, and underreaching

    Federico Errica, Henrik Christiansen, Viktor Zaverkin, Takashi Maruyama, Mathias Niepert, and Francesco Alesiani. Adaptive message passing: A general framework to mitigate oversmoothing, oversquashing, and underreaching. InForty-second International Conference on Machine Learning, 2025

  12. [12]

    Graph u-nets

    Hongyang Gao and Shuiwang Ji. Graph u-nets. Ininternational conference on machine learning, pages 2083–2092. PMLR, 2019

  13. [13]

    Schoenholz, Patrick F

    Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. Neural message passing for quantum chemistry. InProceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, page 1263–1272. JMLR.org, 2017

  14. [14]

    Understanding pooling in graph neural networks.IEEE transactions on neural networks and learning systems, 35(2):2708–2718, 2022

    Daniele Grattarola, Daniele Zambon, Filippo Maria Bianchi, and Cesare Alippi. Understanding pooling in graph neural networks.IEEE transactions on neural networks and learning systems, 35(2):2708–2718, 2022

  15. [15]

    Anti-symmetric DGN: a stable architecture for deep graph networks

    Alessio Gravina, Davide Bacciu, and Claudio Gallicchio. Anti-symmetric DGN: a stable architecture for deep graph networks. InThe Eleventh International Conference on Learning Representations, 2023. 11

  16. [16]

    On Oversquashing in Graph Neural Networks Through the Lens of Dynamical Systems.Proceedings of the AAAI Conference on Artificial Intelligence, 39(16):16906–16914, Apr

    Alessio Gravina, Moshe Eliasof, Claudio Gallicchio, Davide Bacciu, and Carola-Bibiane Schönlieb. On Oversquashing in Graph Neural Networks Through the Lens of Dynamical Systems.Proceedings of the AAAI Conference on Artificial Intelligence, 39(16):16906–16914, Apr. 2025

  17. [17]

    DRew: Dynamically rewired message passing with delay

    Benjamin Gutteridge, Xiaowen Dong, Michael M Bronstein, and Francesco Di Giovanni. DRew: Dynamically rewired message passing with delay. InInternational Conference on Machine Learning, pages 12252–12267. PMLR, 2023

  18. [18]

    arXiv preprint arXiv:2103.09430 (2021)

    Weihua Hu, Matthias Fey, Hongyu Ren, Maho Nakata, Yuxiao Dong, and Jure Leskovec. Ogb- lsc: A large-scale challenge for machine learning on graphs.arXiv preprint arXiv:2103.09430, 2021

  19. [19]

    Strategies for pre-training graph neural networks

    Weihua Hu*, Bowen Liu*, Joseph Gomes, Marinka Zitnik, Percy Liang, Vijay Pande, and Jure Leskovec. Strategies for pre-training graph neural networks. InInternational Conference on Learning Representations, 2020

  20. [20]

    Semi-Supervised Classification with Graph Convolutional Networks

    Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks.CoRR, abs/1609.02907, 2016

  21. [21]

    What graph neural networks cannot learn: depth vs width

    Andreas Loukas. What graph neural networks cannot learn: depth vs width. InInternational Conference on Learning Representations, 2020

  22. [22]

    LRIM: a physics-based benchmark for provably evaluating long-range capabilities in graph learning

    Joël Mathys, Henrik Christiansen, Federico Errica, Takashi Maruyama, and Francesco Alesiani. LRIM: a physics-based benchmark for provably evaluating long-range capabilities in graph learning. InThe Fourteenth International Conference on Learning Representations, 2026

  23. [23]

    Neural Network for Graphs: A Contextual Constructive Approach.IEEE Transactions on Neural Networks, 20(3):498–511, 2009

    Alessio Micheli. Neural Network for Graphs: A Contextual Constructive Approach.IEEE Transactions on Neural Networks, 20(3):498–511, 2009

  24. [24]

    Can you hear me now? a benchmark for long-range graph propagation

    Luca Miglior, Matteo Tolloso, Alessio Gravina, and Davide Bacciu. Can you hear me now? a benchmark for long-range graph propagation. InThe Fourteenth International Conference on Learning Representations, 2026

  25. [25]

    Short-range oversquashing

    Yaaqov Mishayev, Yonatan Sverdlov, Tal Amir, and Nadav Dym. Short-range oversquashing. InThe Fourth Learning on Graphs Conference, 2025

  26. [26]

    Graph neural networks exponentially lose expressive power for node classification

    Kenta Oono and Taiji Suzuki. Graph neural networks exponentially lose expressive power for node classification. InInternational Conference on Learning Representations, 2020

  27. [27]

    On the difficulty of training recurrent neural networks

    Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. InProceedings of the 30th International Conference on International Con- ference on Machine Learning - Volume 28, ICML’13, page III–1310–III–1318. JMLR.org, 2013

  28. [28]

    Recipe for a General, Powerful, Scalable Graph Transformer.Advances in Neural Information Processing Systems, 35, 2022

    Ladislav Rampášek, Mikhail Galkin, Vijay Prakash Dwivedi, Anh Tuan Luu, Guy Wolf, and Dominique Beaini. Recipe for a General, Powerful, Scalable Graph Transformer.Advances in Neural Information Processing Systems, 35, 2022

  29. [29]

    The graph neural network model.IEEE Transactions on Neural Networks, 20(1):61–80, 2009

    Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model.IEEE Transactions on Neural Networks, 20(1):61–80, 2009

  30. [30]

    GLU Variants Improve Transformer

    Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

  31. [31]

    Bronstein, and Johannes F

    Joshua Southern, Francesco Di Giovanni, Michael M. Bronstein, and Johannes F. Lutzeyer. Understanding virtual nodes: Oversquashing and node heterogeneity. InThe Thirteenth Interna- tional Conference on Learning Representations, 2025

  32. [32]

    Towards scale-invariant graph-related problem solving by iterative homogeneous gnns

    Hao Tang, Zhiao Huang, Jiayuan Gu, Bao-Liang Lu, and Hao Su. Towards scale-invariant graph-related problem solving by iterative homogeneous gnns. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, volume 33, pages 15811–15822. Curran Associates, Inc., 2020. 12

  33. [33]

    Bronstein

    Jake Topping, Francesco Di Giovanni, Benjamin Paul Chamberlain, Xiaowen Dong, and Michael M. Bronstein. Understanding over-squashing and bottlenecks on graphs via curvature. InInternational Conference on Learning Representations, 2022

  34. [34]

    Graph attention networks

    Petar Veliˇckovi´c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. InInternational Conference on Learning Representations, 2018

  35. [35]

    Neural execution of graph algorithms

    Petar Veliˇckovi´c, Rex Ying, Matilde Padovano, Raia Hadsell, and Charles Blundell. Neural execution of graph algorithms. InInternational Conference on Learning Representations, 2020

  36. [36]

    Hierarchical Reasoning Model

    Guan Wang, Jin Li, Yuhao Sun, Xing Chen, Changling Liu, Yue Wu, Meng Lu, Sen Song, and Yasin Abbasi Yadkori. Hierarchical reasoning model.arXiv preprint arXiv:2506.21734, 2025

  37. [37]

    Hierarchical graph representation learning with differentiable pooling.Advances in neural information processing systems, 31, 2018

    Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren, Will Hamilton, and Jure Leskovec. Hierarchical graph representation learning with differentiable pooling.Advances in neural information processing systems, 31, 2018

  38. [38]

    Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

    Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

  39. [39]

    On vanishing gradients, over- smoothing, and over-squashing in gnns: Bridging recurrent and graph learning, 2025

    Álvaro Arroyo, Alessio Gravina, Benjamin Gutteridge, Federico Barbero, Claudio Gallicchio, Xiaowen Dong, Michael Bronstein, and Pierre Vandergheynst. On vanishing gradients, over- smoothing, and over-squashing in gnns: Bridging recurrent and graph learning, 2025. A Additional details on the architecture A.1 Modular Message Passing and the Gated GINE Insta...