pith. sign in

arxiv: 2511.06443 · v1 · submitted 2025-11-09 · 💻 cs.LG

How Wide and How Deep? Mitigating Over-Squashing of GNNs via Channel Capacity Constrained Estimation

Pith reviewed 2026-05-17 23:19 UTC · model grok-4.3

classification 💻 cs.LG
keywords graph neural networksover-squashingchannel capacityinformation theoryhidden dimensionspropagation depthspectral GNNrepresentation learning
0
0 comments X

The pith

Channel capacity estimation selects hidden dimensions and depths to mitigate over-squashing in GNNs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a framework called C3E that selects hidden dimensions and propagation depths for spectral graph neural networks by modeling them as communication channels and solving a nonlinear program to optimize channel capacity. A sympathetic reader would care because typical heuristic choices for these settings often produce severe loss of information when signals propagate across graphs. The work shows that over-squashing arises specifically from cumulative compression inside the evolving representation matrices. Experiments across nine datasets indicate that wider hidden layers reliably lessen this compression while depth requires balancing compression against the growth in representational power.

Core claim

Spectral graph neural networks can be treated as communication channels whose capacity depends on hidden dimensions, propagation depth, the chosen propagation mechanism, and the structure of the input graph. Formulating the choice of dimensions and depth as a nonlinear programming problem that maximizes this capacity yields architectural settings that reduce the cumulative compression of information in representation matrices. Larger hidden dimensions directly counteract compression, whereas depth exhibits a more nuanced effect that reflects a trade-off between added compression and increased representation complexity.

What carries the argument

Channel Capacity Constrained Estimation (C3E), a nonlinear programming procedure that treats GNN propagation steps as a communication channel to compute suitable hidden dimensions and depth.

If this is right

  • Over-squashing occurs through the progressive compression of information inside successive representation matrices.
  • Larger hidden dimensions reduce the amount of information lost to compression at each step.
  • Propagation depth must be chosen to balance further compression against gains in representational complexity.
  • Settings produced by the capacity-constrained optimization improve representation learning on the evaluated datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same capacity-modeling idea could be applied to message-passing rules in non-spectral GNNs if an analogous channel description can be written down.
  • For graphs with long diameters the depth trade-off might favor shallower networks paired with wider layers.
  • Running the nonlinear program on graphs with controlled changes in diameter or density would test how sensitive the resulting choices are to structure.

Load-bearing premise

Modeling spectral graph neural networks as communication channels whose capacity is determined by hidden dimensions, depth, and graph structure accurately reflects actual information flow during propagation.

What would settle it

Finding that architectures chosen by the nonlinear program produce no measurable drop in over-squashing or no gain in representation quality on a fresh collection of graphs would show the capacity model does not predict real behavior.

Figures

Figures reproduced from arXiv: 2511.06443 by Jin Zheng, John Cartlidge, Zinuo You.

Figure 1
Figure 1. Figure 1: The green-axis (left) denotes performance, and blue-axis (right) denotes parameters counts (in millions). C3E￾estimated solutions (red points, starred for optimal) consistently land in high-performance regions within dashed intervals defined in Eq. (12), outperforming baselines with heuristic dimensions (green points, e.g., 16, 32, 64, 128, 256, 512, 1024). Statistics Cora Citeseer Pubmed AmazonPhoto Amazo… view at source ↗
Figure 2
Figure 2. Figure 2: C3E-estimated models (red lines) avoid information loss by maintaining high H(Hl) across layers. In contrast, naively stacked baselines (green lines) suffer from over-squashing, with entropy collapsing to near-zero long before the final layer. Dotted lines indicate the maximum entropy for initial feature dimensions (brown) and fixed hidden dimensions (blue). 5.3 Entropy Transitions in Representation Matrix… view at source ↗
Figure 3
Figure 3. Figure 3: Average performance of C3E-estimated models versus η across nine benchmark datasets. Each curve shows the averaged C3E-estimated model performance per dataset. Page 19 of 29 [PITH_FULL_IMAGE:figures/full_fig_p019_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Examples of C3E solutions (hidden dimensions) on five benchmark datasets. Page 21 of 29 [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The relationship between propagation depth ( [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Learnable weight matrix entry distribution on Cora, [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Representation matrix entry distribution on Cora, [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Learnable weight matrix entry distribution on AmazonPhoto, [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Representation matrix entry distribution on AmazonPhoto, [PITH_FULL_IMAGE:figures/full_fig_p027_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Learnable weight matrix entry distribution on AmazonComputers, [PITH_FULL_IMAGE:figures/full_fig_p028_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Representation matrix entry distribution on AmazonComputers, [PITH_FULL_IMAGE:figures/full_fig_p029_11.png] view at source ↗
read the original abstract

Existing graph neural networks typically rely on heuristic choices for hidden dimensions and propagation depths, which often lead to severe information loss during propagation, known as over-squashing. To address this issue, we propose Channel Capacity Constrained Estimation (C3E), a novel framework that formulates the selection of hidden dimensions and depth as a nonlinear programming problem grounded in information theory. Through modeling spectral graph neural networks as communication channels, our approach directly connects channel capacity to hidden dimensions, propagation depth, propagation mechanism, and graph structure. Extensive experiments on nine public datasets demonstrate that hidden dimensions and depths estimated by C3E can mitigate over-squashing and consistently improve representation learning. Experimental results show that over-squashing occurs due to the cumulative compression of information in representation matrices. Furthermore, our findings show that increasing hidden dimensions indeed mitigate information compression, while the role of propagation depth is more nuanced, uncovering a fundamental balance between information compression and representation complexity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Channel Capacity Constrained Estimation (C3E), a nonlinear programming framework that selects hidden dimensions and propagation depths for spectral GNNs by modeling them as communication channels whose capacity depends on graph structure, depth, dimensions, and propagation mechanism. It attributes over-squashing to cumulative compression of information in representation matrices and reports that C3E-derived choices mitigate this effect while improving representation learning on nine public datasets; depth is characterized as playing a nuanced role that balances compression against representation complexity.

Significance. If the channel-capacity formulation accurately predicts information flow and the resulting parameters generalize, the work would supply a principled, information-theoretic alternative to heuristic choices of width and depth in GNNs, together with empirical evidence linking architectural decisions to measurable compression in representation matrices. The multi-dataset evaluation is a positive feature, though its interpretive weight hinges on the validity of the underlying capacity model.

major comments (3)
  1. [Abstract and §3] Abstract and §3: the nonlinear program is presented as directly connecting channel capacity to hidden dimension, depth, and graph structure, yet the manuscript supplies no explicit capacity formula, no description of how the program is solved, and no statement of the precise constraints or scaling constants employed; without these details the central claim that the selected parameters reduce observed compression cannot be verified.
  2. [§4] §4 (experimental results): consistent gains are reported across nine datasets, but the text does not indicate whether error bars, multiple random seeds, or statistical significance tests accompany the improvements; this omission weakens the assertion that C3E “consistently” mitigates over-squashing.
  3. [§3.2] §3.2 (capacity model): the formulation appears to treat each layer as an additive or linear channel whose mutual information scales directly with dimension and depth, yet the paper itself describes over-squashing as arising from nonlinear activations and repeated Laplacian applications; the mapping from empirical compression in representation matrices to the capacity objective is therefore not secured.
minor comments (2)
  1. [§3] Notation for the channel-capacity threshold and any scaling constants should be introduced with explicit symbols and distinguished from free parameters.
  2. [§3] A brief discussion of how graph-specific quantities (e.g., eigenvalues or effective diameter) are estimated without introducing circularity relative to the downstream task would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have made revisions to strengthen the presentation of the C3E framework and its empirical validation.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3: the nonlinear program is presented as directly connecting channel capacity to hidden dimension, depth, and graph structure, yet the manuscript supplies no explicit capacity formula, no description of how the program is solved, and no statement of the precise constraints or scaling constants employed; without these details the central claim that the selected parameters reduce observed compression cannot be verified.

    Authors: We appreciate this observation and agree that additional mathematical detail is necessary for verifiability. The original submission presented the nonlinear program at a high level without the full capacity expression or solver specifics. In the revised manuscript we have inserted the explicit channel-capacity formula (derived from the mutual information between input features and propagated representations under the spectral filter), the complete nonlinear program with all equality and inequality constraints, the chosen scaling constants, and a description of the interior-point solver together with convergence criteria. These additions directly support the claim that the optimized parameters reduce measured compression in the representation matrices. revision: yes

  2. Referee: [§4] §4 (experimental results): consistent gains are reported across nine datasets, but the text does not indicate whether error bars, multiple random seeds, or statistical significance tests accompany the improvements; this omission weakens the assertion that C3E “consistently” mitigates over-squashing.

    Authors: We concur that statistical reporting is essential to substantiate consistency claims. We have re-executed all experiments using ten independent random seeds, added standard-deviation error bars to every table and figure in §4, and included paired t-test p-values comparing C3E-derived configurations against baselines. The revised results confirm that the observed improvements remain statistically significant (p < 0.05) across the nine datasets, thereby reinforcing the assertion that C3E mitigates over-squashing. revision: yes

  3. Referee: [§3.2] §3.2 (capacity model): the formulation appears to treat each layer as an additive or linear channel whose mutual information scales directly with dimension and depth, yet the paper itself describes over-squashing as arising from nonlinear activations and repeated Laplacian applications; the mapping from empirical compression in representation matrices to the capacity objective is therefore not secured.

    Authors: We thank the referee for this precise critique of the modeling assumptions. Although over-squashing originates from nonlinear activations and iterated Laplacian multiplications, the C3E capacity model employs an effective information-theoretic upper bound that is calibrated against the empirically observed compression in the final representation matrices. In the revision we have augmented §3.2 with a short derivation showing how the spectral capacity expression bounds the end-to-end mutual information even in the presence of nonlinearities, together with additional plots that correlate the capacity objective directly with measured compression metrics. We believe these clarifications secure the mapping while preserving the abstraction’s utility. revision: partial

Circularity Check

0 steps flagged

No significant circularity: derivation grounded in external information-theoretic model

full rationale

The paper formulates hidden dimension and depth selection as a nonlinear program that maximizes channel capacity under a communication-channel model of spectral GNN propagation. This objective is defined from graph structure, propagation mechanism, and information-theoretic capacity expressions rather than from the downstream task loss or representation quality metrics used in evaluation. No equations reduce the chosen parameters to fitted values of the target performance; the optimization is solved once per graph and then applied to GNN training. No self-citation chains, self-definitional loops, or renaming of known empirical patterns appear in the derivation. The approach is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on treating spectral GNNs as communication channels whose capacity can be computed from dimensions, depth, and graph structure; this modeling step is an unproven domain assumption rather than a derived result.

free parameters (1)
  • channel capacity threshold or scaling constants
    The nonlinear program necessarily introduces at least one tunable threshold or scaling factor to balance compression against representation complexity; its value is not derived from first principles.
axioms (1)
  • domain assumption Spectral graph neural networks can be modeled as communication channels whose capacity is a direct function of hidden dimension, propagation depth, propagation mechanism, and graph structure.
    This modeling choice is invoked to turn architecture selection into a capacity-constrained optimization problem.

pith-pipeline@v0.9.0 · 5469 in / 1319 out tokens · 29958 ms · 2026-05-17T23:19:50.454465+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 6 internal anchors

  1. [1]

    Achille, A.; and Soatto, S. 2018. Emergence of invariance and disentanglement in deep representations. Journal of Machine Learning Research, 19(50): 1--34

  2. [2]

    Alon, U.; and Yahav, E. 2020. On the bottleneck of graph neural networks and its practical implications. arXiv preprint arXiv:2006.05205

  3. [3]

    Bodnar, C.; Di Giovanni, F.; Chamberlain, B.; Lio, P.; and Bronstein, M. 2022. Neural sheaf diffusion: A topological perspective on heterophily and oversmoothing in GNN s. Adv. Neural Information Processing Systems, 35: 18527--18541

  4. [4]

    Bruna, J.; Zaremba, W.; Szlam, A.; and LeCun, Y. 2014. Spectral networks and deep locally connected networks on graphs. In Int. Conf. on Learning Representations (ICLR)

  5. [5]

    Cai, C.; and Wang, Y. 2020. A note on over-smoothing for graph neural networks. arXiv preprint arXiv:2006.13318

  6. [6]

    Cai, S.; Li, L.; Deng, J.; Zhang, B.; Zha, Z.-J.; Su, L.; and Huang, Q. 2021. Rethinking graph neural architecture search from message-passing. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6657--6666

  7. [7]

    Chan, K. H. R.; Yu, Y.; You, C.; Qi, H.; Wright, J.; and Ma, Y. 2022. Redunet: A white-box deep network from the principle of maximizing rate reduction. Journal of Machine Learning Research, 23(114): 1--103

  8. [8]

    Chen, D.; Lin, Y.; Li, W.; Li, P.; Zhou, J.; and Sun, X. 2020. Measuring and relieving the over-smoothing problem for graph neural networks from the topological view. In AAAI Conference on Artificial Intelligence, volume 34, 3438--3445

  9. [9]

    Chien, E.; Peng, J.; Li, P.; and Milenkovic, O. 2020. Adaptive universal generalized pagerank graph neural network. arXiv preprint arXiv:2006.07988

  10. [10]

    Cong, W.; Ramezani, M.; and Mahdavi, M. 2021. On provable benefits of depth in training graph convolutional networks. Adv. Neural Information Processing Systems, 34: 9936--9949

  11. [11]

    Cover, T. M. 1999. Elements of information theory. John Wiley & Sons

  12. [12]

    Defferrard, M.; Bresson, X.; and Vandergheynst, P. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. Adv. Neural Information Processing Systems, 29

  13. [13]

    Dehmer, M.; and Mowshowitz, A. 2011. A history of graph entropy measures. Information Sciences, 181(1): 57--78

  14. [14]

    Di Giovanni, F.; Giusti, L.; Barbero, F.; Luise, G.; Lio, P.; and Bronstein, M. M. 2023. On over-squashing in message passing neural networks: The impact of width, depth, and topology. In Int. Conf. on Machine Learning (ICML), 7865--7885

  15. [15]

    P.; Joshi, C

    Dwivedi, V. P.; Joshi, C. K.; Luu, A. T.; Laurent, T.; Bengio, Y.; and Bresson, X. 2023. Benchmarking graph neural networks. Journal of Machine Learning Research, 24(43): 1--48

  16. [16]

    Fuzhen, Z. 2005. The S chur complement and its applications. Series: Numerical Methods and Algorithms, 4

  17. [17]

    Gallager, R. G. 1968. Information theory and reliable communication, volume 588. Springer

  18. [18]

    Gasteiger, J.; Bojchevski, A.; and G \"u nnemann, S. 2018. Predict then Propagate: Graph Neural Networks meet Personalized PageRank. In Int. Conf. on Learning Representations (ICLR)

  19. [19]

    Gasteiger, J.; Wei enberger, S.; and G \"u nnemann, S. 2019. Diffusion improves graph learning. Adv. Neural Information Processing Systems, 32

  20. [20]

    Glorot, X.; and Bengio, Y. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, 249--256. JMLR Workshop and Conference Proceedings

  21. [21]

    Gravina, A.; Eliasof, M.; Gallicchio, C.; Bacciu, D.; and Sch \"o nlieb, C.-B. 2025. On oversquashing in graph neural networks through the lens of dynamical systems. In AAAI Conference on Artificial Intelligence, volume 39(16), 16906--16914

  22. [22]

    Hamilton, W.; Ying, Z.; and Leskovec, J. 2017. Inductive representation learning on large graphs. Adv. Neural Information Processing Systems, 30

  23. [23]

    He, M.; Wei, Z.; and Wen, J.-R. 2022. Convolutional neural networks on graphs with chebyshev approximation, revisited. Adv. Neural Information Processing Systems, 35: 7264--7276

  24. [24]

    He, M.; Wei, Z.; Xu, H.; et al. 2021. Bernnet: Learning arbitrary graph spectral filters via bernstein approximation. Adv. Neural Information Processing Systems, 34: 14239--14251

  25. [25]

    Henaff, M.; Bruna, J.; and LeCun, Y. 2015. Deep convolutional networks on graph-structured data. arXiv preprint arXiv:1506.05163

  26. [26]

    Hu, W.; Fey, M.; Zitnik, M.; Dong, Y.; Ren, H.; Liu, B.; Catasta, M.; and Leskovec, J. 2020. Open graph benchmark: Datasets for machine learning on graphs. Adv. Neural Information Processing Systems, 33: 22118--22133

  27. [27]

    G.; Li, M.; et al

    Huang, K.; Wang, Y. G.; Li, M.; et al. 2024. How Universal Polynomial Bases Enhance Spectral Graph Neural Networks: Heterophily, Over-smoothing, and Over-squashing. arXiv preprint arXiv:2405.12474

  28. [28]

    Jacot, A.; Gabriel, F.; and Hongler, C. 2018. Neural tangent kernel: Convergence and generalization in neural networks. Adv. Neural Information Processing Systems, 31

  29. [29]

    Jaynes, E. 2003. Probability Theory: The Logic of Science, volume 727. Cambridge University Press

  30. [30]

    Jaynes, E. T. 1957. Information theory and statistical mechanics. Physical review, 106(4): 620

  31. [31]

    K.; and Mont \'u far, G

    Karhadkar, K.; Banerjee, P. K.; and Mont \'u far, G. 2022. FoSR: First-order spectral rewiring for addressing oversquashing in GNNs. arXiv preprint arXiv:2210.11790

  32. [32]

    N.; and Welling, M

    Kipf, T. N.; and Welling, M. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In Int. Conf. on Learning Representations (ICLR) (ICLR)

  33. [33]

    Kraft, D. 1988. A software package for sequential quadratic programming. Forschungsbericht- Deutsche Forschungs- und Versuchsanstalt fur Luft- und Raumfahrt

  34. [34]

    Kullback, S. 1997. Information theory and statistics. Courier Corporation

  35. [35]

    Lee, J.; Lee, I.; and Kang, J. 2019. Self-attention graph pooling. In Int. Conf. on Machine Learning (ICML), 3734--3743

  36. [36]

    Loukas, A. 2020 a . How hard is to distinguish graphs with graph neural networks? Adv. Neural Information Processing Systems, 33: 3465--3476

  37. [37]

    Loukas, A. 2020 b . What graph neural networks cannot learn: depth vs width. In Int. Conf. on Learning Representations (ICLR)

  38. [38]

    Maskey, S.; Paolino, R.; Bacho, A.; and Kutyniok, G. 2024. A fractional graph L aplacian approach to oversmoothing. Adv. Neural Information Processing Systems, 36

  39. [39]

    Nt, H.; and Maehara, T. 2019. Revisiting graph neural networks: All we have is low-pass filters. arXiv preprint arXiv:1905.09550

  40. [40]

    H.; Chen, D.; and Borgwardt, K

    Pellizzoni, P.; Schulz, T. H.; Chen, D.; and Borgwardt, K. 2024. On the expressivity and sample complexity of node-individualized graph neural networks. Adv. Neural Information Processing Systems, 37: 120221--120251

  41. [41]

    P.; Luu, A

    Ramp \'a s ek, L.; Galkin, M.; Dwivedi, V. P.; Luu, A. T.; Wolf, G.; and Beaini, D. 2022. Recipe for a general, powerful, scalable graph transformer. Adv. Neural Information Processing Systems, 35: 14501--14515

  42. [42]

    A.; Yaida, S.; and Hanin, B

    Roberts, D. A.; Yaida, S.; and Hanin, B. 2022. The principles of deep learning theory, volume 46. Cambridge University Press Cambridge, MA, USA

  43. [43]

    Rozemberczki, B.; Allen, C.; and Sarkar, R. 2021. Multi-scale attributed node embedding. Journal of Complex Networks, 9(2): cnab014

  44. [44]

    M.; Bansal, Y.; Dapello, J.; Advani, M.; Kolchinsky, A.; Tracey, B

    Saxe, A. M.; Bansal, Y.; Dapello, J.; Advani, M.; Kolchinsky, A.; Tracey, B. D.; and Cox, D. D. 2019. On the information bottleneck theory of deep learning. Journal of Statistical Mechanics: Theory and Experiment, 2019(12): 124020

  45. [45]

    Deep Information Propagation

    Schoenholz, S. S.; Gilmer, J.; Ganguli, S.; and Sohl-Dickstein, J. 2016. Deep information propagation. arXiv preprint arXiv:1611.01232

  46. [46]

    Sen, P.; Namata, G.; Bilgic, M.; Getoor, L.; Galligher, B.; and Eliassi-Rad, T. 2008. Collective classification in network data. AI Magazine, 29(3): 93--93

  47. [47]

    Shannon, C. E. 1948. A mathematical theory of communication. The Bell System Technical Journal, 27(3): 379--423

  48. [48]

    Shchur, O.; Mumme, M.; Bojchevski, A.; and G \"u nnemann, S. 2018. Pitfalls of graph neural network evaluation. arXiv preprint arXiv:1811.05868

  49. [49]

    Shen, X.; Wang, Y.; Lin, M.; Huang, Y.; Tang, H.; Sun, X.; and Wang, Y. 2023. Deepmad: Mathematical architecture design for deep convolutional neural network. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6163--6173

  50. [50]

    Sun, Z.; Lin, M.; Sun, X.; Tan, Z.; Li, H.; and Jin, R. 2021. Mae-det: Revisiting maximum entropy principle in zero-shot nas for efficient object detection. arXiv preprint arXiv:2111.13336

  51. [51]

    E.; and Bishop, C

    Tipping, M. E.; and Bishop, C. M. 1999. Probabilistic principal component analysis. Journal of the Royal Statistical Society Series B: Statistical Methodology, 61(3): 611--622

  52. [52]

    The information bottleneck method

    Tishby, N.; Pereira, F. C.; and Bialek, W. 2000. The information bottleneck method. arXiv preprint physics/0004057

  53. [53]

    Understanding over-squashing and bottlenecks on graphs via curvature.arXiv preprint arXiv:2111.14522,

    Topping, J.; Di Giovanni, F.; Chamberlain, B. P.; Dong, X.; and Bronstein, M. M. 2021. Understanding over-squashing and bottlenecks on graphs via curvature. arXiv preprint arXiv:2111.14522

  54. [54]

    Veli c kovi \'c , P.; Cucurull, G.; Casanova, A.; Romero, A.; Li \`o , P.; and Bengio, Y. 2018. Graph Attention Networks. In Int. Conf. on Learning Representations (ICLR)

  55. [55]

    Von Neumann, J. 2018. Mathematical foundations of quantum mechanics: New edition, volume 53. Princeton University Press

  56. [56]

    Wang, X.; and Zhang, M. 2022. How powerful are spectral graph neural networks. In Int. Conf. on Machine Learning (ICML), 23341--23362

  57. [57]

    Wang, Z.; and Wu, L. 2023. Theoretical analysis of the inductive biases in deep convolutional networks. Adv. Neural Information Processing Systems, 36: 74289--74338

  58. [58]

    Wu, F.; Souza, A.; Zhang, T.; Fifty, C.; Yu, T.; and Weinberger, K. 2019. Simplifying graph convolutional networks. In Int. Conf. on Machine Learning (ICML), 6861--6871

  59. [59]

    Wu, T.; Ren, H.; Li, P.; and Leskovec, J. 2020. Graph information bottleneck. Adv. Neural Information Processing Systems, 33: 20437--20448

  60. [60]

    Xu, K.; Hu, W.; Leskovec, J.; and Jegelka, S. 2018 a . How powerful are graph neural networks? arXiv preprint arXiv:1810.00826

  61. [61]

    Xu, K.; Li, C.; Tian, Y.; Sonobe, T.; Kawarabayashi, K.-i.; and Jegelka, S. 2018 b . Representation learning on graphs with jumping knowledge networks. In Int. Conf. on Machine Learning (ICML), 5453--5462

  62. [62]

    Yang, C.; Wang, R.; Yao, S.; Liu, S.; and Abdelzaher, T. 2020. Revisiting over-smoothing in deep GCNs. arXiv preprint arXiv:2003.13663

  63. [63]

    Z.; Peng, H.; Li, A.; Xue, S.; and Su, J

    Yang, Z.; Zhang, G.; Wu, J.; Yang, J.; Sheng, Q. Z.; Peng, H.; Li, A.; Xue, S.; and Su, J. 2023. Minimum entropy principle guided graph neural networks. In ACM International Conference on Web Search and Data Mining, 114--122

  64. [64]

    You, J.; Ying, Z.; and Leskovec, J. 2020. Design space for graph neural networks. Adv. Neural Information Processing Systems, 33: 17009--17021

  65. [65]

    S.; Hooi, B.; Xu, H.; and Feng, J

    Zhou, K.; Dong, Y.; Wang, K.; Lee, W. S.; Hooi, B.; Xu, H.; and Feng, J. 2021. Understanding and resolving performance degradation in deep graph convolutional networks. In 30th ACM International Conference on Information & Knowledge Management, 2728--2737

  66. [66]

    Zhu, H.; and Koniusz, P. 2021. Simple spectral graph convolution. In Int. Conf. on Learning Representations (ICLR)