How Wide and How Deep? Mitigating Over-Squashing of GNNs via Channel Capacity Constrained Estimation
Pith reviewed 2026-05-17 23:19 UTC · model grok-4.3
The pith
Channel capacity estimation selects hidden dimensions and depths to mitigate over-squashing in GNNs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Spectral graph neural networks can be treated as communication channels whose capacity depends on hidden dimensions, propagation depth, the chosen propagation mechanism, and the structure of the input graph. Formulating the choice of dimensions and depth as a nonlinear programming problem that maximizes this capacity yields architectural settings that reduce the cumulative compression of information in representation matrices. Larger hidden dimensions directly counteract compression, whereas depth exhibits a more nuanced effect that reflects a trade-off between added compression and increased representation complexity.
What carries the argument
Channel Capacity Constrained Estimation (C3E), a nonlinear programming procedure that treats GNN propagation steps as a communication channel to compute suitable hidden dimensions and depth.
If this is right
- Over-squashing occurs through the progressive compression of information inside successive representation matrices.
- Larger hidden dimensions reduce the amount of information lost to compression at each step.
- Propagation depth must be chosen to balance further compression against gains in representational complexity.
- Settings produced by the capacity-constrained optimization improve representation learning on the evaluated datasets.
Where Pith is reading between the lines
- The same capacity-modeling idea could be applied to message-passing rules in non-spectral GNNs if an analogous channel description can be written down.
- For graphs with long diameters the depth trade-off might favor shallower networks paired with wider layers.
- Running the nonlinear program on graphs with controlled changes in diameter or density would test how sensitive the resulting choices are to structure.
Load-bearing premise
Modeling spectral graph neural networks as communication channels whose capacity is determined by hidden dimensions, depth, and graph structure accurately reflects actual information flow during propagation.
What would settle it
Finding that architectures chosen by the nonlinear program produce no measurable drop in over-squashing or no gain in representation quality on a fresh collection of graphs would show the capacity model does not predict real behavior.
Figures
read the original abstract
Existing graph neural networks typically rely on heuristic choices for hidden dimensions and propagation depths, which often lead to severe information loss during propagation, known as over-squashing. To address this issue, we propose Channel Capacity Constrained Estimation (C3E), a novel framework that formulates the selection of hidden dimensions and depth as a nonlinear programming problem grounded in information theory. Through modeling spectral graph neural networks as communication channels, our approach directly connects channel capacity to hidden dimensions, propagation depth, propagation mechanism, and graph structure. Extensive experiments on nine public datasets demonstrate that hidden dimensions and depths estimated by C3E can mitigate over-squashing and consistently improve representation learning. Experimental results show that over-squashing occurs due to the cumulative compression of information in representation matrices. Furthermore, our findings show that increasing hidden dimensions indeed mitigate information compression, while the role of propagation depth is more nuanced, uncovering a fundamental balance between information compression and representation complexity.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Channel Capacity Constrained Estimation (C3E), a nonlinear programming framework that selects hidden dimensions and propagation depths for spectral GNNs by modeling them as communication channels whose capacity depends on graph structure, depth, dimensions, and propagation mechanism. It attributes over-squashing to cumulative compression of information in representation matrices and reports that C3E-derived choices mitigate this effect while improving representation learning on nine public datasets; depth is characterized as playing a nuanced role that balances compression against representation complexity.
Significance. If the channel-capacity formulation accurately predicts information flow and the resulting parameters generalize, the work would supply a principled, information-theoretic alternative to heuristic choices of width and depth in GNNs, together with empirical evidence linking architectural decisions to measurable compression in representation matrices. The multi-dataset evaluation is a positive feature, though its interpretive weight hinges on the validity of the underlying capacity model.
major comments (3)
- [Abstract and §3] Abstract and §3: the nonlinear program is presented as directly connecting channel capacity to hidden dimension, depth, and graph structure, yet the manuscript supplies no explicit capacity formula, no description of how the program is solved, and no statement of the precise constraints or scaling constants employed; without these details the central claim that the selected parameters reduce observed compression cannot be verified.
- [§4] §4 (experimental results): consistent gains are reported across nine datasets, but the text does not indicate whether error bars, multiple random seeds, or statistical significance tests accompany the improvements; this omission weakens the assertion that C3E “consistently” mitigates over-squashing.
- [§3.2] §3.2 (capacity model): the formulation appears to treat each layer as an additive or linear channel whose mutual information scales directly with dimension and depth, yet the paper itself describes over-squashing as arising from nonlinear activations and repeated Laplacian applications; the mapping from empirical compression in representation matrices to the capacity objective is therefore not secured.
minor comments (2)
- [§3] Notation for the channel-capacity threshold and any scaling constants should be introduced with explicit symbols and distinguished from free parameters.
- [§3] A brief discussion of how graph-specific quantities (e.g., eigenvalues or effective diameter) are estimated without introducing circularity relative to the downstream task would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have made revisions to strengthen the presentation of the C3E framework and its empirical validation.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3: the nonlinear program is presented as directly connecting channel capacity to hidden dimension, depth, and graph structure, yet the manuscript supplies no explicit capacity formula, no description of how the program is solved, and no statement of the precise constraints or scaling constants employed; without these details the central claim that the selected parameters reduce observed compression cannot be verified.
Authors: We appreciate this observation and agree that additional mathematical detail is necessary for verifiability. The original submission presented the nonlinear program at a high level without the full capacity expression or solver specifics. In the revised manuscript we have inserted the explicit channel-capacity formula (derived from the mutual information between input features and propagated representations under the spectral filter), the complete nonlinear program with all equality and inequality constraints, the chosen scaling constants, and a description of the interior-point solver together with convergence criteria. These additions directly support the claim that the optimized parameters reduce measured compression in the representation matrices. revision: yes
-
Referee: [§4] §4 (experimental results): consistent gains are reported across nine datasets, but the text does not indicate whether error bars, multiple random seeds, or statistical significance tests accompany the improvements; this omission weakens the assertion that C3E “consistently” mitigates over-squashing.
Authors: We concur that statistical reporting is essential to substantiate consistency claims. We have re-executed all experiments using ten independent random seeds, added standard-deviation error bars to every table and figure in §4, and included paired t-test p-values comparing C3E-derived configurations against baselines. The revised results confirm that the observed improvements remain statistically significant (p < 0.05) across the nine datasets, thereby reinforcing the assertion that C3E mitigates over-squashing. revision: yes
-
Referee: [§3.2] §3.2 (capacity model): the formulation appears to treat each layer as an additive or linear channel whose mutual information scales directly with dimension and depth, yet the paper itself describes over-squashing as arising from nonlinear activations and repeated Laplacian applications; the mapping from empirical compression in representation matrices to the capacity objective is therefore not secured.
Authors: We thank the referee for this precise critique of the modeling assumptions. Although over-squashing originates from nonlinear activations and iterated Laplacian multiplications, the C3E capacity model employs an effective information-theoretic upper bound that is calibrated against the empirically observed compression in the final representation matrices. In the revision we have augmented §3.2 with a short derivation showing how the spectral capacity expression bounds the end-to-end mutual information even in the presence of nonlinearities, together with additional plots that correlate the capacity objective directly with measured compression metrics. We believe these clarifications secure the mapping while preserving the abstraction’s utility. revision: partial
Circularity Check
No significant circularity: derivation grounded in external information-theoretic model
full rationale
The paper formulates hidden dimension and depth selection as a nonlinear program that maximizes channel capacity under a communication-channel model of spectral GNN propagation. This objective is defined from graph structure, propagation mechanism, and information-theoretic capacity expressions rather than from the downstream task loss or representation quality metrics used in evaluation. No equations reduce the chosen parameters to fitted values of the target performance; the optimization is solved once per graph and then applied to GNN training. No self-citation chains, self-definitional loops, or renaming of known empirical patterns appear in the derivation. The approach is therefore self-contained against external benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
free parameters (1)
- channel capacity threshold or scaling constants
axioms (1)
- domain assumption Spectral graph neural networks can be modeled as communication channels whose capacity is a direct function of hidden dimension, propagation depth, propagation mechanism, and graph structure.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 1 … ϕ = max [½ ln(2πe) + ½ Σ ln(n w_{l-1} σ²_{S_l}) ] … effective channel capacity ϕ₀ … nonlinear programming problem … ln(n) ≤ ϕ₀ ≤ (1/η) ln(n)
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanalpha_pin_under_high_calibration unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
representation compression ratio θ = ϕ / w … Corollary 2 … threshold w* = e^{1-K} … dual dependency on w and L
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Achille, A.; and Soatto, S. 2018. Emergence of invariance and disentanglement in deep representations. Journal of Machine Learning Research, 19(50): 1--34
work page 2018
- [2]
-
[3]
Bodnar, C.; Di Giovanni, F.; Chamberlain, B.; Lio, P.; and Bronstein, M. 2022. Neural sheaf diffusion: A topological perspective on heterophily and oversmoothing in GNN s. Adv. Neural Information Processing Systems, 35: 18527--18541
work page 2022
-
[4]
Bruna, J.; Zaremba, W.; Szlam, A.; and LeCun, Y. 2014. Spectral networks and deep locally connected networks on graphs. In Int. Conf. on Learning Representations (ICLR)
work page 2014
- [5]
-
[6]
Cai, S.; Li, L.; Deng, J.; Zhang, B.; Zha, Z.-J.; Su, L.; and Huang, Q. 2021. Rethinking graph neural architecture search from message-passing. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6657--6666
work page 2021
-
[7]
Chan, K. H. R.; Yu, Y.; You, C.; Qi, H.; Wright, J.; and Ma, Y. 2022. Redunet: A white-box deep network from the principle of maximizing rate reduction. Journal of Machine Learning Research, 23(114): 1--103
work page 2022
-
[8]
Chen, D.; Lin, Y.; Li, W.; Li, P.; Zhou, J.; and Sun, X. 2020. Measuring and relieving the over-smoothing problem for graph neural networks from the topological view. In AAAI Conference on Artificial Intelligence, volume 34, 3438--3445
work page 2020
- [9]
-
[10]
Cong, W.; Ramezani, M.; and Mahdavi, M. 2021. On provable benefits of depth in training graph convolutional networks. Adv. Neural Information Processing Systems, 34: 9936--9949
work page 2021
-
[11]
Cover, T. M. 1999. Elements of information theory. John Wiley & Sons
work page 1999
-
[12]
Defferrard, M.; Bresson, X.; and Vandergheynst, P. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. Adv. Neural Information Processing Systems, 29
work page 2016
-
[13]
Dehmer, M.; and Mowshowitz, A. 2011. A history of graph entropy measures. Information Sciences, 181(1): 57--78
work page 2011
-
[14]
Di Giovanni, F.; Giusti, L.; Barbero, F.; Luise, G.; Lio, P.; and Bronstein, M. M. 2023. On over-squashing in message passing neural networks: The impact of width, depth, and topology. In Int. Conf. on Machine Learning (ICML), 7865--7885
work page 2023
-
[15]
Dwivedi, V. P.; Joshi, C. K.; Luu, A. T.; Laurent, T.; Bengio, Y.; and Bresson, X. 2023. Benchmarking graph neural networks. Journal of Machine Learning Research, 24(43): 1--48
work page 2023
-
[16]
Fuzhen, Z. 2005. The S chur complement and its applications. Series: Numerical Methods and Algorithms, 4
work page 2005
-
[17]
Gallager, R. G. 1968. Information theory and reliable communication, volume 588. Springer
work page 1968
-
[18]
Gasteiger, J.; Bojchevski, A.; and G \"u nnemann, S. 2018. Predict then Propagate: Graph Neural Networks meet Personalized PageRank. In Int. Conf. on Learning Representations (ICLR)
work page 2018
-
[19]
Gasteiger, J.; Wei enberger, S.; and G \"u nnemann, S. 2019. Diffusion improves graph learning. Adv. Neural Information Processing Systems, 32
work page 2019
-
[20]
Glorot, X.; and Bengio, Y. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, 249--256. JMLR Workshop and Conference Proceedings
work page 2010
-
[21]
Gravina, A.; Eliasof, M.; Gallicchio, C.; Bacciu, D.; and Sch \"o nlieb, C.-B. 2025. On oversquashing in graph neural networks through the lens of dynamical systems. In AAAI Conference on Artificial Intelligence, volume 39(16), 16906--16914
work page 2025
-
[22]
Hamilton, W.; Ying, Z.; and Leskovec, J. 2017. Inductive representation learning on large graphs. Adv. Neural Information Processing Systems, 30
work page 2017
-
[23]
He, M.; Wei, Z.; and Wen, J.-R. 2022. Convolutional neural networks on graphs with chebyshev approximation, revisited. Adv. Neural Information Processing Systems, 35: 7264--7276
work page 2022
-
[24]
He, M.; Wei, Z.; Xu, H.; et al. 2021. Bernnet: Learning arbitrary graph spectral filters via bernstein approximation. Adv. Neural Information Processing Systems, 34: 14239--14251
work page 2021
-
[25]
Henaff, M.; Bruna, J.; and LeCun, Y. 2015. Deep convolutional networks on graph-structured data. arXiv preprint arXiv:1506.05163
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[26]
Hu, W.; Fey, M.; Zitnik, M.; Dong, Y.; Ren, H.; Liu, B.; Catasta, M.; and Leskovec, J. 2020. Open graph benchmark: Datasets for machine learning on graphs. Adv. Neural Information Processing Systems, 33: 22118--22133
work page 2020
-
[27]
Huang, K.; Wang, Y. G.; Li, M.; et al. 2024. How Universal Polynomial Bases Enhance Spectral Graph Neural Networks: Heterophily, Over-smoothing, and Over-squashing. arXiv preprint arXiv:2405.12474
-
[28]
Jacot, A.; Gabriel, F.; and Hongler, C. 2018. Neural tangent kernel: Convergence and generalization in neural networks. Adv. Neural Information Processing Systems, 31
work page 2018
-
[29]
Jaynes, E. 2003. Probability Theory: The Logic of Science, volume 727. Cambridge University Press
work page 2003
-
[30]
Jaynes, E. T. 1957. Information theory and statistical mechanics. Physical review, 106(4): 620
work page 1957
-
[31]
Karhadkar, K.; Banerjee, P. K.; and Mont \'u far, G. 2022. FoSR: First-order spectral rewiring for addressing oversquashing in GNNs. arXiv preprint arXiv:2210.11790
-
[32]
Kipf, T. N.; and Welling, M. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In Int. Conf. on Learning Representations (ICLR) (ICLR)
work page 2017
-
[33]
Kraft, D. 1988. A software package for sequential quadratic programming. Forschungsbericht- Deutsche Forschungs- und Versuchsanstalt fur Luft- und Raumfahrt
work page 1988
-
[34]
Kullback, S. 1997. Information theory and statistics. Courier Corporation
work page 1997
-
[35]
Lee, J.; Lee, I.; and Kang, J. 2019. Self-attention graph pooling. In Int. Conf. on Machine Learning (ICML), 3734--3743
work page 2019
-
[36]
Loukas, A. 2020 a . How hard is to distinguish graphs with graph neural networks? Adv. Neural Information Processing Systems, 33: 3465--3476
work page 2020
-
[37]
Loukas, A. 2020 b . What graph neural networks cannot learn: depth vs width. In Int. Conf. on Learning Representations (ICLR)
work page 2020
-
[38]
Maskey, S.; Paolino, R.; Bacho, A.; and Kutyniok, G. 2024. A fractional graph L aplacian approach to oversmoothing. Adv. Neural Information Processing Systems, 36
work page 2024
-
[39]
Nt, H.; and Maehara, T. 2019. Revisiting graph neural networks: All we have is low-pass filters. arXiv preprint arXiv:1905.09550
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[40]
H.; Chen, D.; and Borgwardt, K
Pellizzoni, P.; Schulz, T. H.; Chen, D.; and Borgwardt, K. 2024. On the expressivity and sample complexity of node-individualized graph neural networks. Adv. Neural Information Processing Systems, 37: 120221--120251
work page 2024
-
[41]
Ramp \'a s ek, L.; Galkin, M.; Dwivedi, V. P.; Luu, A. T.; Wolf, G.; and Beaini, D. 2022. Recipe for a general, powerful, scalable graph transformer. Adv. Neural Information Processing Systems, 35: 14501--14515
work page 2022
-
[42]
Roberts, D. A.; Yaida, S.; and Hanin, B. 2022. The principles of deep learning theory, volume 46. Cambridge University Press Cambridge, MA, USA
work page 2022
-
[43]
Rozemberczki, B.; Allen, C.; and Sarkar, R. 2021. Multi-scale attributed node embedding. Journal of Complex Networks, 9(2): cnab014
work page 2021
-
[44]
M.; Bansal, Y.; Dapello, J.; Advani, M.; Kolchinsky, A.; Tracey, B
Saxe, A. M.; Bansal, Y.; Dapello, J.; Advani, M.; Kolchinsky, A.; Tracey, B. D.; and Cox, D. D. 2019. On the information bottleneck theory of deep learning. Journal of Statistical Mechanics: Theory and Experiment, 2019(12): 124020
work page 2019
-
[45]
Schoenholz, S. S.; Gilmer, J.; Ganguli, S.; and Sohl-Dickstein, J. 2016. Deep information propagation. arXiv preprint arXiv:1611.01232
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[46]
Sen, P.; Namata, G.; Bilgic, M.; Getoor, L.; Galligher, B.; and Eliassi-Rad, T. 2008. Collective classification in network data. AI Magazine, 29(3): 93--93
work page 2008
-
[47]
Shannon, C. E. 1948. A mathematical theory of communication. The Bell System Technical Journal, 27(3): 379--423
work page 1948
-
[48]
Shchur, O.; Mumme, M.; Bojchevski, A.; and G \"u nnemann, S. 2018. Pitfalls of graph neural network evaluation. arXiv preprint arXiv:1811.05868
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[49]
Shen, X.; Wang, Y.; Lin, M.; Huang, Y.; Tang, H.; Sun, X.; and Wang, Y. 2023. Deepmad: Mathematical architecture design for deep convolutional neural network. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6163--6173
work page 2023
- [50]
-
[51]
Tipping, M. E.; and Bishop, C. M. 1999. Probabilistic principal component analysis. Journal of the Royal Statistical Society Series B: Statistical Methodology, 61(3): 611--622
work page 1999
-
[52]
The information bottleneck method
Tishby, N.; Pereira, F. C.; and Bialek, W. 2000. The information bottleneck method. arXiv preprint physics/0004057
work page internal anchor Pith review Pith/arXiv arXiv 2000
-
[53]
Topping, J.; Di Giovanni, F.; Chamberlain, B. P.; Dong, X.; and Bronstein, M. M. 2021. Understanding over-squashing and bottlenecks on graphs via curvature. arXiv preprint arXiv:2111.14522
-
[54]
Veli c kovi \'c , P.; Cucurull, G.; Casanova, A.; Romero, A.; Li \`o , P.; and Bengio, Y. 2018. Graph Attention Networks. In Int. Conf. on Learning Representations (ICLR)
work page 2018
-
[55]
Von Neumann, J. 2018. Mathematical foundations of quantum mechanics: New edition, volume 53. Princeton University Press
work page 2018
-
[56]
Wang, X.; and Zhang, M. 2022. How powerful are spectral graph neural networks. In Int. Conf. on Machine Learning (ICML), 23341--23362
work page 2022
-
[57]
Wang, Z.; and Wu, L. 2023. Theoretical analysis of the inductive biases in deep convolutional networks. Adv. Neural Information Processing Systems, 36: 74289--74338
work page 2023
-
[58]
Wu, F.; Souza, A.; Zhang, T.; Fifty, C.; Yu, T.; and Weinberger, K. 2019. Simplifying graph convolutional networks. In Int. Conf. on Machine Learning (ICML), 6861--6871
work page 2019
-
[59]
Wu, T.; Ren, H.; Li, P.; and Leskovec, J. 2020. Graph information bottleneck. Adv. Neural Information Processing Systems, 33: 20437--20448
work page 2020
-
[60]
Xu, K.; Hu, W.; Leskovec, J.; and Jegelka, S. 2018 a . How powerful are graph neural networks? arXiv preprint arXiv:1810.00826
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[61]
Xu, K.; Li, C.; Tian, Y.; Sonobe, T.; Kawarabayashi, K.-i.; and Jegelka, S. 2018 b . Representation learning on graphs with jumping knowledge networks. In Int. Conf. on Machine Learning (ICML), 5453--5462
work page 2018
- [62]
-
[63]
Z.; Peng, H.; Li, A.; Xue, S.; and Su, J
Yang, Z.; Zhang, G.; Wu, J.; Yang, J.; Sheng, Q. Z.; Peng, H.; Li, A.; Xue, S.; and Su, J. 2023. Minimum entropy principle guided graph neural networks. In ACM International Conference on Web Search and Data Mining, 114--122
work page 2023
-
[64]
You, J.; Ying, Z.; and Leskovec, J. 2020. Design space for graph neural networks. Adv. Neural Information Processing Systems, 33: 17009--17021
work page 2020
-
[65]
S.; Hooi, B.; Xu, H.; and Feng, J
Zhou, K.; Dong, Y.; Wang, K.; Lee, W. S.; Hooi, B.; Xu, H.; and Feng, J. 2021. Understanding and resolving performance degradation in deep graph convolutional networks. In 30th ACM International Conference on Information & Knowledge Management, 2728--2737
work page 2021
-
[66]
Zhu, H.; and Koniusz, P. 2021. Simple spectral graph convolution. In Int. Conf. on Learning Representations (ICLR)
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.